Zero-Inflated Data Analysis Using Graph Neural Networks with Convolution

Jun, Sunghae

doi:10.3390/computers15020104

Open AccessArticle

Zero-Inflated Data Analysis Using Graph Neural Networks with Convolution

by

Sunghae Jun

Department of Data Science, Cheongju University, Cheongju 28503, Republic of Korea

Computers 2026, 15(2), 104; https://doi.org/10.3390/computers15020104

Submission received: 15 January 2026 / Revised: 30 January 2026 / Accepted: 31 January 2026 / Published: 2 February 2026

Download

Browse Figures

Versions Notes

Abstract

Zero-inflated count data are characterized by an excessive frequency of zeros that cannot be adequately analyzed by a single distribution, such as Poisson or negative binomial. This problem is pervasive in many practical applications, including document–keyword matrix derived from text corpora, where most keyword frequencies are zero. Conventional statistical approaches, such as the zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB) models, explicitly separate a structural zero component from a count component, but they typically assume independent observations and can be unstable when covariates are high-dimensional and sparse. To address these limitations, this paper proposes a graph-based zero-inflated learning framework that combines simple graph convolution (SGC) with zero-inflated count regression heads such as ZIP and ZINB. We first construct an observation graph by connecting similar samples, and then apply SGC to propagate and smooth features over the graph, producing convolutional representations that incorporate neighborhood information while remaining computationally lightweight. The resulting representations are used as covariates in ZIP and ZINB heads, which preserve probabilistic interpretability through maximum likelihood learning. Our experiments on simulated zero-inflated datasets with controlled zero ratios demonstrate that the proposed ZIP+SGC and ZINB+SGC consistently reduce prediction errors compared with their non-graph baselines, as measured by mean absolute error and root mean squared error. Overall, the proposed approach provides an efficient and interpretable way to integrate graph neural computation with zero-inflated modeling for sparse count prediction problems.

Keywords:

zero-inflated data; graph neural networks; zero-inflated Poisson; zero-inflated negative binomial; convolution; simple graph convolution

1. Introduction

Zero-inflated (ZI) count data arise when the observed number of zeros is substantially larger than what standard count distributions would predict [1,2,3]. This phenomenon has been widely reported in diverse application domains and is particularly common in high-dimensional sparse settings such as document–keyword matrices [4,5,6,7]. In such data, zeros may occur not only because the expected count is small, based on sampling zeros, but also because a separate mechanism generates structural zeros [8,9,10]. This characteristic is frequently observed in practice. For example, in a document–keyword matrix, each entry represents the frequency of a keyword in a document, and most entries are zero because each document contains only a small subset of the overall terms [2,6,7,11]. Similar ZI patterns are also common in scientific text mining, biomedical measurements, and other sparse event-count environments [1,4,5,12,13,14,15,16,17].

A large body of statistical literature has developed explicit probabilistic models for ZI counts. The representative approaches include the zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB) models, which combine a Bernoulli component for structural zeros with a Poisson or negative binomial component for positive counts [1,8,14,18]. These models improve interpretability by separating the inflation process from the count process, and ZINB further accommodates overdispersion through the variance structure of the negative binomial distribution.

However, classical ZI regression is often challenged by modern data settings [8,9,19,20,21,22]. First, the covariates, such as keyword frequency of document–keyword matrix, can be high-dimensional and sparse. Second, the inflation and count parts may suffer from identifiability and optimization instability under strong sparsity. Third, most ZI models treat observations as independent, thus ignoring potentially useful relationships among samples.

In parallel, machine learning has explored flexible predictive models for sparse counts [23,24,25]. Two parts of the learning strategies, which are classification for zero or non-zero, and regression for positive values, generalize the hurdle perspective, while likelihood-based deep models estimate distributional parameters using neural networks. Although these methods can handle nonlinearity and large feature spaces, they still typically rely on the independent and identically distributed assumption at the sample level. In many applications, however, samples are not isolated. Text documents, for instance, form natural similarity structures such as shared terminology, overlapping technical domains, and co-occurrence patterns. Exploiting such relational structure can act as a form of information sharing, which is particularly beneficial under severe sparsity and noise.

Graph neural networks (GNNs) provide a general mechanism for representation learning on graphs by aggregating and transforming neighborhood information [26,27,28,29]. Among them, the graph convolutional network (GCN) is a widely used architecture that performs feature mixing via normalized adjacency-based propagation [30,31,32,33]. However, deep GCN stacks can introduce redundant computation and may require careful tuning to avoid over-smoothing. Simple graph convolution (SGC) simplifies GCN by collapsing multiple propagation steps into a linear graph filter followed by a simple predictor, yielding substantial computational benefits while retaining competitive performance in many tasks [30].

The motivations for studying the proposed method in this paper are as follows:

-: The ZI count data with high-dimensional sparsity often lead to unstable estimation and degraded prediction accuracy in conventional ZI regression.
-: Most classical ZIP and ZINB formulations assume independent observations and ignore potentially informative relational structure among samples, such as document similarity.
-: The graph-based convolution can enable information sharing across similar samples, but practical ZI modeling benefits from lightweight and interpretable convolution, such as SGC.

We represent the research questions (RQs) of our paper as follows:

-: RQ1: Can incorporating graph convolution via SGC improve the predictive performance of classical zero-inflated models such as ZIP and ZINB for sparse zero-inflated count data?
-: RQ2: Are the performance gains of graph-enhanced models (ZIP+SGC, ZINB+SGC) stable under different zero-inflated ratios and repeated runs, as measured by MAE and RMSE?

In addition, we build the working hypotheses of our research as follows.

-: Hypothesis1: ZIP+SGC achieves lower MAE and RMSE than the baseline ZIP under the same experimental setting.
-: Hypothesis2: ZINB+SGC achieves lower MAE and RMSE than the baseline ZINB under the same experimental setting.

Furthermore, the contributions of this paper can be summarized in three points as follows:

-: We propose a unified framework that couples SGC-based graph propagation with ZIP and ZINB heads for zero-inflated count prediction based on ZIP+SGC and ZINB+SGC.
-: The proposed method is lightweight and interpretable; graph convolution is implemented via a fixed propagation operator, while the ZI heads are trained by standard maximum likelihood estimation (MLE).
-: Through controlled simulation experiments under varying zero ratios, we demonstrate consistent reductions in mean absolute error (MAE) and root mean squared error (RMSE) compared with the corresponding non-graph baselines.

To provide a clear roadmap of the proposed approach, we summarize the methodology as follows:

Step1. Data preparation: Represent the observations as a feature matrix X with a zero-inflated target count y.

Step2. Graph construction: Build an observation graph by connecting each sample to its k-nearest neighbors using a similarity measure, forming an adjacency matrix A.

Step3. Normalization and self-loops: Add self-loops and compute a normalized propagation operator to stabilize message passing and prevent numerical scaling issues.

Step4. SGC propagation: Apply SGC propagation for K steps to obtain graph-smoothed features, capturing relational information among observations.

Step5. Zero-inflated modeling: Fit probabilistic ZIP and ZINB regression heads using the propagated features, yielding ZIP+SGC and ZINB+SGC, and estimate parameters via maximum-likelihood.

Step6. Evaluation and comparison: Evaluate predictive performance using MAE and RMSE, and compare against baseline ZIP and ZINB without graph propagation.

The remainder of the paper is organized as follows: Section 2 reviews statistical and machine learning approaches to zero-inflated modeling. Section 3 presents the proposed ZI-SGC methodology. Section 4 reports experimental results, and Section 5 concludes with a discussion and future directions.

2. Zero-Inflated Models

2.1. Zero-Inflated Models with Statistics

The statistical approach to zero-inflated models has been primarily established from the perspective of the generalized linear model (GLM) [34,35]. The basic idea is to model the inflation part, which describes the probability of a structural zero, and the count part, which describes the mean of a positive distribution, as separate link functions. For example, the probability of zero

π

, a parameter of the binomial distribution, is typically linked to explanatory variables via a logit link, while the parameter

μ

, a parameter of the Poisson distribution, is typically linked to explanatory variables via a log link. Because zero-inflated data cannot be explained by a single probability distribution, statistics use zero-inflated models that use two probability distributions simultaneously. For example, Bernoulli and Poisson distributions are used simultaneously to explain zero and non-zero values separately. This is zero-inflated Poisson (ZIP), and zero-inflated negative binomial (ZINB) consists of Bernoulli and negative binomial distributions for zero-inflated data [8,9]. The zero-inflated model based on statistics has a structure that models the probability distribution where Y occurs as a zero value and a value greater than zero [9,18]. In the model, the probability of Y explaining the occurrence of zero is computed by the binomial distribution. Also, Poisson and negative binomial distributions are used as probability distributions that describe Y values greater than zero. In general, the zero-inflated models based on statistics are defined as follows [10]:

P (Y = 0) = π + (1 - π) f (Y = 0), P (Y > 0) = (1 - π) f (Y > 0)

(1)

In Equation (1),

π

is the probability of a zero;

f (\cdot)

represents a probability distribution that describes values greater than zero, such as Poisson and negative binomial distributions. The advantages of a statistical approach to zero-inflated models include the interpretability of parameters, MLE and likelihood-based inference, and relatively clear model selection and diagnostics. Furthermore, ZINB can account for overdispersion by leveraging the variance structure of the negative binomial distribution. However, it has limitations, including instability in high-dimensional covariates, identifiability issues between the inflation and count parts, and increased convergence and estimator variability when data are highly sparse.

Bayesian formulations of zero-inflated count models have also been studied as an alternative statistical framework for handling excess zeros and over-dispersion, often providing a principled route to uncertainty quantification through posterior inference [15,16,17]. While the proposed method in this paper is not Bayesian and focuses on combining ZIP and ZINB regression heads with lightweight graph propagation, these Bayesian studies are relevant in positioning the broader landscape of zero-inflated modeling and in motivating the need for approaches that remain both interpretable and computationally efficient. In particular, uncertainty-aware, zero-inflated modeling highlights the practical importance of robust methods when data are sparse and noisy, which is also a key motivation for our graph-enhanced ZI framework.

2.2. Zero-Inflated Models with Machine Learning

Machine learning-based, zero-inflated analysis models can be broadly divided into two approaches [6,7,10,17,18,23]. First, it generalizes statistical two-part modeling to machine learning. This approach involves training a binary classification model that distinguishes between

Y = 0

and

Y > 0

, and separately training a regression model that predicts the magnitude of the

Y > 0

condition. In other words, these two parts can be replaced by various learners, such as logistic regression, tree-based models, and neural networks. Regularization and feature selection can particularly improve performance for high-dimensional sparse features. Second, from a likelihood-based, deep-learning perspective, this method uses a zero-inflated probability distribution but computes the parameters using a neural network. In this case, learning is performed by minimizing the negative log-likelihood rather than minimizing the error that is the difference between the actual and predicted values, which is then linked to a statistical model. This method provides probabilistic computations capable of learning nonlinear relationships and easily combining complex feature representations. However, when data are highly sparse or have a high zero ratio, model learning can become unstable, so simplification of the inflation part, regularization, and appropriate evaluation criteria must be considered together. In summary, the zero-inflated models based on machine learning offer strengths in model flexibility and high-dimensional feature processing, but ignoring the probabilistic structure of zero inflation can compromise interpretability and generalizability. Therefore, this paper combines GNNs with convolution, which reflects inter-relationships between observations while preserving the probabilistic structure of statistical zero-inflated heads like ZIP and ZINB, to improve prediction performance in sparse zero-inflated environments.

3. Zero-Inflated Data Analysis Using GNNs with Convolution

In this section, we present an efficient method for zero-inflated data analysis with GNNs and convolution. GNNs are neural network models that perform representation learning on graph-structured data consisting of nodes and edges. To improve clarity and ensure that each variable appearing in the equations used in this paper, we summarize the descriptions and ranges of all symbols used throughout this section in Table 1.

To construct our GNN model, we consider a graph

G = (V, E)

, where

V

and

E

represent a node (vertex) and an edge (connection), respectively [26,36,37]. GNNs are defined as a family of neural network models that learn graph data [29]. Among the models of GNNs, we consider a graph convolutional network (GCN) in this paper. GCN is one of the representative GNN architectures that use graph convolution, a method of normalizing and merging neighboring features [30]. Combining zero-inflated (ZI) and simple graph convolution (SGC), we propose a ZI-SGC framework that performs feature propagation and graph convolution based on SGC to simultaneously reflect excess zeros and relational structure between observations in zero-inflated count data and applies this to ZIP or ZINB models. The core idea is that by transforming

X

to

X^{(K)}

through graph convolution, the existing zero-inflated regression model can estimate parameters more stably. Denoising and smoothing are used in this process. The SGC in this paper is a simplified form of K-step propagation that removes the layer-wise nonlinearity of GCN. For n observations (

i = 1,2, \dots, n

),

y_{i}

and

x_{i}

are target and feature, respectively. We define the data

X

as follows [30]:

X = (x_{1}, x_{2}, \dots, x_{n}) \in R^{n \times p}

(2)

In Equation (2), p is number of features. Using the data defined in this way, we build a model that predicts the target using features and obtain

{\hat{y}}_{i}

that is the predicted value of

y_{i}

and the probability of zero

\hat{P} (y_{i} = 0)

. To analyze zero-inflated data using the ZI+SGC model, we construct a graph

G

with each observation as a node. The edges of the graph are based on similarities between observations, such as cosine similarity. For sample

i

, the top k neighbor set

N_{k} (i)

is constructed and adjacency A is defined as follows [30]:

A_{i j} = {\begin{matrix} w_{i j} = s i m (x_{i}, x_{j}) & , j \in N_{k} (i) \\ 0 & , o t h e r w i s e \end{matrix}

(3)

In Equation (3),

w_{i j}

represents the edge weight connecting nodes i and j. A larger

w_{i j}

value indicates that nodes i and j are more similar and more strongly connected. The

w_{i j}

value of 0 indicates that the two nodes are not connected. In addition,

s i m (x_{i}, x_{j})

is a function representing the similarity between two feature vectors

x_{i}

and

x_{j}

. Commonly used similarity functions include cosine similarity, Gaussian similarity. In this paper, we use cosine similarity and construct k-nearest neighbor (k-NN) graph accordingly. Using the adjacency

A \in R^{n \times n}

, we build GNNs model with convolution via SGC. The SGC propagation is defined based on normalized adjacency with a self-loop. First, adjacency-including self-loop is defined as follows [30,32]:

\tilde{A} = A + I

(4)

In Equation (4), where I is the identity matrix, which adds a self-loop to each node, we define

\tilde{D}

as degree matrix of

\tilde{A}

. In addition, a propagation operator

S

is constructed by normalizing

\tilde{A}

as shown in Equation (5) [30,32].

S = {\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{- 1 / 2}

(5)

In general, we choose symmetric normalization for

S

. We compute the degree matrix

\tilde{D}

as

{\tilde{D}}_{i i} = \sum_{j} {\tilde{A}}_{i j}

. SGC performs smoothing of the features through K-step propagation and then computes

X^{(K)} = S^{K} X

and apply a linear predictor [30,32]. As K increases, it reflects a wider range of K-hop neighborhood information. This simplification of SGC reduces the computational load while maintaining the performance of GCN.

S^{K}

has low pass filtering properties in the graph spectral domain, resulting in smoothness that promotes feature similarity between adjacent nodes. Therefore, even in sparse, noisy, zero-inflated data, denoising effects utilizing neighborhood information can be expected. We fit the following zero-inflated count regression using

x_{i}^{(K)}

obtained by SGC as a covariate.

\log (μ_{i}) = β_{0} + {(x_{i}^{K})}^{T} β, π_{i} = σ (τ_{0} + z_{i}^{T} τ)

(6)

In Equation (6),

μ_{i}

and

π_{i}

are the mean and zero-inflation probability of the count component. Also,

σ (\cdot)

and

z_{i}

are the sigmoid function and covariate of inflation part. The likelihood of the traditional zero-inflated model is represented as follows:

P (Y_{i} = 0) = π_{i} + (1 - π_{i}) f (0), P (Y_{i} > 0) = (1 - π_{i}) f (y_{i})

(7)

In Equation (7),

f (\cdot)

is a probability distribution for count data, such as Poisson and negative binomial distributions. Since we estimate the parameter

θ

using MLE, we use the following objective function:

\hat{θ} = {a r g m i n}_{θ} {- \sum_{i = 1}^{n} l o g (P (y_{i} | x_{i}^{K}, θ))}

(8)

The advantage of SGC is that the

X^{(K)}

is calculated first, and subsequent learning becomes an optimization problem of zero-inflated regression heads such as ZIP and ZINB, so Equation (8) can be used. Finally, we predict the output using the following expectation and probability in Equation (9):

{\hat{y}}_{i} = E (Y_{i} | x_{i}^{K}), \hat{P} (Y_{i} = 0)

(9)

To evaluate the performance between the proposed model and existing models, we use the following MAE and RMSE [25,34]:

M A E = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |, R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(10)

In Equation (10),

y_{i}

and

{\hat{y}}_{i}

are actual and predicted values, respectively. We describe our proposed model for zero-inflated data analysis, called ZI-SGC, in Algorithm 1.

Algorithm 1 Zero-Inflated Simple Graph Convolution (ZI-SGC) Model

Input:

X \in R^{n \times p}

(n

: number of nodes, p

: number of features)

y \in N_{0}^{n}

    k: k-nearest neighbor (k-NN) parameter
    K: number of propagation steps
    Symmetric or random walk: normalization types
    Count regression head: ZIP or ZINB
Output:

\hat{θ}

: fitted parameters

\hat{y}

: predictions

\hat{P} (Y = 0)

: probability of zero occurring
    MAE: mean absolute error
    RMSE: root mean squared error
Procedure:
1. Preprocessing
    (1-1) Applying standardization to X depending on data scale
2. Graph Construction
    (2-1) Building k-NN graph
    (2-2) Obtaining adjacency matrix A
3. Self-Loop
    (3-1) Computing

\tilde{A} = A + I

4. Normalization
(4-1) Computing

S = {\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{- 1 / 2}

5. Convolution of SGC
(5-1) Computing

X^{(K)} = S^{K} X

6. Fitting Zero-Inflated Head
(6-1) Modeling ZIP and ZINB by maximizing log-likelihood
(6-2) Minimizing objective function on parameter

θ

7. Inference and Evaluation
(7-1) Estimating

\hat{y}

and \hat{P} (Y = 0)

(7-2) Computing MAE and RMSE

Through the procedure of Algorithm 1, we explain the entire process of zero-inflated data analysis, from the preprocessing of original data to the final parameter inference and performance evaluation. Next, the proposed method of this paper is presented in a procedure as shown in Figure 1.

Figure 1 summarizes the zero-inflated data analysis procedure proposed in this paper. First, the input is zero-inflated count data with excessive zeros, consisting of a feature matrix X for each observation and a target count y. Next, a graph is constructed using similarity relationships between observations and expressed as an adjacency matrix A. Self-loops and normalization are then added to ensure stable graph propagation. SGC is applied to the prepared graph, transforming the features of each observation into a propagated representation that reflects neighborhood information. Finally, a zero-inflated regression head is trained using the transformed features as input. In this study, ZIP+SGC and ZINB+SGC are used as representative models. Model performance is evaluated using MAE and RMSE to determine whether combining SGC-based graph convolution improves prediction performance compared to the baseline ZI model.

4. Experiments and Results

To show the improved performance of our proposed models, we use simulation and real datasets. The simulation data used in this experiment was generated to evaluate the performance of SGC-based graph propagation in environment of graph-structured zero-inflated count regression. The target variable

y

is generated by the latent signal smoothed through the graph neighborhood, and the features are generated by mixing the signal with noise. We describe the process of generating the simulation data used in this experiment in Algorithm 2.

Algorithm 2 Simulation Data Generation for Zero-Inflated Data Analysis

Input:

n

: number of nodes

p

: number of features

C

: number of communities

p_{w}

: within edge probability

p_{b}

: between edge probability

K_{g e n}

: propagation steps

z_{0}

: zero ratio

σ

: noise scale

a

: clipping constant
Output:

X

: feature matrix, X \in R^{n \times p}

y

: response, y \in N_{0}^{n}

S

: propagation operator
Procedure:
1. Stochastic Block Model
(1-1)

Assigning community labels, c_{i} \in {1,2, \dots, C}

(1-2)

Constructing adjacency A

, for each node i

sample candidate neighbors and connect i \leftrightarrow j

with probability, {\begin{matrix} p_{w} & , c_{i} = c_{k} \\ p_{b} & , o . w . \end{matrix}

2. Normalization
(2-1)

Adding \tilde{A} = A + I

, {\tilde{D}}_{i i} = \sum_{j} {\tilde{A}}_{i j}

(I

: identity matrix, \tilde{D}

: degree matrix of \tilde{A}

)
(2-2)

Defining symmetric normalization S = {\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{- 1 / 2}

3. Latent Signal
(3-1)

Sampling μ ~ N (0, I)

(3-2)

Propagating sample μ^{(K_{g})} = S^{K_{g}} μ

(K_{g}

: number of propagation steps)
(3-3)

Standardizing μ^{(K_{g})}

to zero mean and unit variance
4. Feature Construction
If j ≤ 3 (for j = 1, 2, …, p)

Then X_{j} = μ + ε_{j}

, ε_{j} ~ N (0, σ^{2} I)

Else X_{j} ~ N (0, I),

5. Count Mean

Defining linear predictor η_{i} = β_{0} + β_{1} μ_{i}^{(K_{g})}

, η_{i} = c l i p (η_{i}, [- a, a])

, μ_{i} = e x p (η_{i})

6. Zero-Inflated Sampling
(6-1)

Setting r = \frac{1}{α}

(6-2)

Sampling λ_{i} ~ G a m m a (r, \frac{μ_{i}}{r})

, then y_{i}^{*} ~ P o i s s o n (λ_{i})

(6-3)

Yielding zero - inflated marginal for y_{i}^{*}

7. Zero-Inflated Calibration
(7-1)

Computing zero probability P (0 | μ_{i}, r)

and its mean {\bar{p}}_{0}

(7-2)

Setting structural zero probability π = \frac{z_{0} - {\bar{p}}_{0}}{1 - {\bar{p}}_{0}}

where π = m i n (m a x (π, 0), 0.95)

8. Injection of structural zeros

For each i

with probability π

, setting y_{i} = 0

Otherwise setting y_{i} = y_{i}^{*}

Based on Algorithm 2, we divided the zero ratio into 0.2 (20%) and 0.3 (30%) and generated experimental data, respectively. To verify the quality of the generated simulation data, we performed graph visualization on each data set. Figure 2 shows the results of the graph visualization according to the zero ratio.

In Figure 2, we show the results of visualizing simulation data generated by setting the zero ratio of the target variable to 0.2 and 0.3, respectively, in the graph based on the same Stochastic Block Model (SBM). Figure 2 intentionally visualizes the same graph topology for zero ratios 0.2 and 0.3 because the graph is constructed from the covariates X (sample similarity) and is treated as an underlying relational structure that does not change across the two conditions. The experimental factor is the zero-inflation level of the target count Y, the probability of structural zeros in the zero-generation mechanism. Holding the graph fixed while varying only the target’s zero-inflation level provides a controlled comparison: it isolates the effect of increasing zero inflation without conflating it with changes in graph connectivity. Therefore, the role of Figure 2 is not to show topology differences across conditions, but to confirm that the same relational structure is used and to contextualize performance comparisons under different zero-inflation regimes. In this visualization, node color represents community label, and the spring layout is used to ensure that the graph’s connectivity and community structure are maintained. In both settings where the zero ratio is 0.2 and 0.3, a pattern of communities being mixed but internally more densely connected is observed, which means that a graph structure reflecting zero inflation as designed during the data generation process was stably formed. This graph provides a structure that enables information diffusion through neighborhoods, which serves as the basis for meaningful SGC propagation. Therefore, we carried out a performance evaluation between the comparative models using the generated simulation data. Next, we represent the degree distribution according to the zero ratio in Figure 3.

This figure presents the degree distributions of the simulation graphs generated when the zero ratio of the target variable is set to 0.2 and 0.3, respectively. In this experiment, the zero ratio is achieved by adjusting the structural zero probability

π

of the response variable

Y

, and is not a parameter that changes the graph topology itself. Therefore, as observed in Figure 3, the sparse network characteristics, in which the degree is mainly concentrated in the range between 0 and 3, and relatively large degrees appear only in some nodes, are consistently maintained in both settings. This demonstrates that the connection method based on SBM in the data generation process operates stably, and the graph structure, such as connection density and neighborhood size, does not significantly change even when the zero ratio changes. In particular, the similarity of the degree distributions suggests that the observed differences in subsequent performance comparisons are not simply due to the denser graph favoring propagation, but rather to the model being placed in more challenging prediction situations as the level of zero inflation itself increases. In other words, the experiments with zero ratios of 0.2 and 0.3 systematically varied zero inflation while keeping the graph structure nearly fixed. This design allows for a clearer isolation and evaluation of the performance differences between the proposed model and existing models, attributing them to SGC propagation.

Furthermore, the fact that the average degree of nodes in the sparse graph is not large suggests that the propagation of SGC

S^{K}

operates by gradually spreading information within a limited neighborhood, rather than shuffling a large number of neighbors at once. This aligns with the design of our simulation, which generates latent signals in a smoothed form through neighborhoods while reducing the risk of over-smoothing due to excessive propagation. Consequently, the degree distribution in Figure 3 serves as a structural diagnostic, confirming that the experimental graph is sparse and that changes in the zero ratio do not disrupt the graph topology. Since the results in Figure 2 and Figure 3 confirmed that the simulation data were appropriately generated for the experiments in this paper, we conducted experiments comparing our proposed ZIP+SGC and ZINB+SGC models against the existing representative zero-inflated models, ZIP and ZINB. Table 2 shows the results comparing MAE and RMSE when the zero ratio is 0.2.

Table 2 presents the results comparing the prediction performance of ZIP, ZINB, and our proposed models, ZIP+SGC and ZINB+SGC, under the condition of zero ratio = 0.2. The SGC provides a representation that reflects neighborhood information by transforming features into the form

X^{(K)} = S^{K} X

, using the propagation operator

S

defined in the graph between observations. This graph smoothing can contribute to recovering meaningful signals from noisy features in zero-inflated sparse count data. Our experimental results show that ZIP+SGC improves performance compared to the original ZIP. Specifically, as shown in Table 2, while the MAE of ZIP was 30.65, ZIP+SGC’s decreased to 27.85, and its RMSE also decreased from 106.34 to 102.97. This suggests that features generated through graph propagation based on SGC can significantly reduce prediction errors compared to the traditional ZIP model.

The same trend is observed in the models based on the negative binomial distribution. While the MAE of the conventional ZINB was 34.09, the proposed model, ZINB+SGC, reduced it to 29.75, and the RMSE also improved from 120.01 to 115.77. In other words, the graph-based representation in this study consistently improved performance not only in the ZIP model but also in ZINB, which considers overdispersion. This demonstrates that even in situations where zero-inflation and overdispersion are simultaneously included, the neighborhood signal derived from the graph structure substantially contributes to improved prediction accuracy. Meanwhile, the standard deviation (SD) reflects variability across repeated runs. In Table 2, while the SDs of models incorporating SGC are similar to or slightly larger than the baseline, the consistent decrease in mean error supports the notion that graph propagation provides a more advantageous representation on average. In summary, under the zero ratio of 0.2, ZIP+SGC and ZINB+SGC simultaneously improved both MAE and RMSE compared to their respective baselines, confirming that the SGC-based expansion method proposed in this study is an effective approach to improving the performance of zero-inflated count models. Next, Table 3 shows the performance comparison results when the zero ratio is 0.3.

Table 3 presents the results of a comparison of the prediction performance of existing zero-inflated models and the proposed models in this study under the zero ratio = 0.3 condition, similar to Table 2. Consistent with the trend observed under the zero ratio = 0.2 condition in Table 1, these results show that models incorporating SGC-based propagation achieve lower prediction errors compared to their respective baselines of ZIP and ZINB. Specifically, ZIP+SGC reduced the MAE from 28.80 to 27.43 and improved the RMSE from 107.70 to 105.83 compared to the baseline ZIP. This suggests that even when using the same ZIP, feature representation that reflects neighborhood information through graph propagation can contribute to improved prediction accuracy. Furthermore, in a negative binomial-based comparison, the proposed model ZINB+SGC outperformed the baseline ZINB with an MAE of 29.74 and an RMSE of 112.29, and with an MAE of 27.41 and an RMSE of 107.17. In particular, the improvement in RMSE of ZINB+SGC was relatively large, indicating that SGC also worked advantageously in mitigating large errors in count data.

The SD represents variability across replicated experiments, and Table 3 shows that models combining SGC exhibit similar or slightly reduced SD compared to the baseline. For example, the MAE SD of ZIP is 7.70, while that of ZIP+SGC is 7.41. This indicates that the performance improvement is not limited to a specific replicate but rather remains relatively stable across replicated experiments. In summary, even when the zero ratio increased to 0.3, ZIP+SGC outperformed ZIP, and the same was true for ZINB+SGC and ZINB. This confirms that the graph-enhanced representation proposed in this study provides consistent performance improvements in a zero-inflated count regression environment. In addition, the fact that the models based on SGC maintain superior performance compared to the baseline even when the zero ratio increases from 0.2 to 0.3, which increases prediction difficulty, suggests the robust applicability of the proposed method. Next, we present the MAE plot in Figure 4 to check the performance difference between the compared models according to the zero ratio.

Figure 4 compares the MAE of four models (ZIP, ZINB, ZIP+SGC, and ZINB+SGC) in the form of mean ± SD when the zero-inflated ratio is varied from 0.2 to 0.3. Overall, models incorporating SGC achieve lower MAE values than their respective baselines under all conditions, suggesting that representation via graph propagation consistently improves performance in zero-inflated count prediction. First, at zero ratio = 0.2, ZINB among the baseline models exhibits a lower MAE than ZIP. However, variants incorporating SGC (ZIP+SGC, ZINB+SGC) achieve additional error reduction for both baselines. Notably, ZINB+SGC exhibits the lowest MAE among the four models, demonstrating that performance can be maximized when the ZINB head, which accounts for overdispersion, is combined with graph-based smoothing. This is consistent with the quantitative comparison results in Table 2.

Even when the zero ratio is increased to 0.3, the performance rankings are generally maintained, and the superiority of the model based on SGC persists. Specifically, ZIP+SGC exhibits lower MAE than ZIP, and ZINB+SGC exhibits lower MAE than ZINB, confirming the robustness of the graph-enhanced representation even when zero inflation increases prediction difficulty. Furthermore, the error bars of SD in Figure 4 show that the SGC-based model tends to lower the average error while maintaining a similar level of variability to the baseline. This supports the notion that the performance improvement is not limited to a specific iteration but is relatively stable across repeated runs. In summary, Figure 4 visually demonstrates that the proposed models incorporating SGC consistently achieve lower MAE than the baseline even when the zero ratio increases from 0.2 to 0.3. This experimentally supports the core assertion of this study that feature representations incorporating neighborhood information through graph propagation can improve prediction accuracy on sparse and zero-inflated count data. Also, we show the RMSE plot in Figure 5 to compare the performance between existing zero-inflated models and our proposed models according to the zero ratio.

Similar to the results in Figure 4, Figure 5 also compares the RMSE between the comparison models in the form of mean ± SD when the zero ratio is set to 0.2 and 0.3. The overall pattern is similar to the MAE results in Figure 4, and it is visually confirmed that the proposed models incorporating SGC achieve lower RMSE values than the respective baseline models. At zero ratio = 0.2, among the baseline models, ZIP and ZINB exhibit relatively large RMSEs, while ZIP+SGC and ZINB+SGC show reduced RMSEs, demonstrating superior prediction performance. This suggests that feature representations that incorporate neighborhood information through graph propagation contribute to reducing prediction errors. In particular, ZINB+SGC tends to exhibit the lowest RMSE across both zero ratio settings, suggesting that combining the ZINB head, which accounts for overdispersion, with SGC-based smoothing can effectively mitigate large errors.

Increasing the zero ratio to 0.3 strengthens zero inflation, making the prediction problem more difficult. However, the SGC-based model maintains superior RMSE compared to the baseline models of ZIP and ZINB. This demonstrates that the proposed method is relatively robust to changes in zero inflation levels. Furthermore, the error bars (SD) in Figure 5 represent variability across repeated runs. The SGC-coupled model exhibits a tendency to lower the average RMSE while maintaining variability similar to the baseline. This supports the idea that the performance improvement is not limited to a specific iteration but is reproducible on average. Consistent with the MAE results in Figure 4, Figure 5 demonstrates that the graph-enhanced representation improves performance in zero-inflated count regression not only in terms of average error but also in terms of RMSE, which is sensitive to large errors. This experimentally supports the notion that our proposed method can simultaneously improve prediction stability and accuracy in sparse and zero-inflated data.

To address the concern that simulation alone may raise concerns about generalizability, we conducted an experiment using real data rather than simulation data. The dataset is a document–keyword matrix constructed from patent documents in the domain of synthetic data technology. The proposed framework is designed to solve a practical prediction problem that frequently occurs in text mining and patent analytics, modeling and predicting sparse zero-inflated keyword frequencies. In the matrix, each entry is the frequency of a keyword in a document, and the matrix is typically dominated by zeros because each document contains only a small subset of the vocabulary. This produces a challenging zero-inflated count prediction task, where the goal is to predict a target keyword’s count from other sparse keyword signals while exploiting relational structure among documents. We added a real-data experiment based on a patent document–keyword matrix with 2929 documents and 14 keyword variables. The target keyword (Y) is synthetic, and the remaining keywords are used as predictors (X) as follows: algorithm, analysis, computing, data, deep, distribution, generating, intelligent, learn, machine, network, neural, and statistics. This real dataset exhibits strong sparsity; the overall proportion of zeros in X is approximately 72.4%, and the zero ratio of the target Y is approximately 39.6%. We show the results of performance comparison between comparative models with real data in Table 4.

Although the performance improvement on the real dataset is smaller than in the simulation study, which is expected because real-world relational structure and noise are more complex, the proposed models still show consistent gains. In our experiments, the proposed models achieve the best average MAE among all compared methods. Similarly, RMSE is improved compared to the corresponding baselines. These results support that incorporating lightweight graph convolution SGC can provide measurable benefits for real-world sparse zero-inflated count prediction, even when the improvement magnitude is modest.

5. Discussion

The proposed framework targets a common real-world setting of zero-inflated count data, in which zeros dominate because events or terms occur in only a small fraction of observations. This phenomenon is prevalent across many practical domains, including text analytics, business intelligence, and related applications. In such scenarios, a key implication is that prediction robustness can be improved by exploiting relational structure via lightweight graph propagation. As a result, the framework can support downstream decision-making tasks such as identifying influential features, prioritizing items for review, and detecting rare yet important patterns.

Our results indicate that combining SGC with probabilistic zero-inflated heads can yield measurable performance gains while preserving interpretable model components. Compared with purely neural approaches, ZIP and ZINB explicitly separate the zero-generation mechanism from the count-generation mechanism, which facilitates interpreting why zeros occur and how counts vary conditional on being non-zero. Meanwhile, SGC-based convolution provides a computationally efficient way to share information across related observations, which is particularly beneficial when individual samples contain limited signal due to sparsity. Importantly, SGC avoids the complexity of deeper GCN architectures by using a simplified propagation operator, which can be advantageous for stable training and reproducible implementation.

While simulation experiments often demonstrate clearer gains under controlled data-generating mechanisms, real-world datasets typically introduce additional challenges, including heterogeneous noise sources, imperfect similarity measures, and complex dependencies that may not be fully captured by a simple k-NN graph. Consequently, performance improvements in real applications may be smaller yet still meaningful, especially when the gains are consistent across repeated train–test splits. This observation is consistent with broader findings in graph-based learning, where the effectiveness of convolution depends strongly on the quality of graph construction and its alignment with the prediction task.

This study has several limitations. First, an inaccurately constructed graph can propagate noise and degrade performance. Second, over-smoothing may occur when the propagation depth (K) is too large, leading to less discriminative representations. Third, likelihood-based estimation for ZIP/ZINB can encounter convergence issues in extremely sparse or high-dimensional settings, which may require careful regularization or simplified inflation components. Finally, although the proposed framework is computationally lightweight relative to deeper GNNs, scaling to very large graphs may still require approximate nearest-neighbor search and optimized sparse linear algebra routines.

Overall, our findings complement established work on zero-inflated modeling and recent advances in graph representation learning. Classical ZIP and ZINB provide principled probabilistic tools for handling excess zeros, whereas modern graph methods show that neighborhood aggregation can improve predictive performance when relational information is informative. By integrating these directions, the proposed ZIP+SGC and ZINB+SGC frameworks offer a practical compromise between interpretability and graph-based learning capability. The results support the broader view that lightweight graph propagation can improve prediction in sparse count settings when the constructed graph reflects meaningful similarity, and they motivate further exploration of adaptive graph construction and more expressive graph neural architectures for zero-inflated data.

To further contextualize our findings within the related literature, prior studies on zero-inflated modeling have shown that accounting for excess zeros and over-dispersion is crucial for stable inference and prediction in sparse count settings such as ZIP and ZINB formulations and their extensions. These findings collectively suggest that robust modeling of zero inflation benefits from both distributional assumptions and structure-aware information sharing, which motivates our ZI-SGC design. In addition, several recent works have emphasized that improving prediction under severe sparsity often requires either enhanced learning strategies for imbalanced or zero-inflated data in boosting-based improvements to zero-inflated models or incorporating additional structural information to share signal across related observations. For example, boosting-enhanced zero-inflated modeling has been reported as an effective strategy for imbalanced settings, while graph-based approaches have been increasingly explored for spatiotemporal or relational count prediction problems. Our results are consistent with these general findings, incorporating lightweight graph propagation, SGC provides a practical way to leverage relational structure without sacrificing the interpretability of probabilistic ZIP and ZINB heads. At the same time, the modest improvement observed on the real-world document–keyword matrix is also aligned with the broader evidence that performance gains from graph convolution depend strongly on graph construction quality and the alignment between the graph and the prediction task.

Future work may include exploring alternative similarity measures, jointly learning the graph with the prediction model via graph structure learning, extending the framework to spatiotemporal graphs for event-count modeling, and incorporating uncertainty quantification to support decision-making in high-stakes domains.

6. Conclusions

This paper investigated ZI count prediction problems in which excessive zeros and severe sparsity make conventional count regression unstable and less accurate, especially when covariates are high-dimensional. To address these problems, we proposed a graph-based ZI learning framework, called ZI-SGC, which integrates SGC with probabilistic zero-inflated regression heads such as ZIP and ZINB. The proposed approach constructs an observation graph among samples, applies lightweight graph convolution through SGC propagation to obtain neighborhood-aware covariates, and then fits ZIP or ZINB heads via maximum likelihood learning. This design preserves the probabilistic interpretability of ZI models while exploiting relational structure for denoising and signal recovery.

Through controlled simulation experiments with varying zero ratios between 0.2 and 0.3, we consistently observed that incorporating SGC improves predictive accuracy. In particular, both ZIP+SGC and ZINB+SGC achieved lower prediction errors than their corresponding baselines of ZIP and ZINB by MAE and RMSE, and these gains remained stable across repeated runs. These results suggest that graph-enhanced representations can be beneficial in ZI count regression settings, even when the underlying ZI heads remain unchanged.

Despite these promising findings, several limitations should be acknowledged. First, the performance of ZI-SGC depends on graph construction choices such as similarity measure, k-NN parameter, and normalization, as well as the propagation depth K, which may introduce sensitivity and require systematic tuning. Second, SGC employs a fixed linear propagation operator, which yields efficiency and stability. However, more expressive GNN architectures may capture richer relational patterns may capture more complex relational patterns at the cost of additional complexity. Future work will extend ZI-SGC in several directions. We will explore principled strategies for graph construction and hyperparameter selection, such as robust similarity metrics, adaptive neighborhood size, and validation-based K selection, and investigate more expressive graph neural variants and hybrid heads for improved modeling flexibility. In addition, scalability to large graphs and inductive settings of generalizing to unseen nodes will be considered to enhance practical utility in large-scale, sparse-count prediction problems.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to the limitations of the ongoing project.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kong, M.; Xu, S.; Levy, S.M.; Datta, S. GEE type inference for clustered zero-inflated negative binomial regression with application to dental caries. Comput. Stat. Data Anal. 2015, 85, 54–66. [Google Scholar] [CrossRef]
Jun, S. Patent Keyword Analysis Using Bayesian Zero-Inflated Model and Text Mining. Stats 2024, 7, 827–841. [Google Scholar] [CrossRef]
Jeong, Y.; Lee, K.; Park, Y.W.; Han, S. Systematic approach for learning imbalanced data: Enhancing zero-inflated models through boosting. Mach. Learn. 2024, 113, 8233–8299. [Google Scholar] [CrossRef]
Hwang, B.S. A Bayesian joint model for continuous and zero-inflated count data in developmental toxicity studies. Commun. Stat. Appl. Methods 2022, 29, 239–250. [Google Scholar] [CrossRef]
de Souza, H.C.C.; Louzada, F.; Ramos, P.L.; de Oliveira Júnior, M.R.; Perdoná, G.D.S.C.A. Bayesian approach for the zero-inflated cure model: An application in a Brazilian invasive cervical cancer database. J. Appl. Stat. 2022, 49, 3178–3194. [Google Scholar] [CrossRef]
Park, S.; Jun, S. Zero-Inflated Patent Data Analysis Using Compound Poisson Models. Appl. Sci. 2023, 13, 4505. [Google Scholar] [CrossRef]
Uhm, D.; Jun, S. Zero-Inflated Patent Data Analysis Using Generating Synthetic Samples. Future Internet 2022, 14, 211. [Google Scholar] [CrossRef]
Hilbe, J.M. Negative Binomial Regression, 2nd ed.; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
Hilbe, J.M. Modeling Count Data; Cambridge University Press: New York, NY, USA, 2014. [Google Scholar]
Cameron, A.C.; Trivedi, P.K. Regression Analysis of Count Data, 2nd ed.; Cambridge University Press: New York, NY, USA, 2013. [Google Scholar]
Jun, S. Zero-Inflated Text Data Analysis using Generative Adversarial Networks and Statistical Modeling. Computers 2023, 12, 258. [Google Scholar] [CrossRef]
Cao, J.; Zhu, M.; Wen, Y.; Liu, Y.; Zheng, X. Spatial-Temporal Zero-Inflated Negative Binomial Graph Neural Network-Powered Multi-Port Vessel Traffic Flow Prediction. Proc. Int. Conf. Artif. Intell. Auton. Transp. Lect. Notes Electr. Eng. 2024, 1389, 252–260. [Google Scholar]
Lu, L.; Fu, Y.; Chu, P.; Zhang, X. A Bayesian Analysis of Zero-Inflated Count Data: An Application to Youth Fitness Survey. In Proceedings of the 2014 Tenth International Conference on Computational Intelligence and Security, Kunming, China, 15–16 November 2014; pp. 699–703. [Google Scholar]
Yusuf, O.B.; Bello, T.; Gureje, O. Zero Inflated Poisson and Zero Inflated Negative Binomial Models with Application to Number of Falls in the Elderly. Biostat. Biom. Open Access J. 2017, 1, 69–75. [Google Scholar] [CrossRef]
Workie, M.S.; Azene, A.G. Bayesian zero-inflated regression model with application to under-five child mortality. J. Big Data 2021, 8, 1–23. [Google Scholar] [CrossRef]
Lee, K.H.; Coull, B.A.; Moscicki, A.B.; Paster, B.J.; Starr, J.R. Bayesian variable selection for multivariate zero-inflated models: Application to microbiome count data. Biostatistics 2020, 21, 499–517. [Google Scholar] [CrossRef]
Wanitjirattikal, P.; Shi, C. A Bayesian zero-inflated binomial regression and its application in dose-finding study. J. Biopharm. Stat. 2020, 30, 322–333. [Google Scholar] [CrossRef]
Hardin, J.W.; Hilbe, J.M. Regression models for count data based on the negative binomial(p) distribution. Stata J. 2014, 14, 280–291. [Google Scholar] [CrossRef]
Berge, K.V.D.; Perraudeau, F.; Soneson, C.; Love, M.I.; Risso, D.; Vert, J.-P.; Robinson, M.D.; Dudoit, S.; Clement, L. Observation weights unlock bulk RNA-seq tools for zero inflation and single-cell applications. Genome Biol. 2018, 19, 24. [Google Scholar] [CrossRef] [PubMed]
Alomair, G. Predictive performance of count regression models versus machine learning techniques: A comparative analysis using an automobile insurance claims frequency dataset. PLoS ONE 2024, 19, e0314975. [Google Scholar] [CrossRef]
Young, D.S.; Roemmele, E.S.; Yeh, P. Zero-inflated modeling part I: Traditional zero-inflated count regression models, their applications, and computational tools. WIREs Comput. Stat. 2022, 14, e1541. [Google Scholar] [CrossRef]
Sidumo, B.; Sonono, E.; Takaidza, I. Count Regression and Machine Learning Techniques for Zero-Inflated Overdispersed Count Data: Application to Ecological Data. Ann. Data Sci. 2023, 11, 803–817. [Google Scholar] [CrossRef]
Murphy, K.P. Machine Learning: A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
Theodoridis, S. Machine Learning A Bayesian and Optimization Perspective; Elsevier: London, UK, 2015. [Google Scholar]
Efron, B.; Hastie, T. Computer Age Statistical Inference; Cambridge University Press: New York, NY, USA, 2016. [Google Scholar]
Ortega, A.; Frossard, P.; Kovačević, J.; Moura, J.M.F.; Vandergheynst, P. Graph Signal Processing: Overview, Challenges, and Applications. Proc. IEEE 2018, 106, 808–828. [Google Scholar] [CrossRef]
Shuman, D.I.; Narang, S.K.; Frossard, P.; Ortega, A.; Vandergheynst, P. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Process. Mag. 2013, 30, 83–98. [Google Scholar] [CrossRef]
Sandryhaila, A.; Moura, J.M.F. Discrete Signal Processing on Graphs. IEEE Trans. Signal Process. 2013, 61, 1644–1656. [Google Scholar] [CrossRef]
Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A Comprehensive Survey on Graph Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4–24. [Google Scholar] [CrossRef] [PubMed]
Wu, F.; Souza, A.; Zhang, T.; Fifty, C.; Yu, T.; Weinberger, K. Simplifying Graph Convolutional Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 6861–6871. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations 2017, Toulon, France, 24–26 April 2017. [Google Scholar]
Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NeurIPS), Red Hook, NY, USA, 5–10 December 2016; pp. 3844–3852. [Google Scholar]
Yao, L.; Mao, C.; Luo, Y. Graph Convolutional Networks for Text Classification. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 905, pp. 7370–7377. [Google Scholar]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning, Data Mining, Inference, and Prediction; Springer: New York, NY, USA, 2017. [Google Scholar]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
Sucar, L.E. Probabilistic Graphical Models; Springer: New York, NY, USA, 2015. [Google Scholar]
Hamilton, W.L.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing, Auckland, New Zealand, 2–6 December 2017; pp. 1025–1035. [Google Scholar]

Figure 1. Proposed modeling procedure.

Figure 2. Visualization of simulated datasets according to zero ratios.

Figure 3. Degree distribution of simulated datasets according to zero ratios.

Figure 4. MAE plot according to the zero ratio for model performance comparison.

Figure 5. RMSE plot according to the zero ratio for model performance comparison.

Table 1. Description of all variables used in the equations.

Symbol	Description	Dimension or Range
n	Number of observations	positive integer
p	Number of features	positive integer
X	Feature matrix	$R^{n \times p}$
$x_{i}$	Feature vector of observation i.	$R^{p}$
y	Target count vector	$N_{0}^{n}$
$y_{i}$	Target count of observation i.	$N_{0}$
${\hat{y}}_{i}$	Predicted count for observation i	$R \geq 0$
G=(V,E)	Graph with node V and edge E	graph
A	Adjacency matrix of G (possibly weighted)	$R^{n \times n}$
$w_{i j}$	Edge weight between nodes i and j	$R \geq 0$
sim(·,·)	Similarity function	$R^{p} \times R^{p} \to R$
k	Number of nearest neighbors in the k-NN graph	positive integer
$N_{k} (i)$	Set of k nearest neighbors of node i	subset of {1,…,n}
I	Identity matrix	$R^{n \times n}$
$\tilde{A} = A + I$	Adjacency matrix with self-loops	$R^{n \times n}$
D	Degree matrix of $\tilde{A}$	diagonal $R^{n \times n}$
S	Normalized propagation operator.	$R^{n \times n}$
K	SGC propagation steps	nonnegative integer
$X^{K}$	SGC-convolved features after K steps.	$R^{n \times p}$
$μ_{i}$	Mean of the count component.	$R > 0$
$π_{i}$	Structural zero-inflation probability	[0,1]
$σ (\cdot)$	Sigmoid function	$R \to (0,1)$
$β, γ$	Regression coefficients of count and inflation parts	vectors
r	Dispersion parameter in ZINB.	$R > 0$

Table 2. Performance comparison between comparative models with zero ratio = 0.2.

Model	MAE		RMSE
Model	Mean	SD	Mean	SD
ZIP	30.65	10.25	106.34	126.48
ZIP+SGC	27.85	11.28	102.97	127.32
ZINB	34.09	13.44	120.01	128.54
ZINB+SGC	29.75	13.73	115.77	132.67

Table 3. Performance comparison between comparative models with zero ratio = 0.3.

Model	MAE		RMSE
Model	Mean	SD	Mean	SD
ZIP	28.80	7.70	107.70	80.31
ZIP+SGC	27.43	7.41	105.83	80.60
ZINB	29.74	8.04	112.29	78.25
ZINB+SGC	27.41	7.62	107.17	80.64

Table 4. Performance comparison between comparative models with real data.

Model	MAE		RMSE
Model	Mean	SD	Mean	SD
ZIP	1.3094	0.0503	1.7954	0.0849
ZIP+SGC	1.2868	0.0501	1.7937	0.0840
ZINB	1.3092	0.0506	1.8085	0.0826
ZINB+SGC	1.2858	0.0458	1.7971	0.0732

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jun, S. Zero-Inflated Data Analysis Using Graph Neural Networks with Convolution. Computers 2026, 15, 104. https://doi.org/10.3390/computers15020104

AMA Style

Jun S. Zero-Inflated Data Analysis Using Graph Neural Networks with Convolution. Computers. 2026; 15(2):104. https://doi.org/10.3390/computers15020104

Chicago/Turabian Style

Jun, Sunghae. 2026. "Zero-Inflated Data Analysis Using Graph Neural Networks with Convolution" Computers 15, no. 2: 104. https://doi.org/10.3390/computers15020104

APA Style

Jun, S. (2026). Zero-Inflated Data Analysis Using Graph Neural Networks with Convolution. Computers, 15(2), 104. https://doi.org/10.3390/computers15020104

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Zero-Inflated Data Analysis Using Graph Neural Networks with Convolution

Abstract

1. Introduction

2. Zero-Inflated Models

2.1. Zero-Inflated Models with Statistics

2.2. Zero-Inflated Models with Machine Learning

3. Zero-Inflated Data Analysis Using GNNs with Convolution

4. Experiments and Results

5. Discussion

6. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI