Generalized Conﬁdence Intervals for Zero-Inﬂated Pareto Distribution

: This paper considers interval estimations for the mean of Pareto distribution with excess zeros. Three approaches for interval estimation are proposed based on ﬁducial generalized pivotal quantities (FGPQs), respectively. Simulation studies are performed to assess the performance of the proposed methods, along with three measurements to determine comparisons with competing approaches. The advantages and disadvantages of each method are provided. The methods are illustrated using a real phone call dataset.


Introduction
Pareto distribution was introduced by Pareto in 1897 [1]. Since then, it has been widely employed in many applied fields such as meteorology, economics, politics, etc. Confidence intervals (CIs) or approximate CIs for the parameters of a Pareto distribution have been discussed by many authors; Asrabadi derived the unique minimum variance unbiased estimate (UMVUE) for the 100p th percentile of a Pareto distribution [2], Chen discussed the exact joint confidence region for the parameters of a Pareto distribution [3], and Wu discussed the interval estimation of a Pareto distribution based on a type II censored sample [4].
In practice, the situation becomes more complicated when the data contain a certain proportion of zeros, as zero values are often neglected by default to avoid complicated calculations. For example, an income in economics and network science data has an excess of zero counts [5]. In such cases, the data include zero values following a zero-inflated Pareto distribution. Previous work for zero-inflated models focused on the Poisson distribution, and the Lognormal distribution indicates that finding the confidence intervals of a zero-inflated Pareto distribution can be performed [6][7][8]. Hasan and Krishnamoorthy proposed confidence intervals for the mean and a percentile based on zero-inflated Lognormal data [9]. Waguespack et al. developed confidence intervals for the mean of a zero-inflated Poisson distribution [10]. However, the interval estimations for the parameters of a zero-inflated Pareto distribution have not been deeply investigated yet.
To the best of our knowledge, there is no published methods to obtain confidence intervals for the mean of a zero-inflated Pareto distribution. In this article, we derive three different approaches to estimate the Fiducial generalized confidence intervals (FGCIs). Since the generalized confidence interval was introduced by Weerahandi [11], it has been widely applied to practical situations where standard solutions do not exist. More details about the FGCIs can be seen in [12][13][14][15].
This article is organized as follows: In Section 2, we develop three different approaches to construct the FGCIs for the mean of the zero-inflated Pareto distribution. In Section 3, we conduct Monte Carlo simulation studies to evaluate coverage probabilities and other measurements of the proposed GCIs. In Section 4, we provide an example with actual data, followed by giving concluding remarks in Section 5.

Model
Assume that the population of interest contains both zeros and positive observations, such that the probability of having a zero response is δ, where 0 < δ < 1, and that the non-zero observations have a Pareto distribution. Let X 1 , X 2 , . . . , X n be sample from the population, and n 1 and n 0 be the numbers of non-zero and zero observations, respectively. Without loss of generality, we further assumed that the non-zero observations came first: X i > 0 for i = 1, · · · , n 1 , and X i followed a Pareto distribution with shape parameter α and scale parameter c, and X i = 0 for i = n 1 + 1, · · · , n. Then, n 0 followed a binomial distribution with the probability δ, and the probability density function of non-zero observations was: and X i > 0 for i = 1, · · · , n 1 . Then, the mean of the ith population was: For convenience, let θ 1 = 1 − δ,θ 2 = α α−1 , and θ 3 = c; then, θ could be denoted as Based on the properties of exponential distribution, we obtain where S = lnX (1) and T = (∑ n 1 i=1 lnX i − n 1 lnX (1) )/n 1 , S and T are independent.

Methods
In this section, we proposed three difference methods of constructing confidence intervals for zero-inflated Pareto mean via fiducial inference; fiducial inference was first introduced by Fisher [16], more details can be found in [12,13,17].
Let X be a random vector with a distribution indexed by a parameter ξ ∈ Ξ, and θ = π(ξ) be the parameter of interest. Assume that the data-generating mechanism for X could be expressed as: where the distribution of E is known as being independent of any other parameters. Equation (4) can be understood as the equation that was used to generate the data, and it was termed the structural equation. The set-valued function was defined as: the function Q(X, E) could be understood as an inverse of the function G. To avoid measurability problems, assume Q(x, e) was a measurable function of u. Notice that the equation x = G(ξ, e) was satisfied for ξ, and e used to generate the observed data x. Assume that the event {Q(x, e) = ∅} happened and the distribution of E had to be conditioned on this event. Then, generalized fiducial distribution of ξ was defined as the conditional distribution: the random variable having the distribution described in (6) is called generalized fiducial quantity (GFQ), which was denoted as R θ (x). In the following, three methods of constructing Fiducial generalized confidence intervals for Pareto distributions were proposed.

Generalized Fiducial Quantities for Parameters of Pareto Distribution
Let U ∼ χ 2 2 and V ∼ χ 2 2(n 1 −1) , and U and V be independent. From (3), we had: For the observed values of s and t, the equation of: had the following unique solution: Hence, for given (s, t), the generalized fiducial quantities (GPQs) for α and c were: respectively. Therefore, the GPQs of θ 2 and θ 3 were: respectively.

Proposed FGCI 1
Because binomial distribution is discrete, it was difficult to obtain the GPQ for δ. Tian derived the GPQs T 1 δ and T 2 δ for δ based on beta distribution by Tian [7], which was the conjugate prior to binomial distribution, where n 0 and n 1 were the numbers of zero and non-zero observations. Then, the two GPQs for population mean θ were: Let T 1 (γ/2) and T 2 (1 − γ/2) be the 100γ/2th and 100(1 − γ/2)th percentiles of T 1 and T 2 , respectively; a 1 − γ/2 Fiducial generalized confidence interval (FGCI) was given as: In practice, we needed the following Algorithm 1 for FGCI 1 .
Algorithm 1: For a given sample, determine n 0 and n 1 , and calculate observation s and t.

Proposed FGCI 2
Let N 0 ∼ B(n, δ); binomial distribution was asymptotically normally distributed as the sample size n became large; therefore: An alternative approximate GPQ of for δ by Li [18] was defined as: the distribution of T w δ was free of any unknown parameter. Based on (9), the approximate GPQ for θ was: the approximate 100(1 − γ)% FGCI for the zero-inflated Pareto mean was denoted as: The procedure was implemented in the following Algorithm 2.
Algorithm 2: For a given sample, determine n 0 and n 1 , and calculate observation s and t.
Let T w q denote the 100q percentile of ordered t w s. Then, T w γ/2 , T w 1−γ/2 becomes the 1 − γ FGCI of θ.

Proposed FGCI 3
Hannig [13] proposed five ways of finding FGPQ for δ, and simulation studies showed the optimal choice was: From (10), the approximate GPQ for θ was: and the 100(1 − γ)% FGCI for the mean of zero-inflated Pareto distribution was denoted as: Computational details for the FGCI are shown in the following Algorithm 3.

Algorithm 3:
For a given sample, determine n 0 and n 1 , and calculate observation s and t.
Let T F q denote the 100q percentile of ordered t F s. Then, T F γ/2 , T F 1−γ/2 becomes the 1 − γ FGCI of θ.

Simulation Studies
As the proposed methods in the preceding sections were approximate, we introduced three measurements to appraise their validity and accuracies by using the Monte Carlo simulation. The measurements were: (i) Coverage probabilities (CP) : the percentage of the true values that the parameter of interest fell into the confidence intervals we constructed.
(ii) Upper error rate (UER): the ratio of the true values for the parameter of interest that were above the upper limits.
(iii) Lower error rate (LER) : the ratio of the true values for the parameter of interest that were below the lower limits.
Our simulation study was conducted with six different proportions of zeros ranging from low to high, combined with different setups of parameters and sample sizes. To estimate the coverage probabilities of the FGCIs for the mean, we generated 2500 samples, each with size n, where each sample contained observations that were zeros and non-zeros, from a Pareto(α, c) distribution. For each generated sample, we used Algorithms 1-3 with 5000 runs to estimate the 95% CIs. The proportion of 2500 CIs that included the assumed mean was the Monte Carlo estimate of the coverage probability. The estimated coverage probabilities are reported in Tables 1 and 2. We observed from the simulation results that the coverage probabilities were conservative for the proposed methods when the samples were too small in all scenarios. As the sample size became larger, the coverage probabilities for the FGCI 1 and FGCI 2 came close to a nominal level, and the coverage probabilities for the FGCI 2 converged to a nominal level much faster than FGCI 1 . For FGCI 3 , we could see that the coverage probabilities were very sensitive to large values of parameters, as the values of parameters became large, the coverage became very liberal. All three methods proposed a return of fairly balanced tail error rates when the sample sizes became larger. We could also see from the results that no matter how the proportion of zeros changed, it did not affect the coverage probabilities much.
It is noteworthy that, in our simulation studies, we simulated a case with a sample size of 37,000 to show that our proposed methods worked properly for the real-data example in Section 4. Furthermore, our simulation results showed that all three methods returned satisfactory outcomes as long as the parameters were small; when the values of parameters became larger, the coverage probabilities for FGCI 3 became liberal.
In conclusion, since the simulation results showed that our proposed method two returned satisfactory results according to the coverage probabilities for all different scenarios, we recommend the use of method two in real-world problems.

Real-Data Example
In this part, we applied our proposed methods in the network science application [5]. Recently, complex network science has become a significant tool in many fields, such as sociology, climate informatics, finance, as well as genetics, among others (see [5] and the references therein).
In this example, we considered the phone call network dataset [5], which was directed and contained 36,595 vertexes and 91,826 edges. Figure 1 is an overall visualization of this dataset.
Many of the prototypes in the network science models exhibited a power law phenomenon, namely, the degree (k) distributions of such networks were usually heavy-tailed with the power law distribution [19].
Thus, the Pareto distribution became a natural continuous approximation in such an application, and has been widely investigated in the analysis of the degree distribution of complex networks [20].
While most degree distributions were focused on the undirected networks, in our example, we focused on the distribution of directed networks. Figure 2 shows the degree distribution for the in-degree (left panel) and out-degree (right panel) in the log-log scale with the survival probability. The straight line results suggested the power law (Pareto) distribution of the data. We also computed the statistic values for both the in-degree and out-degree; the scale and shape parameters for the in-degree were (1.000, 0.109) and the scale and shape parameters for the out-degree were (1.000, 0.110). Both the Pareto probability plots and the statistics values computed showed that the data fit the Pareto distribution well.
For networks without isolated vertexes, all the undirected networks usually had positive degrees. However, on the other hand, not all the in-degrees or the out-degrees, for the directed networks, always had a strictly positive degree. Thus, the zero-inflated Pareto distribution is an ideal tool for modeling the behavior of the degree distribution of the directed networks that follow the power law. Tables 3 and 4 present the fitted results for the mean degree distribution (in and out) with our proposed methods. We presented confidence intervals produced by each method in the two tables to show the applicability of our proposed methods. As can be seen from the tables, all of them had consistent results. However, as we mentioned in the simulation analysis, since method two had consistent coverage probabilities for all different scenarios, we recommend the use of the result produced by method two instead of the other results.

Conclusions
In this paper, we proposed three approaches to construct the Fiducial generalized confidence intervals for the mean of the zero-inflated Pareto distribution. Detailed computational algorithms were provided and we also conducted an extensive Monte Carlo simulation on various cases to compare the proposed methods. The results showed that the proposed methods were very satisfactory according to the coverage probabilities in simulation studies. Furthermore, we applied our approaches to a real dataset with the application in network science, and they also provided reasonable results for application in real-life situations. What might be needed in future work is to find the confidence intervals for the zero-inflated generalized Pareto distribution.
Author Contributions: Conceptualization, methodology, formal analysis, X.L.; software, validation, writing-original draft preparation, X.W. All authors have read and agreed to the published version of the manuscript.