Abstract
This paper introduces a new flexible probability tool for modeling extreme and zero-inflated count data under different shapes of hazard rates. Many relevant mathematical and statistical properties are derived and analyzed. The new tool can be used to discuss several kinds of data, such as “asymmetric and left skewed”, “asymmetric and right skewed”, “symmetric”, “symmetric and bimodal”, “uniformed”, and “right skewed with a heavy tail”, among other useful shapes. The failure rate of the new class can vary and can take the forms of “increasing-constant”, “constant”, “monotonically dropping”, “bathtub”, “monotonically increasing”, or “J-shaped”. Eight classical estimation techniques—including Cramér–von Mises, ordinary least squares, L-moments, maximum likelihood, Kolmogorov, bootstrapping, and weighted least squares—are considered, described, and applied. Additionally, Bayesian estimation under the squared error loss function is also derived and discussed. Comprehensive comparison between approaches is performed for both simulated and real-life data. Finally, four real datasets are analyzed to prove the flexibility, applicability, and notability of the new class.
Keywords:
survival discretization; Gibbs sampler; Metropolis–Hastings technique; L-moment structure; bootstrapping approach; Kolmogorov method; Bayesian analysis; Markov chain Monte Carlo; extreme and zero-inflated count data MSC:
62E99; 62E15
1. Introduction
The discretization of current continuous models has recently drawn significant interest. This is due to the fact that the data must often be recorded on a discrete scale rather than a continuous analog. Examples include the number of daily COVID-19 deaths, or the number of renal cysts caused by steroid use, among others. In order to discuss such data, new discrete models have been proposed and investigated in the statistical and mathematical literature. For instance, the discrete Weibull (DW) (Nakagawa and Osaki [1]), discrete Rayleigh (DR) (Roy [2]), discrete inverse Weibull (DIW) (Jazi et al. [3]), discrete exponential (DE) (Gomez-Déniz [4]), discrete inverse Rayleigh (DR) (Hesterberg [5]), discrete exponentiated Weibull (EDW) (Nekoukhou and Bidram [6]), discrete Lindley-II (DLy-II) (Hussain et al. [7]), discrete Lomax (DLx) and discrete Burr XII (DBXII) (Para and Jan [8]), discrete generalized Burr–Hatke distribution (for more details, see El-Morshedy et al. [9]), discrete exponentiated Lindley (EDLy) (see El-Morshedy et al. [10]), discrete generalized Burr–Hatke distribution (see Yousof et al. [11]), and the discrete inverse Burr (DIB) distribution (see Chesneau et al. [12]), among others.
On the other hand, another statistical approach has recently been followed up to define new discrete G families of probability distributions. The statistical approach depends on generating new discrete G families of probability distributions based on some existing continuous families. For example, following Bourguignon et al. [13], Aboraya et al. [14] defined and studied the discrete Rayleigh G (DR-G) family of distributions, while Ibrahim et al. [15] proposed a discrete analog of the Weibull G family. Following Steutel and van Harn [16], Eliwa et al. [17] introduced and discussed the discrete Gompertz G (DGz-G) family of distributions. Recently, Yousof et al. [18] presented a new G family of continuous distributions called the exponential generalized G (EG-G) family. The CDF of the EG-G family can be expressed as follows:
where
is the generalized odds ratio argument, refers to the survival function (SF) of any baseline model with the parameter vector , is a scale parameter, and is an additional shape parameter. In this work and following Yousof et al. [18] and Steutel and van Harn [16], we define and study a new discrete analog of the EG-G family called the discrete exponential generalized G (DEG-G) family. We are motivated to introduce the DEG-G family for the following reasons:
- Generating new probability mass functions that can be “asymmetric and left skewed”, “asymmetric and right skewed”, “symmetric”, “symmetric and bimodal”, “uniformed”, or “right skewed with a heavy tail”, among other useful shapes. The wide flexibility of the probability mass function (PMF) for any new model allows us to employ the new model for analyzing many different environmental datasets.
- Presenting some new special models with different types of hazard rate functions (HRFs), such as “increasing-constant”, “constant”, “monotonically decreasing”, “bathtub”, “monotonically increasing”, “decreasing–increasing-decreasing”, and “J-shaped”. The more forms of failure rates, the greater the elasticity of the distribution. These shapes facilitate the work of many practitioners, who may use the new distribution in statistical modeling and mathematical analysis. For this specific purpose, we give the problem of checking the failure rate function a great deal of attention.
- The degrees of the skew coefficient, kurtosis coefficient, failure rate function, and diversity in the PMF and failure rate functions all play a role in the flexibility of the new distribution. Additionally, the probability distribution’s usability and effectiveness in statistical modeling are crucial in this regard. Examining the novel PMF, we discovered that it was quite adaptable in this and other areas. This is what motivated us to investigate this probability distribution thoroughly.
- Proposing new discrete models for modeling “over-dispersed”, “equal-dispersed”, and “under-dispersed” real data. As shown in this paper, the new discrete family has shown a remarkable superiority in modeling these types of data, whether symmetric or asymmetric, and containing outliers or not containing outliers.
- Introducing new discrete models for analyzing extreme and zero-inflated count data.
- Comparison of the estimation methods with both simulated and real-life data for recommending the best method in each case.
- A zero-inflated probability distribution, or distribution that permits many zero-valued observations, is the foundation of a statistical model known as a zero-inflated model in statistics. For instance, individuals who have not purchased insurance against the risk, and are therefore unable to make a claim, would cause the quantity of insurance claims within a community for a particular type of risk to be zero-inflated. Often, the zero-inflated Poisson regression model is used for modeling and predicting the zero-inflated count data; however, in this paper, we are motivated to use the DEG-G family for this purpose.
- In statistical modeling of the bathtub hazard rate count data, the DEG-G family under the Weibull baseline model provides adequate results; hence, the DEG-G family under the Weibull baseline is recommended for modeling the bathtub hazard rate count data (see Section 6.1). Moreover, the same baseline model is also suitable for modeling the monotonically increasing failure rate count data with adequate fitting (see Section 6.2).
- In the case of zero-inflated medical data with a decreasing failure rate and some outliers, the new family is an appropriate choice to deal with this type of data (see Section 6.3).
- In case of zero-inflated agricultural data with a decreasing–increasing–decreasing failure rate and some outliers, the new class is an appropriate choice for modeling this kind of data (see Section 6.4).
- In fact, we empirically demonstrate that the proposed family of distributions fits four real datasets more accurately than 16 other extended relevant distributions with 3–4 parameters (see Section 7).
- Through simulation experiments and relying on the new class, many of the classical estimation techniques and Bayesian approaches are tested and evaluated, and important conclusions are reached in this regard, including the following:
The maximum likelihood estimation approach is still the most efficient and most consistent of the rest of the classical methods; however, most of the other methods perform well, except for the Kolmogorov estimation method.
- Generally, the Bayesian technique and maximum likelihood estimation method can be recommended for statistical modeling and applications.
- The Kolmogorov estimation method provides the worst results for all real datasets; this problem still needs more investigation for understanding of its main reasons.
Many helpful statistical properties, such as the probability-generating function, central and ordinary moment, moment-generating function, cumulant-generating function, and dispersion index (Disp-Ix), are calculated and statistically examined in this article once the new generator is defined. Some special discrete members, based on Weibull (W), inverse Weibull (IW), Lomax (Lx), Burr X (BX), inverse Burr X (IBX), log-logistic (LL), Rayleigh (R), inverse Rayleigh (IR), exponential (E), inverse Lindley (ILi), inverse Lomax (ILx), inverse log-logistic (ILL), inverse exponential (IE), and Lindley (Li) distributions, are listed in Table 1. Different classical (non-Bayesian) methods of estimation, including the Cramér–von Mises estimation (CVME), maximum likelihood estimation (MLE), ordinary least squares estimation (OLSE), bootstrapping (Bootst), L-moments (L-mom), Kolmogorov estimates (KE), weighted least squares estimation (WLSE), and Anderson–Darling left-tail from the second order (AD2LE), are considered. For more details about these methods, see the works of Chesneau et al. [12], Yousof et al. [18], Aboraya et al. [14], and Ibrahim et al. [15]. The Bayesian estimation under the squared error loss function is also considered. The well-known Markov chain Monte Carlo (MCMC) simulations are performed to compare the classical and Bayesian methods. The applicability of the DEG-G family is explained and discussed using four real-life datasets. The DEG-G family under the Weibull model case provides a more adequate fit than many competitive models, due to the consistent Akaike information criterion (CAICR), Akaike information criterion (AICR), chi-squared (), Kolmogorov–Smirnov (K–S), and corresponding p-value (P.V). For more detail about these statistics, see the works of El-Morshedy et al. [9,10], Aboraya et al. [14], and Eliwa et al. [17].
Table 1.
Some new sub-models.
2. The DEG-G Class
Starting with (1) and utilizing the discretization approach, the CDF of the DEG-G family can be formulated as follows:
where ), , and
The corresponding SF to (2) can be derived as follows:
According to Kemp [19] and (3), the PMF of the DEG-G family can be expressed as follows:
where
Based on (3) and (4), the hazard rate function (HRF) can then be proposed as follows:
In Table 1, some members of the DEG-G family are provided. The new PMF in (4) is most tractable when the CDF of the baseline member has a simple analytic expression.
For the W model, we have . Then, based on (3), the PMF of the DEGW model can be expressed as follows:
In Figure 1 and Figure 2, some PMF and HRF plots of the DEGW model are sketched under some selected parameter values. Based on Figure 1, it can be seen that the PMF of the DEGW can be “asymmetric and left skewed”, “asymmetric and right-skewed”, “symmetric”, “uniform”, or “right-skewed with heavy tail”, among other useful PMF shapes. Moreover, it can be used as a probability tool to discuss zero-inflated data. According to Figure 2, we can conclude that the HRF of the DEGW can be “increasing-constant”, “constant”, “monotonically decreasing “, “bathtub or decreasing-constant-increasing “, “monotonically increasing”, or “J-shaped”.
Figure 1.
The PMF of the DEGW model.
Figure 2.
The HRF of the DEGW model.
3. Main Properties
3.1. Moments
Theorem 1.
is a non-negative random variable (RV), where DEG-G () family, then the rth moment of the RV can be expressed as follows:
Proof.
Since
Thus,
Using (5), the mean , , , , and variance (), can be respectively written as follows:
and
Table 2 lists some numerical results for , , , and under the DEGW model. It should be noted that the first four moments can be numerically evaluated, although they have no closed forms. All numerical results in Table 2 were derived using the R program, and the infinity problem was overcome by assuming a very large value instead of it (108), since values beyond this value can be ignored because they are too small. Therefore, the results of , , , and are approximated.
Table 2.
, , , and of the DEGW distribution.
3.2. Central Moment and Dispersion Index
The central moment of the RV , i.e., , can be formulated as follows:
Hence, the ) can be expressed as follows:
or
The dispersion index (Disp-Ix) of the DEG-G family can derived as follows:
Some numerical results for the Disp-Ix are presented in Table 3, with useful comments. The Disp-Ix, also known as the coefficient of dispersion, relative variance, or variance-to-mean ratio (VMR), is a normalized measure of the dispersion of a probability distribution that is used in probability theory and statistics to determine whether a set of observed occurrences is clustered or dispersed in comparison to a common statistical model. It is described as the variance-to-mean ratio. In order to establish whether the observed real dataset can be modeled using a Poisson process, the Disp-Ix is used to measure whether a particular collection of observations is clustered or dispersed compared to a particular statistical model. When the Disp-Ix for any real dataset is less (greater) than 1, the dataset is referred to as under (over)-dispersed phenomena. Table 3 displays a numerical analysis and its associated computations for the Disp-Ix. The kurtosis, i.e., , and skewness, i.e., , of the RV can be obtained from the common relationships. Table 3 reports some numerical results for , , , and of the DEGW distribution. Based on Table 3, it can be seen that increases as increases, and decreases as and increase. The . The ranges from to . The Disp-Ix , or “ “, or “=“, like the standard well-known Poisson distribution (see Poisson [20]). Thus, the DEGW distribution could be useful in modeling “under-dispersed”, “equi-dispersed”, or “over-dispersed” count data.
Table 3.
Numerical results for , , , and of the DEGW distribution.
3.3. Generating Functions
Theorem 2.
Assume that is the non-negative RV, where DEG-G() class. Then, the moment-generating function (MGF) of the RV can be derived as follows:
Proof.
The MGF of the non-negative RV
can be derived as follows:
The first derivatives of (6), with respect to at , yield the first moments around the origin, i.e.,
where
and
The cumulant-generating function (CGF) is the logarithm of the MGF. Thus, the rth cumulant, i.e., , can be derived as In this context, we can highlight some important mathematical results: The 1st cumulant is the mean (). The 2nd cumulant is known as the variance (). The 3rd cumulant is the same as the 3rd central moment . However, the 4th and higher-order cumulants are not equal to central moments. The cumulants can be also expressed as follows:
It is possible to derive the probability-generating function (PGF) as follows:
4. Estimation and Inference
In this section, we are concerned with the different estimation methods, including classical methods and Bayesian methods. The classical methods are many and varied, some of which depend on the theory of maximization, and some of which depend on the theory of minimization. In any case, the classical methods on the whole differ from Bayes’ method in their origin and methodology of estimation, as will be explained in detail in theory and practice. The two subsections of this section cover Bayesian and non-Bayesian estimation techniques. Eight non-Bayesian estimation techniques, including the MLE, CVME, OLSE, WLSE, L-mom., KE, Bootst, and AD2LE methods, are taken into consideration in the first subsection. Then, the Bayesian estimation approach under the well-known squared error loss function (SELF) is taken into consideration in the second subsection.
4.1. Classical Estimation Techniques
4.1.1. The Maximum Likelihood Estimation Method
Maximum likelihood estimation (MLE) is a statistical technique for estimating the parameters of a probability distribution that has been assumed given some observed data. This is accomplished by maximizing a likelihood function to make the observed data as probable as possible given the assumed statistical model. The maximum likelihood estimate is the location in the parameter space where the likelihood function is maximized. Maximum likelihood is a popular approach for making statistical inferences, since its rationale is clear and adaptable.
The derivative test for figuring out maxima can be used if the likelihood function is differentiable. The ordinary least squares estimator, for example, maximizes the likelihood of the linear regression model, allowing the first-order requirements of the likelihood function to be explicitly solved in some circumstances. However, in many cases, it is essential to use numerical techniques to determine the probability function’s maximum. MLE is typically comparable to maximum a posteriori (MAP) estimates under a uniform prior distribution of the parameters from the standpoint of Bayesian inference. MLE is a specific example of an extremum estimator in frequentist inference, with likelihood as the objective function. If we assume a random sample from the presented class, then the log-likelihood function for the vector can be given as follows:
The can then be maximized via statistical programs such as “R”, or by solving the nonlinear system obtained from by differentiation. Then, the score vectors
can be easily derived as follows:
and
Setting
and simultaneously solving them produces the MLEs for the DEG-G family parameters. For numerically addressing such problems, the Newton–Raphson algorithm is used.
4.1.2. The Cramér–von Mises Estimation Approach
Consider a random sample from the proposed generator. Then, the CVME of the parameter vector () can be obtained by minimizing
with respect to (w.r.t) and, respectively, where and
The three nonlinear equations below are then solved to yield the CVME of the parameters and :
and
where
and
are the first derivatives of the CDF of DEG-G distribution w.r.t , and , respectively.
4.1.3. The Ordinary Least Squares Technique
Geometrically, this is defined as the total of the squared distances between each data point in the set and its corresponding point on the regression surface, which are measured parallel to the axis of the dependent variable. The lower the differences, the better the model fits the data. Particularly in the case of a basic linear regression, when there is only one regressor on the right-hand side of the regression equation, the resultant estimator can be stated by a straightforward formula. If denotes the CDF of the DEG-G family and represents the -ordered random sample, then the can be obtained upon minimizing
with respect to and, respectively, where . The OLSEs are obtained by solving the following nonlinear equations:
and
with respect to andrespectively, where , and are defined in (9), (10) and (11), respectively.
4.1.4. The Weighted Least Squares Estimation Method
Ordinary least squares and linear regression can be generalized into weighted least squares (WLS), also known as weighted linear regression (WLR), which incorporates knowledge of the variance of the observations into the regression. Another variation of generalized least squares is WLS. If denotes the CDF of the DEG-G class, and we assume that is the -ordered random sample, then the WLSE can be derived by minimizing the function with respect to , and , where
and Furthermore, the WLSEs can be reported by solving
and
with respect to andrespectively, where , and are defined in (9), (10) and (11), respectively.
4.1.5. L-Moments Estimation Approach
For a random sample taken from a certain population, the sample’s L-mom can be established and utilized as estimators of the population’s L-mom. The L-mom for the population can be obtained from
The first three L-mom can be expressed as follows:
and
where is the L-mom for the sample. Then, the L-mom estimators of the parameters , and can be obtained by solving the following three equations numerically
and
4.1.6. The Kolmogorov Estimation Method
The Kolmogorov estimates (KEs) can be obtained by minimizing the following function:
For estimating each parameter, the KEs are obtained by comparing and and selecting the maximum. However, for , we minimize the whole function For more detail about the KE approach, see the work of Aguilar et al. [21].
4.1.7. Bootstrapping Technique
Bootst, which is a type of test or metric that mimics the sampling process by using random sampling with replacement, belongs to the larger category of resampling techniques. With Bootst, sample estimates are given accuracy ratings such as bias, variance, confidence intervals, prediction error, etc. Using random sampling techniques, this strategy enables estimation of the sample distribution for nearly any statistic. The observed data’s empirical distribution function is a common option for an approximate distribution. A few resamples with replacement of the observed dataset can be constructed in the case where a set of observations can be believed to come from an independent and identically distributed population (and of equal size to the observed dataset). A potent statistical procedure, the Bootst approach is particularly helpful with small sample size. The assumption of a “normal” or “t” distribution cannot normally be utilized to cope with sample sizes smaller than 40 (see Efron [22] and Hesterberg [5]). Techniques for Bootst perform very well with samples that contain fewer than 40 observations. This is because Bootst requires resampling. These methods make no assumptions regarding the distribution of our data. With the increased accessibility of computational resources, Bootst has grown in popularity. This is due to the necessity of using a computer for Bootst to be useful. In the application section, we examine this in more detail.
4.1.8. The Anderson–Darling Left-Tail from the Second Order Approach
The AD2LEs of and can be obtained by minimizing
Thus, the AD2LEs can be derived by solving the following nonlinear equations:
and
4.2. Bayesian Estimation
Before we can even begin to discuss how a Bayesian approach might estimate a population parameter, we must first recognize one important distinction between frequentist and Bayesian statisticians. The distinction is whether a statistician views a parameter as a random variable or as an unknowable constant. A Bayes estimator, also known as a Bayes action, is an estimator or decision rule used in estimation theory and decision theory that minimizes the posterior expected value of a loss function (i.e., the posterior expected loss). In other words, it optimizes the utility function’s posterior expectation. Maximum a posteriori estimation is a different approach to constructing an estimator in the context of Bayesian statistics. We can assume the following prior distributions for the parameters , and, where
and
We can also assume that the parameters are independently distributed. The joint prior distribution can be written as follows:
The posterior distribution of the parameters is defined as follows:
where the likelihood = and is the joint prior distribution. Under SELF, the Bayesian estimators of and are the means of their marginal posteriors. Those marginal posteriors’ formulae cannot be utilized to derive the Bayesian estimates. Hence, the numerical approximation is required. Markov chain Monte Carlo (MCMC) methods are a class of algorithms used in statistics for probability distribution sampling. One can obtain a sample of the desired distribution by building a Markov chain with the desired distribution as its equilibrium distribution and recording states from the chain. The distribution of the sample closely resembles the real target distribution as the number of steps increases. For building chains, a few algorithms are available—notably the Metropolis–Hastings algorithm. In this work, we suggest using the Gibbs sampler and Metropolis–Hastings (M–H) algorithm—two MCMC approaches. Since the conditional posteriors of the parameters and cannot be obtained in any standard forms, it is advisable to pull samples from the joint posterior of the parameters using a hybrid MCMC. The full conditional posteriors of and can be easily calculated. The simulation algorithm can be summarized in the following steps:
- (1)
- Assume the initial values for , and at the stage.
- (2)
- Consider the elementary values for , and at the stage.
- (3)
- The M–H approach is utilized to derive
- (1)
- To obtain the samples of size from the relevant posteriors of interest, repeat steps 2–3 for M = 100,000 times.
- (2)
- Obtain the Bayesian estimates of , and using the following formulae:
5. Simulations: Comparing Classical and Bayesian Estimation Methods
For comparing the classical and Bayesian methods, MCMC simulation studies were performed. The results are presented in Table 4 (| 50, 100, 200, 300), Table 5 (| 50, 100, 200, 300), Table 6 (| 50, 100, 200, 300), and Table 7 (| 50, 100, 200, 300). The numerical assessments were performed based on the mean squared errors (MSEs). First, we generated samples of the DEGW model. Based on Table 4, Table 5, Table 6 and Table 7, it should be noted that the performance of all estimation methods improves when . The value with “*” is the best estimation in its row for all estimation methods. Generally, it should be noted that the MLE and the Bayesian methods are recommended for statistical modeling and applications; this assessment, as shown in Table 4, Table 5, Table 6 and Table 7, is mainly reliant on a comprehensive simulation study, and the simulation, as is well known, precedes the application on real data. Additionally, despite the diversity of classical methods and their abundance, the MLE method is still the most efficient and most consistent of the rest of the classical methods; however, most of the other methods perform well. In this section, we use simulation studies to evaluate different estimation methods, and not to compare these methods; this does not preclude the use of simulation for comparisons between different estimation methods, but the real data are often used to compare the different estimation methods, and this is what prompts us to present four applications for this specific purpose. This is in addition to four other applications to compare the competing models.
Table 4.
Results of the MSEs where .
Table 5.
Results of the MSEs where .
Table 6.
Results of the MSEs where .
Table 7.
Results of the MSEs where .
6. Comparing Various Estimation Techniques Via Real Data
6.1. Dataset I: Failure Times
This dataset represents the failure times of 50 devices, in weeks. The data observations are available in the work of Bodhisuwan and Sangpoom [23], and were recently analyzed by Eliwa et al. [17], Aboraya et al. [14], and Ibrahim et al. [15]. Table 8 lists the estimates, K–S, and P.V statistics for failure time data. Based on Table 8, it can be seen that the Bayesian method is the best, with K–S and P.V , followed by the MLE method, with K–S = 0.163038 and P.V = 0.15266. The KE method provides undesirable results or unexpected results (K–S = 0.51000 and P.V < 0.0001), and this may be due to the nature of the data used, or to any other random reasons. In any case, these results need further study and analysis, one way or another. Figure 3 gives the Kaplan–Meier (estimated survival function) plots using failure time data for the nine estimation methods. The graphical results in Figure 3 confirm and support the numerical results shown in Table 8.
Table 8.
Comparing methods using dataset I.
Figure 3.
Kaplan–Meier plots based on dataset I.
6.2. Dataset II: Failure Times of 15 Electronic Components
This lifetime data gives the failure times for 15 electronic components in an acceleration lifetime test (see Lawless et al. [24]). Table 9 gives the estimates, K–S, and P.V statistics for second failure time data. Based on Table 9, it can be noted that the AD2LE method is the best, with K–S = 0.09885 and P.V = 0.99855, followed by the Bayesian method, with K–S = 0.09937 and P.V = 0.99843. Figure 4 gives the Kaplan–Meier plots using second failure time data for the nine estimation methods. The graphical results in Figure 4 support the results in Table 9. Again, the KE method provided undesirable results or unexpected results (K–S = 0.53331 and P.V = 0.00039), and this may be due to the nature of the data used, or to any other random reasons. In any case, these results need further study and analysis, one way or another.
Table 9.
Comparing methods using dataset II.
Figure 4.
Kaplan–Meier plots under dataset II.
6.3. Dataset III: Counts of Kidney Cysts
The data on kidney cyst counts reflect the number of cysts in lymphogenic kidneys caused by corticosteroids, which are linked to the expression of recognized cytogenic molecules and Indian hedgehog (see the works of Chan et al. [25], Eliwa et al. [17], Aboraya et al. [14], and Ibrahim et al. [15]). Table 10 gives the estimates, K–S, and P.V statistics for the kidney dataset. Based on Table 10, we can observe that the AD2LE method is the best, with K–S = 0.09885 and P.V = 0.99855, followed by the CVM method, with K–S = 0.28412 and P.V = 0.86757. Figure 5 gives the Kaplan–Meier plots using kidney data for all methods. The graphical results in Figure 5 confirm the results of Table 10. Moreover, the KE method provided undesirable results or unexpected results (K–S = 134.915 and P.V < 0.0001), and this may be due to the nature of the data used, or to any other random reasons. In any case, these results need further study and analysis, one way or another.
Table 10.
Comparing methods using dataset III.
Figure 5.
Kaplan–Meier plots according to dataset III.
6.4. Dataset IV: Number of European Corn Borer Larvae
These data represent the number of European corn borer larvae in the field (see the works of Bebbington et al. [26], Eliwa et al. [17], and Aboraya et al. [14]). Table 11 gives the estimates, K–S, and P.V statistics for dataset IV. Based on Table 11, the L-mom method is the best, with K–S = 1.34090 and P.V = 0.51148, followed by the CVM method, with K–S = 2.21687 and P.V = 0.36774. Figure 6 gives the Kaplan–Meier plots using corn borer larvae data for all methods. The plots in Figure 6 confirm the results in Table 11. The KE and Bootst methods provided undesirable results or unexpected results (K–S = and P.V < 0.0001), and this may be due to the nature of the data used, or to any other random reasons. In any case, these results need further study and analysis, one way or another.
Table 11.
Comparing methods for dataset IV.
Figure 6.
Kaplan–Meier plots for dataset IV.
7. Competitive Models: Comparative Study and Interpretation
We used four real data applications to demonstrate the adaptability, usefulness, and significance of the DEGW distributions. The fitted distributions were analyzed and compared using the log-likelihood function, AICR, CAICR, with degree of freedom (d.f) P.V, and K–S and its P.V. Table 12 shows the competitive models and their abbreviations.
Table 12.
The competitive models.
7.1. Dataset I: Failure Time Data of 50 Devices
Using the data of Bebbington et al. [26], we compared the fits of the DEGW distribution with some competitive discrete models, such as EDW, DW, DIW, DLy-II, EDLy, DLL, and DPa. The failure time data are shown in Figure 7 together with the quantile–quantile (Q-Q) plot (middle panel), boxplot (left panel), and total time on test (TTT) plot (right panel). Table 13 displays the MLEs and associated standard errors (St.Ers). Table 14 displays the goodness-of-fit test statistics. The MATHCAD application was used to generate the results for Table 13, Table 14, and all other comparable results in the following subsections. Based on Table 14, the DEGW provides the best fits against all competitive models, with = 233.467, AICR = 472.933, CAICR = 473.455, K–S = 0.16304, and P.V = 0.15266. Figure 8 gives the fitted HRF (FHRF), fitted SF (FSF) (also called the Kaplan–Meier SF), and probability–probability (P–P) plots for failure time data. Based on Table 13, we have 17.7667, 811.8649, 1.359128, 3.271034, and Disp-Ix 45.6957.
Figure 7.
Box, Q-Q, and TTT plots for dataset I.
Table 13.
The MLEs (and their corresponding St.Ers) for dataset I.
Table 14.
The goodness-of-fit test statistics for comparing the competitive models for dataset I.
Figure 8.
The FHRF, ESF, and P–P plots for dataset I.
7.2. Dataset II: Failure Times of 15 Electronical Components
We compared the DEGW distributions’ fits to some models, including the DGE-II, DEx, DR, DIW, DIR, DLx, DPa, and DBXII, using data pertaining to electronic components. Table 15 exhibits both the MLEs and St.Ers. The test statistics are presented in Table 16. Based on Table 16, the DEGW provides the best fit compared to all discrete competitive models, with = 63.791, AICR = 133.581, CAICR = 135.763, K–S = 0.11998, and P.V = 0.98219. Figure 9 gives the Q–Q plot (middle panel), boxplot (left panel), and TTT plot (right panel) for the data of the second failure times. Figure 10 gives the FHRF, ESF, and P–P plots for second failure times. Based on Table 15, we have 4.873615, 133.6553, 2.890644, 11.67092, and Disp-Ix 27.42426.
Table 15.
The MLEs (and their corresponding St.Ers) for dataset II.
Table 16.
The goodness-of-fit test statistics for comparing the competitive models for dataset II.
Figure 9.
Boxplot, Q-Q plot, and TTT plot for dataset II.
Figure 10.
The FHRF, ESF, and P–P plots for dataset II.
7.3. Dataset III: Counts of Kidney Data
We compared the DEGW distribution’s fits to some competing models, including the DW, DIW, DR, DE, DLi, DLy-II, DLx, and Poisson models, for this set of data. Table 17 shows the MLEs and St.Ers. The goodness-of-fit statistics are provided in Table 18. Based on Table 18, the DEGW provides the best fits against all competitive models, with = 167.047, AICR = 340.094, CAICR = 340.321, = 0.35698and P.V0.83653. Figure 11 provides the boxplot, Q-Q plot, and TTT plot for the kidney data. Figure 12 gives the fitted PDF (FPMF), fitted SF (FSF), fitted HRF (FHRF), and fitted CDF (FCDF) plots. Based on Table 17, we have 1.432338, 4.86933, 2.018928, 7.255506, and Disp-Ix 3.399567.
Table 17.
The MLEs (and their corresponding St.Ers) for dataset III.
Table 18.
The goodness-of-fit test statistics for comparing the competitive models for dataset III.
Figure 11.
Boxplot, Q-Q plot, and TTT plot for dataset III.
Figure 12.
The FPMF, FSF, FHRF, and FCDF plots for dataset III.
7.4. Dataset IV: European Corn Borer Data
In this section, we compare the DEGW distributions’ fits to those of other rival models, including the DGIW, DIW, DBXII, DIR, DR, NB, DPa, and Poisson model. Table 19 displays the MLEs and St.Ers. The goodness-of-fit statistics are shown in Table 20. Based on Table 20, the DEGW provides the best fits against all competitive models, with , AICR = 407.912, CAICR = 408.119, = 2.07337, and P.V0.35463. Figure 13 gives the boxplot, Q-Q plot, and TTT plot. Figure 14 gives the FPMF, FSF, FHRF, and FCDF plots for corn borer larvae data. Based on Table 19, we have 1.44422, 2.811083, 1.329544, 4.446335, and Disp-Ix 1.946437.
Table 19.
The MLEs (and their corresponding St.Ers) for dataset IV.
Table 20.
The goodness-of-fit test statistics for comparing the competitive models for dataset IV.
Figure 13.
Boxplot, Q-Q plot, and TTT plot for dataset IV.
Figure 14.
The FPMF, FSF, FHRF, and FCDF plots for dataset IV.
8. Concluding Remarks
The discrete exponential generalized G (DEG-G) family is a new discrete variation of the exponential family that we suggest in this study. Numerous pertinent DEG-G family features—including the dispersion index, central and ordinary moments, cumulant-generating function, probability-generating function, and moment-generating function—were developed and studied, with numerical illustrations. After the new family was proposed, the DEGW model was introduced and studied in detail. The skewness . The spread of its kurtosis was from to . The dispersion index , or “”, or “=”. In order to simulate “under-dispersed”, “equi-dispersed”, or “over-dispersed” count data, the DEGW distribution may be helpful. In addition to being “symmetric”, “symmetric and bimodal”, “uniform”, or “right skewed with long tail”, the probability mass function of the DEGW distribution can also be “asymmetric and right-skewed”, “asymmetric and left-skewed”, “symmetric”, or “symmetric and bimodal”. The DEGW distribution’s failure rate can take one of five different forms: “constant,” “growing-constant,” “bathtub,” “monotonically increasing”, or “J-shapeed”. The DEGW parameters were estimated via various techniques to determine the best estimator for each data. A thorough comparison of the various methodologies was conducted for both simulated and real-life data. Finally, four real-life datasets were analyzed, and the following results can be concluded:
- In modeling the asymmetric failure time count data (the data of 50 devices), it can be seen that the Bayesian method is the best method, with K–S = 0.14712 and P.V = 0.22927, followed by the MLE method, with K–S = 0.163038 and P.V = 0.15266. However, for this dataset, the discrete exponential generalized G family provides the best fit under the Weibull baseline, with = 233.467, AICR = 472.933, CAICR = 473.455, K–S = 0.16304, and P.V = 0.15266.
- In modeling the asymmetric failure time count data (the data of 15 electronic components), the Anderson–Darling (left-tail second-order) method is the best method, with K–S = 0.09885 and P.V = 0.99855, followed by the Bayesian method, with K–S = 0.09937 and P.V = 0.99843. However, for this dataset, the discrete exponential generalized G family provides the best fit under the Weibull baseline, with = 63.791, AICR = 133.581, CAICR = 135.763, K–S = 0.11998, and P.V = 0.98219.
- In modeling the asymmetric counts of kidney data, the Anderson–Darling (left-tail second-order) method is the best, with K–S = 0.09885 and P.V = 0.99855, followed by the Cramér–von Mises estimation method, with K–S = 0.28412 and P.V = 0.86757. However, for this dataset, the discrete exponential generalized G family provides the best fit under the Weibull baseline, with = 167.047, AICR = 340.094, CAICR = 340.321, = 0.35698, and P.V 0.83653.
- In modeling the asymmetric European corn borer larvae data, the L-moment method is the best, with K–S = 1.34090 and P.V = 0.51148, followed by the Cramér–von Mises estimation method, with K–S = 2.21687 and P.V = 0.36774. However, for this dataset, the discrete exponential generalized G family provides the best fit under the Weibull baseline, with = 200.956, AICR = 407.912, CAICR = 408.119, = 2.07337, and P.V 0.35463.
Discrete distributions still need more studies and applications, especially with regard to the statistical testing of hypotheses and validation, whether in the case of complete data or in the case of censored data. In this regard, the reader may find a guide in the works of Goual and Yousof [28], Yousof [29], Yadav et al. [30], Yadav et al. [31], and Mansour [32].
Author Contributions
M.S.E.: review and editing, validation, writing the original draft preparation, conceptualization, data curation, formal analysis, project administration, software; M.E.-M.: review and editing, methodology, conceptualization, software; H.M.Y.: review and editing, software, validation, writing the original draft preparation, conceptualization, supervision. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the Deputyship for Research and Innovation, Ministry of Education, Saudi Arabia, grant number QU-IF-2-5-3-25110.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The four datasets are available in the paper.
Acknowledgments
The authors extend their appreciation to the Deputyship for Research & Innovation, Ministry of Education, Saudi Arabia for funding this research work through the project number (QU-IF-2-5-3-25110). The authors also thank Qassim University for technical support.
Conflicts of Interest
The authors declare no conflict of interest.
Abbreviations
| RV | Random variable |
| PMF | Probability mass function |
| CDF | Cumulative distribution function |
| EG-G | Exponential generalized G |
| DEG-G | Discrete exponential generalized G |
| exp-G | Exponential G |
| SF | Survival function |
| HRF | Hazard rate function |
| Disp-Ix | Dispersion index |
| MLEs | Maximum likelihood estimations |
| CVME | Cramér–von Mises estimation |
| OLSE | Ordinary least squares estimation |
| Bootst | Bootstrapping |
| KE | Kolmogorov estimation |
| WLSE | Weighted least squares estimation |
| AD2LE | Anderson–Darling method of left-tail second-order estimation |
| MSE | Mean square error |
| St.Ers | Standard errors |
| MCMC | Markov chain Monte Carlo |
| Log-likelihood | |
| AICR | Akaike information criterion |
| CAICR | Consistent Akaike information criterion |
| BIC | Bayesian information criterion |
| HQIC | Hannan–Quinn information criterion |
| K–S | Kolmogorov–Smirnov |
| P.V | p-value |
| P–P | Probability–probability |
| TTT | Total time in test |
| Q-Q | Quantile–quantile |
| FHRF | Fitted hazard rate function |
| FSF | Fitted survival function |
References
- Nakagawa, T.; Osaki, S. The discrete Weibull distribution. IEEE Trans. Reliab. 1975, 24, 300–301. [Google Scholar] [CrossRef]
- Roy, D. Discrete Rayleigh distribution. IEEE Trans. Relib. 2004, 53, 255–260. [Google Scholar] [CrossRef]
- Jazi, A.M.; Lai, D.C.; Alamatsaz, H.M. Inverse Weibull distribution and estimation of its parameters. Stat. Methodol. 2010, 7, 121–132. [Google Scholar] [CrossRef]
- Gomez-Déniz, E. Another generalization of the geometric distribution. Test 2010, 19, 399–415. [Google Scholar] [CrossRef]
- Hesterberg, T. Bootstrap, Wiley Interdisciplinary Reviews: Computational Statistics. WIREs Comp. Stat. 2011, 3, 497–526. [Google Scholar] [CrossRef]
- Nekoukhou, V.; Bidram, H. The exponentiated discrete Weibull distribution. Stat. Oper. Res. Trans. 2015, 39, 127–146. [Google Scholar]
- Hussain, T.; Aslam, M.; Ahmad, M. A two-parameter discrete Lindley distribution. Rev. Colomb. Estad. 2016, 39, 45–61. [Google Scholar] [CrossRef]
- Para, B.A.; Jan, T.R. On discrete three-parameter Burr type XII and discrete Lomax distributions and their applications to model count data from medical science. Biom. Biostat. Int. J. 2016, 4, 1–15. [Google Scholar]
- El-Morshedy, M.; Eliwa, M.S.; Altun, E. Discrete Burr-Hatke distribution with properties, Estimation Methods and Regression Model. IEEE Access 2020, 8, 74359–74370. [Google Scholar] [CrossRef]
- El-Morshedy, M.; Eliwa, M.S.; Nagy, H. A new two-parameter exponentiated discrete Lindley distribution: Properties, estimation and applications. J. Appl. Stat. 2020, 47, 354–375. [Google Scholar] [CrossRef]
- Yousof, H.M.; Chesneau, C.; Hamedani, G.G.; Ibrahim, M. A new discrete distribution: Properties, characterizations, modeling real count data, Bayesian and non-Bayesian estimations. Statistica 2021, 81, 135–162. [Google Scholar]
- Chesneau, C.; Yousof, H.M.; Hamedani, G.; Ibrahim, M. A New One-parameter Discrete Distribution: The Discrete Inverse Burr Distribution: Characterizations, Properties Applications, Bayesian and Non-Bayesian Estimations. Stat. Optim. Inf. Comput. 2022, 10, 352–371. [Google Scholar] [CrossRef]
- Bourguignon, M.; Silva, R.B.; Cordeiro, G.M. The Weibull-G family of probability distributions. J. Data Sci. 2014, 12, 53–68. [Google Scholar] [CrossRef]
- Aboraya, M.M.; Yousof, H.; Hamedani, G.G.; Ibrahim, M. A new family of discrete distributions with mathematical properties, characterizations, Bayesian and non-Bayesian estimation methods. Mathematics 2020, 8, 1648. [Google Scholar] [CrossRef]
- Ibrahim, M.; Ali, M.M.; Yousof, H.M. The discrete analog of the Weibull G family: Properties, different applications, Bayesian and non-Bayesian estimation methods. Ann. Data Sci. 2021, 1–38. [Google Scholar] [CrossRef]
- Steutel, F.W.; van Harn, K. Infinite Divisibility of Probability Distributions on the Real Line; Marcel Dekker: New York, NY, USA, 2004. [Google Scholar]
- Eliwa, M.S.; Alhussain, Z.A.; El-Morshedy, M. Discrete Gompertz-G family of distributions for over-and under-dispersed data with properties, estimation, and applications. Mathematics 2020, 8, 358. [Google Scholar] [CrossRef]
- Yousof, H.M.; Majumder, M.; Jahanshahi, S.M.A.; Ali, M.M.; Hamedani, G.G. A new Weibull class of distributions: Theory, characterizations and applications. J. Stat. Res. Iran JSRI 2018, 15, 45–82. [Google Scholar] [CrossRef]
- Kemp, A.W. Classes of discrete lifetime distributions. Commun. Stat. Theor. Methods. 2004, 33, 3069–3093. [Google Scholar] [CrossRef]
- Poisson, S.D. Probabilité des Jugements en Matiére Criminelle et en Matiére Civile, Précédées des RéGles Génerales du Calcul des Probabilitiés; Bachelier: Paris, France, 1837; pp. 206–207. [Google Scholar]
- Aguilar, G.A.; Moala, F.A.; Cordeiro, G.M. Zero-Truncated Poisson Exponentiated Gamma Distribution: Application and Estimation Methods. J. Stat. Theory Pract. 2019, 13, 1–20. [Google Scholar] [CrossRef]
- Efron, B. The bootstrap and modern statistics. J. Am. Stat. Assoc. 2000, 95, 1293–1296. [Google Scholar] [CrossRef]
- Bodhisuwan, W.; Sangpoom, S. The discrete weighted Lindley distribution. In Proceedings of the International Conference on Mathematics, Statistics, and Their Applications, Banda Aceh, Indonesia, 4–6 October 2016. [Google Scholar]
- Lawless, J.F. Statistical Models and Methods for Lifetime Data; Wiley: New York, NY, USA, 2003. [Google Scholar]
- Chan, S.; Riley, P.R.; Price, K.L.; McElduff, F.; Winyard, P.J. Corticosteroid-induced kidney dysmorphogenesis is associated with deregulated expression of known cystogenic molecules, as well as Indian hedgehog. Am. J. Physiol.-Ren. Physiol. 2009, 298, 346–356. [Google Scholar] [CrossRef] [PubMed]
- Bebbington, M.; Lai, C.D.; Wellington, M.; Zitikis, R. The discrete additive Weibull distribution: A bathtub-shaped hazard for discontinuous failure data. Reliab. Eng. Syst. Saf. 2012, 106, 37–44. [Google Scholar] [CrossRef]
- Dougherty, E.R. Probability and Statistics for the Engineering, Computing and Physical Sciences; Prentice Hall: Englewood Cliffs, NJ, USA, 1992; pp. 149–152. ISBN 0-13-711995-X. [Google Scholar]
- Goual, H.; Yousof, H.M. Validation of Burr XII inverse Rayleigh model via a modified chi-squared goodness-of-fit test. J. Appl. Stat. 2020, 47, 393–423. [Google Scholar] [CrossRef] [PubMed]
- Yousof, H.M.; Ali, M.M.; Goual, H.; Ibrahim, M. A new reciprocal Rayleigh extension: Properties, copulas, different methods of estimation and modified right censored test for validation. Stat. Transit. New Ser. 2021, 23, 1–23. [Google Scholar] [CrossRef]
- Yadav, A.S.; Goual, H.; Alotaibi, R.M.; Ali, M.M.; Yousof, H.M. Validation of the Topp-Leone-Lomax model via a modified Nikulin-Rao-Robson goodness-of-fit test with different methods of estimation. Symmetry 2020, 12, 57. [Google Scholar] [CrossRef]
- Yadav, A.S.; Shukla, S.; Goual, H.; Saha, M.; Yousof, H.M. Validation of xgamma exponential model via Nikulin-Rao-Robson goodness-of- fit test under complete and censored sample with different methods of estimation. Stat. Optim. Inf. Comput. 2020, 10, 457–483. [Google Scholar] [CrossRef]
- Mansour, M.; Rasekhi, M.; Ibrahim, M.; Aidi, K.; Yousof, H.M.; Elrazik, E.A. A New Parametric Life Distribution with Modified Bagdonavičius–Nikulin Goodness-of-Fit Test for Censored Validation, Properties, Applications, and Different Estimation Methods. Entropy 2020, 22, 592. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).