1. Introduction
The analysis and modeling of count data have received significant attention in recent decades, with a particular emphasis being placed on the development of discrete distributions. A widely used model for the analysis and modeling of count data is the Poisson distribution (PD). The important condition for the PD is an equal dispersion of count data. The equal dispersion can be evaluated using a statistical measure known as an index of dispersion (ID). The ID defines the quantity of variability in a distribution. The definition of the ID is given below:
Definition 1. The index of dispersion of a distribution, denoted as ID, can be defined as follows:where Var(Y) and μ are the variance and the mean of Y, respectively. The ID implies the following: If , then it is over-dispersion.
If , then it is under-dispersion.
If , then it is equal dispersion.
For more details, see [
1].
However, this condition is rarely observed in practical scenarios. Count data often show over-dispersion, thereby requiring an investigation of modeling alternatives, which provide greater flexibility than the PD. The negative binomial distribution (NBD) is a frequently employed alternative distribution for modeling count data, particularly for data that show over-dispersion. The NBD is extensively employed in the modeling of diverse datasets, including biological, medical sciences, accident statistics, social sciences, economics, quality control, ecology, and so forth. The study [
2] outlines the NBD as a mixture of PDs, and the mean of the PD is a random variable following a gamma distribution. The probability mass function (
pmf) of the NBD is denoted as follows:
For more details about the NBD and its properties, see [
1].
Furthermore, count data frequently display an excess of zeros and heterogeneity in variance, making traditional statistical distributions insufficient for modeling purposes. Nevertheless, the NBD is frequently favored over the PD because of its ability to provide increased flexibility in modeling data that appear over-dispersed, as previously indicated. Many studies have developed alternative models to deal with the presence of excess zeros and variability in the dataset. For example, zero-inflated models, hurdle models, or finite mixture models have been proposed in order to more efficiently deal with this issue. In addition, the mixing of PD or NBD with a lifetime distribution is frequently employed for the same issue. For example, numerous research studies have demonstrated that the mixed negative binomial distribution offers a superior fit for count data in comparison to the PD and NBD. Moreover, weighted distributions are used to solve the problem by multiplying count distributions with weight functions, as developed in [
3,
4]. Since then, the concept of weighted distributions has established itself in the literature as a powerful tool for modeling. By allowing us to adjust probabilities based on specific weights assigned to each outcome, weighted distributions provide a flexible framework that enhances our ability to accurately represent and analyze complex real-world phenomena. This adaptability has made weighted distributions invaluable in various fields such as statistics, biostatistics, biomedicine, ecology, survival data analysis, meta-analysis, and intervention data analysis. For discrete distributions, the weighted distribution is defined as follows:
Definition 2. Let X be a random variable with pmf and let be a non-negative weighting function such that exists and is finite. Then, the pmf is defined as follows:is a weighted distribution of . For more information on weighted distributions for discrete random variables, refer to the work [
1] on the subject. For example, the modified negative binomial distribution in [
5] can be viewed as a weighted geometric distribution.
Although the standard distributions possess attractive characteristics, they do not provide the best fit for real-world data that have deviations. The source of deviations can be either inflation in low counts or high dispersion. Hence, there is a need to develop new distributions that demonstrate superior performance. In this study, we employ the idea of conflation as a tool to cope with the presence of excess zeros or generally low counts and over-dispersed data. The concept of the conflation of probability distributions is presented by [
6] and defined as follows:
Definition 3. If are pmfs, then the corresponding conflated distribution iswhere A is the intersection of the supports of all the distributions. In terms of random variables, if are independent with pmfs , respectively, then Ref. [
6] presented conflation as a method for consolidating data from several independent experiments, all of which were designed to measure the same unknown quantity. In other words, distribution conflation is a distribution that inherits some properties from its components. For
, Equation (
1) can be viewed as a weighted distribution, where one mass function is the parent distribution and the other one is a weight function. In this sense, the conflation distributions are weighted distributions. Hence, one may use conflation methods to model data with a high excess of low counts and over-dispersion by conflating a distribution with a decreasing mass function with an over-dispersed distribution.
As a result, we introduce a new distribution by combining the NBD and the logarithmic distribution (LD) into a single distribution that reflects the common information between them with minimal loss of information. The
pmf of the LD is given by:
For more details about the logarithmic distribution and its properties, see [
1].
The new distribution is called the conflation of negative binomial and logarithmic distributions (CNBLD). The LD has a decreasing pmf, hence it is capable of modeling data with a high frequency for low counts while the NBD is over-dispersed; therefore, their conflation is expected to handle data expressing high excesses of low counts and over-dispersion. The LD does not support zero, hence the CNBLD inherits this property, which limits the applications of the CNBLD to positive count data. To overcome this problem, the study also presents two modifications of the CNBLD. The first modification shifts the CNBLD one position to the left, resulting in the shifted CNBLD that is denoted as SCNBLD. The SCNBLD retains the flexibility of the CNBLD but extends its support to zero values. The second modification conflates a shifted logarithmic distribution with the NBD, resulting in the conflation of a negative binomial shift logarithmic distribution (CNBSLD). The CNBSLD also aims to combine the features of both distributions to provide flexibility and the ability to model a wider range of data.
The structure of this paper is organized as follows:
Section 2 presents the definitions and discusses the graphical representations of the proposed models;
Section 3 describes some of the statistical properties of the proposed models, such as moments, log-concavity, index of dispersion, and likelihood ratio stochastic order;
Section 4 discusses the estimation of the parameters using the method of moments and the maximum likelihood method and evaluates the accuracy of these estimates by a simulation study;
Section 5 outlines the usefulness of the new distribution across several fields, showing its superior performance compared to the existing modified negative binomial distributions employed to fit similar data; finally,
Section 6 presents a conclusion.
2. Conflation of Negative Binomial and Logarithmic Distributions
In this section, the conflation of negative binomial logarithmic distributions (CNBLD), and the developed versions of the CNBLD are introduced. The developed versions of the CNBLD are named as follows: shifted conflation of negative binomial logarithmic distributions (SCNBLD) and conflation of negative binomial weighted by shift logarithmic distribution (CNBSLD). This section outlines the , the cumulative distribution functions (), which are denoted as , the survival functions (), which are denoted as (, and the hazard rate functions (h) of the CNBLD, SCNBLD, and CNBSLD.
Definition 4. The random variable Y is said to follow the CNBLD with parameters and if its is given as follows: Here, is the normalizing constant that can be expressed as follows:where is the Pochhammer symbol, and is a generalized hypergeometric function; for more details, see [1,7]. The generalized hypergeometric function is available in popular programming packages such as R, Mathematica, MATLAB, Python, and others. In this paper, we used the genhypergeo(.) function from the hypergeo package in R.
In comparison with (
1), in terms of random variables, if
X and
Z are independent random variables with
X following an NBD and
Z following an LD, then
. In the special case when
, the CNBLD reduces to the LD.
Remark 1. It should be noted that according to Definition 4, the CNBLD is a weighted negative binomial distribution with an LD as the weight function. In addition, the CNBLD can be considered as a weighted logarithmic distribution with an NBD as the weighting function.
Next, the
of the CNBLD are visualized for different values of parameters
, 0.5, and
and
, and 8 in
Figure 1. The parameter
has an impact on the dispersion of the distribution, while the parameter
r is mainly responsible for the shape of the CNBLD. In general, the shape of the
s of the CNBLD is skewed to the right; however, with an increase in the value of
r and
, the distribution becomes less skewed and displays more symmetry. On the other hand, for smaller values of
and
r, the
pmf of the CNBLD is a decreasing function with a high probability for low
y values. However, as
and
r increase, the function’s behavior shifts, initially rising to a peak before decreasing. This change illustrates how larger parameters introduce greater variability into the distribution.
The
,
, and
h of the CNBLD, respectively, are as follows:
Thus, according to
, the
can be given as follows:
Using the definition of the
, the
can be defined as follows:
Further, the
h of the CNBLD can be calculated as follows:
Most real-life count data have zero as a possible value. Therefore, the current study developed the CNBLD using two methods for this purpose. The first method was the obvious one which shifted the CNBLD by one to the left. The second method shifted the LD conflated with the NBD. Therefore, we obtained the following two definitions:
Definition 5. The random variable Y is said to follow the SCNBLD with parameters and , if its is given by the following: Consequently, we obtain the
, the
, and the
h functions of the SCNBLD, respectively, as follows:
Definition 6. The random variable Y is said to follow the CNBSLD with parameters and if its is given as follows: Remark 2. Note that the CNBSLD is a shifted LD for and a geometric distribution for , and hence the CNBSLD can be considered as an extension of the two distributions.
The following theorem can be used to derive the CNBSLD.
Theorem 1. If X and Z are independent random variables following an NBD with parameters and and a shifted logarithmic distribution with parameter , thenwhere Proof. The proof can be obtained directly by calculating the conditional probability. □
The
,
, and the
h of the CNBSLD are, respectively:
Here,
is the Gaussian hypergeometric function (see [
1,
7] for more information). It is possible to calculate the
of the CNBSLD as follows:
where
Thus, according to
, the
can be given as follows:
Using the definition of the
, the
can be defined as follows:
Further, the
h of the CNBSLD can be calculated as follows:
A comparison between the SCNBLD, the CNBSLD, and the NBD can be made by looking at the
s for different values of
and
r.
Figure 2 shows the
for
, and 8 with
, and
. The shape of all distributions is skewed to the right but tends to be symmetric for large
and
r. The difference between the distributions decreases significantly as
and
r increase, and the distributions behave more similarly. The
appears identical for relatively large
y, depending on the value of
r in all plots with small probabilities of all distributions. For example, the
s of all distributions are the same after
when
and
. In general, as the value of
r increases, the value of
y at which the
is constant increases. The plots show that both
r and
have a clear influence on the behavior of the different distributions.
3. Some Statistical Properties
In this section, we examine several useful statistical properties of the CNBLD, SCNBLD, and CNBSLD. These include deriving the mean, variance, and probability generating functions for each distribution. In addition, we calculate the index of dispersion (ID) for the CNBLD, SCNBLD, and CNBSLD, which provides information about the variability relative to their means. Furthermore, we discuss the likelihood ratio stochastic order and log-concavity property for the new distributions. The likelihood ratio stochastic order study is extended to the NBD to provide a more comprehensive understanding of the relative behaviors and properties of these distributions.
3.1. Moments and Probability Generating Functions
The statistical results for the moment and probability generating functions associated with the CNBLD are reviewed below.
The mean, the variance, and the probability generating function for the CNBLD are given as follows:
The formulas above can be given as follows:
For the variance, the second moment can be expressed as follows:
Hence, the variance can be shown to be as follows:
The form of the probability generating function becomes obvious from the of the CNBLD.
The following introduces the moments and probability generating function related to the SCNBLD.
The results can be obtained from the fact that , where Y follows a CNBLD.
Finally, the moments and probability generating function of the CNBSLD are as follows:
The mean, the variance, and the probability generating function for the CNBSLD can be given as follows:
They can be obtained as follows:
For the variance, we obtain the following:
The form of the probability generating function becomes obvious from the shape of the of the CNBSLD.
3.2. Index of Dispersion
In this subsection, we introduce the ID of the NBD, SCNBLD, and CNBSLD that are denoted as
,
, and
, respectively, for different values of
r and
. The
,
, and
for different values of
and
r are calculated in
Table 1. Since
and
have complicated mathematical formulas, the IDs are calculated for selected values of
r and
.
is given by . This implies that as increases, increases, indicating higher dispersion with higher .
For SCNBLD and CNBSLD, as increases, and increase. This means that for a fixed r, the dispersion increases as increases.
For SCNBLD and CNBSLD, as r increases, and also increase. This suggests that for a fixed , the dispersion increases as r increases.
If , then ; as a result, the SCNBLD is more suitable for data with greater dispersion. This suggests that the value of r determines the interchangeability of the two distributions.
3.3. Log-Concavity Property
Log-concave probability distributions are essential in various areas, including reliability theory, labor economics, monopoly theory, mechanism design theory, political science, and law. Refer to [
8] for additional information.
Definition 7. A discrete random variable X is log-concave if for all x.
Theorem 2. The pmf of the CNBLD is log-concave for and log-convex for .
Proof. For
, we have
Using the property that
, we can obtain:
As a result, we observe that shows an increase in r for . Therefore, implies for any y, but if and only if . Equivalently, this is true if and only if . Thus, indicates that is log-concave when .
For
,
, resulting in
This completes the proof. □
The CNBLD offers a flexible alternative to the NBD, with properties that depend on the parameter r. While the NBD is log-concave and unimodal for and log-convex for , the CNBLD has similar properties, but with different transition points: it is log-concave and unimodal when and log-convex when . Moreover, the transition from log-convex to log-concave in the CNBLD is gradual as r increases from 2 to 7, which improves its ability to model more diverse and precise datasets. This flexibility makes the CNBLD particularly well suited for developing precise statistical models that better fit the unique characteristics of the data and allow for a more effective analysis and interpretation compared to the NBD.
Remark 3. Using a similar argument, we can conclude that if and only if . Thus, for , the pmf of the CNBLD is log-concave on the set . The transition from log-convexity to log-concavity occurs gradually as r rises from 2 to 7.
Remark 4. The SCNBLD is log-convex for and log-concave for , in contrast to the NBD, which is log-convex for and log-concave for . This is because log-concavity does not change with shifting.
Theorem 3. The CNBSLD is log-convex for and log-concave for .
Proof. The conclusion is derived from the log-concavity of the NBD, which remains unchanged by truncation and shifting. □
3.4. Likelihood Ratio Stochastic Order
The likelihood ratio stochastic ordering provides a powerful method for comparing distributions, regardless of whether they belong to the same family with different parameters or are of completely different types. We can determine the ordering relationship between random variables by analyzing the likelihood ratio, which gives us insights into their probabilistic behavior and trends. In this section, we discuss the likelihood ratio stochastic ordering for our new distributions. We also extend this discussion to compare the likelihood ratio stochastic ordering for our new distributions with the NBD.
First, we introduce the definition of the likelihood ratio stochastic order used in this subsection.
Definition 8. Let and be two discrete random variables with pmfs and , respectively. We say that is smaller than in the likelihood ratio stochastic order (denoted by if the ratio is non-decreasing in y over the union of the supports of and .
The likelihood ratio stochastic order is very strong; it implies the hazard stochastic order and other stochastic orders. For more details on the implications and applications of stochastic ordering, see [
9]. In this subsection, we refer to the CNBLD with parameters
and
r as CNBLD(
r,
).
Theorem 4. Let and be two random variables following CNBLD (,r) and CNBLD(,r), respectively. If , then .
Proof. Let
be the
of the CNBLD (
,
r). Then, we obtain the following:
Here, . Since the ratio increases in y if and only if , this implies . □
Remark 5. For the SCNBLD and the CNBSLD, the following implications hold:
If and are two random variables following SCNBLD(, r) and SCNBLD(, r), respectively, such that , then .
If and are two random variables following CNBSLD(, r) and CNBSLD(, r), respectively, such that , then .
Proof. The proof is similar to the proof of Theorem 4. □
Theorem 5. Let and be two random variables following CNBLD(θ,) and CNBLD(θ,), respectively. If , then .
Proof. Let
be the
of CNBLD(
,
r). Then, we obtain the following:
Here . Since, the ratio increases in y if and only if , this implies . □
Remark 6. For the SCNBLD and the CNBSLD, the following implications hold:
If and are two random variables following SCNBLD(θ, ) and SCNBLD(θ, ), respectively, such that , then .
If and are two random variables following CNBSLD(θ, ) and CNBSLD(θ, ), respectively, such that , then .
Proof. The proof is similar to the proof of Theorem 5. □
Corollary 1. Let be a random variable from CNBLD. If , and , then we conclude from Theorems 4 and 5 the following: Hence, the following is given: Theorem 6. Let , and be three random variables following SCNBLD, CSNBLD and NBD, respectively. Then, .
Proof. To prove that
, we examine the following ratio:
where
. We observe that the term
is an increasing function of
y when
. Therefore,
.
Similarly, we need to examine the following ratio:
Here, . We observe that the term is an increasing function of y. Hence, .
Since we have showed that and , we conclude that . This means that in the likelihood ratio stochastic order, is stochastically larger than , and is stochastically larger than .
In various fields such as economics, insurance, and risk management, the stochastic order of the likelihood ratio is concerned with risk analysis and decision-making by identifying which distributions are more or less likely to produce large values. □