Abstract
This paper discusses new approaches to parameter estimation of gamma distribution based on representative points. In the first part, the existence and uniqueness of gamma mean squared error representative points (MSE-RPs) are discussed theoretically. In the second part, by comparing three types of representative points, we show that gamma MSE-RPs perform well in parameter estimation and simulation. The last part proposes a new Harrel–Davis sample standardization technique. Simulation studies reveal that the standardized samples can be used to improve estimation performance or generate MSE-RPs. In addition, a real data analysis illustrates that the proposed technique yields efficient estimates for gamma parameters.
Keywords:
parameter estimation; gamma distribution; representative points; mean squared error; quantile estimator MSC:
62F10
1. Introduction
The term representative points (RPs) indicates a set of supporting points with corresponding probabilities, which can be used as the best approximation of a d-dimensional probability distribution. Representative points can be regarded as a discretization of a continuous distribution, and are expected to retain as much information as possible. In the univariate case, X is considered to be a population random variable with cumulative distribution function (cdf) , a discrete random variable Z is defined to approximate X with probability mass function (pmf) by a set of supporting points () with probabilities , where and . In the literature, there are several approaches to choosing the supporting point set z. For example, a set of random samples from can be viewed as a representative of the distribution; Fang and Wang [1] suggest generating representative points based on the number theoretic method. In 1957, Cox [2] proposes the idea of using mean squared error (MSE) to measure the loss of information from , where
The point set such that arrives its minimum is called the mean squared error representative points (MSE-RPs) of . MSE-RPs are found to have many good properties and have been applied in study fields such as signal compression (Gersho and Gray [3]), numerical integration computation (Pagès [4,5]), simulating stochastic differential equation (Gobet et al. [6]; El Amri et al. [7]), statistical simulation (Fang et al. [8], Fang et al. [9]) and clothing standard settings (Fang and He [10]; Flury [11]). To compute MSE-RPs for different distributions, effective numerical methods are proposed. Fang-He algorithm (Fang and He [10]) calculates MSE-RPs by solving a system of non-linear equations; Lloyd I algorithm (Lloyd [12]), LBG algorithm (Linde et al. [13]) and Competitive Learning Vector Quantization algorithm (Pagès [5]) obtain MSE-RPs by iterating a long training sequence of data; Tarpey’s self-consistency algorithm (Tarpey [14]) brings the idea of k-means algorithm for generating MSE-RPs; Chakraborty et al. [15] provides an accelerate algorithm using Newton’s method. When the number of MSE-RPs (k) is large, obtaining MSE-RPs becomes computationally intensive. Fang and He [10] presents some discussion on the optimum choice of k.
Recently, the use MSE-RPs properties for some distributions have been studied in detail, including normal distribution (Fang et al. [8]), mixed normal distribution (Fang et al. [9] and Li et al. [16]), arcsine distribution (Jiang et al. [17]) and exponential distribution (Xu et al. [18]). A general relationship between MSE-RPs and population distribution can be found in the work of Fei [19] and Fang et al. [9]. The study of the gamma distribution’s MSE-RPs (gamma MSE-RPs) can be traced back to Fu [20], which discusses the existence of gamma MSE-RPs and establishes an algorithm for computing these points. The gamma distribution is one of the most important distributions in statistics and probability theory, it is worth taking a closer look at gamma MSE-RPs and discovering their merits. The innovations of this paper are listed as follows:
- New theoretical results prove the uniqueness of gamma MSE-RPs;
- Gamma MSE-RPs are found to outperform other types of representative points in parameter estimation;
- A new standardization technique is proposed to improve the estimation performance of random samples from the gamma distribution.
Our discussion will focus on these three perspectives. Section 2 provides some preliminary knowledge of the gamma distribution and different types of representative points for readers to access our content easily. Section 3 gives some theoretical discussion on the existence and uniqueness of gamma MSE-RPs. An algorithm for generating gamma MSE-RPs is recommended. Section 4 compares the performance of three typical gamma representative points in parameter estimation and simulation. The results demonstrate that gamma MSE-RPs take advantage of other representative points in many scenarios. Section 5 introduces a new Harrel–Davis standardization technique. Simulation studies show that the standardized samples have better performances than random samples in estimation and can be used to generate gamma MSE-RPs. Section 6 provides a real clinical data analysis and illustrates that the standardization technique yields efficient estimates for gamma parameters.
2. Preliminaries
2.1. The Gamma Distribution and Gamma MSE-RPs
A gamma-distributed random variable with shape parameter a and rate parameter b is denoted . The corresponding probability density function (pdf) in the shape-rate parametrization is
where is the gamma function. The mean, variance, skewness and kurtosis of X are
accordingly. Let be a set of MSE-RPs for , derive the following intervals
with the corresponding probabilities in these intervals as
Here is the pdf in (2).
2.2. Other Types of Representative Points
In addition to MSE-RPs, two other types of representative points are frequently discussed in the literature: Monte Carlo representative points and number theoretic representative points.
- (A)
- Monte Carlo representative points (MC-RPs)
MC-RPs are generated by the Monte Carlo method. Consider a random sample from the distribution function ; this can be treated as a set of MC-RPs, written as , where and
- (B)
- Number theoretic representative points (NT-RPs)
NT-RPs are determined from the number theoretic method (Fang and Wang [1]). Given an one dimensional interval , it is known that point set () is uniformly scattered on this interval. Based on the inverse transformation method, points
are k NT-RPs of . The supporting point set is with probability
2.3. Harrel-Davis Quantile Estimator
In Harrel and Davis [21], a distribution-free quantile estimator is proposed, which consists of a linear combination of the order statistics admitting a jackknife variance. Let denote a random sample of size n from ; the pth quantile estimator is
where is the empirical distribution function. That is, , and is the indicator function of the set A. This method can be used for sample standardization. More details are discussed in Section 5.
3. The Existence and Uniqueness of Gamma MSE-RPs
Let a random variable with and is the supporting points set of X, to minimize , by taking partial derivative of (1), we have
where is the pdf of the gamma distribution (2). When , system of Equation (7) has only one equation
Obviously, it has one solution , which is the only representative point. When , the existence of MSE-RPs is true if the system of Equation (7) has a solution. After several transformations, (7) becomes
where is the cdf. Theorem 1 shows that the system of Equation (8) has a solution:
Theorem 1.
- For given , equationa solution exists if and only if .
- For given , Equationexists a solution when , where is the th representative point in the set of gamma MSE-RPs, which has .
- For a given , Equationa solution exists.
Theorem 1 guarantees the existence of gamma MSE-RPs. Its proof is provided in Appendix A. For the special case , the existence can be provided by statements 1 and 3 in Theorem 1. Next, we show the uniqueness of gamma MSE-RPs in Theorem 2.
Theorem 2.
Suppose . For any , the set of gamma MSE-RPs is unique if .
The proof of Theorem 2 is provided in Appendix A. As a result, these two theorems guarantee the existence and uniqueness of gamma MSE-RPs. Furthermore, throughout this paper, gamma MSE-RPs are generated based on the self-consistency algorithm [22]. The details of this algorithm are provided in Appendix B.
4. Gamma MSE-RPs in Parameter Estimation and Simulation
This section compares the performances of gamma MSE-RPs with other types of representative points, i.e., NT-RPs and MC-RPs, in terms of parameter estimation and simulation. Recall that random variable and Z is a discrete approximation of X. The mean, variance, skewness and kurtosis of Z are
By the method of moments, we have
which are the point estimators of a and b in . As Z is a discrete approximation of X, it is expected that the moments of Z and estimates in (12) are close to the moments of X, a and b accordingly. The following theorem shows some connections between gamma MSE-RPs and the corresponding .
Theorem 3.
Let with , is a set of gamma MSE-RPs of with corresponding probabilities in (4); then,
The proof of Theorem 3 is provided in Appendix A. Note that Theorem 3 is established not only for the gamma distribution but also for all continuous population distribution. Next, moments and estimates in (12) are calculated from MSE-RPs, NT-RPs, and MC-RPs of different . Three typical shapes of gamma distributions (—monotone decreasing; —right skewed and —bell-shaped; their pdfs are plotted in Figure 1). These are chosen and the representative points are set to three sizes (). The first part of Table 1, Table 2 and Table 3 summarizes the results in different scenarios. The last line of each table presents the moments and parameters of . It is clear that if k is fixed, the moments and estimates of MSE-RPs are closer to the true values than other representative points. Moreover, we can observe that the means of MSE-RPs are almost equal to the means of in all scenarios; when k becomes large, the moments and estimates of MSE-RPs converge to the true values much faster than other representative points. These results are consistent with the description in Theorem 3.
Figure 1.
Probability density function for , and .
Table 1.
Summary of results from RPs of in parameter estimation.
Table 2.
Summary of results from RPs of in parameter estimation.
Table 3.
Summary of results from RPs of in parameter estimation.
Next, the comparison focuses on the estimating performance of samples from representative points. We take samples from different shapes of gamma distributions (, and ), as well as their representative points with different sizes (). Setting sample size and repeat sampling times for each scenario, the method of moment estimates ( and ) and maximum likelihood estimates ( and ) are calculated. Define
as the average proportional deviation between estimations and parameters. The second part of Table 1, Table 2 and Table 3 show that MSE-RPs samples have the smallest average proportional deviation in most of the selected scenarios. Table A1 and Table A2 in Appendix C give medians and 95% empirical confidence intervals of , , and . In this simulation study, we observe that the point estimates of a and b from MSE-RPs samples generally have good estimation accuracy with both the moment and maximum likelihood methods. Meanwhile, when k is large, the estimation performances of MSE-RPs samples are similar to those samples from the corresponding . It is also worth mentioning that when , the proportional deviation and are much smaller than and . That is, when the size of gamma MSE-RPs is small, it is better to estimate parameters using the method of moments.
5. Generating MSE-RPs from Harrel–Davis Standardized Samples
This section discusses how to generate MSE-RPs from a gamma-distributed sample. A commonly used approach has two steps as follows:
- Calculate the maximum likelihood estimates (MLEs) for a and b, namely and , based on the sample dataset;
- Generate MSE-RPs from the gamma distribution with the estimated parameters, i.e., .
As we know, the representativeness of MSE-RPs depends on the estimate of gamma parameters. More accurate estimates will produce better representativeness. However, if a random sample does not represent the population well, the estimates may show large deviations from the true parameters. Hence, the MSE-RPs that are generated are not good representatives of the population distribution. This usually occurs when the sample size is small or medium. Next, we introduce a new Harrel–Davis (HD) standardization technique that can reduce the effect of randomness from samples. This technique transfers a random sample to a set of HD quantile estimators and then treats these estimators as a new “sample”. Recall that a set of quantiles with equal probability is a set of NT-RPs for population; a similar idea is utilized for sample standardization.
Definition 1
(HD standardized sample). Let be a set of sample data from a gamma distribution; set , which is called the HD standardized sample of x, where is the th HD quantile estimator defined in (6), and ().
Note here that is not a random sample because are not independent. However, since quantile estimators are equiprobable (), set is treated as an arbitrarily selected sample, which can be used to calculate MLEs for a and b. A new approach to generate MSE-RPs is proposed as follows:
- Obtain the HD standardized sample;
- Calculate the MLEs for a and b, namely and , based on the HD standardized sample;
- Generate MSE-RPs from .
Next, a simulation study is provided to show the good performance of HD standard samples in parameter estimation. Consider three gamma distributions (, and ) and three different sample sizes (), in each scenario, a number of random samples are generated and their HD standardized samples are obtained. The MLEs are calculated for each sample/standardized sample and summarized in Table 4. This shows that the means of estimates from HD standardized samples are closer to the true value in most scenarios. Moreover, the estimates from HD standardized samples appear to have smaller standard deviations than those from random samples. We conclude that HD standardized samples outperform random samples in terms of estimation accuracy and stability based on these results. Therefore, it is recommended to use the new three-step approach to generate MSE-RPs. Here, a comparison study between the MSE-RPs generated by random samples and HD-samples is provided. The estimates ( and ) in Table 4 are used to generate gamma MSE-RPs. Table 5 summarizes the results when with the size of MSE-RPs . It shows that the moments of gamma MSE-RPs from HD-samples are close to the moments of the origin . Meanwhile, the method of moment estimates in (12) are obtained. The estimates from HD samples have a better accuracy than those from random samples. This conclusion is generally valid when .
Table 4.
Mean (Standard deviation) of MLEs from samples and HD standardized samples.
Table 5.
Summary of results for MSE-RPs from the esitmated gamma distributions.
It is noteworthy that the HD standardization technique can also be applied in resampling. Consider another simulation study with the same settings as Table 4. We resample from each sample/standardized sample using and calculate the MLEs. The means and standard deviations of the resampled MLEs are summarized in Table 6. This shows that estimates from standardized samples generally have a better accuracy and smaller standard deviations when resampling.
Table 6.
Mean (Standard deviation) of resampled MLEs from samples and HD standardized samples.
6. Real Data Illustration
In this section, we consider a real-world dataset and illustrate the HD standardized technique proposed in the previous section. In this clinical study, 97 Swiss females () aged 70–74 inclusive at the time of diagnosis of dementia (a form of mental disorder) were studied for survival times (in years) by Elandt–Johnson and Johnson [23]. These data were analyzed by Ozonur and Paul [24] using the likelihood ratio test and score test with p-values 0.233 and 0.140, which are greater than 0.05. Both tests suggest that the two-parameter gamma distribution adequately fits the dementia data.
Point estimates (MLE) and the bootstrap interval estimates [25] based on the origin sample data and the corresponding HD sample are calculated. The approximate () bootstrap percentile interval is defined as
In practice, we resample the original data times to obtain 1000 replications of the parameter estimate (i.e., and for the gamma distribution) with . These estimates are sorted and the 25th value is used as the lower bound; the 975th value is the upper bound. The MLEs based on the HD standardized sample are and with confidence intervals and . The lengths of confidence intervals are shorter than those based on the origin sample data, where and with confidence intervals and .
7. Concluding Remarks
In the first part of this paper, the existence and uniqueness of gamma MSE-RPs are proved using two different approaches. An effective algorithm is recommended for the generation of gamma MSE-RPs. The second part of this paper compares gamma MSE-RPs with other representative points in terms of parameter estimation and simulation. This shows that the moments and estimates based on gamma MSE-RPs are the closest to the true values in different scenarios. In addition, samples from gamma MSE-RPs show a good general estimation accuracy. The last part of this paper introduces the new HD standardization technique. When a gamma-distributed sample is at hand, we recommend first transferring it to the HD standardized sample and then using it to estimate gamma parameters or generate MSE-RPs.
In future work, we would like to study whether the MSE-RPs of other distributions can also perform well in parameter estimation. It would also be interesting to explain how HD standardization technique reduces the randomness from samples through a theoretical demonstration.
Author Contributions
Conceptualization, X.K.; Methodology, S.W.; Validation, M.Z.; Supervision, H.Y. All authors have read and agreed to the published version of the manuscript.
Funding
This work was partially supported by the Guangdong Provincial Key Laboratory of Interdisciplinary Research and Application for Data Science, BNU-HKBU United International College (UIC), project code 2022B1212010006 and in part by Guangdong Higher Education Upgrading Plan (2021–2025) R0400001-22.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Not applicable.
Acknowledgments
The authors thank the Editor, Associate Editor and referees for their constructive comments leading to significant improvement of this paper.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A. Proofs of Theorems
Proof of Theorem 1.
The proof of three points in this theorem are provided as follows. Without loss of generality, consider a gamma distribution with (i.e., ) in all proofs.
Proof of point 1. Let
there is
Because and
Hence . In addition,
we have
Combine (A1) with the condition is continuous for and ; point 1 of Theorem 1 is proved.
Proof of point 3.
Let
therefore
Since , and
we have
Next, we show that for , is firstly monotone-increasing and then monotone-decreasing. Derive by to obtain
Let
we have
Note that in Equation (A4), , as long as
must exist that satisfies
Therefore, () satisfies
which means that is firstly monotone-increasing and then monotone-decreasing. In addition, we have
and . Thus, the function must cross the x-axis and the solution exists. One more step:
and in the neighborhood domain; furthermore,
We find that is a monotone-increasing function of .
Proof of point 2. Proving point 2 is complicated. Here, we provide the prove of a special case when . Let
thus:
Deriving by , we have
Let
we have
For , (A7) can be simplified to
set , we have
Therefore, crosses the x-axis twice for .
As and , combined with the facts and , we know that, for , is first monotone-increasing and then monotone-decreasing. In addition, and ; we conclude when . Next, consider
therefore, the solution exists if
We find that
is exactly (A2). From the analysis in the proof of point 3, we conclude that the solution exists when . □
Proof of Theorem 2.
Proof of Theorem 3.
Appendix B. Self-Consistency Algorithm for Generating Gamma MSE-RPs
The self-consistency algorithm [22] has the following steps:
1. Let the be the initial set.
2. Compute the conditional expectation using the system of equation,
and compare the distance between and for each . If the minimum distance is not smaller than the pre-defined error, e.g., , proceed to check the next step.
3. Repeat steps 1 and 2, obtaining corresponding , , until convergence is reached.
Appendix C. Median Estimates and Confidence Intervals of a and b
Table A1.
Median estimates and confidence intervals of a and b (method of moments).
Table A1.
Median estimates and confidence intervals of a and b (method of moments).
| k | RP | ||||||
| MSE | |||||||
| 5 | NT | ||||||
| MC | |||||||
| MSE | |||||||
| 20 | NT | ||||||
| MC | |||||||
| MSE | |||||||
| 100 | NT | ||||||
| MC | |||||||
Table A2.
Median estimates and confidence intervals of a and b (MLEs).
Table A2.
Median estimates and confidence intervals of a and b (MLEs).
| k | RP | ||||||
|---|---|---|---|---|---|---|---|
| MSE | |||||||
| 5 | NT | ||||||
| MC | |||||||
| MSE | |||||||
| 20 | NT | ||||||
| MC | |||||||
| MSE | |||||||
| 100 | NT | ||||||
| MC | |||||||
References
- Fang, K.T.; Wang, Y. Number-Theoretic Methods in Statistics; Chapman and Hall: London, UK, 1994. [Google Scholar]
- Cox, D.R. Note on grouping. J. Am. Stat. Assoc. 1957, 52, 543–547. [Google Scholar] [CrossRef]
- Gersho, A.; Gray, R.M. Vector Quantization and Signal Compression; Springer Science & Business Media: New York, NY, USA, 2012; Volume 159. [Google Scholar]
- Pagès, G. A space quantization method for numerical integration. J. Comput. Appl. Math. 1998, 89, 1–38. [Google Scholar] [CrossRef]
- Pagès, G. Introduction to vector quantization and its applications for numerics. ESAIM Proc. Surv. 2015, 48, 29–79. [Google Scholar] [CrossRef]
- Gobet, E.; Pagès, G.; Pham, H.; Printems, J. Discretization and simulation of the Zakai equation. SIAM J. Numer. Anal. 2006, 44, 2505–2538. [Google Scholar] [CrossRef]
- El Amri, M.R.; Helbert, C.; Lepreux, O.; Zuniga, M.M.; Prieur, C.; Sinoquet, D. Data-driven stochastic inversion via functional quantization. Stat. Comput. 2020, 30, 525–541. [Google Scholar] [CrossRef]
- Fang, K.T.; Zhou, M.; Wang, W.J. Applications of the representative points in statistical simulations. Sci. China Math. 2014, 57, 2609–2620. [Google Scholar] [CrossRef]
- Fang, K.T.; He, P.; Yang, J. Set of representative points of statistical distributions and their applications. Sci. Sin. Math. 2020, 50, 1–20. [Google Scholar]
- Fang, K.T.; He, S.D. The problem of selecting a specified number of representative points from a normal population. Acta Math. Appl. Sin. 1984, 7, 293–306. [Google Scholar]
- Flury, B.A. Principal points. Biometrika 1990, 77, 33–41. [Google Scholar] [CrossRef]
- Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
- Linde, Y.; Buzo, A.; Gray, R. An algorithm for vector quantizer design. IEEE Trans. Commun. 1980, 28, 84–95. [Google Scholar] [CrossRef]
- Tarpey, T. Self-consistency algorithms. J. Comput. Graph. Stat. 1999, 8, 889–905. [Google Scholar]
- Chakraborty, S.; Roychowdhury, M.K.; Sifuentes, J. High precision numerical computation of principal points for univariate distributions. Sankhya B 2021, 83, 558–584. [Google Scholar] [CrossRef]
- Li, Y.N.; Fang, K.T.; He, P.; Peng, H. Representative points from a mixture of two normal distributions. Mathematics 2022, 10, 3952. [Google Scholar] [CrossRef]
- Jiang, J.J.; He, P.; Fang, K.T. An interesting property of the arcsine distribution and its applications. Stat. Probab. Lett. 2015, 105, 88–95. [Google Scholar] [CrossRef]
- Xu, L.H.; Fang, K.T.; He, P. Properties and generation of representative points of the exponential distribution. Stat. Pap. 2022, 63, 197–223. [Google Scholar] [CrossRef]
- Fei, R.C. Statistical relationship between the representative points and the population. J. Wuxi Inst. Light Ind. 1991, 10, 78–83. [Google Scholar]
- Fu, H.H. The problem of selecting a specified number of representative points from a gamma population. J. Min. Sci. Technol. 1985, 107–116. [Google Scholar]
- Harrell, F.E.; Davis, C.E. A new distribution-free quantile Estimator. Biometrika 1982, 69, 635–640. [Google Scholar] [CrossRef]
- Stampfer, E.; Stadlober, E. Methods for estimating principal points. Commun. Stat.-Simul. Comput. 2002, 31, 261–277. [Google Scholar] [CrossRef]
- Elandt-Johnson, R.; Johnson, N. Survival Models and Data Analysis; Wiley Series in Probability and Statistics: New York, NY, USA, 1999. [Google Scholar]
- Ozonur, D.; Paul, S. Goodness of fit tests of the two-parameter gamma distribution against the three-parameter generalized gamma distribution. Commun. Stat.-Simul. Comput. 2020, 51, 687–697. [Google Scholar] [CrossRef]
- Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; Chapman and Hall/CRC: New York, NY, USA, 1994. [Google Scholar]
- Trushkin, A. Sufficient conditions for uniqueness of a locally optimal quantizer for a class of convex error weighting functions. IEEE Trans. Inf. Theory 1982, 28, 187–198. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).