Abstract
In this article we introduce an extension of the Akash distribution. We use the slash methodology to make the kurtosis of the Akash distribution more flexible. We study the general probability density function of this new model, some properties, moments, skewness and kurtosis coefficients. Statistical inference is performed using the methods of moments and maximum likelihood via the EM algorithm. A simulation study is carried out to observe the behavior of the maximum likelihood estimator. An application to a real data set with high kurtosis is considered, where it is shown that the new distribution fits better than other extensions of the Akash distribution.
MSC:
62E10; 62F10
1. Introduction
The slash distribution is an extended version of the normal distribution. It is characterized by the ratio of two separate random variables: one following a normal distribution and the other following a power of the uniform distribution. Therefore, we define a slash distribution for variable S as:
where , is independent of and ; its representation can be seen in Johnson et al. [1]. The distribution in question exhibits heavier tails compared to the normal distribution, indicating a higher level of kurtosis. The characteristics of this particular distribution are explored in detail in the works of Rogers and Tukey [2] and Mosteller and Tukey [3]. Kafadar [4] delves into the topic of maximum likelihood estimation for the location and scale parameters. Wang and Genton [5] present a multivariate version of the slash distribution as well as a multivariate skew version. The slash distribution is further extended by Gomez and Venegas [6] through the incorporation of the multivariate elliptic distributions. This methodology to increase the weight of the queues has also been used in distributions with positive support. To name a few, we mention the works of Olmos et al. [7] in the half-normal and Rivera et al. [8] in the Rayleigh model, among others. Based on the work of Rivera et al. [8], the scale mixture of Rayleigh (SMR) model is proposed. We say that with and if the probability density function (pdf) of Y is
Also, a necessary distribution in the development of this paper is the gamma distribution, whose pdf is given by
where . Its corresponding cumulative distribution function (cdf) is denoted by:
Shanker [9] introduced the Akash distribution and applied it to real lifetime data sets from medical science and engineering. Thus, we say that a random variable (r.v.) Y has an Akash model (AK) with shape parameter if its pdf is
where and we denote it by The parameter is a shape parameter, and if we add a scale parameter the pdf is given by
where is a scale parameter and is a shape parameter. We denote it by
Extensions of the AK distribution are carried out by Shanker and Shukla [10,11], among others. Both extensions consider adding a parameter and we will compare them with the new distribution. The two-parameter Akash distribution (TPAD) introduced by Shanker and Shukla [10] has the following pdf:
where and we denote it by
The power Akash distribution (PAD), introduced by Shanker and Shukla [11], has the following pdf:
where and we denote it by
The main motivation of this work is to introduce an extended version of the AK distribution given in Equation (6), making use of the slash methodology, in order to obtain a new distribution with greater kurtosis to be able to accommodate outliers. Pronounced fluctuations in the data sets encountered in such diverse disciplines as economic and actuarial sciences, environmental and earth sciences, among others, are very frequent. Thus, heavy-tailed models are necessary to perform better modelling in the presence of extreme values. For example, the normal distribution does not perform well in modelling data sets with extreme observations. We must therefore resort to heavy-tailed distributions. For example, in problems in which the involved r.v. has a high kurtosis, the probability that a rare event occurs can be highly underestimated if a model without heavy tails is used, which is solved by using a model with these characteristics. In the economy, practical examples of rare events are pandemics, and the 2008–2009 financial crisis, to name a few. In geology, a rare event might be a mega earthquake or a sudden eruption of a volcano that has been dormant for centuries.
The paper is structured as follows: in Section 2 we deliver our proposal and present its properties. In Section 3, we perform inference using the method of moments and maximum likelihood via the EM algorithm and a simulation study is also carried out. In Section 4, we apply the distribution to a real data set and compare it with other extensions of the AK distribution. Finally, in Section 5, we provide the main conclusions.
2. New Density and Its Properties
In this section, we introduce the representation, density and properties of the new distribution.
2.1. Representation
The representation of this new distribution is given by
where , , Y and Z are independent r.v.’s with . We name the distribution of X slash AK (SAK) and denote it by .
2.2. Density Function
The following Proposition shows the pdf of the SAK distribution is generated using the representation given in Equation (9).
Proposition 1.
Let . Then, the pdf of X is given by
where and G is the cdf of the gamma distribution given in Equation (4).
Proof.
Then, marginalizing in relation to V we obtain the pdf of X, obtaining
With the change in variable and using Equation (4), the result is obtained. □
Observation 1.
As the parameter q decreases, Table 1 and Figure 1 illustrate that the weight of the right tail increases.
Table 1.
Tail comparisons of the AK and SAK distributions.
Figure 1.
Left side: examples of the SAK() (in black), SAK() (in blue), SAK() (in red). Right side: examples of the SAK() (in black), SAK() (in blue), SAK() (in red).
In particular, Table 1 compares in the AK and SAK distributions for different values of x.
2.3. Properties
The following Proposition gives the cdf in closed form. It depends on G, which is the cdf of the gamma distribution given in Equation (4).
Proposition 2.
Proof.
The result is obtained from a direct application of the definition of a cdf. □
2.3.1. Reliability Analysis
The reliability function and the hazard function of the SAK distribution are provided in Corollary 1.
Corollary 1.
The reliability and hazard functions of the model are given by
- 1.
- 2.
where .
In Figure 2, we present the hazard function of the SAK model for several values of and q.
Figure 2.
Hazard function of the SAK() distribution (in black), SAK() distribution (in blue), SAK() distribution (in red).
2.3.2. Right Tail of the SAK Distribution
According to Rolski et al. [12], a distribution has a heavy right tail if
The following result shows that the SAK distribution is heavy-tailed.
Proposition 3.
The r.v. is heavy-tailed.
Proof.
Applying L’Hospital’s rule twice we have,
□
The following Proposition shows that the SAK distribution can be represented as a scale mixture between the AK and Beta distributions.
Proposition 4.
If and then .
Proof.
The joint density of X and Z is given by
The marginal distribution of X is obtained as
□
The following proposition illustrates that the AK model is a particular case of the SAK distribution for .
Proposition 5.
Let and . If , then X converges in law to Y.
Proof.
Using its representation we analyze the convergence of this quotient, where and Beta. In the Beta() distribution we have, and . Then, applying Chebychev’s inequality for Z, we have
If then the right hand side of Equation (12) tends to zero, i.e., converges in probability to 0. Also, , then we have,
As , with the application of the Slutsky’s Lemma to , it is obtained
Thus, when q is large enough, X converges in law to a distribution. □
2.3.3. Moments
In this subsection we obtain the moments of the SAK distribution. To achieve this aim, we first introduce the following lemma.
Lemma 1.
Let with . For , exists if and only if and in this case
Proof.
The r-th moment of the r.v. is given by Shanker [9], which is . Then calculating the r-th moment of , where is a parameter of scale, the result is obtained. □
Proposition 6 presents the moments of the SAK distribution.
Proposition 6.
Let with θ, . For , is given by
Proof.
Using the representation given in the Proposition 4 and by Lemma 1, we obtain
Solving the integral gives the result. □
From Proposition 6, we can obtain expressions for the non-central moments, , and the variance of , , which are presented in Corollary 2.
Corollary 2.
Let with θ and . The noncentral moments and the variance of X, , are obtained
where
Remark 1.
Note that when , , which is the variance of an distribution.
The next Corollary presents the skewness coefficient, , of a model.
Corollary 3.
Let , with and . Then the skewness coefficient of X is:
Proof.
Recall that
where , and were given in Corollary 2. □
Also, the kurtosis coefficient, , of a distribution is given in the following Corollary.
Corollary 4.
Let with and . The kurtosis coefficient of X is
where , , and
Proof.
Recall that
where , , , and were given in Corollary 2. □
Remark 2.
Note that the skewness coefficient of the model can be written as
From here, it is straighforward to check that
On the other hand, the kurtosis coefficient of the model can be written as
Therefore, it is simple to check that
Note that the skewness and kurtosis coefficients of the model coincides with that of the AK(θ) for (see Shanker, 2015).
The findings from the data in Table 2 indicate that the skewness and kurtosis coefficients are influenced by the parameters and q. Moreover, it is observed that as the value of q decreases, the skewness and kurtosis coefficients tend to increase. Conversely, when the value of q increases, the skewness and kurtosis coefficients align with those of the AK() distribution (Proposition 5).
Table 2.
Skewness and kurtosis of the SAK distribution for various values of the shape parameters.
3. Inference
In this section, our focus is on examining the estimation of parameters using the method of moments and maximum likelihood (ML) through the EM algorithm. Additionally, we conduct simulations to analyze the effectiveness of ML estimators in situations with limited data samples.
3.1. Method of Moment Estimators
Let be a random sample from . Let and be the first two sample moments.
Proposition 7.
Given a random sample from with , the moment method estimators of θ and q provides the following estimators
where it is necessary to solve Equation (16) numerically to obtain . Then is replaced in Equation (15) to obtain .
3.2. ML Estimation
Let be a random sample from . Then the log-likelihood function is
where Taking partial derivatives in in relation to and q and equaling those equations to zero, we obtain
where , and Solving this system of equations to find the ML estimates numerically may be a difficult task due to the functions it involves. However, an EM algorithm can be implemented (see Dempster et al. [13]) to obtain the ML estimates. The following subsection is dedicated to achieving this goal.
3.3. EM Algorithm
A different way to represent the SAK model is provided through a stochastic approach.
where and , represent non-observable variables. This representation can be used for an alternative estimation procedure based on the EM algorithm (Dempster et al. [13]). In this context, the observed data are given by , where . The vectors and are the latent variables and the vector are the complete data. Note that the joint distribution of is given by
Therefore, up to a constant that does not depend on the vector of parameters , the complete log-likelihood function for the model is given by
With this, the expected value of , given the observed data, is
where and . Note that
where , is the cdf for the gamma model and TG denotes the gamma distribution with shape a and rate b truncated in the interval (0,1). Therefore, using properties of conditional expectations, we have and by (19) such expectations are simple to be computed. In a similar manner, we can compute , obtaining as results
Therefore, the kth iteration of the algorithm comprises the following steps:
- E-step: given and , for compute and using Equations (20) and (21), respectively.
- M1-step: update as
- M2-step: update as the solution for the non-linear equation
The E, M1 and M2 steps are iterated until convergence is achieved. Convergence is defined as reaching a point where the difference between the estimates obtained in two consecutive iterations is smaller than a predetermined value. Note that the M1-step was obtained in a closed form, whereas the solution for can be obtained, for instance, with the uniroot function of R [14].
In the following subsection we run some simulations to study the behavior of the ML estimators.
3.4. Simulation Study
In this subsection, we will conduct a brief simulation study using R software 4.3.2 to evaluate the performance of the ML estimators obtained through the EM algorithm for the SAK model discussed earlier. To generate the data, we will consider three different values for (0.5, 3, and 10), three values for q (0.5, 1, and 2), and five sample sizes (30, 50, 100, 200, and 500). For each combination of , q, and n, we will draw 1000 replicates and calculate the ML estimators. The initial value to start the EM algorithm is based on , the estimation of obtained from the AK model (with scale fixed at 1) and . In addition, for each replicate we estimate the standard errors based on the observed information matrix. Table 3 reports the empirical bias (bias), the average of the estimated bias (bias), the mean of the standard errors, the square root of the mean squared error based on empirical data, and the 95% probability that the estimated parameters fall within the asymptotic distribution are all indicators of the performance of maximum likelihood estimators. The table presented, Table 3, demonstrates that as the sample size (n) increases, the estimator’s performance improves.
Table 3.
Estimated bias, SE, RMSE and CP of the ML estimators of the parameters of the SAK distribution for different sample sizes.
4. Application
In this section, we analyze a real data set showing that the SAK distribution can be more appropriate than other commongly used distributions to model heavy right-tailed data for this particular data set and based on some model selection criteria. The data correspond to plasma beta-carotene levels (ng/mL) of 314 patients. This data set contains 14 variables and is available online at http://Lib.stat.cmu.edu/datasets/PlasmaRetinol (accessed on 31 October 2023). In this study, we consider the variable Betaplasma. The medical interest in this variable comes from the fact that low levels of plasma beta-carotene may be associated with higher risk of developing certain types of cancer. In Table 4, we present some descriptive statistics including the sample skewness, , and sample and kurtosis . We may observe high kurtosis in this data set.
Table 4.
Summary for betaplasma data.
The moment estimates for the parameters of the SAK distribution are and . These estimates are useful starting values, required to implement maximum likelihood estimation using numerical methods. The ML estimates for the parameters of the AK, TPAD, PAD, SMR, and SAK models are displayed in Table 5. For each distribution, we present the maximum of the log-likelihood function. It can be seen that the SAK model presented a larger value of log-likelihood than the other models.
Table 5.
ML estimates for AK, TPAD, PAD, SMR and SAK models (standard errors are in parenthesis).
In order to compare the fit of the distributions, we considered the usual Akaike information criterion (AIC), introduced by Akaike [15], and the Bayesian information criterion (BIC), proposed by Schwarz [16]. It is known that AIC = and BIC = where k is the number of parameters in the model, n is the sample size and is the maximized value of the log-likelihood function. Table 6 shows the AIC and BIC for each model, indicating that the SAK distribution leads to a better fit than the other distributions. Figure 3 presents the histogram for the data together with the fitted densities.
Table 6.
AIC and BIC criteria for fitted models.
Figure 3.
Betaplasma: histogram and fitted pdf for AK, TPAD, PAD, SMR and SAK distributions.
In our analysis, we also calculated the quantile residuals (QR). If the model is suitable for the data, the QR should follow a distribution similar to the standard normal distribution, as explained in Dunn and Smyth [17]. To confirm this assumption, conventional normality tests such as the Anderson-Darling (AD), Cramer-von Mises (CVM), and Shapiro-Wilkes (SW) tests can be used. Figure 4 demonstrates the quantile residuals of the PAD, SMR, and SAK distributions through a qqplot. The QR results for the AK and TPAD models are just as unsatisfactory as those of the PAD distribution. Considering the outcomes of all three tests, it seems that the SAK model offers a better fit for the dataset.
Figure 4.
The qqplots of the quantile residuals for the fitted modelscand p-values of the AD, CVM and SW tests.
5. Discussion
This paper presents an extended version of the AK model based on the slash methodology. Some properties of this new distribution are derived. It is also compared with two other distributions using a real data set. Estimation is performed through ML via the EM algorithm. The new SAK distribution is an alternative to fit heavy-tailed right-skewed data. Additional features of the SAK distribution are:
- The distribution has two stochastic representations, one of them based on the quotient of two independent r.v.’s and another based on a scale mixture between the AK and Beta distributions.
- The pdf, cdf and hazard function of the SAK distribution are explicit and are represented by the cdf of the gamma model.
- The proposed model has a heavy right tail.
- The model contains the AK distribution as a limit, that is, when the parameter q tends to infinity in the distribution SAK, the AK distribution is obtained.
- The moments and the skewness and kurtosis coefficient have an explicit form.
- In the application, observing the AIC and BIC and the AD, CVM and SW statistical tests, we may conclude that the SAK distribution fits the Betaplasma data set better than the PAD and SMR distributions, which are also extensions of the AK distribution.
Author Contributions
Conceptualization, H.W.G.; software, D.I.G.; methodology, Y.M.G., L.F.-L. and D.I.G.; formal analysis, L.F.-L. and H.W.G.; investigation, Y.M.G., D.I.G. and H.W.G.; writing—original draft preparation, Y.M.G., D.I.G. and H.W.G.; writing—review and editing, L.F.-L.; validation, Y.M.G. and L.F.-L.; resources, L.F.-L. All authors have read and agreed to the published version of the manuscript.
Funding
The research was partially funded by Proyecto Puente-UA and Universidad del Bío-Bío.
Data Availability Statement
Publicly available datasets were analyzed in this study. This data can be found here: [http://Lib.stat.cmu.edu/datasets/PlasmaRetinol].
Conflicts of Interest
The authors declare no conflict of interest.
References
- Jonhson, N.L.; Kotz, S.; Balakrishnan, N. Continuous Univariate Distributions, 2nd ed.; Wiley: New York, NY, USA, 1995; Volume 1. [Google Scholar]
- Rogers, W.H.; Tukey, J.W. Understanding some long-tailed symmetrical distributions. Statist. Neerlandica 1972, 26, 211–226. [Google Scholar] [CrossRef]
- Mosteller, F.; Tukey, J.W. Data Analysis and Regression; Addison-Wesley: Reading, MA, USA, 1977. [Google Scholar]
- Kafadar, K. A biweight approach to the one-sample problem. J. Am. Statist. Assoc. 1982, 77, 416–424. [Google Scholar] [CrossRef]
- Wang, J.; Genton, M.G. The multivariate skew-slash distribution. J. Stat. Plan. Inference 2006, 136, 209–220. [Google Scholar] [CrossRef]
- Gómez, H.W.; Venegas, O. Erratum to: A new family of slash-distributions with elliptical contours [Statist. Probab. Lett. 77 (2007) 717–725]. Stat. Probab. Lett. 2008, 78, 2273–2274. [Google Scholar] [CrossRef]
- Olmos, N.M.; Varela, H.; Gómez, H.W.; Bolfarine, H. An extension of the half-normal distribution. Stat. Pap. 2012, 53, 875–886. [Google Scholar] [CrossRef]
- Rivera, P.A.; Barranco-Chamorro, I.; Gallardo, D.I.; Gómez, H.W. Scale Mixture of Rayleigh Distribution. Mathematics 2020, 8, 1842. [Google Scholar] [CrossRef]
- Shanker, R. Akash Distribution and Its Applications. Int. J. Probab. Stat. 2015, 4, 65–75. [Google Scholar]
- Shanker, R.; Shukla, K.K. On Two-Parameter Akash Distribution. Biom. Biostat. Int. J. 2017, 6, 00178. [Google Scholar] [CrossRef][Green Version]
- Shanker, R.; Shukla, K.K. Power Akash Distribution and Its Application. J. Appl. Quant. Methods 2017, 12, 1–10. [Google Scholar]
- Rolski, T.; Schmidli, H.; Schmidt, V.; Teugel, J. Stochastic Processes for Insurance and Finance; John Wiley & Sons: Hoboken, NJ, USA, 1999. [Google Scholar]
- Dempster, A.P.; Laird, N.M.; Rubim, D.B. Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. Ser. B 1977, 39, 1–38. [Google Scholar]
- R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2023. [Google Scholar]
- Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
- Schwarz, G. Estimating the dimension of a model. Ann. Statist. 1978, 6, 461–464. [Google Scholar] [CrossRef]
- Dunn, P.K.; Smyth, G.K. Randomized Quantile Residuals. J. Comput. Graph. Stat. 1996, 5, 236–244. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).