Abstract
In this paper, we will present a new, more flexible class of distributions with a domain in the interval (0,1), which presents heavier tails than other distributions in the same domain, such as the , Kumaraswamy, and Weibull Unitary distributions. This new distribution is obtained as a transformation of two independent random variables with a Weibull distribution, which we will call the Generalized Unitary Weibull distribution. Considering a particular case, we will obtain an alternative to the , Kumaraswamy, and Weibull Unitary distributions. We will call this new distribution of two parameters the type 2 unitary Weibull distribution. The probability density function, cumulative probability distribution, survival function, hazard rate, and some important properties that will allow us to infer are provided. We will carry out a simulation study using the maximum likelihood method and we will analyze the behavior of the parameter estimates. By way of illustration, real data will be used to show the flexibility of the new distribution by comparing it with other distributions that are known in the literature. Finally, we will show a quantile regression application, where it is shown how the proposed distribution fits better than other competing distributions for this type of application.
1. Introduction
There are various probability distributions with support on (0,1). One of the most used is the distribution, which is a family of continuous probability distributions defined in the interval (0,1) with two shape parameters, both positive, normally denoted by and .
In Bayesian inference, the distribution is generally used as the conjugate prior to probability distribution for the Bernoulli, binomial, negative binomial, and geometric distributions. For example, the distribution can be used in Bayesian analysis to describe any initial knowledge about the probability of success. In addition, it is a density that is usually used to model the data associated with percentages and proportions.
The usual formulation of the distribution is also known as the type I distribution, whose density function is provided by:
where , are shape parameters, with . We denote this by writing .
A distribution similar to is the Kumaraswamy distribution [], but it is simpler in the sense that simulations can be obtained from the inverse of the cumulative distribution, since it has a closed expression, alike the quantiles. Its density is defined by:
where , are the shape parameters, with . We denote this by writing .
Mazucheli et al. [] show the Unitary Weibull distribution where they present some inferential procedures. Mazucheli et al. [] present a unitary version of the Weibull distribution as an alternative to the distribution to model quantiles conditional on covariates. The stochastic representation of a Unitary Weibull distribution is provided by , with , denoted by , which has a density function provided by:
In this paper, the Generalized Unitary Weibull distribution of a random variable Y is presented based on a transformation of two independent random variables with distributions and , denoted by . In particular, we will study the case for and that we will call Weibull Unitary distribution type 2, denoted by , where , .
The article is organized in the following manner. In Section 2, we provide the stochastic representation, the pdf of a random variable with distribution, and present some properties and the distribution as a particular case. The cumulative distribution function (cdf), quantiles, reliability functions, and hazard rate, moments, skewness coefficients, and kurtosis are also provided. Some statistical properties are provided. The Canonical Unitary Weibull distribution and its properties are presented. In Section 3, an inference is made through a simulation study of the parameter estimates using the maximum likelihood method. In addition, the , , , and distributions are fitted to real data sets in Section 4. In Section 5, a discussion and the main conclusions are presented.
2. The Generalized Unitary Weibull Family of Distribution
A random variable Y has a Generalized Unitary Weibull distribution, of parameters , , , and , denoted by , if its stochastic representation is provided by:
where and , , and are independent random variables. Its density function is presented below.
2.1. Density Function
Proposition 1.
Let then the density function of Y is:
where , , , α, and .
Proof.
Using the stochastic representation provided in (4), we have that:
are independent random variables and, using the Jacobian of the transformation, it follows that:
Hence,
Therefore,
where . □
Now, we provide some elementary properties.
Proposition 2.
Let then:
- 1.
- If then , where U denotes the uniform distribution in (0,1).
- 2.
- If , and then is symmetric.
- 3.
- If then .
Proof.
Let , whose density is represented in proposition 1.
- 1.
- The result is obtained by replacing in the distribution of Y then .
- 2.
- If and then:Then .
- 3.
- The result follows from plugging into the distribution of Y.
□
2.2. Density Function of the Unitary Weibull Distribution Type 2
Definition 1.
Setting and in (5), the density function of Y is provided for:
which we will call Weibull Unitary distribution type 2, denoted by .
Figure 1 below show each pdf of the distribution compared to the distribution. It shows that, for certain values of the parameters, respectively, the distributions are very similar and in others there is quite a difference.
Figure 1.
pdf for and different values of .
Figure 2 shows the pdfs of the distribution for and different values of .
Figure 2.
pdf for and different values of .
Proposition 3.
Let . Then, the cdf of Y is provided by:
Proof.
Performing the change of variable y expanding the integral, we obtain the result. □
Corollary 1.
Let , then the quantile function of Y is provided by:
Proof.
Solving t from provides the result. □
In Figure 3, we graphically illustrate the behavior of the Cumulative distribution function of the distribution for different values of and .
Figure 3.
Cdf of for different values and .
2.3. The Reliability, Hazard Rate Functions and Increasing Failure Rate
Two important measures of reliability are the reliability function and hazard (failure) rate function. The reliability function of a random variable Y is defined by , where denotes the cdf of Y. The risk rate function is defined by . For the distribution , as a direct consequence of Proposition 3, both reliability measures can be expressed in closed form. The corresponding expression is obtained in the following Proposition simple form.
In Table 1, it can be seen that the distribution better captures the values’ outliers compared to the , , and distributions, since the reliability is higher.
Table 1.
Reliability function comparisond for distributions of , , , and .
Proposition 4.
Let . Then, the hazard rate funtion of Y is provided by:
Proof.
Replacing and provides the result. □
Figure 4 shows the hazard rate function of the distribution for different values of and . Looking at the graphical representation, it is clear that it presents a wide variety of forms. Therefore, the new family of distributions is flexible enough to model real data sets.
Figure 4.
The hazard rate functions for the distribution.
Next, we present the Increasing Failure Rate, which is defined as the derivative of the failure rate function provided in (15).
Proposition 5.
Let Y have distribution . Then for any θ and the random variable Y has Increasing Failure Rate (IFR).
Proof.
The first derivative of h provided in (15) can be written as follows
It is clear that since , and , which implies the result. □
2.4. Moments
The following statement shows the moments for the distribution. Essentially, these moments are expressed as a numerical integral (the problem of obtaining a closed analytic expression remains open).
Definition 2.
Let . Hence, for we define:
Proposition 6.
Let then:
Proof.
□
In particular, for we have:
Remark 1.
This Proposition allows us to reaffirm that, for and any value of the parameter θ, the density is symmetric (case ).
Remark 2.
From definition 2, the skewness and kurtosis coefficients can be obtained through:
and
respectively, which do not present a closed expression, so they must be obtained using numerical methods.
Corollary 2.
Let , then:
Proof.
Figure 5 and Table 2 graphically and numerically show the behavior of the asymmetry and kurtosis coefficients of the distribution and are consistent with what is represented in corollary 2. That is, the value of the asymmetry coefficient, given a value of the parameter , is the same for as for , but with the opposite sign. For example: and . Similarly, the value of the kurtosis coefficient, given a value of the parameter , is the same for as it is for . For example: .
Figure 5.
Plots of the skewness (left) and kurtosis of the distribution (right).
Table 2.
Skewness and kurtosis values of the model with different values of and .
2.5. Some Statistical Properties
2.5.1. Entropy of
The entropy can be obtained using the density function of Y; specifically, the following form expression is obtained:
If Y be a random variable with distribution. So, the entropy of Y is provided by:
Table 3 shows the entropy values of the distribution for different values of the parameters and .
Table 3.
Entropy values for the distribution () for different values of and .
2.5.2. Mean Residual Life
An important reliability quantity for positive random variables is the mean residual life, which is defined as .
For the case that , then the mean residual life of Y is obtained by replacing:
2.5.3. Incomplete Moments
The r-th incomplete moment of is defined as:
If , then the r-th incomplete moment of Y is provided by:
An interesting application of the first incomplete moment is that the mean deviation about the mean of Y can be directly obtained, specifically by means of the relation (see []):
where .
2.5.4. Lorenz Curve and the Gini Index
The Lorentz curve and Gini coefficient are tools used in the field of economics to measure income inequality in a society.
The Lorenz curve (see []), , can also be obtained from the quantile function of Y; specifically, the following closed-form expression is obtained:
where .
If . Next, the Lorenz curve is provided by:
where
The Gini index (see []) is the measure of inequality associated with the Lorenz curve. For the random variable X, the Gini index is defined by:
In the next result, an analytical expression is provided for .
Proposition 7.
Let , then the Gini index of Y is provided by:
Figure 6 shows the Lorenz curve using the distribution for different values of the parameters and .
Figure 6.
Lorenz Curve for different values of and .
It can be observed that, as increases, inequality with the Gini index decreases, and, as increases, inequality with the Gini index increases.
2.6. Canonical Type 2 Unitary Weibull Distribution
Let causing ; then, the distribution of Y is called the canonical type 2 Weibull distribution and we will denote it by and its density function has the following expression:
Its most important properties are:
- 1.
- The of Y is provided by:
- 2.
- Quantile function of Y is:
- 3.
- The r-th moment of Y has the following expression:where .In particular, for we have:
- 4.
- Kurtosis coefficient is provided by following expression.Figure 7 shows the graphic behavior of the kurtosis for the canonical distribution for different values of .
- 5.
- The Lorenz curve of Y is:
- 6.
- The expression for the Gini index of Y is provided by:Figure 8 shows the Lorenz curve and the Gini index of the canonical distribution for different values of in which the parameter is directly proportional to the Gini index.
- 7.
- Entropy of Y:Figure 9 shows the graph of the entropy of the canonical distribution for different values of .
Figure 7.
canonical kurtosis for different values of .
Figure 8.
Lorenz curve and Gini index of the canonical distribution for different values of .
Figure 9.
Graph of the entropy of the canonical distribution for different values of .
3. Inference
In this section, we discuss the statistical inference of the estimators for the model .
3.1. Maximum Likelihood Estimate
We now discuss the maximum likelihood estimate. Given a random sample of the distribution , the logarithm of the likelihood function can be written as:
Therefore, the maximum likelihood equations are provided by:
The solutions to the equations can be obtained using numerical procedures such as the Newton–Raphson procedure.
3.2. Simulation Study
We use the Monte Carlo method to generate random numbers from the distribution .
Table 4 presents a simulation study of 1000 samples of size , and 200 for different values of the parameters and . These random values are obtained from , and substituting in the quantile for given and , we obtain the random values of the distribution . On the other hand, the table shows that when the sample size increases, the parameter estimates converge asymptotically to the parameters. However, the standard deviations and the average length of the confidence intervals decrease as the sample size increases. This allows us to verify the consistency of the parameter estimates. Finally, the values obtained from the empirical coverage are as expected, since it is close to a 95% confidence.
Table 4.
Simulation of 1000 iterations of the model ).
4. Analysis of Real Data
4.1. Example 1: Application to Medical Data
In this example, we compute the MLEs of to fit the , , , and models to a real data set. The data can be found in the book on Biostatistics (see [] Daniel, Pag. 475) and correspond to a study carried out by Slemenda et al. [], in which he investigates the effects of lateral bone mineral density (LBMD) on spinal osteoarthritis in 66 women aged 34–87 years. Some descriptive statistics are shown in Table 5. Table 6 shows the MLEs for the models: , , , and . Using the Akaike criterion (AIC) [], criterion Bayesian (BIC) [], the Kolmogorov–Smirnov (KS) test, and Chen’s approximate goodness-of-fit test [] (W*), (A*), we see that model best fits the data. The advantage of the model is more evident for the data with more extreme observations, see Figure 10 (side right). Figure 11 and Figure 12 show that the distribution fits the data better than the , , and distributions.
Table 5.
Summary statistics for ant data set of the LBMD.
Table 6.
Parameters estimates for , , , and distributions.
Figure 10.
Histogram for LBMD data with Densities (solid line), (dashed line), (dotted line), and (dashed dotted line) (left) and tails (right).
Figure 11.
QQ plots for the LBMD data set: (a), (b), (c), and (d).
Figure 12.
Comparison of cumulative distributions for the LBMD data set for (blue line), (red line), (green line), and (orange line).
Observing Table 6, we see that the values of AIC and BIC are lower than those of their competitors, thus the statistic , A*, and W* indicating the best fit of the distribution in comparison with the distributions , , and .
4.2. Example 2: An Application to Environment Data
In this section, we compute the MLEs of to fit the , , , and models to a real environment data set. The data can be found at https://dga.mop.gob.cl/servicioshidrometeorologicos/Paginas/default.aspx (1 December 2022) servicioshidrometeorologicos/Paginas/default.aspx and they correspond to the fluviometric and meteorological data recorded in monitoring stations from Arica to Tierra del Fuego. In addition, you will have access to various official statistical reports on hydrometeorological variables and water quality, obtained from our National Hydrometric Network; the analyzed data are the percentage of dissolved oxygen in a lake. Some descriptive statistics are shown in Table 7. Table 8 shows the MLEs for the models: , , , and . From the Akaike criteria (AIC), (BIC), we see that the model best fits the data. Figure 13 shows that the model fits the data better than , , and models.
Table 7.
Summary statistics for environment data set of the percentage of dissolved oxygen.
Table 8.
Parameter estimates for the distributions , , , and .
Figure 13.
Histogram for percent dissolved oxygen data (left) with densities of (solid line), (dashed line), (dotted line), and (dashed line) and tails (right).
The QQ plots of the data with the distribution compared to the , , and distributions adjusted with the maximum likelihood estimators of their parameters are shown in Figure 14.
Figure 14.
QQ plots for the data set: (a), (b), (c), and (d).
4.3. Example 3: An Application to Quantile Regression
4.3.1. One-Dimensional Quantile Regression
Translating this concept of quantile to the regression line, we obtain the linear quantile regression (see []). If we assume that:
with and that the conditional expected value is not necessarily zero, but the -th quantile of the error with respect to the return variable is zero , so the -th quantile of with respect to X can be written as:
The estimators of and are obtained by:
being and . To estimate the parameters, the function described in the formula must be minimized.
4.3.2. Quantile Regression Unitary Weibull Type 2
In this case, in the regression equation:
where the response variable , it is possible to reparameterize it in the distribution. So, one way to obtain the quantile of the function of Y is the following:
Let , then and substituting into the density function of Y, we obtain:
then the cdf of Y is:
then , where is the quantile parameter. Considering known, and are estimated by the maximum likelihood.
4.3.3. An Application of Quantile Regression to Praters Gas Mileage Data
To illustrate this, we consider Simas et al. [] investigating Praters gas mileage data based on the same mean equation as above, but now with temperature. Table 9 shows the statistics of these data. Table 10 shows the maximum likelihood estimators as predictor variables of and their standard errors for the , , and distributions.
Table 9.
Summary statistics for data set of the temperature and yield.
Table 10.
Parameters estimates and standard error for the quantile regression coefficients , , and models for the dataset and the quantile of 0.5.
Looking at Table 11 and Figure 15 and Figure 16, we see that the distribution compared to the and distributions fits better using quantile regression when the variable response has high kurtosis.
Table 11.
AIC and BIC values for the models , , and of the Temperature and Yield.
Figure 15.
Quantile regression for Yield and Temperature data with density (left) and density (right).
Figure 16.
Quantile regression for Yield and Temperature data with density (left) and density (right).
5. Discussion
In this work, we have introduced a new family of distributions with a domain in the interval (0,1) and with heavier tails than some similar distributions seen in the literature. The new family is based on a transformation of two independent random variables with a two-parameter Weibull distribution. We define the new family by its stochastic representation. We provide its density function and reliability function and also provide some statistical properties of interest. In the inferential part, we estimate the parameters of the new model using the maximum likelihood method and the information criteria are used to select the best model and evaluate the goodness of fit of the new distribution compared to other similar distributions. A Monte Carlo simulation study was carried out to empirically evaluate the statistical performance of the estimators, using the maximum likelihood method for the parameters of the new model. In addition, we show the coverage probabilities and the mean length of the confidence intervals obtained for the corresponding parameters using the asymptotic normality of these estimators. The simulation study reported consistent performance of these estimators. Finally, three illustrations with real data were created, with two related to medical information and the environment. A third application was related to quantile regression. These analyses provided sufficient information to conclude that the proposed model presents better behavior when compared to others from the competition.
Author Contributions
Data curation, J.R.; formal analysis, J.R., M.A.R., P.L.C. and J.A.; investigation, J.R., M.A.R. and P.L.C.; methodology, J.R., M.A.R., P.L.C. and J.A.; writing—original draft, J.R., M.A.R., P.L.C. and J.A.; writing—review and editing, M.A.R., P.L.C. and J.A.; Funding Acquisition, J.R., M.A.R. and J.A. All authors have read and agreed to the published version of the manuscript.
Funding
Research of J.R., M.A.R. and J.A. was supported by the Universidad de Antofagasta through Projecto Semillero UA 2022.
Data Availability Statement
The analyzed data is available at the URL and references, respectively, given in the article.
Acknowledgments
The authors acknowledge helpful of Universidad de Antofagasta for the research of J. Reyes, M. Rojas and J. Arrué was supported by Proyecto Semillero UA 2022.
Conflicts of Interest
No potential conflict of interest was reported by the authors.
References
- Kumaraswamy, P. A generalized probability density function for double-bounded random processes. J. Hydrol. 1980, 46, 79–88. [Google Scholar] [CrossRef]
- Mazucheli, J.; Menezes, A.F.B.; Ghitany, M.E. The unit-Weibull distribution and associated inference. J. Appl. Probability Stat. 2018, 13, 1–22. [Google Scholar]
- Mazucheli, J.; Menezes, A.F.B.; Fernandes, L.B.; de Oliveira, R.P.; Ghitany, M.E. The unit-Weibull distribution as an alternative to the Kumaraswamy distribution for the modeling of quantiles conditional on covariates. J. Appl. Stat. 2019, 47, 954–974. [Google Scholar] [CrossRef]
- Butler, R.J.; McDonald, J.B. Using incomplete moments to measure inequality. J. Econom. 1989, 42, 109–119. [Google Scholar] [CrossRef]
- Gastwirth, J.L. The Estimation of the Lorenz Curve and Gini Index. Econ. Stat. 1972, 54, 306–316. [Google Scholar] [CrossRef]
- Daniel, W.W. Biostatistics: A Foundation for Analysis in the Health Sciences, 9th ed.; John Wiley and Sons, Inc.: Hoboken, NJ, USA, 2005. [Google Scholar]
- Slemenda, C.W.; Turner, C.H.; Peacock, M.; Christian, J.C.; Sorbel, J.; Hui, S.L.; Johnston, C.C. The genetics of proximal femur geometry, distribution of bone mass and bone mineral density. Osteoporos. Int. 1996, 6, 178–182. [Google Scholar] [CrossRef]
- Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
- Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
- Chen, G.; Balakrishnan, N. A general Purpose Aproximate Goodness-of-Fit Test. J. Qual. Technol. 1995, 27, 154–161. [Google Scholar] [CrossRef]
- Buchinsky, M. Quantile regression, Box-Cox transformation model, and the U.S. wage structure. J. Econom. 1995, 65, 109–154. [Google Scholar] [CrossRef]
- Simas, A.B.; Barreto-Souza, W.; Rocha, A.V. Improved Estimators for a General Class of Beta Regression Models. Comput. Stat. Data Anal. 2010, 54, 348–366. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).