Abstract
In this article, we study an extension of the sinh Cauchy model in order to obtain asymmetric bimodality. The behavior of the distribution may be either unimodal or bimodal. We calculate its cumulative distribution function and use it to carry out quantile regression. We calculate the maximum likelihood estimators and carry out a simulation study. Two applications are analyzed based on real data to illustrate the flexibility of the distribution for modeling unimodal and bimodal data.
1. Introduction
It frequently occurs in real life that we find continuous data that are bimodal; these cannot be modeled by known unimodal distributions. It is therefore of interest to investigate more flexible distributions in modes that will be useful for professionals working in different areas of knowledge.
In unimodal distributions, the flexibility is based on the asymmetry and kurtosis of the data. In this context, Azzalini [1] introduced the skew-normal (SN) distribution, with asymmetry parameter . It has a probability density function (pdf) given by
where and denote, respectively, the density and cumulative distribution functions of the distribution. This is denoted as . SN(0) becomes the standard normal distribution.
Bimodal distributions generated from skew distributions can be found in Ma and Genton [2], Kim [3], Lin et al. [4,5], Elal-Olivero et al. [6], Arnold et al. [7], Arnold et al. [8], and Venegas et al. [9], among others. The importance of studying these distributions is based on the fact that they do not have identifiability problems and can be used as alternative parametric models to replace the use of mixtures of distributions that present estimation problems from either the classical or the Bayesian point of view (see McLachlan and Peel [10]; Marin et al. [11]). One difficulty with these distributions is that in general, there is no closed-form expression for their cumulative distribution function (cdf). This makes it more difficult to generate data from these distributions for simulation studies or to carry out quantile regression. Additionally, many such bimodal distributions have complicated expressions for a general quantile (say, the q-th).
A variety of bimodal data sets and appropriate models have been presented by many authors. For example, Cobb et al. [12] used the quartic exponential density presented by Fisher [13] to model crude birth rates data; Rao et al. [14] used a bimodal distribution to analyze fish length data; Famoye et al. [15] used the beta-normal distribution to analyze egg diameter data; Everitt and Hand [16] discussed some mixture distributions for modeling bimodal data; Chatterjee et al. [17] and Weisberg [18] presented two bimodal data sets on the eruption and interruption times of the Old Faithful geyser; Bansal et al. [19] discussed the bimodality of quantum dot size distribution; Famoye et al. [15] cited a variety of bimodal distributions that arise from different areas of science. On the other hand, the sinh Cauchy (SC) distribution is given by
where , , , is a location parameter, is a scale parameter, and is a symmetric parameter. The SC distribution produces unimodal and bimodal densities. The disadvantage of the SC distribution is that it is symmetric, which limits it to modeling only symmetric bimodal data. The main objective of this article is therefore to study a bimodal skew-symmetric model with closed cdf, in order to apply it to quantile regression. To do this, we used an extension of the SC distribution that we call the gamma–sinh Cauchy (GSC) distribution, which presents flexibility in its modes and also closed-form expression in its cdf. The GSC distribution belongs to the (gamma-G generator) family introduced by Zografos and Balakrishnan [20]. For any baseline cdf , , they defined the gamma-G generator by the pdf and cdf given by
and
respectively, where is a skewness parameter, is a vector of parameters, , is the incomplete gamma function, and is the usual gamma function. We remark that in the literature, there are many models that can accommodate bimodal distributions. However, in only a few of them do the parameters have an interpretation in terms of measures of central tendency (mean, median, for instance) or a general q-th quantile. As we will show in Section 3, the main advantage of the GSC is that the location parameter represents the respective q-th quantile under a certain restriction over , which is very convenient for the use of this model in a quantile regression framework.
The paper is organized as follows. Section 2 develops the GSC distribution, its basic properties, and quantile regression. In Section 3, we perform a small-scale simulation study of the maximum likelihood (ML) estimators for parameters. Two applications to real data are discussed in Section 4, which illustrate the usefulness of the proposed model. Finally, conclusions are given in Section 5.
2. Gamma–Sinh Cauchy Distribution
The GSC distribution is obtained considering G in (2) as the cdf of the SC distribution. The pdf can be written as
where , and is an asymmetric parameter. We denoted this by .
The cdf is given by
Particular cases:
- SC distribution,
- , hyperbolic secant distribution (Talacko [21]).
The following proposition states conditions for the symmetry of the GSC distribution.
Proposition 1.
The density of the model is symmetric if and only if .
Proof.
Without loss of generality, we consider and . For , the density of the model is
This function is clearly even because and are even. To prove the reciprocal, we will argue by contradiction. Let such that the density is symmetric, i.e. , . This implies that
From the latter equality, and jointly with the fact that the logarithmic function is injective, we find that , , which implies that , producing a contradiction. □
The unimodal and bimodal regions for are illustrated in Figure 1. We can see that for all , there is such that is bimodal. Figure 2 shows the density function for some values of the parameters and , considering the location and scale parameters fixed at 0 and 1, respectively. The distribution assumes symmetric unimodal and bimodal shapes and asymmetric unimodal and bimodal shapes. Figure 3 shows the skewness and kurtosis coefficients for the GSC model under different values of and (such coefficients do not depend on and ). As illustrated previously, the model can assume positive and negative values for the skewness coefficient and can also accommodate kurtosis coefficients lower than, equal to, and greater than the normal model (<3, =3 and >3, respectively).
Figure 1.
Unimodal and bimodal regions for .
Figure 2.
Plots for the gamma–sinh Cauchy (GSC) model for different values of the parameters with , (a) (b) , and (c,d) .
Figure 3.
Skewness and kurtosis coefficients for the GSC model with different values for and .
The GSC Model for Quantile Regression
From Equation (5), it follows that the cdf of the GSC distribution evaluated in is given by
where corresponds to the cdf of the gamma distribution with shape and scale parameters a and 1, respectively. Note that depends only on (and not on or ). As is an increasing function in terms of and , the equation has an unique solution for , which can also be written as
Equation (7) can be solved numerically. For instance, in R the uniroot function can be used. Table 1 shows some values for with different values for q.
Table 1.
Some values for in terms of q.
For this reason, for a fixed q, if we take satisfying (7), by (6) the parameter directly represents the q-th quantile, allowing regression to be performed conveniently even though . Under this setting, a set of available p-covariates, say , for , can be introduced as follows:
This is a convenient property of the GSC distribution because it provides a simple way to performing quantile regression in a model that can be unimodal or bimodal, depending only on parameter (because is considered as fixed in this setting).
As far as we know, there is no model in the literature that is parameterized conveniently in terms of the q-th quantile and can also be unimodal or bimodal. Figure 1 shows that for any fixed in the GSC, there is an interval for where the distribution is unimodal and an interval where the distribution is bimodal.
3. ML Estimation for the GSC Distribution
3.1. ML Estimation
Consider as a size n random sample from the pdf . Hence, the log-likelihood function is given by
where . To compute the ML estimation for , (8) must be maximized. That is, we have to solve the following system of equations: , , , and . More precisely, we have to solve
where , and is the digamma function. The system of equations given above can be solved using numerical procedures such as the Newton–Raphson procedure. An alternative is to use the NumDeriv routine with the R software (R Core Team [22]).
3.2. Simulation Study
In this Section, we present a brief simulation study to assess the performance of MLE in the GSC model. To draw values from the model, we used the inversion method. If , then
where is the inverse of the cdf of the gamma distribution with shape and rate parameters equal to and 1, respectively. In all scenarios, we considered and , three values for —, and —and three values for —, and 2. We also considered three sample sizes: 100, 200, and 500. For each combination of the parameters, we drew 1,000 samples and computed the ML estimates. Table 2 summarizes the results considering the average of the bias (bias), the root of the estimated mean squared error (RMSE), and the 95% coverage probability (CP). Note that in all cases, the bias and RMSE decreased when the sample size was increased, suggesting that the estimators are consistent. Finally, we also remark that the coverage probability converged to the nominal values used for the construction of the confidence intervals when the sample size was increased, suggesting that the normality for the ML estimates is reasonable in sample sizes.
Table 2.
Simulation study for the GSC model.
4. Applications
In this section, we carry out two applications to real data, the first using the GSC model without covariates and the second applying quantile regression to uni- and bimodal data.
4.1. Application 1: Without Covariates
The first application reported is for the data set consisting of 1150 heights measured at 1 micron intervals along the drum of a roller (i.e., parallel to the axis of the roller). This was part of an extensive study of roller surface roughness. It is available for downloading at http://lib.stat,emu.edu/jasadata/laslett. Table 3 presents summary statistics for the data set where and correspond to the sample asymmetry and kurtosis coefficients, respectively.
Table 3.
Descriptive statistics for the data set.
We fitted this using the SN model, the exponentiated sinh Cauchy (ESC) model (see Cooray [23]), and the GSC model. A summary of these fits is presented in Table 4. Based on the AIC criteria, the GSC provided the best fit for the height data set. Figure 4 shows plots of the density functions for the fitted models using the MLEs for SN, ECG and GSC distributions.
Table 4.
Maximum likelihood (ML) estimates for the GSC, exponentiated sinh Cauchy (ECG), and skew-normal (SN) models for the roller data set.
Figure 4.
Fitted models for roller data set.
4.2. Data Set 2: Quantile Regression to Bimodal Data
The second application we consider is the Australian data set available in the sn package in R. This data set is related to 102 male and 100 female athletes collected at the Australian Institute of Sport. The linear model considered is
where is the body fat percentage for the i-th athlete and and are the covariates body mass index and lean body mass for the i-th athlete, respectively. In addition, , satisfies Equation (7), and is the fixed quantile that is being modeled. This data set was also analyzed in Martínez-Flórez et al. [24] using a bimodal regression model. However, the authors modeled the mean of the distribution. In our approach, we model the , and percentiles of the distribution, which provides a more informative scenario to explain body fat in terms of the body mass index and lean body mass. Our approach is compared with the skewed Laplace (SKL) and skewed Student-t (SKT) models discussed in Galarza et al. [25], where the authors proposed a flexible model in a quantile regression model context. Table 5 shows the AIC for those models considering different quantiles. We also present the p-value for the Kolmogorov–Smirnov (K–S) test of the hypothesis that the respective quantile residuals came from the standard normal distribution. P-values greater than 5% suggest that with this significance level, the standard normal assumption is reasonable for those residuals, in which case the model would be appropriate for this data set. Note that based on the AIC criteria, the GSC presents a better fit for this data set, except for the median regression (). On the other hand, based on the p-value for the K–S test applied to the quantile residuals, we conclude that GSC, SKL, and SKT are appropriate models for and (the GSC and SKT provide a better fit based on the AIC criteria). However, for and , GSC provides the better fit because the p-values are (significantly) greater than 0.05. Finally, for no model seems appropriate, but based on the p-values, the GSC provides a better fit than SKL and SKT distributions.
Table 5.
AIC and p-value for K–S test in the ais data set for the GSC, skewed Laplace (SKL), and skewed Student-t (SKT) models and different quantiles.
Figure 5 shows the regression coefficients for the quantile regression presented in Equation (9) and their respective 95% confidence intervals. Note that body mass index and lean body mass are significant in explaining all the quantiles modeled.
Figure 5.
Estimates for regression coefficients (and 95% confidence interval)s for variables bmi (left panel) and lbm (right panel) in different quantile regression models with quantiles equal to , and and response variable Bfat.
Figure 6 shows the profile density for the q-th quantile of body fat percentage for and . Note that the distribution of the quantile is unimodal, and the distribution of the quantile is bimodal.
Figure 6.
Distribution for and quantiles of body fat percentage considering body mass index and lean body mass equal to 22.96 and 64.87, respectively. Curves in black, red, and green represent the density functions estimated by the GSC, SKL, and SKT models, respectively.
5. Final Comments
This paper proposes the GSC distribution, which is flexible in its modes and contains the SC distribution as a special case. We implemented ML estimation in a quantile regression, obtaining significant results. In the applications to real data it performs very well, better than potential rival models. Some further characteristics of the GSC distribution are:
- The GSC distribution contains the SC and hyperbolic secant models as special cases.
- The GSC distribution presents great flexibility in its modes, as can be observed in Figure 1.
- The proposed model has a closed-form expression for its cdf.
- In the two applications, we show that the GSC model fits better than the other models.
Author Contributions
All of the authors contributed significantly to this research article.
Funding
The research of Yolanda M. Gómez was supported by proyecto DIUDA programa de inserción No. 22367 of the Universidad de Atacama. The research of Héctor W. Gómez was supported by Grant SEMILLERO UA-2019 (Chile).
Acknowledgments
The authors would like to thank the editor and the anonymous referees for their comments and suggestions, which significantly improved our manuscript.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Azzalini, A. A class of distributions which includes the normal ones. Scand. J. Stat. 1985, 12, 171–178. [Google Scholar]
- Ma, Y.; Genton, M.G. Flexible class of skew-symmetric distributions. Scand. J. Stat. 2004, 31, 459–468. [Google Scholar] [CrossRef]
- Kim, H.J. On a class of two-piece skew-normal distributions. Statistics 2005, 39, 537–553. [Google Scholar] [CrossRef]
- Lin, T.I.; Lee, J.C.; Hsieh, W.J. Robust mixture models using the skew-t distribution. Stat. Comput. 2007, 17, 81–92. [Google Scholar] [CrossRef]
- Lin, T.I.; Lee, J.C.; Yen, S.Y. Finite mixture modeling using the skew-normal distribution. Stat. Sin. 2007, 17, 909–927. [Google Scholar]
- Elal-Olivero, D.; Gómez, H.W.; Quintana, F.A. Bayesian Modeling using a class of Bimodal skew-Elliptical distributions. J. Stat. Plan. Infer. 2009, 139, 1484–1492. [Google Scholar] [CrossRef]
- Arnold, B.C.; Gómez, H.W.; Salinas, H.S. On multiple constraint skewed models. Statistics 2009, 43, 279–293. [Google Scholar] [CrossRef]
- Arnold, B.C.; Gómez, H.W.; Salinas, H.S. A doubly skewed normal distribution. Statistics 2015, 49, 842–858. [Google Scholar] [CrossRef]
- Venegas, O.; Salinas, H.S.; Gallardo, D.I.; Bolfarine, H.; Gómez, H.W. Bimodality based on the generalized skew-normal distribution. J. Stat. Comput. Simul. 2018, 88, 156–181. [Google Scholar] [CrossRef]
- McLachlan, G.J.; Peel, D. Finite Mixture Models; Wiley Interscience: New York, NY, USA, 2000. [Google Scholar]
- Marin, J.M.; Mengersen, K.; Robert, C. Bayesian modeling and inference on mixtures of distributions. Handbook Stat. 2005, 25, 459–503. [Google Scholar]
- Cobb, L.; Koppstein, P.; Chen, N.H. Estimation and moment recursion relations for multimodal distributions of the exponential families. J. Arner. Stat. Assoc. 1983, 78, 124–130. [Google Scholar] [CrossRef]
- Fisher, R.A. On the Mathematical Foundations of Theoretical Statistics. Philos. Trans. R. Soc. London Ser. A 1922, 222, 309–368. [Google Scholar] [CrossRef]
- Rao, K.S.; Narayana, J.L.; Sastry, V.P. A bimodal distribution. Bull. Calcutta Math. Soc. 1988, 80, 238–240. [Google Scholar]
- Famoye, F.; Lee, C.; Eugene, N. Beta-normal distribution: Bimodality properties and application. J. Modern Appl. Stat. Method 2004, 3, 85–103. [Google Scholar] [CrossRef]
- Everitt, B.S.; Hand, D.J. Finite Mixture Distributions; Chapman & Hall: London, UK, 1981. [Google Scholar]
- Chatterjee, S.; Handcock, M.S.; Simonoff, J.S. A Casebook for a First Course in Statistics and Data Analysis; John Wiley & Sons: New York, NY, USA, 1995. [Google Scholar]
- Weisberg, S. Applied Linear Regression; John Wiley & Sons: New York, NY, USA, 2005. [Google Scholar]
- Bansal, B.; Gokhale, M.R.; Bhattacharya, A.; Arora, B.M. InAs/InP quantum dots with bimodal size distribution: Two evolution pathways. J. Appl. Phys. 2007, 101, 1–6. [Google Scholar] [CrossRef]
- Zografos, K.; Balakrishnan, N. On families of beta- and generalized gamma-generated distributions and associated inference. Stat. Methodol. 2009, 6, 344–362. [Google Scholar] [CrossRef]
- Talacko, J. Perks’ distributions and their role in the theory of Wiener’s stochastic variables. Trabajos de Estadistica 1956, 7, 159–174. [Google Scholar] [CrossRef]
- Team, R.C. R: A Language and Environment for Statistical Computing. Available online: http://www.R-project.org (accessed on 26 May 2019).
- Cooray, K. Exponentiated Sinh Cauchy distribution with applications. Commun. Stat. Theor. Method 2013, 42, 3838–3852. [Google Scholar] [CrossRef]
- Martínez-Flórez, G.; Salinas, H.S.; Bolfarine, H. Bimodal regression model. Colombian J. Stat. 2017, 40, 65–83. [Google Scholar]
- Galarza, C.E.; Lachos, V.H.; Barbosa, C.; Castro, L.M. Robust quantile regression using a generalized class of skewed distributions. Stat 2017, 6, 113–130. [Google Scholar]
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).