An Asymmetric Bimodal Double Regression Model

In this paper, we introduce an extension of the sinh Cauchy distribution including a double regression model for both the quantile and scale parameters. This model can assume different shapes: unimodal or bimodal, symmetric or asymmetric. We discuss some properties of the model and perform a simulation study in order to assess the performance of the maximum likelihood estimators in finite samples. A real data application is also presented.


Introduction
A wide range of phenomena can be defined more appropriately using probability distributions. This description is very useful because of the properties associated with a distribution: expectation, shape, range, etc. However, fitting data can be difficult when their distribution is bimodal, which occurs commonly in practice: in astrophysics, the metallicity of the globular cluster system in the Milky Way (see [1]); in ecology, the tree cover of moist savanna and tropical forest ecosystems (see [2]); in genetics, gene expression measurements (see [3]). Other practical examples of bimodality in data can be seen in [4][5][6]. In the literature, there are many proposals discussing bimodal distributions; e.g., the works of [7][8][9][10][11][12]. Bimodal data can be fitted by a mixture of two unimodal distributions. When the mixture is created from the same model, the main difficulty is the non-identifiability of the proposed mixture model. The traditional example is the mixture of normals. Alternatively, the most workable practical method is to use a distribution which already has bimodal properties. For the latter situation, we introduce the gamma-sinh Cauchy (GSC) distribution, proposed by [13]. We note that the initials GSC can be found in the literature as an acronym for Generalized Skew-Cauchy, and readers should be aware of this when reviewing the literature. This model has uni/bimodal properties. However, unlike the distributions discussed in the works mentioned above, in this model, one of the parameters can be interpreted as the q-th quantile under certain conditions. This is very convenient, because it allows covariates to be introduced into non-homogeneous populations directly. The probability density function (pdf) for the GSC distribution is given by where x, µ ∈ R, σ, λ and φ > 0. The corresponding cumulative distribution function (cdf) is given by where G(·; φ) denotes the cdf of the gamma distribution with shape and scale parameters equal to φ and 1 respectively. The GSC can be asymmetric or symmetric and, as the main advantage, its cdf has a closed-form expression which can be generated quickly in many different softwares. This is useful for generating random data, besides defining quantiles. Regression models seek to describe the behavior of a variable of interest (or response) from covariables (explanatory variables). In general, a function called a link function links a characteristic of the response variable to the explanatory variables through parameters estimated from observed data. In our case, the response variable is bimodal and described by the GSC distribution, while the relationship between the response and explanatory variables is through the quantile. This type of relation is known as quantile regression. The literature on parametric models in the context of quantile regressions has increased considerably in recent years. For instance, for responses in the unit interval, see the works of [14,15]; for responses in the positive line see the work of [16]; and for responses in the real line see the works of [17,18].
The main advantage of quantile regression is that it is more robust against outliers. The advantage is that we can have a more informative approach to the response than simply modeling some specific measure of the population, such as the mean or median. The applicability of this method can be seen in ecology [19], econometrics [20], environmetrics [21], and medicine [22], for instance. In general terms, the distributions are not parametrized directly in terms of a general quantile q ∈ (0, 1) or any specific quantile, except for some particular cases. For instance, for the N(θ, σ 2 ) model, θ represents the mean and the median of the distribution, but the q-th quantile is given by θ + z q σ, where z q is the q-th of the standard normal distribution. On the other hand, which quantile is of interest depends on the research. In some cases, small quantiles will be of interest, whereas in other contexts, large quantiles will be the focus. For instance, in a nutrition context, larger quantiles of weight are of interest because they enable the nutritionist to define which patients are at higher risk; on this basis, they can define special treatments for such patients.
In view of the importance of bimodal distributions, the chief object of this paper is to build on the quantile regression structure in the GSC distribution. Double regression has been studied quite extensively in the literature; for example, the authors of [23] consider a regression structure for both components based on a new parameterization indexed by mean and dispersion parameters; in [24], a regression model is proposed that is useful for situations where the variable of interest is continuous and restricted to the positive real line and is related to other variables through the mean and precision parameters; and in [25], a new parameterization of the gamma distribution is used that is indexed by mode and precision parameters.
The paper is organized as follows. In Section 2, we develop the GSC regression model with its properties. In Section 3, we perform a small-scale simulation and evaluate the point estimation. An application to real data, which illustrates the usefulness of the proposed model, is discussed in Section 4. Finally, conclusions are given in Section 5.

The GSC Regression Model
Gómez et al. [13] show that F(µ; µ, σ, λ, φ) = G(log(2), φ) and then, for a fixed q ∈ (0, 1) such as the parameter µ represents the q-th quantile of the distribution. This equation can be solved numerically. Henceforth, we use the notation GSC q (µ, σ, λ) to refer to a random variable with GSC distribution where φ is fixed as in (1), µ is the q-th quantile of the distribution, σ is a scale parameter and λ is a shape parameter. Figure 1 shows the relation between q and φ and the regions where the model is unimodal or bimodal, depending on the modeled quantile q and the parameter λ. Note that, for q ∈ (0, 1), we have (λ 1 , λ 2 ) ∈ R 2 + such that the model is unimodal for λ 1 and bimodal for λ 2 . In the next proposition, we enunciate a property related to the unimodality of the model.
Proof. Since µ and σ are location and scale parameters, without a loss of generality, we can consider µ = 0 and σ = 1. Deriving the pdf of the GSC 0.5 (µ = 0, σ = 1, λ) model in relation to x, we obtain that .
In this case, are the two modes of the distribution.
Suppose now that we are interested in modeling the q-th quantile of the distribution for a non-homogeneous population. We assume that, for a fixed q ∈ (0, 1), the q-th quantile of the distribution µ and the scale parameter σ satisfy the following functional relations: where β 1 (q) = (β 11 (q), . . . , β 1p 1 (q)) and β 2 (q) = (β 21 (q), . . . , β 2p 2 (q)) are vectors of unknown regression coefficients such that β 1 (q) , β 2 (q) ∈ R p 1 +p 2 , with p 1 + p 2 < n; and x 1i = (x 11i , . . . , x 1p 1 i ) and x 2i = (x 21i , . . . , x 2p 2 i ) are the observations of the known regressors p 1 and p 2 . Note that the vector x 1i is linked with the µ i (q) parameter using the identity link; then, interpretations for the covariates in x 1i can be performed using the same idea as an ordinary linear regression. For instance, if the jth covariate 1 ≤ j ≤ p 1 is a quantitative variable, then, fixing the rest of the covariates, the q-th quantile of the distribution increases by β 1j (q) units when x 1j is increased to x 1j + 1. Similarly, for the regression coefficients related to the scale parameter, after fixing the rest of the covariates, the scale of the distribution for the q-th quantile of the distribution is increased by exp(β 2k (q)) units when x 2k is increased to x 2k + 1, 1 ≤ k ≤ p 2 . We highlight that in [13], a regression structure was assumed only for µ i (q). However, the assumption that all the observations have the same scale parameter could be unrealistic, as each observation could have its own scale. For this reason, it seems reasonable to assume this double regression structure. In this setting, the log-likelihood function for ψ(q) = (β 1 (q), β 2 (q), λ), up to a constant, is given by The maximum likelihood (ML) estimator of ψ(q), say ψ(q), is obtained by maximizing (ψ(q)) in relation to ψ(q). For this model, such a maximization procedure does not provide a closer form, meaning that numerical procedures need to be implemented. Specifically, we use the Broyden-Fletcher-Goldfarb-Shanno (BFGS) quasi-Newton method; see [26] (p. 199). This procedure is implemented in the R software [27]. The programs are available on request. Finally, under regularity conditions, ψ(q) satisfies that where N p (0, I p ) denotes the standard multivariate distribution and H( ψ) denotes the estimated Hessian matrix of the log-likelihood function in relation to ψ.

Simulation Study
In this section, we present a simulation study to evaluate the performance of ML estimates in finite samples. The computational procedure is implemented using R software [27]. Values of the GSC distribution were drawn using inverse transform sampling. We considered a scheme with two covariates, both simulated from the uniform distribution between −2 and 2. We considered combinations of values for the quantile q : 0.10, 0.50 and 0.75; vectors for β 1  We also considered three sample sizes: 100, 200 and 500. Based on 5000 replicates, we compute the mean of the estimated bias for each estimator (bias), the mean of the estimated standard errors (SE), the root of the estimated mean squared error (RMSE), and the 95% coverage probabilities (CP). Table 1 summarizes the results. From Table 1, it can be observed that the bias, SE, and RMSE for all the parameters tend to approach zero when the sample size is increased, showing that the ML estimates obtained are asymptotically consistent. On the other hand, the CP values are closer to the nominal values used in their construction (95%), suggesting that the asymptotic distribution in Equation (3) is reasonable, even in finite samples.

Application
To illustrate the GSC double regression model, we consider the Australian data set available in the package sn in R [28], which includes data on 202 athletes collected at the Australian Institute of Sport. Codes were performed in [27] and are available upon request. Our main aim is to explain the body fat percentage (Bfat) in terms of the body mass index (bmi) and the lean body mass (lbm). Particularly, we consider Bfat i (q) ∼ GSC(µ i (q), σ i (q), λ, φ(q)), where φ(q) satisfies (1), q ∈ (0, 1) and for i = 1, . . . , 202, we have that In other words, the bmi and lbm explain both the q-th quantile of Bfat and the scale of the distribution. The same structure of covariates was considered in [13], but without modeling the scale parameter; i.e., considering β 22 (q) = β 23 (q) = 0. We refer to those models as GSC and GSC 0 for the cases where σ is modeled and not modeled, respectively. We considered q ∈ {0.1, 0.25, 0.5, 0.75, 0.9}. Our approach is compared with the skewed Laplace (SKL) model in [29]. Table 2 shows the Akaike information criterion (AIC [30]) for the three models. We also present the statistic for the likelihood ratio test (LRT) to test H 0 : β 22 (q) = β 23 (q) = 0 versus H 0 : β 22 (q) = 0 or β 23 (q) = 0, for all the quantiles considered. In addition, we also compute the quantile residuals [31] for the GSC model. If the model is correctly specified, the residual should be a random sample from the standard normal distribution. We checked this assumption with the traditional Kolmogorov-Smirnov test. Note that the GSC presents the lowest AIC for the quantiles up to the median, and GSC 0 presents the lowest AIC for the rest of the quantiles. This is explained because, according to the LRT, the coefficients related to the bmi and lbm variables are not significant (under any common level of significance) for modeling the scale parameter for q = 0.75 and q = 0.9, while they are significant for the rest of the quantiles. Finally, based on the quantile residuals, the GSC double regression model seems to be appropriate for modeling all quantiles, except the largest. Figure 2 also shows the regression coefficients in terms of the quantiles and their respective 95% confidence intervals. Note that β 12 (q) and β 13 (q) are significant (based on 5% significance) for all the quantiles considered; i.e., the bmi and lbm variables are relevant for explaining the different quantiles of Bfat. Specifically, we can obtain the following interpretations for β 12 (q) and β 13  We highlight the large difference between the interpretations for athletes in the lowest 10% and highest 10% of Bfat.
Finally, Figure 3 shows the different estimated pdf values for the Bfat under different scenarios for bmi and lbm. We note that these estimated pdf values assumed different shapes-unimodal, bimodal symmetric, and bimodal asymmetric-justifying the use of the double regression GSC model in this example. Finally, Figure 4 shows the pairs (λ, q) for the five quantiles modeled, identifying the unimodal and bimodal cases.

Conclusions
In this paper, we present a new extension of the GSC model, introducing a double regression structure to model both the quantile and the scale of the distribution. This structure produces a competitive model for modeling heterogeneous populations with different shapes: unimodal symmetric, unimodal asymmetric, bimodal symmetric, and bimodal asymmetric. The illustration with a real data set shows that the model provides better performance than other proposals in the literature. A limitation of the model is that the shape parameter is common for all the observations. Further extensions should include covariates in this parameter also.