A New Quantile Regression for Modeling Bounded Data under a Unit Birnbaum–Saunders Distribution with Applications in Medicine and Politics

: Quantile regression provides a framework for modeling the relationship between a response variable and covariates using the quantile function. This work proposes a regression model for continuous variables bounded to the unit interval based on the unit Birnbaum–Saunders distribution as an alternative to the existing quantile regression models. By parameterizing the unit Birnbaum–Saunders distribution in terms of its quantile function allows us to model the effect of covariates across the entire response distribution, rather than only at the mean. Our proposal, especially useful for modeling quantiles using covariates, in general outperforms the other competing models available in the literature. These ﬁndings are supported by Monte Carlo simulations and applications using two real data sets. An R package, including parameter estimation, model checking as well as density, cumulative distribution, quantile and random number generating functions of the unit Birnbaum–Saunders distribution was developed and can be readily used to assess the suitability of our proposal.


Introduction
Recently, researchers are interested in deriving distributions with bounded support. Some of these distributions are the unit Birnbaum-Saunders (UBS) [1], unit gamma [2], unit Gompertz [3], unit inverse-Gaussian [4], unit Lindley [5,6], unit Weibull (UWEI) [7][8][9], and trapezoidal beta [10] models, among others. This interest is due to several natural and anthropogenic phenomena are measured as indexes, percentages, proportions, rates and ratios, which are bounded on a certain interval, usually the unit interval. The need for modeling and analyzing bounded data occurs in many fields of real life such as in medicine [8], politics [11], and psychology [12]. Hence, to statistically describe this kind of data, distributions with bounded support are needed.
A series of distributions with support on the unit interval have been proposed in the literature based on transformations of the cumulative distribution function (CDF). Among them, we can mention the log-Bilai [13], log-shifted Gompertz [14], unit Gompertz [3], unit inverse-Gaussian [4], unit Lindley [5], and UWEI [8] distributions.
In order to assess the influence of one or more covariates on the mean of the distribution of a response variable bounded on the unit interval, the following distributions can be used: beta [15], beta rectangular [16], log-Bilai [13], log-Lindley [17], log-weighted exponential [18], quasi-beta [19], simplex [20], unit gamma [2], and unit Lindley [5,6,21]. To investigate the relationship between the response mode and covariates, we can consider the beta [22], Kumaraswamy (KUMA), unit gamma, and unit Gompertz [23] distributions. 3 (1) and where 0 < x < 1, α > 0 is the shape parameter and Φ is the CDF of the standard normal distribution. It is noteworthy that δ = exp(−β) is a scale parameter and is also the median of the distribution of X, since F(δ; α, β) = 0.5. In addition, the r-th moment of X is given by By inverting the CDF defined in (2), we have the QF stated as where 0 < τ < 1 and Φ −1 is the QF of the standard normal distribution. Now, by solving the linear equation Q(τ; α, β) − µ = 0 in β obtained from (3), we have where h(α, τ) = −(1/2){2 + [αΦ −1 (1 − τ)] 2 − αΦ −1 (1 − τ) 4 + αΦ −1 (1 − τ)}, and g −1 is the inverse function of g, which is assumed to be a quantile link function being strictly increasing, twice differentiable and mapping (0, 1) into R. Hence, we can parameterize (1) and (2), as a function of µ, by means of and where 0 < x < 1, 0 < µ < 1, α > 0, and τ is fixed. Figure 1 shows some possible shapes of the PDF of the UBS distribution for selected values of the parameters µ, α and τ. The behaviors of the mean, variance, coefficient of skewness and coefficient of kurtosis are shown in Figure 2.
An advantage of the parameterization stated in (5) and (6) is that µ is the τ-th quantile and, consequently, the interpretation of this parameterization becomes more interesting in applications. Furthermore, it is possible to consider that the parameter µ depends on covariates; see Section 3.   The standard ML method can be used to estimate the UBS parameters by maximizing the corresponding log-likelihood function. Let x = (x 1 , . . . , x n ) be a realization of the random sample X = (X 1 , . . . , X n ), where X 1 , . . . , X n are independent and identically distributed random variables according to (1). Using (5), the corresponding log-likelihood function, with θ = (µ, α), is written as x, τ)/∂α, and h (α, τ) = ∂h(α, τ)/∂α, we get the associated score vector with coordinates stated as By solving numerically the system of non-linear equations in µ and α, formed by the coordinates of U(θ; x, τ) defined in (7) and (8), we have the ML estimate of θ. The asymptotic standard errors (SEs) of the corresponding estimators are obtained by inverting the Hessian matrix of the log-likelihood function. This Hessian matrix is generated by taking the second derivative of (7), with respect to µ and α.
For τ = 0.5, we have that while µ must be found numerically taking α = ξ( µ) in the expression defined in (7).

The UBS Quantile Regression Model
In this section, considering the parameterized PDF stated in (5), we formulate the UBS quantile regression model. Let X 1 , . . . , X n be n independent random variables, where each X i , for i = 1, . . . , n, follows the PDF given in (5) with unknown quantile parameter µ i , unknown shape parameter α, and τ ∈ (0, 1) is assumed as fixed, that is, X i ∼ UBS(µ i , α; τ). Here, the UBS quantile regression model is defined imposing that the quantile µ i of X i satisfies the functional relation expressed by where δ = (δ 0 , . . . , δ p−1 ) is a p-dimensional vector of unknown regression coefficients, with p < n and n being the sample size, z i = (1, z i1 , . . . , z i(p−1) ) denotes the observations on p known covariates, and g is given as in (4). There are several possibilities for the link function g stated in (9). For instance, the most useful well-known link functions are: which are the inverse CDF of the logistic, standard normal, minimum extreme-value, maximum extreme-value and Cauchy distributions, respectively. Let X 1 , . . . , X n be n independent random variables such that X i ∼ UBS(µ i , α; τ) denotes a UBS distributed random variable and consider that to ensure that the predicted quantiles lie within the unit interval. For fixed τ ∈ (0, 1), let now θ = (δ , α) be the vector of p + 1 unknown parameters to be estimated. The ML estimate θ = ( δ , α) of θ is obtained by the maximizing the log-likelihood function stated in (7). It is not possible to derive analytical solution for the ML estimate θ so that it must be calculated numerically using some optimization algorithm such as Newton-Raphson and quasi-Newton. Based on [15], we suggest to use, as an initial guess for δ, the ordinary least squares estimate of this parameter vector obtained from the linear regression of the transformed responses g(x 1 ), . . . , g(x n ) on Z, that is, Under mild regularity conditions and when the sample size n is large, the asymptotic distribution of the ML estimator θ = ( δ , α) is approximately multivariate normal (of dimension p + 1) with mean vector θ = (δ , α) and variance covariance matrix ∂θ ∂θ is the expected Fisher information matrix. Note that there is no closed form expression for the matrix K(θ). Nevertheless, as shown in [52], the estimated observed Fisher information matrix ∂θ ∂θ θ= θ is a consistent estimator of the expected Fisher information matrix K(θ). Therefore, for large n, we can replace K(θ) by J( θ).
Let θ r be the r-th component of θ. The asymptotic 100 × (1 − γ)% confidence interval for θ r is given by [ θ r ± z γ/2 SE( θ r )], with r = 1, . . . , p + 1, where z γ/2 is the 100 × γ/2 upper quantile of the standard normal distribution and SE( θ r ) is the estimated asymptotic SE of θ r . Note that SE( θ r ) is the square root of the r-th diagonal element of the matrix J −1 ( θ).
Due to the direct interpretation of the parameters in terms of odds, in this paper, we consider only the logit link. When µ i is the τ-th quantile, for 0 < τ < 1, the interpretations are straightforward. In addition, a strictly positive link function relating the shape parameter β with covariates w i , not necessarily equal to x i , can be considered. Of course, other link functions might be explored.

A Monte Carlo Simulation Study
In this section, we present the results of the Monte Carlo simulations used to assess the empirical bias and root mean-squared error (RMSE) of the ML estimators of the UBS quantile regression parameters. We also present the coverage probability (CP) of the 95% confidence interval (CP 95% ) based on asymptotic normality of the ML estimators. We consider sample sizes n = 20, 50, 100, 200, 300; τ = 0.10, 0.25, 0.50, 0.75, 0.90; α = 0.5, 1.0, 2.0, and two regression frameworks formulated as For each combination of (n, τ, α) and both regression frameworks (the covariate(s) remained constant throughout the simulations), B = 5000 pseudo-random samples were simulated using the SAS Data-Step, while parameter estimates were obtained by the quasi-Newton method in SAS PROC NLMIXED. The values of the response variable, given n, τ, α and the covariate(s), are generated from the quantile function stated as The empirical bias, RMSE and CP are calculated, respectively, by , where = α, δ 0 , δ 1 , or δ 2 , I is the indicator function, and SE( i ) is the corresponding estimated SE. Tables 1-3 and Tables 4-6 report the results for the first and second regression structures respectively. These tables reveal a low bias in the estimation of α and δ for all scenarios. The empirical RMSE is also low and quickly tends to zero as the sample size increases. Higher values of bias and RMSE are observed as the quantiles are distant from τ = 0.5, either from the left or right. Still, for all scenarios, the CP tends to the nominal confidence coefficient as the sample size increases. In summary, the simulations reveal that the UBS quantitative regression model has its parameters well estimated according to the metrics considered and can be an alternative to other models available in the literature. Table 1. Empirical bias, RMSE and 95% CP for the true values δ 0 = 1.0, δ 1 = 2.0 and α = 0.5 of the indicated quantile level (τ), sample size (n) and parameter (δ 0 , δ 1 , α) with simulated data.   Table 3. Empirical bias, RMSE and 95% CP for the true values δ 0 = 1.0, δ 1 = 2.0 and α = 2.0 of the indicated quantile level (τ), sample size (n) and parameter (δ 0 , δ 1 , α) with simulated data.  Table 4. Empirical bias, RMSE and 95% CP for the true values δ 0 = 1.0, δ 1 = 1.0, δ 2 = 2.0 and α = 0.5 of the indicated quantile level (τ), sample size (n) and parameter (δ 0 , δ 1 , α) with simulated data.

Empirical Application with Data from Political Sciences
In this and the following section, for illustrative purposes, we apply the UBS quantile regression model in the analysis of two real data sets. In addition to the UBS model, we also consider quantile regression models based on the KUMA, LEEG, UBUR, UCHE, UHN, unit logistic (ULOG) and UWEI, distributions. The PDF, CDF and QF for these distributions are presented in Appendix A.
The objective of these sections is not to present all numerous approaches to variable selection, regression diagnostics, link function selection or parameter interpretation, but rather suggest the use of the UBS quantile regression as an alternative to other models available in the literature.
In this first application, the response variable indicates the proportion of voters in the 2014 Brazil presidential election, received by the elected president, Dilma Rousseff, in 74 municipalities of the state of Sergipe. This state is located in the Northeast Region of Brazil and is formed by 75 municipalities. The data set is available in the library baquantreg [53] from R and was considered the human development index (HDI) in 2010 as covariate. We call this set "vote data" and assume the regression structure for µ i formulated as logit(µ i ) = δ 0 + δ 1 HDI i , for i = 1, . . . , 74. Table 7 reports the ML estimates and SEs for the indicated models. Note the difference between the estimates of δ 0 and δ 1 in the models. The rate of change in the conditional quantile, expressed by the estimated regression coefficients, is illustrated in Figure 3. Observe how the rate of change in the proportion of votes depends on the quantile. As also shown in Figure 3, the first plot on the left, a higher HDI indicates less proportion of votes received by the elected president. For the other models, the ML estimates were obtained from the unitquantreg package, which is under development. Table 8 presents the values of the Akaike information criterion used to compare various competing models. These values indicate that the UBS quantile regression model is the best one. This better performance, of course, is not a rule and depends on the data under analysis, but this fact highlights the importance of our new proposal for such data. In order to assess the goodness of fit of the UBS quantile regression model to the vote data, in Figure 4 are shown the half-normal plots with simulated envelopes for the Cox-Snell and randomized quantile residuals. These figures indicate a good fit of the UBS quantile regression model to the τ-th proportion of votes, for τ ∈ (0.10, 0.25, 0.50, 0.75, 0.90).    Figure 4. Half-normal plots with envelopes of Cox-Snell (first row) and randomized quantile (second row) residuals for the indicated τ quantile level with vote data.

Empirical Application with Data from Sports Medicine
In this second application, we consider a data set taken from [54], also available at http: //www.leg.ufpr.br/doku.php/publications:papercompanions:multquasibeta (accessed on 9 April 2021). These data are related to the body fat percentage, which was measured at five regions: android, arms, gynoid, legs, and trunk. The fat percentages at android, arms, gynoid, legs, and trunk correspond to the five response variables considered in the original study where these data were collected. The data set contains 298 observations and the covariates are: age (in years), body mass index (in kg/m 2 ), gender (female or male) and IPAQ (sedentary (S), insufficiently active (I), or active (A)). As described in [55], the IPAQ is a questionnaire that allows the estimation of weekly time spent on physical activities of moderate and strong intensity, in different contexts of daily life, such as: housework, leisure, transportation, and work, as well as the time spent in passive activities performed on the seating position.
We analyze the response variable body fat percentage at arms. We call this set as "arm data". The results of the analyses for the other four responses are available under request from the authors. For the response variable analyzed here, we assume µ i is given by logit(µ i ) = δ 0 + δ 1 z i1 + δ 2 z i2 + δ 3 z i3 + δ 4 z i4 + δ 5 z i5 , for i = 1, . . . , 298, where z i1 : age; z i2 : body mass index; z i3 : 0 for female, 1 for male; z i4 : 0 for IPAQ = S, 1 for IPAQ = I; and z i5 : 0 for IPAQ = S, 1 for IPAQ = A. Table 9 reports the ML estimates and SEs for the indicated models. Note the difference between the estimates of δ j , for j = 0, 1, . . . , 5, in the models. The rate of change in the conditional quantile, expressed by the estimated regression coefficients, is illustrated in Figure 5. Note how the rate of change in body fat percentage at arms depends on the quantile. Table 10 presents the values of the Akaike information criterion used to compare various models. These values indicate that the UBS quantile regression model is the best one. In order to assess if this model fits the data well, in Figure 6 are shown the halfnormal plots with simulated envelopes for the residuals. This figure displays a good fit of the UBS quantile regression model to the τ-th percentage of body fat at arms, for τ ∈ (0.10, 0.25, 0.50, 0.75, 0.90).   Theoretical quantiles (τ = 0.90) Figure 6. Half-normal plots with envelopes of Cox-Snell (first row) and randomized quantile (second row) residuals for the indicated τ quantile level with arm data.

Conclusions, Limitations and Future Investigation
Although the quantile regression methodology appeared in 1978 [34], few works consider it from a strictly parametric point of view. Some of these recently published works used distributions for responses restricted to the zero-one interval.
In this paper, we extended the unit Birnbaum-Saunders distribution by parameterizing its scale parameter in terms of a quantile that can depend on covariates. This extension employed a quantile parameterization that permitted us to state a setting mimicking the generalized linear models, giving wide flexibility in the formulation. We estimated the model parameter with the maximum likelihood method and used Cox-Snell and randomized quantile residuals to assess the adequacy of our formulation to the data, as well as the Akaike information criterion for model selection.
We conducted a Monte Carlo simulation study to empirically evaluate the statistical performance of the maximum likelihood estimators of the unit Birnbaum-Saunders quantile regression parameters. We also studied coverage probabilities of the confidence intervals of the corresponding parameters based on the asymptotic normality of these estimators. This simulation study reported the good statistical performance of such estimators.
Two data analyses were conducted related to political sciences and sports medicine with seven competing models to the unit Birnbaum-Saunders quantile regression. Both of these analyses reported a suitable performance of the new regression quantile model and superior to all of the competing models, giving evidence that the unit Birnbaum-Saunders distribution is an excellent alternative for quantile modeling and for dealing with bounded data into the unit interval. Such results reported that the unit Birnbaum-Saunders quantile regression model may be a new setting for analyzing this type of data. The new approach may be a good addition to the tools of statisticians and diverse practitioners interested in the modeling of quantiles.
A limitation of our approach is that the explanatory variables may condition quantiles and also the shape parameter. Thus, the effect of this parameter is a topic to be studied in a future work based on the line presented in [56,57] when jointly modeling two parameters. Other topics of future research are related to studying how multivariate, spatial, temporal structures could be added into the quantile regression framework [58][59][60][61]. In addition, Tobit and Cobb-Douglas type frameworks may be analyzed in the thematic of this investigation [62,63]. Moreover, censored observation might be studied in the present context [64].
The authors are working on these and other issues associated with the present investigation and the corresponding findings are expected to be reported in future works.

Acknowledgments:
The authors would also like to thank the Editors and Reviewers for their constructive comments which led to improve the presentation of the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.