You are currently viewing a new version of our website. To view the old version click .
Mathematics
  • Article
  • Open Access

21 April 2022

Vasicek Quantile and Mean Regression Models for Bounded Data: New Formulation, Mathematical Derivations, and Numerical Applications

,
,
and
1
Department of Statistics, Universidade Estadual de Maringá, Maringá 87020-900, Brazil
2
Department of Measurement and Evaluation, Artvin Coruh University, Artvin 08100, Turkey
3
School of Industrial Engineering, Pontificia Universidad Católica de Valparaíso, Valparaíso 2362807, Chile
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Machine Learning and Statistical Modeling with Applications in Real-World Data and Artificial Intelligence

Abstract

The Vasicek distribution is a two-parameter probability model with bounded support on the open unit interval. This distribution allows for different and flexible shapes and plays an important role in many statistical applications, especially for modeling default rates in the field of finance. Although its probability density function resembles some well-known distributions, such as the beta and Kumaraswamy models, the Vasicek distribution has not been considered to analyze data on the unit interval, especially when we have, in addition to a response variable, one or more covariates. In this paper, we propose to estimate quantiles or means, conditional on covariates, assuming that the response variable is Vasicek distributed. Through appropriate link functions, two Vasicek regression models for data on the unit interval are formulated: one considers a quantile parameterization and another one its original parameterization. Monte Carlo simulations are provided to assess the statistical properties of the maximum likelihood estimators, as well as the coverage probability. An R package developed by the authors, named vasicekreg, makes available the results of the present investigation. Applications with two real data sets are conducted for illustrative purposes: in one of them, the unit Vasicek quantile regression outperforms the models based on the Johnson-SB, Kumaraswamy, unit-logistic, and unit-Weibull distributions, whereas in the second one, the unit Vasicek mean regression outperforms the fits obtained by the beta and simplex distributions. Our investigation suggests that unit Vasicek quantile and mean regressions can be of practical usage as alternatives to some well-known models for analyzing data on the unit interval.

1. Introduction

Modeling of bounded data has been extensively discussed in the literature to infer some statistical indicators such as mortality rates, recovery rates, as well as economic and risk measures. In diverse areas of knowledge, researchers are faced with continuous variables expressed as indices, proportions, rates, and ratios, among others. When there is an interest in explaining the relationship between these variables and some covariates, it is necessary to use a regression model that, in general, offers a summary of the mean of the response variable conditioned to the covariates. However, since the mean is clearly affected by extreme values in the data, inference from mean regression models is also affected by this situation.
When the response variable is skew-distributed, using a robust modeling is more suitable than a mean regression. Some works related to skew distributions are presented in [1,2,3]. For this reason, the quantile regression [4] can be proposed as an alternative modeling.
If the interest is in evaluating the effect of covariates on different quantiles of the distribution of the response variable, including the median, the quantile regression models are also appropriate. For a parametric analysis, the model is constructed considering a distribution supported in the same range as the response variable.
Parametric quantile regression models for variables bounded on the unit interval, that is, on (0, 1), have often appeared recently. These models are formulated by first parameterizing the baseline distribution in terms of a quantile. Then, a regression-based functional form through a link function is introduced. For example, models based on the arcsecant hyperbolic Weibull [5], Lambert-uniform [6], unit-Birnbaum–Saunders [7,8,9,10], unit-Burr-XII [11], exponentiated arcsecant hyperbolic normal [12], unit-Chen [13], log-extended exponential-geometric [14], Johnson-t [15], power Johnson SB [16], L-logistic [17], unit-Weibull [18,19,20], generalized Johnson SB [21], and Kumaraswamy [22] distributions have been postulated. These formulations have been proposed to model the conditional quantiles of a bounded response variable and are well known in the literature [23].
Although many other quantile regression models to adjust the limited data are already available, it is clear that none of them are a panacea for the diverse problems that a practitioner confronts. In this sense, it is important to propose and evaluate new alternatives. For example, the Vasicek (VASI) distribution, proposed in [24], is an alternative probability distribution bounded on the (0, 1) interval. This distribution has two parameters, which are defined on such as interval, where one of them is equal to the model mean. Hence, unlike other models whose parameters do not allow an interpretation direct, the VASI distribution has this advantage since the mean is one of its parameters. This also permits us to consider the distribution mean in the presence of covariates. The VASI distribution is typically employed to model economic data on the (0, 1) interval. The probability density function (PDF) of such a distribution can take different forms, such as U-shaped, J-shaped, or uniform [25]. Note that the literature on the VASI distribution is rather scarce.
Based on the above, the objective of our investigation is to introduce an alternative to the parametric quantile regression models for skewed distributed response variables restricted to the (0, 1) interval by considering the VASI distribution. Then, we extend the range of applications of this distribution.
This paper is organized as follows. Section 2 introduces the VASI distribution parameterized either in terms of its quantile or its mean. Directions regarding the maximum likelihood (ML) estimation method in both quantile and mean parameterizations are presented in Section 3. In this section, diagnostics based on residuals are also considered. In Section 4, we provide two Monte Carlo simulation analyses to assess some statistical properties of the ML estimators, such as bias and mean squared error, in a regression-based functional form through the logit link function. The potential and competitiveness of the proposed models are shown in Section 5 through two applications involving real-world data. Section 6 presents some final considerations about the proposed model. Finally, in Appendix A, we characterize other distributions that appear in this paper.

2. The Vasicek Distribution

In this section, we introduce the VASI distribution parameterized either in terms of its quantile or its mean. A random variable Y with bounded support on the ( 0 , 1 ) interval is VASI distributed [24] if its PDF, cumulative distribution function (CDF), and quantile function (QF) are written, respectively, as    
f ( y ; α , θ ) = 1 θ θ exp 1 2 Φ 1 ( y ) 2 Φ 1 ( y ) 1 θ Φ 1 ( α ) θ 2 , F ( y ; α , θ ) = Φ Φ 1 ( y ) 1 θ Φ 1 ( α ) θ ,
Q ( τ ; α , θ ) = F 1 ( τ ; α , θ ) = Φ Φ 1 ( α ) + Φ 1 ( τ ) θ 1 θ
where 0 < y , α , θ , τ < 1 , whereas Φ and Φ 1 denote, respectively, the standard normal CDF and QF.
Note that θ is the shape parameter of the VASI distribution, and its mean and variance are, respectively, E ( Y ) = α and Var ( Y ) = Φ 2 Φ 1 ( α ) , Φ 1 ( α ) , θ α 2 , where Φ 2 is the bivariate standard normal PDF with correlation θ [26] (p. 900) calculated from the double integral stated as
Φ 2 ( a , b , c ) = 1 2 π 1 c 2 a b exp x 2 2 c x y + y 2 2 ( 1 c 2 ) d y d x .
As presented in [27], the PDF defined in (1) is unimodal with mode computed from Φ [ Φ 1 ( α ) 1 θ / ( 1 2 θ ) ] , when θ < 0.5 ; monotone, when θ = 0.5 ; and U-shaped, when θ > 0.5 . Figure 1 shows the PDF defined in (1) for different values of θ and α . Observe that the distribution possesses a symmetry property, that is, F ( y ; α , θ ) = 1 F ( 1 y ; α , θ ) .
Figure 1. Plots of the PDF stated in (1) for different combinations of θ and α .
Note that we can assess the effect of covariates on the mean of the distribution of Y through some appropriate link function. Moreover, from the QF formulated in (2), we can re-parameterize α in terms of the τ -th quantile, for  τ ( 0 , 1 ) , considering a one-to-one transformation ( α , θ ) ( μ , θ ) , as
α = h 1 ( μ ) = Φ Φ 1 ( μ ) 1 θ Φ 1 ( τ ) θ ,
where μ is, for a fixed and known value τ , the  τ -th quantile of the distribution of Y. Figure 2 shows the PDF stated in (1) re-parameterized and assuming different values for μ , θ , and τ . We consider the first decile ( τ = 0.10 ), the first quartile ( τ = 0.25 ), the median ( τ = 0.50 ), the third quartile ( τ = 0.75 ), and the ninth decile ( τ = 0.90 ) for τ . The dotted vertical lines in Figure 2 indicate where the parameter α is located according to the values of τ . Observe that F y ; μ , θ , τ = 1 F 1 y ; μ , θ , τ .
Figure 2. Plots of the PDF stated in (1) with quantile parameterization for different combinations of θ , μ , and τ .
In summary, assuming that Y is VASI distributed, we can fit its regression models for estimating both the mean and quantile of the response variable, for a fixed τ -level. Its desirable properties may distinguish it from some other distributions proposed in the literature for data on the ( 0 , 1 ) interval.

3. Maximum Likelihood Estimation and Residuals Analysis

In this section, we describe the ML estimation considering the parameterizations ( α , θ ) , ( μ , θ ) , ( α ( x ) , θ ( z ) ) , and ( μ ( x ) , θ ( z ) ) , where x and z are values of the covariate vectors. Some directions on residuals analysis are also provided here.
Let Y = ( Y 1 , , Y n ) be a random sample of size n from a VASI distribution with unknown parameters α and θ and consider that y = ( y 1 , , y n ) are its observed values. Based on the PDF stated in (1), the corresponding log-likelihood function is proportional to
α , θ ; y n 2 log 1 θ θ 1 2 θ i = 1 n Φ 1 y i 1 θ Φ 1 α 2 .
The ML estimates of α and θ are obtained maximizing the log-likelihood function stated in (4) by solving the score equations U α = α , θ ; y / α = 0 and U θ = α , θ ; y / θ = 0 written, respectively, as
U α = 1 ϕ Φ 1 α θ i = 1 n Φ 1 y i 1 θ Φ 1 α ,
U θ = n θ θ 1 + 1 2 θ 1 θ i = 1 n Φ 1 y i 1 θ Φ 1 α Φ 1 y i + 1 2 θ 2 i = 1 n Φ 1 y i 1 θ Φ 1 α 2 ,
where ϕ denotes the standard normal PDF.
By solving the equation U θ = 0 , obtained from (6), in  α , we have
α ^ = Φ 1 θ ^ n i = 1 n Φ 1 y i = ξ ( θ ^ ) ,
while θ ^ is calculated numerically considering α = ξ ( θ ) in the formula stated (6). The estimated asymptotic standard errors (SEs) of α ^ and θ ^ are computed by the corresponding inverse Hessian matrix. Such a matrix is obtained by the second-order derivative of the expression given in (4) with respect to α and θ . Note that
d d α Φ 1 ( α ) = 1 ϕ Φ 1 α , d d α ϕ Φ 1 ( α ) = Φ 1 ( α ) 2 π ϕ Φ 1 α exp 1 2 Φ 1 ( α ) 2 = Φ 1 ( α ) , d 2 d α 2 Φ 1 ( α ) = d d α ϕ Φ 1 ( α ) 1 ϕ [ Φ 1 ( α ) ] 2 = Φ 1 ( α ) ϕ [ Φ 1 ( α ) ] 2 .
Now, by considering the re-parameterization given in (3), the log-likelihood function is proportional to
μ , θ ; y , τ n 2 log 1 θ θ 1 2 θ i = 1 n 1 θ Φ 1 y i Φ 1 μ + θ Φ 1 τ 2 ,
such that
U μ = U α α = g 1 ( μ ) μ Φ Φ 1 μ 1 θ θ Φ 1 τ
and
U θ = U θ θ Φ Φ 1 μ 1 θ θ Φ 1 τ
are the score equations for μ and θ , respectively.
Note that, concomitant with y i , for  i { 1 , , n } , we can also observe covariate vectors x i and z i . Then, we are interested in evaluating the effects of these covariates on ( α , θ ) or ( μ , θ ) . For ML estimation, we have the observations y 1 , , y n from n VASI distributed independent random variables Y 1 , , Y n . Now, assume the equations expressed as
g 1 ( ω i ) = η i = β 0 + β 1 x i 1 + + β p x i p
and
g 2 ( θ i ) = ζ i = δ 0 + δ 1 z i 1 + + δ p z i q ,
which link both η i and ζ i with a linear combination of the values of the covariates x i = ( 1 , x i 1 , , x i p ) and z i = ( 1 , z i 1 , , z i q ) , respectively. Note that ω i can be the mean or the τ -th quantile of Y. We assume that g 1 and g 2 are strictly monotonic, twice differentiable functions that map the mean α i or the τ -th quantile μ i and θ i to the real line. Suitable choices of g 1 and g 2 are the inverse CDF of the logistic (logit link), standard normal (probit link), minimum extreme value (complementary log–log link), maximum extreme value (log–log link), and Cauchy (Cauchit link) distributions. It is important to note that x and z can be identical or they could be subsets of each other.
To obtain the parameter estimates, we need the first-order and second-order partial derivatives of the logarithm of the likelihood function. For the response i, y i namely, with  i { 1 , , n } and, as mentioned, ω being the mean or the τ -th quantile of Y, the log-likelihood function is given by i = i Θ = log f y i ; Θ , x i , z i , τ , such that the score equations are stated as
i β j = i ω i ω i η i η i β j , i δ j = i θ i θ i ζ i ζ i δ j , j { 1 , , p } ,
where Θ = ( β , δ ) , β = ( β 0 , , β p ) is a ( p + 1 ) × 1 parameter vector associated with an n × ( p + 1 ) covariate matrix X and δ = ( δ 0 , , δ q ) is a ( q + 1 ) × 1 parameter vector associated with an n × ( q + 1 ) covariate matrix Z . Considering the full log-likelihood function, we have the score vectors for Θ , written, respectively, as
β = X diag W ω ˙ ω , δ = Z diag W δ ˙ δ ,
with “diag” denoting an n × n diagonal matrix and
W ω = ω 1 η 1 , , ω n η n , ˙ ω = ω 1 , , ω n , W δ = θ 1 ζ 1 , , θ n ζ n , ˙ δ = θ 1 , , θ n .
For the Hessian matrix, we have the expressions stated as
2 β β = X diag ¨ ω ω diag W ω 2 X , 2 δ δ = Z diag ¨ δ δ diag W δ 2 Z , 2 β δ = X diag ¨ μ δ diag W μ diag W δ Z ,
where
¨ ω ω = 2 ω 1 ω 1 , , 2 ω n ω n , ¨ δ δ = 2 δ 1 δ 1 , , 2 δ n δ n , ¨ ω δ = 2 ω 1 δ 1 , , 2 Θ ω n δ n .
The ML estimates ( β ^ 0 , , β ^ p ) and ( δ ^ 0 , , δ ^ q ) can be obtained, for instance, through general-purpose optimization algorithms such as Nelder–Mead, quasi-Newton, and conjugate-gradient, available in the optim function of R [28]. As an alternative, to model ( α , θ ) or ( μ , θ ) , conditional on covariates, we create two generalized additive model for location, scale, and shape (GAMLSS) frameworks that can be used directly in the gamlss function of the gamlss package [29,30,31] of R. An advantage of having a GAMLSS structure is that all the parameters of the distribution can be modeled as linear, nonlinear, or smooth functions of covariates. In addition, we have available all the resources for the statistical modeling process within the gamlss package of R as model selection and diagnostics.
Considering some regularity conditions ([32] pp. 118–119), note that
Θ ^ ˙ N p + 1 ( Θ , ( I ( Θ ) ) 1 ) ,
with I ( Θ ) being the expected information matrix that can be obtained as
I ( Θ ) = E 2 ( Θ ) Θ Θ .
Observe that approximate confidence intervals may be reached employing the expression given in (7), while, for obtaining the information matrix defined in (8), we can utilize the observed information matrix defined by
J ( Θ ) = 2 ( Θ ) Θ Θ ,
whose elements established in (9) may be calculated from the results above presented, evaluated at Θ = Θ ^ .
To test the hypotheses H 0 : Θ = Θ 0 versus H 1 : Θ Θ 0 , with  Θ = ( β , δ ) , as mentioned, we may employ the Wald and likelihood ratio tests. The Wald [33] and likelihood ratio statistics considering the observed information matrix [34] are, respectively, stated as
W = ( Θ ^ Θ 0 ) J ( ( Θ ^ ) ( Θ ^ Θ 0 ) ,
L = 2 ( Θ 0 ) ( Θ ^ ) .
As n , these statistics converge to a random variable with χ 2 distribution and r degrees of freedom (DF), χ r 2 namely, with r being the number of parameters under H 0 , which is rejected, at a significance level ϱ , if the statistic computed according to (10) or (11) is greater than χ r , 1 ϱ 2 , which represents the 100 ( 1 ϱ ) th χ r 2 quantile.
Diagnostics and goodness-of-fit are important because they reveal the strengths and weaknesses of a proposed model. As pointed out in [35]: “Goodness-of-fit is concerned with assessing the validity of models involving statistical distributions, an essential and sometimes forgotten aspect of the modeling exercise. One can only speculate on how many wrong decisions are made due to the use of an incorrect model”. Within the gamlss package, we may utilize the fitted normalized quantile residual [36] defined as
r ^ i = Φ 1 ( F ( y i ; ω ^ i , θ ^ i ) ) , i { 1 , , n } .
If the model is correctly specified, the residual r ^ i stated in (12) has an approximate standard normal distribution. Alternatively, defining r ^ i = log ( 1 F ( y i ; ω ^ i , θ ^ i ) ) , we have the estimated Cox–Snell residual [37]. An important property of the Cox–Snell residual is that, if the model selected fits the data, r ^ i follows the standard exponential distribution.
Note that, from a gamlss object, we can use these residuals to construct theoretical quantile versus empirical quantile (QQ) plots with simulated envelopes [38]. The main advantage of this simulation technique is its ease of interpretation without imposing any assumption on the residual distribution [39].

4. Simulation Studies

In this section, we report the results of two Monte Carlo simulation analyses employed to state the bias and root mean-squared error (RMSE) of the ML estimators of the VASI quantile regression coefficients. In addition, we provide the coverage probability (CP) of the 95% confidence interval (CP95%) using the asymptotic normality of such ML estimators.
Here, we assume τ { 0.10 , 0.25 , 0.50 , 0.75 , 0.90 } ; sample sizes n { 20 , 50 , 80 , 110 , 140 , 170 , 200 } ; and θ { 0.25 , 0.50 , 0.75 } , inserted on two regression frameworks formulated as: (i) logit ( μ i ) = δ 0 + δ 1 z i 1 for δ 0 = 1.0 , δ 1 = 2.0 and z i 1 N ( μ = 0 , σ = 1 ) ; and (ii) logit ( μ i ) = δ 0 + δ 1 z i 1 + δ 2 z i 2 for δ 0 = 1.0 , δ 1 = 1.0 , δ 2 = 0.5 , z i 1 Bernoulli ( p = 0.5 ) and z i 2 N ( μ = 0 , σ = 1 ) . For each ( τ , n , θ ) and both regressions, with the covariates being kept constant in the simulations, M = 10 , 000 replicates were generated using the SAS Data-Step, while the estimates were calculated with the quasi-Newton method of PROC SAS/NLMIXED [40]. Response values, for  ( τ , n , θ ) and the covariates, are obtained from the QF given by
y i = Φ Φ 1 ( α i ) + Φ 1 ( u i ) θ 1 θ ,
where u i U ( 0 , 1 ) and α i = h 1 ( μ i ) = Φ [ Φ 1 ( μ i ) 1 θ Φ 1 ( τ ) θ ] .
The bias, RMSE, and CP were calculated, respectively, as
Bias ( ς ^ ) = 1 M i = 1 M ( ς ^ i ς ) , RMSE ( ς ^ ) = 1 M i = 1 M ς ^ i ς 2 1 / 2 ,
and CP 95 % ( ς ^ ) = ( 1 / M ) i = 1 M 𝟙 [ ς ^ i ± 1.96 × SE ^ ( ς ^ i ) ] , where ς = θ , δ 0 , δ 1 , or δ 2 ; 𝟙 is the indicator function; and  SE ^ ( ς ^ i ) is the corresponding estimated SE.
Table 1, Table 2 and Table 3 and Table 4, Table 5 and Table 6 present the results for the first and second regression models, respectively. Such tables report a small bias when estimating θ and δ for all settings considered in this study. In addition, the estimated RMSE is small and quickly approaches zero as n, the sample size, increases. Larger values of the bias and RMSE are detected as the quantiles are distant from τ = 0.5 , either from the left or right, that is, the values τ { 0.1 , 0.25 , 0.75 , 0.9 } . Moreover, for all scenarios, the CPs tend towards the nominal confidence coefficient, that is, 95%, as n increases from n = 20 to n = 200 , with n { 20 , 50 , 80 , 110 , 140 , 170 , 200 } . The Monte Carlo simulation results for the mean and the same regression structures give us evidence that the coefficients are also well estimated according to the bias, RMSE, and CP, that is, once again, the bias and RMSE decrease as the sample size increases, whereas the CPs approach the confidence coefficient considered, that is, 95 % , as expected. These results are not shown here and can be obtained from the authors upon request.
Table 1. Empirical bias, RMSE, and 95 % CP (true values: δ 0 = 1.00 , δ 1 = 2.00 , and θ = 0.25 ) with simulated data.
Table 2. Empirical bias, RMSE, and 95 % CP (true values: δ 0 = 1.00 , δ 1 = 2.00 , and θ = 0.50 ) with simulated data.
Table 3. Empirical bias, RMSE, and 95 % CP (true values: δ 0 = 1.00 , δ 1 = 2.00 , and θ = 0.75 ) with simulated data.
Table 4. Empirical bias, RMSE, and 95 % CP (true values: δ 0 = 1.00 , δ 1 = 1.00 , δ 2 = 0.50 , and θ = 0.25 ) with simulated data.
Table 5. Empirical bias, RMSE, and 95 % CP (true values: δ 0 = 1.00 , δ 1 = 1.00 , δ 2 = 0.50 , and θ = 0.50 ) with simulated data.
Table 6. Empirical bias, RMSE, and 95 % CP (true values: δ 0 = 1.00 , δ 1 = 1.00 , δ 2 = 0.50 , and θ = 0.75 ) with simulated data.
Observe that the empirical results based on the Monte Carlo simulations for large samples are coherent with the asymptotic theoretical results presented at the end of Section 3 and just before the diagnostic analysis.

5. Applications

In the next subsections, we present two real applications in order to show the potentiality of the proposed quantile and mean regression models, taking the VASI distribution as a baseline. For comparison purposes, in addition to the VASI quantile regression model, we also consider the Johnson SB (JOSB), Kumaraswamy (KUMA), unit-logistic (ULOG), and unit-Weibull (UWEI) quantile regression models, whose PDF, CDF, and QF are defined in Appendix A. The VASI mean regression model is compared with the beta and simplex regression models. Note that this section does not seek to provide all analyses for link function and variable selection, regression diagnostics, or parameter interpretation. Indeed, we are suggesting here the use of the VASI regression model as an alternative to other models available in the literature and not as a full regression analysis.

5.1. The VASI Quantile Regression Model

This application considers a real data set first reported in [41], which was also analyzed in [7]. This data set contains 298 observations about the body fat proportion of patients in a public hospital located in Curitiba, Paraná, Brazil. The fat proportion at android, arms, gynecoid, legs, and body are related to five response variables, and the data set is formed by two continuous and two categorical covariates. Note that the continuous covariates refer to the age (in years) and body mass index (bmi) (in kg/m2) of the individuals, while the categorical covariates are related to gender (female or male) and ipaq (sedentary (S), insufficiently active (I), or active (A)). Observe that the ipaq is a questionnaire that permits the estimation of weekly time spent on physical activities of moderate and strong intensity, in different aspects of daily life, such as transportation, leisure, housework, and work, as well as the time spent in passive activities [42].
The sample contains information from 150 women and 148 men. Patients have a mean age of 46 years old and a standard deviation of 19.88 years old, while the average BMI is 24.72 kg/m2 with a standard deviation of 3.15 kg/m2. The ipaq questionnaire classified individuals as follows: 76 individuals as insufficiently active, 60 sedentary, and 162 active. According to [41], the data set has one outlier, which consists of the individual #158 with the following characteristics: female, 49 years old, with BMI = 29.3 kg/m2 and fat proportion in the arms equal to 0.196; in other words, this patient has a high BMI but low fat proportion in the arms.
For the response variable “proportion of fat in the arms”, we assume that μ i is modeled as
logit ( μ i ) = δ 0 + δ 1 x i 1 + δ 2 x i 2 + δ 3 x i 3 + δ 4 x i 4 + δ 5 x i 5 , i { 1 , , 298 } ,
where
  • x i 1 is the (agei − 46.00) with 46.00 being the average age;
  • x i 2 is the (BMIi − 24.72) with 24.72 being the average BMI;
  • x i 3 is an indicator covariate, in which x i 3 = 0 for female or x i 3 = 1 for male;
  • x i 4 is an indicator covariate, in which x i 4 = 0 for ipaq = S or x i 4 = 1 for ipaq = I;
  • x i 5 is an indicator covariate, in which x i 5 = 0 for ipaq = S or x i 5 = 1 for ipaq = A.
Table 7 presents the values of the ML estimates and their corresponding SEs for the indicated models. The  δ ^ 0 is the estimate for a female with age of 46 years old, BMI = 24.72 kg/m2, and ipaq = S. The parameter estimates for δ 1 and δ 2 indicate that the arm fat proportion is larger for older individuals with larger BMI. In contrast, the estimates for δ 3 , δ 4 , and δ 5 have a negative influence on the arm fat proportion, indicating that this proportion is smaller for insufficiently active and active men, respectively. Consequently, the fat proportion is larger for women and sedentary individuals. As expected, θ ^ does not vary with the quantiles, since it does not depend on covariates. From the variation in the parameters for different values of τ , we can conclude that the estimated arm fat proportion depends on the quantiles. This conditional quantile variation rate, expressed by the estimated regression coefficients, is illustrated in Figure 3. We can also see that, from a statistical point of view, all covariates are significant at 5%, since they do not contemplate the value of zero in their respective confidence intervals.
Table 7. ML estimates and SEs for the body fat proportion data.
Figure 3. Estimated quantile process plot for δ i , with i { 0 , 1 , 2 , 3 , 4 , 5 } and θ .
To support this inference, the model’s appropriateness should be stated. With this aim, we provide the QQ plots with envelopes considering the Cox–Snell and quantile residuals in Figure 4, from which we can conclude that the model presented a good fit, since all the observations are inside the simulated envelopes. Furthermore, Table 8 reports the values of the Akaike information criterion (AIC) employed to compare diverse models. We can conclude by this criterion that the quantile regression model considering the VASI distribution was the one that presented the best fit, as it had the smallest value. Note that, when the quantile level increases, the value of the AIC decreases. The largest estimated log-likelihood values have been obtained at τ = 0.90 , that is, the left skew distribution of the response variable has been captured successfully by this quantile level.
Figure 4. QQ plots with simulated envelopes of Cox–Snell (first row) and normalized quantile (second row) residuals for the listed τ quantile with fat proportion in the arms.
Table 8. The Akaike information criterion for the body fat proportion data.
ML estimates for the model parameters may be calculated using the vasicekreg package by means of the following R codes:
library(gamlss)
library(vasicekreg)
data(bodyfat, package = “vasicekreg”)
fit <- lapply(c(0.10, 0.25, 0.50, 0.75, 0.90), function(Tau)
{
   tau <<- Tau;
   gamlss(arms ~ age +  sex2 + ipaq1 + ipaq2, data = bodyfat, trace = FALSE,
       family = VASIQ(mu.link = “logit”, sigma.link = “logit”))
})
sapply(fit, coef)
Probit, complement log–log, and cauchit link functions could also be used through the argument mu.link and sigma may be employed depending on covariates through the argument sigma.formula. The vasicekreg package is available online at https://cloud.r-project.org/web/packages/vasicekreg/index.html (accessed on 24 March 2022) and through it we can also consider, beyond quantile regression, mean regression and all the functionality of the gamlss package. For the other fitted models, the corresponding ML estimates were calculated with the unitquantreg package, which is under development and available online at https://github.com/AndrMenezes/unitquantreg (accessed on 24 March 2022).

5.2. The VASI Mean Regression Model

From Section 2, we observe that E ( Y ) = α , with Y VASI distributed. Hence, we can easily assess the effect of covariates on the mean of the response variable. Thus, in addition to the quantile regression model, we also propose a regression model considering the mean of the VASI distribution. As an alternative to the VASI regression model for the mean, the beta [43] and simplex [44] regression models are also considered.
In what follows, we illustrate the potential of the VASI mean regression model considering a political data set extracted from the baquantreg package of R [45]. The response corresponds to the proportion of votes in the 2014 presidential election in Brazil, obtained by the elected president, Dilma Rousseff, in the Minas Gerais and Piaui states. First, for each state, we considered the Human Development Index (HDI) in 2010, centered at the mean, as covariate and a regression structure given by
logit ( α i ) = δ 0 + δ 1 HDI i + , for i { 1 , , 1065 } ,
where n = 843 ( 222 ) for Minas Gerais (Piaui) and HDI i + = ( HDI i 0.6479 ) .
Table 9 reports the ML estimates, estimated SEs, and the values of AIC for the three fitted models. The estimate of δ 0 indicates that the logit of the vote proportion for the state of Minas Gerais is 0.6012 and for the state of Piaui is 0.7885 when HDI is 0.6479 . Furthermore, since δ ^ 1 is obtained as a negative value, there is an opposite linear relation between the vote proportion response and HDI covariate for both Brazilian states. Hence, we can conclude that, when HDI increases, a smaller proportion of votes has been received in both Brazilian states. This fact is more accentuated in the state of Minas Gerais since this value is smaller. The AIC values indicate that the VASI mean regression model is the best when compared to the beta and simplex models.
Table 9. ML estimates (with SEs in parentheses) and AIC values for the vote data.
We also consider a second mean regression structure, where “state” is a covariate, and we insert covariates into the parameter θ , as 
logit ( α i ) = δ 0 + δ 1 state i + δ 2 HDI i + ,
logit ( θ i ) = β 0 + β 1 state i , i { 1 , , 1065 } .
Table 10 reports the ML estimates, estimated SEs, and the AIC values. The estimate of δ 0 indicates that the logit of the proportion of votes is 0.5984 for the reference state (Minas Gerais) and HDI is centered at its mean value. The estimate of δ 1 indicates the variation that occurs in the proportion of votes from the reference state when there is a change of one unit in the HDI. From  δ ^ 2 , we can conclude that the proportion of votes is smaller for larger HDI values. The parameter estimate β ^ 0 indicates the logit of the shape parameter of the distribution for the reference state. The parameter estimate β ^ 1 indicates the variation that occurs in the logit of the shape parameter of the distribution for the reference state. Based on the AIC values, we can conclude that the VASI regression model considering the mean is the best fit to the data, when compared to the beta and simplex models.
Table 10. ML estimates (with SEs in parentheses) and AIC values for the vote data.
The results displayed in Table 10 can be obtained from R codes as follows:
vasi <- gamlss(percvotes ~ state + hdi, sigma.formula = ~ state,
               data = votesmgpi, family = VASIM(), trace =  FALSE)
beta <- gamlss(percvotes ~ state + hdi, sigma.formula = ~ state,
               data = votesmgpi, family = BE(), trace = FALSE)
simp <- gamlss(percvotes ~ state + hdi, sigma.formula = ~ state,
               data = votesmgpi, family = SIMPLEX(), trace = FALSE)
GAIC(vasi, beta, simp)
    DF    AIC
vasi  5 -1887.173
beta  5 -1882.968
simp  5 -1877.755
Once a VASI model is fitted, it is important to assess its adequacy by examining the residuals. Through the plot function, applied to objects vasi, beta, or simp, we have available four plots for checking the normalized quantile residuals. The four plots are residuals against the fitted values of the parameter α , residuals against an index, a kernel density estimate of the residuals, and the QQ plot of the residuals.

6. Concluding Remarks

There are various alternative classes of regression models for proportions and rates. Despite having great flexibility and taking different forms, the Vasicek distribution has been neglected in the analysis of data in the unit interval, both in the presence and in the absence of covariates. In this paper, two regression models considering the Vasicek distribution for response variables restricted to the unit interval were presented. The first one was formulated considering a re-parameterization in the parameter α of the Vasicek distribution. The second one was proposed stating the original parameterization in terms of the distribution mean.
In both models proposed, the covariates were related to the quantile or mean of the response variable by the logit link function. Through a Monte Carlo simulation study, we found that, for both models, the parameters are well estimated in terms of bias and root mean squared error. It was also possible to observe that the coverage probabilities tend towards the nominal confidence coefficient as the sample size increases.
The Vasicek quantile regression model was applied to a real-world data set, in which the response variable was the fat proportion measured in the arms. When compared with another four models existing in the literature, that is, those based on the Johnson SB, Kumaraswamy, and unit-Weibull distributions, the model that best fitted the data was the Vasicek quantile regression, according to the values of the Akaike information criterion.
The Vasicek regression model considering the mean of the response variable was applied to a politics data set and compared with the beta and simplex regression models. The model was built considering covariates in the parameter representing the mean and also in the shape parameter of the distribution. By the criteria considered, the Vasicek regression model offered the best fit to the data. In the end, the results revealed that the Vasicek distribution has great potential for analyzing data restricted to the unit interval in the presence of covariates. The analysis using both regression models (quantile and mean) was facilitated by means of a package made available in the R software, which was implemented by the authors of this paper.

Author Contributions

Conceptualization, M.Ç.K.; Data curation, J.M. and B.A.; Formal analysis, B.A. and V.L.; Investigation, J.M., B.A., M.Ç.K. and V.L.; Methodology, J.M., B.A., M.Ç.K. and V.L.; Writing – original draft, J.M., B.A. and M.Ç.K.; Writing – review and editing, V.L. All authors have read and agreed to the published version of the manuscript.

Funding

The research of B. Alves is partially supported by the Coordenação de Aperfeiçoamento Pessoal de Nível Superior –Brazil (CAPES) – Finance Code 001. The research of V. Leiva was partially funded by FONDECYT, project grant number 1200525 from the National Agency for Research and Development (ANID) of the Chilean government under the Ministry of Science and Technology, Knowledge and Innovation.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data and codes are available from the authors upon request.

Acknowledgments

The authors would also like to thank the Editors and Reviewers for their constructive comments which led to improve the presentation of the manuscript.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Appendix A. Other Distributions

In this appendix, we characterize the other distributions considered in the numerical examples.
  • The ULOG distribution [46] is obtained from the transformation
    Y = exp ( X α θ ) 1 + exp ( X α θ ) ,
    where X Log ( 0 , 1 ) denotes a random variable with standard logistic distribution [47]. The associated PDF, CDF, and QF are written, respectively, as
    f ( y ; α , θ ) = θ exp α y 1 y θ 1 1 + exp α y 1 y θ 2 ,
    F ( y ; α , θ ) = exp α y 1 y θ 1 + exp α y 1 y θ
    and
    Q ( τ ; α , θ ) = exp α θ τ 1 τ 1 θ 1 + exp α θ τ 1 τ 1 θ ,
    where 0 < ( y , τ ) < 1 , α > 0 and θ > 0 . For quantile regression, the parameter α defined in (A3) must be re-parametrized as
    α = h 1 ( μ ) = log τ 1 τ θ log μ 1 μ .
  • The JOSB distribution [48] can be obtained from the transformation
    Y = exp ( X α θ ) 1 + exp ( X α θ ) ,
    where X N ( 0 , 1 ) denotes a standard normal distributed random variable. The corresponding PDF, CDF, and QF are written, respectively, as
    f ( y ; α , θ ) = θ 2 π 1 y ( 1 y ) exp 1 2 α + θ log y 1 y 2 ,
    F ( y ; α , θ ) = Φ α + θ log y 1 y
    and
    Q ( τ ; α , θ ) = exp Φ 1 ( τ ) α θ 1 + exp Φ 1 ( τ ) α θ ,
    where 0 < ( y , τ ) < 1 , α R and θ > 0 . For quantile regression, the parameter α defined in (A6) must be re-parametrized as
    α = h 1 ( μ ) = Φ 1 ( τ ) θ log μ 1 μ .
  • The KUMA distribution [49] can be obtained from the transformation Y = exp ( X ) , where X EE ( α , θ ) denotes a random variable with exponentiated–exponential distribution. The associated PDF, CDF, and QF are written, respectively, as
    f ( y ; α , θ ) = α θ y θ 1 ( 1 y θ ) α 1 ,
    F ( y ; α , θ ) = 1 1 y θ α
    and
    Q ( τ ; α , θ ) = 1 1 τ 1 α 1 θ ,
    where 0 < ( y , τ ) < 1 , α > 0 and θ > 0 are shape parameters. For quantile regression, the parameter α defined in (A9) must be re-parametrized as
    α = h 1 ( μ ) = log ( 1 τ ) log ( 1 μ θ ) .
  • The UWEI distribution [18,19] is obtained from the transformation Y = exp ( X ) , where X WEI ( α , θ ) denotes a random variable with Weibull distribution [50]. The associated PDF, CDF, and QF are written, respectively, as
    f ( y ; α , θ ) = α θ y log ( y ) θ 1 exp α log ( y ) θ ,
    F ( y ; α , θ ) = exp α log ( y ) θ
    and
    Q τ ; α , θ = exp log ( τ ) α 1 θ ,
    where 0 < ( y , τ ) < 1 , α > 0 and θ > 0 . For quantile regression, the parameter α defined in (A11) must be re-parametrized as
    α = h 1 ( μ ) = log ( τ ) [ log ( μ ) ] θ .
  • The beta distribution parametrized in terms of its mean and dispersion parameters was given in [43]. The corresponding PDF is written as
    f ( y ; α , θ ) = Γ ( θ ) Γ ( θ α ) Γ 1 α θ y α θ 1 ( 1 y ) ( 1 α ) θ 1 ,
    where 0 < y < 1 , Γ is the gamma function, 0 < α < 1 is the mean, and θ > 0 .
  • The simplex distribution [44,51] can be obtained from the inverse Gaussian distribution. The corresponding PDF is written as
    f ( y ; α , θ ) = 2 π θ 2 y 1 y 3 1 2 exp y α 2 2 θ 2 y 1 y α 2 1 α 2 ,
    where 0 < y < 1 , the parameter α is the mean, while the parameter θ is the dispersion parameter.

References

  1. Yu, K.; Zhang, J. A three-parameter asymmetric Laplace distribution and its extension. Commun. Stat. Theory Methods 2005, 34, 1867–1879. [Google Scholar] [CrossRef]
  2. Geraci, M.; Bottai, M. Quantile regression for longitudinal data using the asymmetric Laplace distribution. Biostatistics 2007, 8, 140–154. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  3. Taylor, J.W. Forecasting value at risk and expected shortfall using a semiparametric approach based on the asymmetric Laplace distribution. J. Bus. Econ. Stat. 2019, 37, 121–133. [Google Scholar] [CrossRef]
  4. Koenker, R.; Bassett, G.J. Regression quantiles. Econ. J. Econ. Soc. 1978, 46, 33–50. [Google Scholar] [CrossRef]
  5. Korkmaz, M.Ç.; Chesneau, C.; Korkmaz, Z.S. A new alternative quantile regression model for the bounded response with educational measurements applications of OECD countries. J. Appl. Stat. 2022, in press. [Google Scholar] [CrossRef]
  6. Iriarte, Y.A.; de Castro, M.; Gómez, H.W. An alternative one-parameter distribution for bounded data modeling generated from the Lambert transformation. Symmetry 2021, 13, 1190. [Google Scholar] [CrossRef]
  7. Mazucheli, J.; Leiva, V.; Alves, B.; Menezes, A.F.B. A new quantile regression for modeling bounded data under a unit Birnbaum–Saunders distribution with applications in medicine and politics. Symmetry 2021, 13, 682. [Google Scholar] [CrossRef]
  8. Mazucheli, J.; Menezes, A.F.B.; Dey, S. The unit-Birnbaum-Saunders distribution with applications. Chil. J. Stat. 2018, 9, 47–57. [Google Scholar]
  9. Sanchez, L.; Leiva, V.; Galea, M.; Saulo, H. Birnbaum-Saunders quantile regression and its diagnostics with application to economic data. Appl. Stoch. Model. Bus. Ind. 2021, 37, 53–73. [Google Scholar] [CrossRef]
  10. Sanchez, L.; Leiva, V.; Galea, M.; Saulo, H. Birnbaum-Saunders quantile regression models with application to spatial data. Mathematics 2021, 8, 1000. [Google Scholar] [CrossRef]
  11. Korkmaz, M.Ç.; Chesneau, C. On the unit Burr-XII distribution with the quantile regression modeling and applications. Comput. Appl. Math. 2021, 40, 29. [Google Scholar] [CrossRef]
  12. Korkmaz, M.Ç.; Chesneau, C.; Korkmaz, Z.S. On the arcsecant hyperbolic normal distribution. Properties, quantile regression modeling and applications. Symmetry 2021, 13, 117. [Google Scholar] [CrossRef]
  13. Korkmaz, M.Ç.; Emrah, A.; Chesneau, C.; Yousof, H.M. On the unit-Chen distribution with associated quantile regression and applications. Math. Slovaca, 2022; in press. [Google Scholar]
  14. Jodrá, P.; Jiménez-Gamero, M.D. A quantile regression model for bounded responses based on the exponential-geometric distribution. REVSTAT Stat. J. 2020, 4, 415–436. [Google Scholar]
  15. Lemonte, A.J.; Moreno-Arenas, G. On a heavy-tailed parametric quantile regression model for limited range response variables. Comput. Stat. 2020, 35, 379–398. [Google Scholar] [CrossRef]
  16. Cancho, V.G.; Bazán, J.L.; Dey, D.K. A new class of regression model for a bounded response with application in the study of the incidence rate of colorectal cancer. Stat. Methods Med. Res. 2020, 29, 2015–2033. [Google Scholar] [CrossRef]
  17. Paz, R.F.; Balakrishnan, N.; Bazán, J.L. L-logistic regression models: Prior sensitivity analysis, robustness to outliers and applications. Braz. J. Probab. Stat. 2019, 33, 455–479. [Google Scholar]
  18. Mazucheli, J.; Menezes, A.F.B.; Fernandes, L.B.; de Oliveira, R.P.; Ghitany, M.E. The unit-Weibull distribution as an alternative to the Kumaraswamy distribution for the modeling of quantiles conditional on covariates. J. Appl. Stat. 2020, 47, 954–974. [Google Scholar] [CrossRef]
  19. Mazucheli, J.; Menezes, A.F.B.; Ghitany, M.E. The unit-Weibull distribution and associated inference. J. Appl. Probab. Stat. 2018, 13, 1–22. [Google Scholar]
  20. Sanchez, L.; Leiva, V.; Marchant, C.; Saulo, H.; Sarabia, J.M. A new quantile regression model and its diagnostic analytics for a Weibull distributed response with applications. Mathematics 2021, 9, 2768. [Google Scholar] [CrossRef]
  21. Lemonte, A.J.; Bazán, J.L. New class of Johnson distributions and its associated regression model for rates and proportions. Biometr. J. 2016, 58, 727–746. [Google Scholar] [CrossRef] [PubMed]
  22. Mitnik, P.A.; Baek, S. The Kumaraswamy distribution: Median-dispersion re-parameterizations for regression modeling and simulation-based estimation. Stat. Pap. 2013, 54, 177–192. [Google Scholar] [CrossRef]
  23. Mazucheli, M.; Alves, B.; Menezes, A.F.B.; Leiva, V. An overview on parametric quantile regression models and their computational implementation with applications to biomedical problems including COVID-19 data. Comp. Meth. Prog. Biomed. 2022; in press. [Google Scholar]
  24. Vasicek, O.A. Probability of Loss on Loan Portfolio; KMV Corporation: San Francisco, CA, USA, 1987. [Google Scholar]
  25. Fischer, M.; Hösle, S. Beyond beta and Vasicek: A comparative analysis of continuous distributions on (0,1). Int. J. Stat. Adv. Theory Appl. 2018, 2, 143–179. [Google Scholar]
  26. SAS. SAS/IML® 9.3 User’s Guide; SAS Institute Inc.: Cary, NC, USA, 2011. [Google Scholar]
  27. Vasicek, O.A. The distribution of loan portfolio value. Risk 2002, 15, 160–162. [Google Scholar]
  28. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2020. [Google Scholar]
  29. Rigby, R.A.; Stasinopoulos, D.M. Generalised additive models for location scale and shape. J. R. Stat. Soc. C 2005, 54, 507–554. [Google Scholar] [CrossRef] [Green Version]
  30. Rigby, R.A.; Stasinopoulos, M.D.; Heller, G.Z.; De Bastiani, F. Distributions for Modeling Location, Scale, and Shape: Using GAMLSS in R; CRC Press: New York, NY, USA, 2019. [Google Scholar]
  31. Rigby, R.A.; Stasinopoulos, D.M.; Voudouris, V. Discussion: A comparison of GAMLSS with quantile regression. Stat. Model. 2013, 13, 335–348. [Google Scholar] [CrossRef]
  32. Davison, A. Statistical Models; Cambridge University Press: Cambrigde, UK, 2003. [Google Scholar]
  33. Wald, A. Sequential Analysis; Wiley: New York, NY, USA, 1947. [Google Scholar]
  34. Wilks, S.S. The large-sample distribution of the likelihood ratio for testing composite hypotheses. Ann. Math. Stat. 1938, 9, 60–62. [Google Scholar] [CrossRef]
  35. Rayner, J.C.W.; Thas, O.; Best, D.J. Smooth Tests of Goodness of Fit Using R; Wiley: Singapore, 2009. [Google Scholar]
  36. Dunn, P.K.; Smyth, G.K. Randomized quantile residuals. J. Comput. Graph. Stat. 1996, 5, 236–244. [Google Scholar]
  37. Cox, D.R.; Snell, E.J. A general definition of residuals. J. R. Stat. Soc. B 1968, 30, 248–265. [Google Scholar] [CrossRef]
  38. Moral, R.A.; Hinde, J.; Demétrio, C.G.B. Half-normal plots and overdispersed models in R: The hnp package. J. Stat. Softw. 2017, 81, 1–23. [Google Scholar] [CrossRef] [Green Version]
  39. Zhao, Y.; Lee, A.H.; Yau, K.K.W.; McLachlan, G.J. Assessing the adequacy of Weibull survival models: A simulated envelope approach. J. Appl. Stat. 2011, 38, 2089–2097. [Google Scholar] [CrossRef]
  40. SAS. SAS/STAT® 15.1 User’s Guide; The NLMIXED Procedure; SAS Institute Inc.: Cary, NC, USA, 2018; pp. 5148–5234. [Google Scholar]
  41. Petterle, R.R.; Bonat, W.H.; Scarpin, C.T.; Jonasson, T.; Borba, V.Z.C. Multivariate quasi-beta regression models for continuous bounded data. Int. J. Biostat. 2020, 1, 39–53. [Google Scholar] [CrossRef] [PubMed]
  42. Benedetti, T.R.B.; Antunes, P.d.C.; Rodriguez-Añez, C.R.; Mazo, G.Z.; Petroski, É.L. Reproducibility and validity of the International Physical Activity Questionnaire (IPAQ) in elderly men. Rev. Bras. Med. Esporte 2007, 13, 11–16. [Google Scholar] [CrossRef]
  43. Ferrari, S.; Cribari Neto, F. Beta regression for modelling rates and proportions. J. Appl. Stat. 2004, 31, 799–815. [Google Scholar] [CrossRef]
  44. Song, P.X.K.; Tan, M. Marginal models for longitudinal continuous proportional data. Biometrics 2000, 56, 496–502. [Google Scholar] [CrossRef]
  45. Santos, B. Baquantreg: Bayesian Quantile Regression Methods. R Package Version 0.1. 2015. Available online: https://rdrr.io/github/brsantos/baquantreg (accessed on 24 March 2022).
  46. Tadikamalla, P.R.; Johnson, N.L. Systems of frequency curves generated by transformations of logistic variables. Biometrika 1982, 69, 461–465. [Google Scholar] [CrossRef]
  47. Balakrishnan, N. Handbook of the Logistic Distribution; Marcel Dekker: New York, NY, USA, 1992. [Google Scholar]
  48. Johnson, N.L. Systems of frequency curves generated by methods of translation. Biometrika 1949, 36, 149–176. [Google Scholar] [CrossRef]
  49. Kumaraswamy, P. A generalized probability density function for double-bounded random processes. J. Hydrol. 1980, 46, 79–88. [Google Scholar] [CrossRef]
  50. Weibull, W.A. A statistical distribution of wide applicability. J. Appl. Mech. 1951, 18, 293–297. [Google Scholar] [CrossRef]
  51. Barndorff-Nielsen, O.E.; Jørgensen, B. Some parametric models on the simplex. J. Multivar. Anal. 1991, 39, 106–116. [Google Scholar] [CrossRef] [Green Version]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.