1. Introduction
Frequently, in real life, we find continuous data on the real line that are asymmetrical; these data cannot be modeled by known symmetric distributions as the normal, Student-t, Cauchy, Laplace, and logistic distributions. It is therefore more interesting to propose more flexible models that will be useful for modeling highly skewed data which arises in several areas.
In this context, the seminal work in Azzalini [
1] introduces a skew-symmetric family of distributions, where this last is established by using a symmetric distribution as a kernel. When this last follows a normal distribution, it rises the well-know skew-normal (SN) distribution. The SN distribution has a skewness parameter which makes possible to have a reasonable model for a skewed distribution. Furthermore, the SN distributions include the normal distribution and possesses several properties which coincide or are similar to the ones of the normal distribution (Azzalini [
1,
2]). However, the SN distribution is limited in terms of flexibility, that is, for moderate values of the skewness parameter nearly all the mass accumulates either on the positive or negative real line, as determined by the sign of the skewness parameter. In such cases, the SN distribution closely resembles the half-normal density, with a nearly linear shape in the side with smaller mass (Arellano-Valle et al. [
3]).
Another alternative to model skewed data is using the family of power-symmetric distributions (see Pewsey et al. [
4]) of which the most widely used is the power-normal (PN) distribution. Some references where this family is discussed are Lehmann [
5], Durrans [
6], Gupta and Gupta [
7], Castillo et al. [
8], among others. In a series of papers by Martínez-Flórez et al. ([
9,
10,
11,
12,
13]) extensions and applications of the PN distribution can be found.
An unification of the SN and PN distributions was proposed by Martínez-Flórez et al. [
11], namely the power skew-normal (PSN) which is a generalization of the SN and PN distributions. Even though sample information about the SN distribution has been widely studied, there is not the same scope for the PSN distribution, which being a generalization of the first one, has characteristics of interest such as: (i) the SN and PN distributions as particular cases, and (ii) the PSN distribution provides greater range for skewness and kurtosis coefficients compared with the SN distribution (see
Table 1), being more flexible to model highly skewed data, which arises frequently in many practical situations. However, the expectation and variance of the PSN distribution cannot be expressed in closed form (have complicated forms), which makes these distributions unsuitable for regression modeling (Martínez-Flórez et al. [
11]). Fortunately, the cumulative distribution function (cdf) of the PSN distribution has a simple form that depends on Owen’s T function (to be defined in the next section). This facilitates the calculation of the quantile function (inverse of the cdf), allowing its utilization in the quantile regression (QR) framework. Quantile regression quantifies the association of the explanatory variables with a given quantile of a dependent variable. In this study, we propose a quantile linear regression model based on the PSN distribution, adopting a new parametrization of this model indexed by the quantile, precision and shape parameters. In particular, for this work, inference is conducted via maximum likelihood.
The rest of the paper proceeds as follows. In
Section 2, we introduce a new parameterization of the PSN distribution that is indexed by the location, precision and shape parameters and its association with a quantile regression model. In addition, elements related to the maximum likelihood (ML) method are presented as well.
Section 3 presents local influence measures under three different perturbation schemes, whereas in
Section 4 a real data analysis is conducted in order to show the applicability of our proposed reparametrized PSN (RPSN) based QR model. Final section summarizes the contributions of the paper.
2. A PSN Distribution Parameterized by Its Quantile Parameter, and Its Associated Quantile Regression Model
In this section, we briefly study the PSN distribution based on Martínez-Flórez et al. [
11]. We introduce a RPSN distribution which is characterized by its quantile, which allows us to use this distribution in the context of QR models.
The probability density function (pdf) of the PSN distribution is given by
where
,
and
and
denote the pdf and cdf of the (standard) skew normal model given by
where
and
denote the pdf and cdf of the standard normal distribution and
is the Owen’s
T function defined as
Moreover, the cdf of the PSN model is given by
Note that
and
corresponds to the very well known SN and PN models, respectively. The main advantage of the PSN model is that provides greater range for skewness and kurtosis coefficients compared with the SN and PN models.
Table 1 shows the range for those coefficients.
The
r-th moment of the distribution depends on the expected value of
,
, where
Y have beta distribution with shape parameters
and 1, respectively. For this reason, some interesting characteristic of the model, such as mean and variance, have cumbersome forms. On the other hand, quantiles of the model also need to be computed numerically since non-closed form are available for the distribution. For this reason, non-interpretation and useful reparametrizations can be performed for this model. Besides, as the Owen’s T function satisfies
, we note that
For this reason, if we consider the restriction
we have that
with
representing directly the
-th quantile of the distribution. For a fixed
and considering
as in (
1), we have a flexible model for quantile regression. This parametrization has not been proposed in the statistical literature. Hence, we can rewrite the PSN distribution according to the parameters
and
, whose cumulative distribution function is now given by
where the quantile
is assumed to be known. Hereafter, we use the notation
to indicate that
Y is a random variable following a restricted PSN distribution with quantile parameter
, precision parameter
, and shape parameter
.
Figure 1 shows the density function for the RPSN model with location and scale parameters fixed at 0 and 1, respectively. Note that in all the curves, the zero represents the specified quantile
. We also note that the curves are not necessarily symmetric for
(the median case).
Let
be the
n independent random variables, where each
, follows the
distribution with quantile parameter
, precision parameter
, and shape parameter
. Suppose that, for a given
, the location, precision and shape parameters for the RPSN satisfy the following functional relations
where
,
, are vectors of unknown regression coefficients which are assumed to be functionally independent,
, with
,
are the linear predictors, and
, are observations on
,
and
known regressors, for
. Moreover covariate matrices
are assumed to have rank
, for
. Link functions
,
and
in (
2) must be strictly monotone and at least twice differentiable, and
is also required to be a positive function. Such functions also satisfy that
,
and
, with
being the inverse function of
.
The log-likelihood function for
has the form
, where
The
score vector of the model is given by
where
,
,
,
,
, for
, with
. Such elements are specified in the
Appendix A.1 Section.
The Hessian for the model is
where
, for
, with
. Such elements are detailed in the
Appendix A.1.
The ML estimators , and of , and , respectively, can be obtained by solving simultaneously the nonlinear system of equations , where denotes a vector of zeros with dimension r. Unfortunately, it is not possible to obtain analytical expressions for the ML estimators above, so numerical methods for solving nonlinear equations system are required.
3. Local Influence
Global influence is related to case deletion, i.e, the effect of dropping a case from the dataset Cook [
14]. The likelihood distance (LD) is defined as LD
, where
is the ML estimate of
under a perturbed model related to
, a perturbation vector. Cook [
14] studied the LD
around the non-perturbed vector
such as
. The normal curvature for
at the direction of the orthonormal vector
is defined as
, where
is the Hessian of
evaluated at
and
and both,
and
are evaluated at
. Hence,
is the largest eigenvalue of
and
the corresponding orthonormal eigenvector. The index
plot of the matrix
suggests how to perturb the model (or data) to obtain large changes in the estimates of
.
For three common perturbation schemes we compute the matrix
where
.
3.1. Case Weights Perturbation
For this case, the perturbed log-likelihood function is defined as
, where
is defined in (
3) and
, for
. In this case,
and
3.2. Case Response Perturbation
We consider now an additive perturbation on the
ith response (say
by making
, where
and
is a scale factor. An usual consideration for such scale factor is
, with
denoting the sample standard deviation of
Y. Note that
. Therefore, under the scheme of response perturbation, the log-likelihood function is given by
, where
and
3.3. Case Continuous Covariate Perturbation
Consider an additive perturbation on a particular continuous covariate including on the quantile parameter, namely
, for
, by making
, where
is a scale factor. Again, a usual consideration is
, with
the sample standard deviation for
. Note that
. Then, under the scheme of response perturbation, the log-likelihood function is given by
, where
and
, with
a vector of dimension
with zeros, except in the
t-th element where is a one. Finally
4. Real Data Analysis
In this section, we present an application to 202 Australian athletes from the Australian Institute of Sport. Such data were discussed in Cook and Weisberg [
15]. In order to exemplify the proposed model, we consider the following quantile regression model:
, where
and
Here, the response bmi represents the body mass index, while the covariates lbm and sex represent the lean body mass and sex of the athletes, respectively. Note that
is not modeled by covariates and
was not included in the scale parameter because in preliminary analysis we found the coefficient related to such term was not significant (to any
). This same problem was illustrated in Galarza et al. [
16] with a class of skew distributions (SKD), but considering a regression scheme only in the quantile parameter. For comparison purpose, we considered the skewed normal (SKN) and skewed Student-t (SKT) models, that are models belonging to the SKD class. Additionally, we also considered the Gamma-Sinh Cauchy (GSC) model, including covariates only in the quantile parameter.
Table 2 shows the Akaike Information Criterion (AIC, Akaike; [
17]) for the referred models. Note that, except for
, the RPSN-QR model attached the minimum AIC for the considered quantiles.
Table 2 displays the MLEs with corresponding standard errors (SE) for the fitted proposed model for each
and
. Note that we have a positive relationship between the response variable (bmi) and lbm in all quantiles. We also observe that the quantile intercepts increases as
increases. Regarding the parameter
, the greater
, the greater the estimate of
.
Figure 2 shows point estimates and
confidence intervals (CIs) for model parameters under the RPSN-QR model for different quantiles. It can be seen that as
increases the coefficient of lean body mass and the coefficient of gender become larger. Moreover, bmi and lbm are significant in explaining all the quantile modeled in
.
Figure 3 presented the estimated quantiles
and
for the bmi in terms of lbm and the sex of the athlete.
We also present in
Table 4 the
p-value to validate the normality hypothesis based on the Kolmogorov–Smirnov (KS; Kolmogorov, [
18]) for the quantile residuals (Dunn and Smyth, [
19]) using different quantile
of such residuals. In all cases, the KS test did not reject the null hypothesis of normality. Therefore, the RPSN is appropriated to model all the quantile in this problem.
We also performed a local influence analysis.
Figure 4 shows such analysis under the three perturbation schemes discussed in
Section 3 for
. The
Appendix A.2 shows the analysis for other quantiles
and
. Note that observations 75, 162 and 178 are detected as potentially influent for all the mentioned quantiles and the observation 53 appears for the quantile
.
To check the impact on the inference of possible influential cases, we consider the relative change (RC), which is computed by removing the possible influential cases for each parameter and its SE as
where
is any component of the vector
, where
and
denote the ML estimate of
and its corresponding SE, respectively, after dropping the
i-th observation.
Table 5 shows such RC for the non-intercept regression coefficients when observations 53, 75, 162 and 178 are removed. Note that the RC is greater for the estimated parameters than its estimated SE. However, the significance of
and
is maintained whereas
is not significant with a 5%. More combinations of dropped observations are presented in the
Appendix A.2.
5. Concluding Remarks
Extending the quantile regression methods to include asymmetric response variables on the real line is promising area of research. In this paper, we have introduced a novel flexible parametric quantile regression model for asymmetric response variables, which can be very useful in modeling response variables on the real line at different quantiles. The proposed quantile regression model was built based on PSN distribution using a new parameterization of this distribution that is indexed by quantile, precision and shape parameters, in which a function of any quantile of the response variable is given by a linear predictor that is defined by regression parameters and explanatory variables. We consider a frequentist approach to estimate the model parameters, and the maximum likelihood inference is employed to estimate the model parameters. An application using a real dataset was presented and discussed. Results of the application showed that the model is adequate; it elaborately showed which covariates influence the response at different levels of quantiles. Finally, there are many possible extensions of the current work, for instance, mixtures of RPSN regression models in order to accommodate multimodality, a semi-parametric component to include a functional covariate to model nonlinearity of the response, and measurement errors, among others. An in-depth investigation of these topics is beyond the scope of this work, and will be considered elsewhere.