A Parametric Quantile Regression Model for Asymmetric Response Variables on the Real Line

Diego I. Gallardo; Marcelo Bourguignon; Christian E. Galarza; Héctor W. Gómez

doi:10.3390/sym12121938

,

and

¹

Departamento de Matemática, Facultad de Ingeniería, Universidad de Atacama, Copiapó 1530000, Chile

²

Departamento de Estatística, Universidade Federal do Rio Grande do Norte, Natal, RN 59000-000, Brazil

³

Departamento de Matemáticas, Escuela Superior Politécnica del Litoral, ESPOL, Guayaquil 090150, Ecuador

⁴

Departamento de Matemáticas, Facultad de Ciencias Básicas, Universidad de Antofagasta, Antofagasta 1240000, Chile

Symmetry2020, 12(12), 1938;https://doi.org/10.3390/sym12121938

This article belongs to the Special Issue Symmetric and Asymmetric Distributions: Theoretical Developments and Applications II

Version Notes

Order Reprints

Abstract

In this paper, we introduce a novel parametric quantile regression model for asymmetric response variables, where the response variable follows a power skew-normal distribution. By considering a new convenient parametrization, these distribution results are very useful for modeling different quantiles of a response variable on the real line. The maximum likelihood method is employed to estimate the model parameters. Besides, we present a local influence study under different perturbation settings. Some numerical results of the estimators in finite samples are illustrated. In order to illustrate the potential for practice of our model, we apply it to a real dataset.

Keywords:

asymmetric data; parametric inference; quantile regression model; skew-normal distributions

1. Introduction

Frequently, in real life, we find continuous data on the real line that are asymmetrical; these data cannot be modeled by known symmetric distributions as the normal, Student-t, Cauchy, Laplace, and logistic distributions. It is therefore more interesting to propose more flexible models that will be useful for modeling highly skewed data which arises in several areas.

In this context, the seminal work in Azzalini [] introduces a skew-symmetric family of distributions, where this last is established by using a symmetric distribution as a kernel. When this last follows a normal distribution, it rises the well-know skew-normal (SN) distribution. The SN distribution has a skewness parameter which makes possible to have a reasonable model for a skewed distribution. Furthermore, the SN distributions include the normal distribution and possesses several properties which coincide or are similar to the ones of the normal distribution (Azzalini [,]). However, the SN distribution is limited in terms of flexibility, that is, for moderate values of the skewness parameter nearly all the mass accumulates either on the positive or negative real line, as determined by the sign of the skewness parameter. In such cases, the SN distribution closely resembles the half-normal density, with a nearly linear shape in the side with smaller mass (Arellano-Valle et al. []).

Another alternative to model skewed data is using the family of power-symmetric distributions (see Pewsey et al. []) of which the most widely used is the power-normal (PN) distribution. Some references where this family is discussed are Lehmann [], Durrans [], Gupta and Gupta [], Castillo et al. [], among others. In a series of papers by Martínez-Flórez et al. ([,,,,]) extensions and applications of the PN distribution can be found.

An unification of the SN and PN distributions was proposed by Martínez-Flórez et al. [], namely the power skew-normal (PSN) which is a generalization of the SN and PN distributions. Even though sample information about the SN distribution has been widely studied, there is not the same scope for the PSN distribution, which being a generalization of the first one, has characteristics of interest such as: (i) the SN and PN distributions as particular cases, and (ii) the PSN distribution provides greater range for skewness and kurtosis coefficients compared with the SN distribution (see Table 1), being more flexible to model highly skewed data, which arises frequently in many practical situations. However, the expectation and variance of the PSN distribution cannot be expressed in closed form (have complicated forms), which makes these distributions unsuitable for regression modeling (Martínez-Flórez et al. []). Fortunately, the cumulative distribution function (cdf) of the PSN distribution has a simple form that depends on Owen’s T function (to be defined in the next section). This facilitates the calculation of the quantile function (inverse of the cdf), allowing its utilization in the quantile regression (QR) framework. Quantile regression quantifies the association of the explanatory variables with a given quantile of a dependent variable. In this study, we propose a quantile linear regression model based on the PSN distribution, adopting a new parametrization of this model indexed by the quantile, precision and shape parameters. In particular, for this work, inference is conducted via maximum likelihood.

Table 1. Range for skewness and kurtosis coefficients for SN, PN and PSN models.

The rest of the paper proceeds as follows. In Section 2, we introduce a new parameterization of the PSN distribution that is indexed by the location, precision and shape parameters and its association with a quantile regression model. In addition, elements related to the maximum likelihood (ML) method are presented as well. Section 3 presents local influence measures under three different perturbation schemes, whereas in Section 4 a real data analysis is conducted in order to show the applicability of our proposed reparametrized PSN (RPSN) based QR model. Final section summarizes the contributions of the paper.

2. A PSN Distribution Parameterized by Its Quantile Parameter, and Its Associated Quantile Regression Model

In this section, we briefly study the PSN distribution based on Martínez-Flórez et al. []. We introduce a RPSN distribution which is characterized by its quantile, which allows us to use this distribution in the context of QR models.

The probability density function (pdf) of the PSN distribution is given by

\begin{matrix} f (y; θ) = \frac{α}{σ} ϕ_{λ} (z) {[Φ_{λ} (z)]}^{α - 1}, \end{matrix}

where

θ = {(μ, σ, λ, α)}^{⊤}

,

z = (y - μ) / σ

and

ϕ_{λ} (\cdot)

and

Φ_{λ} (\cdot)

denote the pdf and cdf of the (standard) skew normal model given by

ϕ_{λ} (y) = 2 ϕ (y) Φ (λ y) and Φ_{λ} (y) = \int_{- \infty}^{y} ϕ_{λ} (t) d t = Φ (y) - 2 T (y, λ),

where

ϕ (\cdot)

and

Φ (\cdot)

denote the pdf and cdf of the standard normal distribution and

T (\cdot, \cdot)

is the Owen’s T function defined as

T (y, λ) = \frac{1}{2 π} \int_{0}^{λ} \frac{e^{- \frac{1}{2} y^{2} (1 + t^{2})}}{1 + t^{2}} d t .

Moreover, the cdf of the PSN model is given by

\begin{matrix} F (y; θ) = {[Φ_{λ} (z)]}^{α} = {[Φ (z) - 2 T (z, λ)]}^{α} . \end{matrix}

Note that

α = 1

and

λ = 0

corresponds to the very well known SN and PN models, respectively. The main advantage of the PSN model is that provides greater range for skewness and kurtosis coefficients compared with the SN and PN models. Table 1 shows the range for those coefficients.

The r-th moment of the distribution depends on the expected value of

{[Φ_{λ}^{- 1} (Y)]}^{s}

,

s = 1, \dots, r

, where Y have beta distribution with shape parameters

α

and 1, respectively. For this reason, some interesting characteristic of the model, such as mean and variance, have cumbersome forms. On the other hand, quantiles of the model also need to be computed numerically since non-closed form are available for the distribution. For this reason, non-interpretation and useful reparametrizations can be performed for this model. Besides, as the Owen’s T function satisfies

T (0, λ) = {(2 π)}^{- 1} arctan (λ)

, we note that

\begin{matrix} F (μ; θ) = {[\frac{1}{2} - \frac{1}{π} arctan (λ)]}^{α} . \end{matrix}

For this reason, if we consider the restriction

\begin{matrix} α = α (λ, τ) = \frac{log (τ)}{log (\frac{1}{2} - \frac{1}{π} arctan (λ))}, \end{matrix}

(1)

we have that

F (μ; θ) = τ

with

μ

representing directly the

τ

-th quantile of the distribution. For a fixed

τ

and considering

α (λ, τ)

as in (1), we have a flexible model for quantile regression. This parametrization has not been proposed in the statistical literature. Hence, we can rewrite the PSN distribution according to the parameters

μ, σ

and

λ

, whose cumulative distribution function is now given by

\begin{matrix} F (y; μ, σ, λ) = {[Φ (z) - 2 T (z, λ)]}^{\frac{log (τ)}{log (\frac{1}{2} - \frac{1}{π} arctan (λ))}}, \end{matrix}

where the quantile

τ \in (0, 1)

is assumed to be known. Hereafter, we use the notation

Y \sim RPSN (μ, σ, λ)

to indicate that Y is a random variable following a restricted PSN distribution with quantile parameter

μ

, precision parameter

σ

, and shape parameter

λ

. Figure 1 shows the density function for the RPSN model with location and scale parameters fixed at 0 and 1, respectively. Note that in all the curves, the zero represents the specified quantile

τ

. We also note that the curves are not necessarily symmetric for

τ = 0.5

(the median case).

Figure 1. Pdf for the RPSN

(μ = 0, σ = 1, λ)

for different values of

λ

:

τ = 0.1

(left panel);

τ = 0.5

(center panel);

τ = 0.9

(right panel). Values for

λ

are:

- 5

(black line),

- 1.5

(red line),

- 0.5

(blue line), 0 (green line),

0.5

(orange line),

1.5

(magenta line) and 5 (purple line).

Let

Y_{1}, \dots, Y_{n}

be the n independent random variables, where each

Y_{i}, i = 1, \dots, n

, follows the

PSN

distribution with quantile parameter

μ

, precision parameter

σ

, and shape parameter

λ

. Suppose that, for a given

τ \in (0, 1)

, the location, precision and shape parameters for the RPSN satisfy the following functional relations

\begin{matrix} g_{1} (μ_{i} (τ)) & = η_{i 1} (τ) = x_{i 1}^{⊤} β_{1} (τ), \\ g_{2} (σ_{i} (τ)) & = η_{i 2} (τ) = x_{i 2}^{⊤} β_{2} (τ) and \\ g_{3} (λ_{i} (τ)) & = η_{i 3} (τ) = x_{i 3}^{⊤} β_{3} (τ), \end{matrix}

(2)

where

β_{j} (τ) = {(β_{j 1} (τ), \dots, β_{j p_{1}} (τ))}^{⊤}

,

j = 1, 2, 3

, are vectors of unknown regression coefficients which are assumed to be functionally independent,

β_{j} (τ) \in R^{p_{j}}

, with

p_{1} + p_{2} + p_{3} < n

,

η_{j i} (τ)

are the linear predictors, and

x_{i j} = {(x_{i j 1}, \dots, x_{i j p_{j}})}^{⊤}

, are observations on

p_{1}

,

p_{2}

and

p_{3}

known regressors, for

i = 1, \dots, n

. Moreover covariate matrices

X_{j} = {(x_{1 j}, \dots, x_{n j})}^{⊤}

are assumed to have rank

p_{j}

, for

j = 1, 2, 3

. Link functions

g_{1} : R \to R

,

g_{2} : R \to R^{+}

and

g_{3} : R \to R

in (2) must be strictly monotone and at least twice differentiable, and

g_{2}

is also required to be a positive function. Such functions also satisfy that

μ_{i} = g_{1}^{- 1} (x_{i 1}^{⊤} β_{1})

,

σ_{i} = g_{2}^{- 1} (x_{i 2}^{⊤} β_{2})

and

λ_{i} = g_{3}^{- 1} (x_{i 3}^{⊤} β_{3})

, with

g_{j}^{- 1} (\cdot)

being the inverse function of

g_{j} (\cdot)

.

The log-likelihood function for

θ = θ (τ) = (β_{1} (τ), β_{2} (τ), β_{3} (τ))

has the form

ℓ (θ) = \sum_{i = 1}^{n} ℓ_{i}

, where

\begin{matrix} ℓ_{i} = ℓ (z_{i}, μ_{i}, σ_{i}, λ_{i}) & = & log α (λ_{i}, τ) - log (σ_{i}) + log [ϕ_{λ_{i}} (z_{i})] + [α (λ_{i}, τ) - 1] log [Φ_{λ_{i}} (z_{i})] . \end{matrix}

(3)

The

(p_{1} + p_{2} + p_{3}) \times 1

score vector of the model is given by

\begin{matrix} \dot{ℓ} (θ) = (\begin{matrix} \frac{\partial ℓ (θ)}{\partial β_{1}} \\ \frac{\partial ℓ (θ)}{\partial β_{2}} \\ \frac{\partial ℓ (θ)}{\partial β_{3}} \end{matrix}) = (\begin{matrix} X_{1}^{⊤} W_{β_{1}}^{1 / 2} {\dot{ℓ}}_{μ} \\ X_{2}^{⊤} W_{β_{2}}^{1 / 2} {\dot{ℓ}}_{σ} \\ X_{3}^{⊤} W_{β_{3}}^{1 / 2} {\dot{ℓ}}_{λ} \end{matrix}), \end{matrix}

(4)

where

W_{β_{j}} = diag (w_{β_{j 1}}, \dots, w_{β_{j n}})

,

w_{β_{1 i}} = {(\partial μ_{i} / \partial η_{1 i})}^{2}

,

w_{β_{2 i}} = {(\partial σ_{i} / \partial η_{2 i})}^{2}

,

w_{β_{3 i}} = {(\partial λ_{i} / \partial η_{3 i})}^{2}

,

{\dot{ℓ}}_{ξ} = ({\dot{ℓ}}_{ξ_{1}}, \dots, {\dot{ℓ}}_{ξ_{n}})

, for

ξ \in {μ, σ, λ}

, with

{\dot{ℓ}}_{ξ_{i}} = \partial ℓ (μ_{i}, σ_{i}, λ_{i}) / \partial ξ_{i}

. Such elements are specified in the Appendix A.1 Section.

The Hessian for the model is

\begin{matrix} H (θ) & = (\begin{matrix} H_{β_{1} β_{1}} & H_{β_{1} β_{2}} & H_{β_{1} β_{3}} \\ \cdot & H_{β_{2} β_{2}} & H_{β_{2} β_{3}} \\ \cdot & \cdot & H_{β_{3} β_{3}} \end{matrix}) \\ = (\begin{matrix} X_{1}^{⊤} {\ddot{ℓ}}_{μ μ} W_{β_{1}} X_{1} & X_{1}^{⊤} {\ddot{ℓ}}_{μ σ} W_{β_{1}}^{1 / 2} W_{β_{2}}^{1 / 2} X_{2} & X_{1}^{⊤} {\ddot{ℓ}}_{μ λ} W_{β_{1}}^{1 / 2} W_{β_{3}}^{1 / 2} X_{3} \\ \cdot & X_{2}^{⊤} {\ddot{ℓ}}_{σ σ} W_{β_{2}} X_{2} & X_{2}^{⊤} {\ddot{ℓ}}_{σ λ} W_{β_{2}}^{1 / 2} W_{β_{3}}^{1 / 2} X_{3} \\ \cdot & \cdot & X_{3}^{⊤} {\ddot{ℓ}}_{λ λ} W_{β_{3}} X_{3} \end{matrix}), \end{matrix}

(5)

where

{\ddot{ℓ}}_{ξ ξ^{^{'}}} = diag ({\ddot{ℓ}}_{ξ_{1} ξ_{1}^{^{'}}}, \dots, {\ddot{ℓ}}_{ξ_{n} ξ_{n}^{^{'}}})

, for

ξ, ξ^{^{'}} \in {μ, σ, λ}

, with

{\dot{ℓ}}_{ξ_{i} ξ_{i}^{^{'}}} = \partial^{2} ℓ_{i} / \partial ξ_{i} \partial ξ_{i}^{^{'}}

. Such elements are detailed in the Appendix A.1.

The ML estimators

{\hat{β}}_{1} (τ)

,

{\hat{β}}_{2} (τ)

and

{\hat{β}}_{3} (τ)

of

β_{1} (τ)

,

β_{2} (τ)

and

β_{3} (τ)

, respectively, can be obtained by solving simultaneously the nonlinear system of equations

ℓ (θ) = 0_{p_{1} + p_{2} + p_{3}}

, where

0_{r}

denotes a vector of zeros with dimension r. Unfortunately, it is not possible to obtain analytical expressions for the ML estimators above, so numerical methods for solving nonlinear equations system are required.

3. Local Influence

Global influence is related to case deletion, i.e, the effect of dropping a case from the dataset Cook []. The likelihood distance (LD) is defined as LD

(ω) = 2 [ℓ (\hat{θ}) - ℓ (\hat{θ} (ω), ω)]

, where

\hat{θ} (ω)

is the ML estimate of

θ

under a perturbed model related to

ω = {(ω_{1}, \dots, ω_{n})}^{⊤}

, a perturbation vector. Cook [] studied the LD

(ω)

around the non-perturbed vector

ω_{0}

such as

\hat{θ} (ω_{0}) = \hat{θ}

. The normal curvature for

\hat{ω}

at the direction of the orthonormal vector

| | d | |

is defined as

C_{d} (\hat{θ}) = 2 | d^{⊤} Δ_{ω}^{⊤} {\ddot{ℓ}}_{\hat{θ} \hat{θ}} Δ_{ω} d |

, where

{\ddot{ℓ}}_{\hat{θ} \hat{θ}}

is the Hessian of

ℓ (θ)

evaluated at

θ = \hat{θ}

and

Δ_{ω} = \partial^{2} ℓ (θ, ω) / \partial θ \partial ω^{⊤} ∣_{θ = \hat{θ} (ω)}

and both,

Δ_{ω}

and

{\ddot{ℓ}}_{\hat{θ} \hat{θ}}

are evaluated at

\hat{θ} (ω)

. Hence,

C_{d_{\max}}

is the largest eigenvalue of

B = Δ_{ω_{0}}^{⊤} {\ddot{ℓ}}_{\hat{θ} \hat{θ}} Δ_{ω_{0}}

and

d_{\max}

the corresponding orthonormal eigenvector. The index

d_{\max}

plot of the matrix

B

suggests how to perturb the model (or data) to obtain large changes in the estimates of

θ

.

For three common perturbation schemes we compute the matrix

Δ_{ω} = \frac{\partial^{2} ℓ (θ, ω)}{\partial θ \partial ω} = {(\begin{matrix} Δ_{ω, β_{1}}^{⊤} & Δ_{ω, β_{2}}^{⊤} & Δ_{ω, β_{3}}^{⊤} \end{matrix})}^{⊤},

where

Δ_{ω, β_{j}}^{⊤} = \frac{\partial^{2} ℓ (θ, ω)}{\partial β_{1} \partial ω}

.

3.1. Case Weights Perturbation

For this case, the perturbed log-likelihood function is defined as

ℓ (θ, ω) = \sum_{i = 1}^{n} ω_{i} ℓ_{i}

, where

ℓ (z_{i}, μ_{i}, σ_{i}, λ_{i})

is defined in (3) and

0 \leq ω_{i} \leq 1

, for

i = 1, \dots, n

. In this case,

ω_{0} = (1, \dots, 1)

and

Δ_{ω} = {(\begin{matrix} X_{1}^{⊤} W_{β_{1}}^{1 / 2} {\dot{ℓ}}_{μ}^{⊤} & X_{2}^{⊤} W_{β_{2}}^{1 / 2} {\dot{ℓ}}_{σ}^{⊤} & X_{3}^{⊤} W_{β_{3}}^{1 / 2} {\dot{ℓ}}_{λ}^{⊤} \end{matrix})}^{⊤} .

3.2. Case Response Perturbation

We consider now an additive perturbation on the ith response (say

y_{i} (\cdot))

by making

y_{i} (ω_{i}) = y_{i} + ω_{i} S_{Y_{i}}

, where

ω_{i} \in R

and

S_{Y_{i}}

is a scale factor. An usual consideration for such scale factor is

S_{Y_{i}} = S_{Y}

, with

S_{Y}

denoting the sample standard deviation of Y. Note that

ω_{0} = (0, \dots, 0)

. Therefore, under the scheme of response perturbation, the log-likelihood function is given by

ℓ (θ, ω) = \sum_{i = 1}^{n} ℓ (z_{i} (ω_{i}), μ_{i}, σ_{i}, λ_{i})

, where

z_{i} (ω_{i}) = (y_{i} (ω_{i}) - μ_{i}) / σ_{i}

and

Δ_{ω} = S_{Y} {(\begin{matrix} X_{1}^{⊤} W_{β_{1}}^{1 / 2} {\ddot{ℓ}}_{μ μ}^{⊤} & X_{2}^{⊤} W_{β_{2}}^{1 / 2} {\ddot{ℓ}}_{μ σ}^{⊤} & X_{3}^{⊤} W_{β_{3}}^{1 / 2} {\ddot{ℓ}}_{μ λ}^{⊤} \end{matrix})}^{⊤} |_{_{z_{i} = z_{i} (ω_{i})}} .

3.3. Case Continuous Covariate Perturbation

Consider an additive perturbation on a particular continuous covariate including on the quantile parameter, namely

x_{t}

, for

t \in {1, \dots, p_{1}}

, by making

x_{i t} (ω_{i}) = x_{i t} + ω_{i} S_{X_{i t}}

, where

S_{X_{i t}}

is a scale factor. Again, a usual consideration is

S_{X_{i t}} = S_{X_{t}}

, with

S_{X_{t}}

the sample standard deviation for

X_{t}

. Note that

ω_{0} = (0, \dots, 0)

. Then, under the scheme of response perturbation, the log-likelihood function is given by

ℓ (θ, ω) = \sum_{i = 1}^{n} ℓ (z_{i}, μ_{i} (ω_{i}), σ_{i}, λ_{i})

, where

μ_{i} (ω_{i}) = g_{1}^{- 1} (x_{i 1}^{⊤} (ω_{i}) β_{1})

and

x_{i 1}^{⊤} (ω_{i}) = x_{i 1}^{⊤} + ω_{i} S_{X_{i t}} J_{t}

, with

J_{t}

a vector of dimension

p_{1}

with zeros, except in the t-th element where is a one. Finally

Δ_{ω} = S_{X_{t}} {(\begin{matrix} diag (J_{t}) X_{1}^{⊤} W_{β_{1}}^{1 / 2} {\ddot{ℓ}}_{μ μ}^{⊤} & diag (J_{t}) X_{2}^{⊤} W_{β_{2}}^{1 / 2} {\ddot{ℓ}}_{μ σ}^{⊤} & diag (J_{t}) X_{3}^{⊤} W_{β_{3}}^{1 / 2} {\ddot{ℓ}}_{μ λ}^{⊤} \end{matrix})}^{⊤} |_{_{z_{i} = z_{i} (ω_{i})}} .

4. Real Data Analysis

In this section, we present an application to 202 Australian athletes from the Australian Institute of Sport. Such data were discussed in Cook and Weisberg []. In order to exemplify the proposed model, we consider the following quantile regression model:

{bmi}_{i} (τ) = μ_{i} (τ) + σ_{i} (τ) ϵ_{i} (τ)

, where

ϵ_{i} (τ) \sim PSN (0, 1, λ, τ)

and

\begin{matrix} μ_{i} (τ) & = β_{10} (τ) + β_{11} (τ) {lbm}_{i} + β_{12} (τ) {sex}_{i} \\ σ_{i} (τ) & = β_{20} (τ) + β_{21} (τ) {lbm}_{i} . \end{matrix}

Here, the response bmi represents the body mass index, while the covariates lbm and sex represent the lean body mass and sex of the athletes, respectively. Note that

λ

is not modeled by covariates and

sex

was not included in the scale parameter because in preliminary analysis we found the coefficient related to such term was not significant (to any

τ \in (0, 1)

). This same problem was illustrated in Galarza et al. [] with a class of skew distributions (SKD), but considering a regression scheme only in the quantile parameter. For comparison purpose, we considered the skewed normal (SKN) and skewed Student-t (SKT) models, that are models belonging to the SKD class. Additionally, we also considered the Gamma-Sinh Cauchy (GSC) model, including covariates only in the quantile parameter. Table 2 shows the Akaike Information Criterion (AIC, Akaike; []) for the referred models. Note that, except for

τ = 0.25

, the RPSN-QR model attached the minimum AIC for the considered quantiles.

Table 2. AIC criterion for different models parameterized in terms of the quantile.

Table 2 displays the MLEs with corresponding standard errors (SE) for the fitted proposed model for each

τ = 0.10, 0.50

and

0.90

. Note that we have a positive relationship between the response variable (bmi) and lbm in all quantiles. We also observe that the quantile intercepts increases as

τ

increases. Regarding the parameter

λ

, the greater

τ

, the greater the estimate of

λ

.

Figure 2 shows point estimates and

95 %

confidence intervals (CIs) for model parameters under the RPSN-QR model for different quantiles. It can be seen that as

τ

increases the coefficient of lean body mass and the coefficient of gender become larger. Moreover, bmi and lbm are significant in explaining all the quantile modeled in

μ_{i}

. Figure 3 presented the estimated quantiles

0.10, 0.25, 0.50, 0.75

and

0.90

for the bmi in terms of lbm and the sex of the athlete.

Figure 2. Athletes dataset: Point estimates (center line) and 95% confidence intervals (CIs) for model parameters under RPSN-QR model.

Figure 3. Data analysis: Fitted RPSN-QR model lines for the response (left panel for males, center panel for females) and scale parameter (right panel) over the grid

τ = {0.10, 0.25, 0.50, 0.75, 0.90}

.

We also present in Table 4 the p-value to validate the normality hypothesis based on the Kolmogorov–Smirnov (KS; Kolmogorov, []) for the quantile residuals (Dunn and Smyth, []) using different quantile

τ

of such residuals. In all cases, the KS test did not reject the null hypothesis of normality. Therefore, the RPSN is appropriated to model all the quantile in this problem.

Table 4. p-values for normality K-S test for residuals under our RPSN-QR model for the athletes dataset for different quantiles

τ

’s.

We also performed a local influence analysis. Figure 4 shows such analysis under the three perturbation schemes discussed in Section 3 for

τ = 0.5

. The Appendix A.2 shows the analysis for other quantiles

τ = 0.1, 0.25, 0.75

and

0.9

. Note that observations 75, 162 and 178 are detected as potentially influent for all the mentioned quantiles and the observation 53 appears for the quantile

0.9

.

Figure 4. Index plots for

C_{i} ({\hat{β}}_{1})

(left),

C_{i} ({\hat{β}}_{2})

(center) and

C_{i} ({\hat{β}}_{3})

(right) under the weight perturbation (upper), response perturbation (center) and covariate perturbation (lower) schemes for RPSN model for

τ = 0.5

.

To check the impact on the inference of possible influential cases, we consider the relative change (RC), which is computed by removing the possible influential cases for each parameter and its SE as

R C_{θ_{j (i)}} = 100 % \times |\frac{{\hat{θ}}_{j} - {\hat{θ}}_{j (i)}}{{\hat{θ}}_{j}}| and R C_{S E (θ_{j (i)})} = 100 % \times |\frac{S E ({\hat{θ}}_{j}) - S E ({\hat{θ}}_{j (i)})}{S E ({\hat{θ}}_{j})}|,

where

θ_{j}

is any component of the vector

θ = θ (τ)

, where

{\hat{θ}}_{j (i)}

and

S E ({\hat{θ}}_{j (i)})

denote the ML estimate of

θ_{j}

and its corresponding SE, respectively, after dropping the i-th observation. Table 5 shows such RC for the non-intercept regression coefficients when observations 53, 75, 162 and 178 are removed. Note that the RC is greater for the estimated parameters than its estimated SE. However, the significance of

β_{11} (τ)

and

β_{12} (τ)

is maintained whereas

β_{21} (τ)

is not significant with a 5%. More combinations of dropped observations are presented in the Appendix A.2.

Table 5. Relative changes (RC) (in %) in ML estimates and their corresponding SE’s for the indicated parameter and respective p-values for the athletes dataset when observations 53, 75, 162 and 178 are dropped.

5. Concluding Remarks

Extending the quantile regression methods to include asymmetric response variables on the real line is promising area of research. In this paper, we have introduced a novel flexible parametric quantile regression model for asymmetric response variables, which can be very useful in modeling response variables on the real line at different quantiles. The proposed quantile regression model was built based on PSN distribution using a new parameterization of this distribution that is indexed by quantile, precision and shape parameters, in which a function of any quantile of the response variable is given by a linear predictor that is defined by regression parameters and explanatory variables. We consider a frequentist approach to estimate the model parameters, and the maximum likelihood inference is employed to estimate the model parameters. An application using a real dataset was presented and discussed. Results of the application showed that the model is adequate; it elaborately showed which covariates influence the response at different levels of quantiles. Finally, there are many possible extensions of the current work, for instance, mixtures of RPSN regression models in order to accommodate multimodality, a semi-parametric component to include a functional covariate to model nonlinearity of the response, and measurement errors, among others. An in-depth investigation of these topics is beyond the scope of this work, and will be considered elsewhere.

Author Contributions

Conceptualization, D.I.G., M.B. and C.E.G.; Formal analysis, D.I.G., M.B., C.E.G. and H.W.G.; Investigation, D.I.G., M.B., C.E.G. and H.W.G.; Methodology, D.I.G., M.B. and C.E.G.; Software, D.I.G. and M.B.; Supervision, C.E.G. and H.W.G.; Validation, D.I.G. and M.B. All authors have read and agreed to the published version of the manuscript.

Funding

The research of H.W. Gómez was supported by Grant PUENTE UA, Chile.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1. Details for Score and Hessian

For the score vector in Equation (4), the elements of the form

\frac{\partial ℓ_{i}}{\partial ξ_{i}}

, with

ξ \in {μ, σ, λ}

are given by

\begin{matrix} \frac{\partial ℓ_{i}}{\partial μ_{i}} & = - \frac{1}{σ_{i}} \{λ_{i} m_{0} (λ_{i} z_{i}) - z_{i} + [α (λ_{i}, τ) - 1] m_{λ_{i}} (z_{i})\}, \\ \frac{\partial ℓ_{i}}{\partial σ_{i}} & = - \frac{z_{i}}{σ_{i}} \{λ_{i} m_{0} (λ_{i} z_{i}) - z_{i} + [α (λ_{i}, τ) - 1] m_{λ_{i}} (z_{i})\} - \frac{1}{σ_{i}}, \\ \frac{\partial ℓ_{i}}{\partial λ_{i}} & = log (τ) {[α (λ_{i}, τ) (λ_{i}^{2} + 1) (\frac{π}{2} - \arctan (λ_{i}))]}^{- 1} (1 + log Φ_{λ_{i}} (z_{i})) \\ + z_{i} m_{0} (λ_{i} z_{i}) - [α (λ_{i}, τ) - 1] \frac{m_{0} (λ_{i} z_{i})}{(1 + λ_{i}^{2})}, \end{matrix}

where

m_{λ} (z) = ϕ_{λ} (z) / Φ_{λ} (z)

.

For the Hessian in Equation (5), the elements of the form

\partial^{2} ℓ_{i} / \partial ξ_{i} \partial ξ_{i}^{^{'}}

, with

ξ, ξ^{^{'}} \in {μ, σ, λ}

are given by

\begin{matrix} \frac{\partial^{2} ℓ_{i}}{\partial μ_{i}^{2}} & = \frac{1}{σ_{i}^{2}} \{λ_{i}^{2} m_{0}^{^{'}} (λ_{i} z_{i}) - 1 + [α (λ_{i}, τ) - 1] m_{λ_{i}}^{^{'}} (z_{i})\} \\ \frac{\partial^{2} ℓ_{i}}{\partial μ_{i} \partial σ_{i}} & = \frac{z_{i}}{σ_{i}} \{λ_{i}^{2} m_{0}^{^{'}} (λ_{i} z_{i}) - 1 + [α (λ_{i}, τ) - 1] m_{λ_{i}}^{^{'}} (z_{i})\} \\ \frac{\partial^{2} ℓ_{i}}{\partial μ_{i} \partial λ_{i}} & = - \frac{1}{σ_{i}} {\frac{log (τ) m_{λ} (z_{i})}{(1 + λ_{i}^{2}) (\frac{π}{2} - \arctan (λ_{i})) (1 + log Φ_{λ_{i}} (z_{i}))} + m (λ_{i} z_{i}) + λ_{i} z_{i} m_{0}^{^{'}} (λ_{i} z_{i}) \\ - \frac{[α (λ_{i}, τ) - 1] λ_{i} m_{0}^{^{'}} (λ_{i} z_{i})}{(1 + λ_{i}^{2})}} \\ \frac{\partial^{2} ℓ_{i}}{\partial σ_{i}^{2}} & = \frac{z_{i}}{σ_{i}^{2}} \{λ_{i}^{2} m_{0}^{^{'}} (λ_{i} z_{i}) - 1 + [α (λ_{i}, τ) - 1] m_{λ_{i}}^{^{'}} (z_{i})\} + \frac{1}{σ_{i}^{2}} \\ \frac{\partial^{2} ℓ_{i}}{\partial σ_{i} \partial λ_{i}} & = - \frac{z_{i}}{σ_{i}} {\frac{log (τ) m_{λ} (z_{i})}{(1 + λ_{i}^{2}) (\frac{π}{2} - \arctan (λ_{i})) (1 + log Φ_{λ_{i}} (z_{i}))} + m (λ_{i} z_{i}) + λ_{i} z_{i} m_{0}^{^{'}} (λ_{i} z_{i}) \\ - \frac{[α (λ_{i}, τ) - 1] λ_{i} m_{0}^{^{'}} (λ_{i} z_{i})}{(1 + λ_{i}^{2})}} \\ \frac{\partial^{2} ℓ_{i}}{\partial λ_{i}^{2}} & = \frac{{log}^{2} (τ) [(1 - π λ + 2 λ_{i} \arctan (λ_{i})) log (\frac{1}{2} - \frac{1}{π} \arctan (λ_{i})) + 1] (1 + log Φ_{λ_{i}} (z_{i}))}{{[α (λ_{i}, τ) (1 + λ_{i}^{2}) (\frac{π}{2} - \arctan (λ_{i}))]}^{2}} \\ - \frac{log (τ) m_{0} (λ_{i} z_{i}) m_{λ_{i}} (z_{i})}{α (λ_{i}, τ) {(1 + λ_{i}^{2})}^{2} (\frac{π}{2} - \arctan (λ_{i})) (1 + log Φ_{λ_{i}} (z_{i}))} + z_{i}^{2} m_{0}^{^{'}} (λ_{i} z_{i}) \\ - \frac{m (λ_{i} z_{i})}{log (τ) α^{2} (λ_{i}, τ) {(1 + λ_{i}^{2})}^{2} (\frac{π}{2} - \arctan (λ_{i}))} \\ - \frac{[α (λ_{i}, τ) - 1]}{(1 + λ_{i}^{2})} \{z_{i} m^{^{'}} (λ_{i} z_{i}) - 2 \frac{λ_{i} m (λ_{i} z_{i})}{(1 + λ_{i}^{2})}\}, \end{matrix}

where

m_{λ}^{^{'}} (z) = λ m_{0} (λ z) m_{λ} (z) - z m_{λ} (z) - m_{λ}^{2} (z)

.

Appendix A.2. Local Influence

In this section, we present additional information for the local influence analysis in the Athletes dataset discussed in Section 5.

Figure A1. Index plots for

C_{i} ({\hat{β}}_{1})

(left),

C_{i} ({\hat{β}}_{2})

(center) and

C_{i} ({\hat{β}}_{3})

(right) under the weight perturbation (upper), response perturbation (center) and covariate perturbation (lower) schemes for RPSN model for

τ = 0.1

.

Figure A2. Index plots for

C_{i} ({\hat{β}}_{1})

(left),

C_{i} ({\hat{β}}_{2})

(center) and

C_{i} ({\hat{β}}_{3})

(right) under the weight perturbation (upper), response perturbation (center) and covariate perturbation (lower) schemes for RPSN model for

τ = 0.25

.

Figure A3. Index plots for

C_{i} ({\hat{β}}_{1})

(left),

C_{i} ({\hat{β}}_{2})

(center) and

C_{i} ({\hat{β}}_{3})

(right) under the weight perturbation (upper), response perturbation (center) and covariate perturbation (lower) schemes for RPSN model for

τ = 0.75

.

Figure A4. Index plots for

C_{i} ({\hat{β}}_{1})

(left),

C_{i} ({\hat{β}}_{2})

(center) and

C_{i} ({\hat{β}}_{3})

(right) under the weight perturbation (upper), response perturbation (center) and covariate perturbation (lower) schemes for RPSN model for

τ = 0.9

.

Table A1. RCs (in %) in ML estimates and their corresponding SEs for the indicated parameter and respective p-values for the athletes dataset when observation 75 and 178 are dropped separately.

Dropped			$τ$
Cases		Parameter	0.10	0.25	0.50	0.75	0.90
75	RC		5.31	7.22	10.82	16.2	22.57
	RCSE	$β_{11} (τ)$	0.23	0.20	0.17	0.11	0.04
	p-value		<0.0001	<0.0001	<0.0001	<0.0001	<0.0001
	RC		1.82	5.03	10.02	16.09	22.08
	RCSE	$β_{12} (τ)$	0.15	0.05	0.08	0.07	0.17
	p-value		<0.0001	<0.0001	<0.0001	<0.0001	<0.0001
	RC		6.77	9.84	14.27	19.20	23.84
	RCSE	$β_{21} (τ)$	0.65	0.93	1.05	0.71	0.33
	p-value		0.0118	0.0105	0.0095	0.0086	0.0078
178	RC		0.72	2.62	6.30	11.88	18.50
	RC $_{S E}$	$β_{11} (τ)$	0.17	0.15	0.12	0.07	0.00
	p-value		<0.0001	<0.0001	<0.0001	<0.0001	<0.0001
	RC		0.12	3.36	8.60	14.88	21.06
	RC $_{S E}$	$β_{12} (τ)$	0.13	0.06	0.09	0.07	0.18
	p-value		<0.0001	<0.0001	<0.0001	<0.0001	<0.0001
	RC		22.91	25.43	29.09	33.17	37.01
	RC $_{S E}$	$β_{21} (τ)$	0.75	0.47	0.31	0.61	1.61
	p-value		0.0449	0.0418	0.0393	0.0371	0.0352

Table A2. RCs (in %) in ML estimates and their corresponding SEs for the indicated parameter and respective p-values for the athletes dataset when observations {75, 178} and {75, 162, 178} are dropped separately.

Dropped			$τ$
Cases		Parameter	0.10	0.25	0.50	0.75	0.90
75 and	RC		6.30	8.16	11.69	17.01	23.29
178	RCSE	$β_{11} (τ)$	0.41	0.39	0.34	0.28	0.19
	p-value		<0.0001	<0.0001	<0.0001	<0.0001	<0.0001
	RC		1.75	5.32	10.58	16.84	22.97
	RCSE	$β_{12} (τ)$	0.29	0.08	0.03	0.18	0.27
	p-value		<0.0001	<0.0001	<0.0001	<0.0001	<0.0001
	RC		31	33.34	36.67	40.38	43.87
	RCSE	$β_{21} (τ)$	0.04	0.27	0.42	0.11	0.91
	p-value		0.0674	0.0633	0.0600	0.0572	0.0546
75, 162	RC		5.43	7.27	10.80	16.13	22.45
and 178	RCSE	$β_{11} (τ)$	0.57	0.54	0.50	0.43	0.34
	p-value		<0.0001	<0.0001	<0.0001	<0.0001	<0.0001
	RC		1.36	5.12	10.53	16.91	23.14
	RCSE	$β_{12} (τ)$	0.39	0.18	0.12	0.26	0.35
	p-value		<0.0001	<0.0001	<0.0001	<0.0001	<0.0001
	RC		43.37	45.46	48.35	51.53	54.51
	RCSE	$β_{21} (τ)$	0.29	0.61	0.77	0.46	0.56
	p-value		0.1300	0.1251	0.1212	0.1178	0.1149

References

Azzalini, A. A class of distributions which includes the normal ones. Scand. J. Stat. 1985, 12, 171–178. [Google Scholar]
Azzalini, A. Further results on a class of distributions which includes the normal ones. Statistica 1986, 46, 199–208. [Google Scholar]
Arellano-Valle, R.B.; Gómez, H.W.; Quintana, F.A. A New Class of Skew-Normal Distributions. Commun. Stat. Theory Methods 2004, 33, 1465–1480. [Google Scholar] [CrossRef]
Pewsey, A.; Gómez, H.W.; Bolfarine, H. Likelihood-based inference for power distributions. Test 2012, 21, 775–789. [Google Scholar] [CrossRef]
Lehmann, E.L. The power of rank tests. Ann. Math. Statist. 1953, 24, 23–43. [Google Scholar] [CrossRef]
Durrans, S.R. Distributions of fractional order statistics in hydrology. Water Resour. Res. 1992, 28, 1649–1655. [Google Scholar] [CrossRef]
Gupta, D.; Gupta, R.C. Analyzing skewed data by power normal model. Test 2008, 17, 197–210. [Google Scholar] [CrossRef]
Castillo, N.O.; Gallardo, D.I.; Bolfarine, H.; Gómez, H.W. Truncated power-normal distribution with application to non-negative measurements. Entropy 2018, 20, 433. [Google Scholar] [CrossRef]
Martínez-Flórez, G.; Arnold, B.C.; Bolfarine, H.; Gómez, H.W. The alpha-power tobit model. Commun. Stat. Theory Methods 2013, 42, 633–643. [Google Scholar] [CrossRef]
Martínez-Flórez, G.; Bolfarine, H.; Gómez, H.W. Doubly censored power-normal regression models with inflation. Test 2015, 24, 265–286. [Google Scholar] [CrossRef]
Martínez-Flórez, G.; Bolfarine, H.; Gómez, H.W. Skew-normal alpha-power model. Statistics 2014, 48, 1414–1428. [Google Scholar] [CrossRef]
Martínez-Flórez, G.; Bolfarine, H.; Gómez, H.W. The log alpha-power asymmetric distribution with application to air pollution. Environmetrics 2014, 25, 44–56. [Google Scholar] [CrossRef]
Martínez-Flórez, G.; Bolfarine, H.; Gómez, H.W. Asymmetric regression models with limited responses with an application to antibody response to vaccine. Biom. J. 2013, 55, 156–172. [Google Scholar] [CrossRef] [PubMed]
Cook, R.D. Detection of influential observation in linear regression. Technometrics 1977, 19, 15–18. [Google Scholar]
Cook, R.D.; Weisberg, S. An Introduction to Regression Graphics; Wiley: New York, NY, USA, 1994. [Google Scholar]
Galarza, C.E.; Lachos, V.H.; Barbosa, C.; Castro, L.M. Robust quantile regression using a generalized class of skewed distributions. Stat 2017, 6, 113–130. [Google Scholar]
Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
Kolmogorov, A.N. Sulla determinazione empirica di una legge di distribuzionc. Giorn. Ist. Ital. Attuar. 1933, 4, 83–91. [Google Scholar]
Dunn, P.; Smyth, G. Randomized quantile residuals. J. Comput. Graph. Stat. 1996, 5, 236–244. [Google Scholar]

Figure 1. Pdf for the RPSN

(μ = 0, σ = 1, λ)

for different values of

λ

:

τ = 0.1

(left panel);

τ = 0.5

(center panel);

τ = 0.9

(right panel). Values for

λ

are:

- 5

(black line),

- 1.5

(red line),

- 0.5

(blue line), 0 (green line),

0.5

(orange line),

1.5

(magenta line) and 5 (purple line).

Figure 2. Athletes dataset: Point estimates (center line) and 95% confidence intervals (CIs) for model parameters under RPSN-QR model.

Figure 3. Data analysis: Fitted RPSN-QR model lines for the response (left panel for males, center panel for females) and scale parameter (right panel) over the grid

τ = {0.10, 0.25, 0.50, 0.75, 0.90}

.

Figure 4. Index plots for

C_{i} ({\hat{β}}_{1})

(left),

C_{i} ({\hat{β}}_{2})

(center) and

C_{i} ({\hat{β}}_{3})

(right) under the weight perturbation (upper), response perturbation (center) and covariate perturbation (lower) schemes for RPSN model for

τ = 0.5

.

Table 1. Range for skewness and kurtosis coefficients for SN, PN and PSN models.

Coefficient	SN	PN	PSN
Skewness	(−0.9953, 0.9953)	[−0.6115, 0.9007]	[−1.6476, 0.9953)
Kurtosis	[3, 3.8692)	[1.7170, 4.3556]	[1.4672, 5.4386]

Table 2. AIC criterion for different models parameterized in terms of the quantile.

$τ$	SKN	SKT	GSC	RPSN ( $σ$ Constant)	RPSN (Modeling $σ$ )
0.10	1097.74	817.77	803.08	808.64	801.37
0.25	1084.46	803.90	801.96	811.08	803.11
0.50	1095.56	810.99	854.38	815.79	806.76
0.75	1151.40	854.57	861.37	824.56	814.01
0.90	1220.96	914.43	865.16	838.78	825.95

Table 3. Estimates and SE for parameters in athletes dataset in RPSN-quantile regression (QR) model for different values of

τ

.

Table 3. Estimates and SE for parameters in athletes dataset in RPSN-quantile regression (QR) model for different values of

τ

.

	$τ = 0.10$			$τ = 0.50$			$τ = 0.90$
Parameter	Est.	SE	p-Value	Est.	SE	p-Value	Est.	SE	p-Value
$β_{10} (τ)$	6.4642	1.1552	-	6.7727	1.0867	-	6.1798	1.0859	-
$β_{11} (τ)$	2.3077	0.3742	<0.0001	2.5008	0.3728	<0.0001	2.9324	0.3695	<0.0001
$β_{12} (τ)$	0.2037	0.0157	<0.0001	0.2299	0.0147	<0.0001	0.2728	0.0151	<0.0001
$β_{20} (τ)$	0.7633	0.7952	-	0.2252	0.3996	-	−0.6261	0.2700	-
$β_{21} (τ)$	0.0096	0.0036	0.0040	0.0108	0.0037	0.0017	0.0125	0.0035	0.0002
$β_{30} (τ)$	−1.1940	1.4984	-	−0.8381	0.5984	-	−0.4916	0.2588	-

Table 4. p-values for normality K-S test for residuals under our RPSN-QR model for the athletes dataset for different quantiles

τ

’s.

Table 4. p-values for normality K-S test for residuals under our RPSN-QR model for the athletes dataset for different quantiles

τ

’s.

$τ$	0.10	0.15	0.20	0.25	0.30	0.35	0.40	0.45	0.50
p-value	0.995	0.996	0.991	0.976	0.951	0.914	0.864	0.853	0.837
$τ$	0.55	0.60	0.65	0.70	0.75	0.80	0.85	0.90
p-value	0.765	0.810	0.777	0.683	0.604	0.524	0.394	0.191

Table 5. Relative changes (RC) (in %) in ML estimates and their corresponding SE’s for the indicated parameter and respective p-values for the athletes dataset when observations 53, 75, 162 and 178 are dropped.

Dropped			$τ$
Cases		Parameter	0.10	0.25	0.50	0.75	0.90
53, 75,	RC		7.20	9.06	12.56	17.82	24.05
162 and 178	RCSE	$β_{11} (τ)$	0.74	0.74	0.69	0.63	0.52
	p-value		<0.0001	<0.0001	<0.0001	<0.0001	<0.0001
	RC		1.98	5.63	10.93	17.22	23.38
	RCSE	$β_{12} (τ)$	0.47	0.26	0.20	0.34	0.42
	p-value		<0.0001	<0.0001	<0.0001	<0.0001	<0.0001
	RC		34.55	36.83	40.05	43.61	46.96
	RCSE	$β_{21} (τ)$	0.23	0.53	0.69	0.36	0.66
	p-value		0.0806	0.0762	0.0727	0.0697	0.0670

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

A Parametric Quantile Regression Model for Asymmetric Response Variables on the Real Line

Abstract

1. Introduction

2. A PSN Distribution Parameterized by Its Quantile Parameter, and Its Associated Quantile Regression Model

3. Local Influence

3.1. Case Weights Perturbation

3.2. Case Response Perturbation

3.3. Case Continuous Covariate Perturbation

4. Real Data Analysis

5. Concluding Remarks

Author Contributions

Funding

Conflicts of Interest

Appendix A

Appendix A.1. Details for Score and Hessian

Appendix A.2. Local Influence

References

Article Metrics

Citations

Article Access Statistics