A New Class of Estimators Based on a General Relative Loss Function

: Motivated by the relative loss estimator of the median, we propose a new class of estimators for linear quantile models using a general relative loss function deﬁned by the Box–Cox transformation function. The proposed method is very ﬂexible. It includes a traditional quantile regression and median regression under the relative loss as special cases. Compared to the traditional linear quantile estimator, the proposed estimator has smaller variance and hence is more efﬁcient in making statistical inferences. We show that, in theory, the proposed estimator is consistent and asymptotically normal under appropriate conditions. Extensive simulation studies were conducted, demonstrating good performance of the proposed method. An application of the proposed method in a prostate cancer study is provided.


Introduction
In contrast to the mean-based regression that mainly gives an overall quantification for the central covariate effect, quantile regression can directly model a series of quantiles (from lower to higher) of the response variable to deliver a global evaluation of the covariate effect [1,2]. A major advantage of quantile regression is that no assumptions about the distribution of the response are required, which makes it practical, robust and amenable to skewed response distributions [3]. Additionally, quantile regression methods can help to handle the cases of heteroscedasticity [4]. Nowadays, quantile regressions have been widely used in many fields, established numerous methodologies covering linear, nonlinear and longitudinal quantile regressions [5,6], as well as applications in survival analysis [7]. Application studies show that quantile regression allows adjustment for potential confounders and calculation of interaction terms and variable selection, while being more robust to statistical outliers and yielding much more information about the underlying associations [1,3]. However, the computation of quantile regressions is relatively complex and somewhat unique, especially compared with ordinary least squares for mean-based linear (or nonlinear) regressions. Take a 0.5 quantile estimation as an example: the quantile regression minimizes the sum of weighted absolute residuals instead of squared residuals [8]. One drawback of quantile regressions is that the estimation efficiency fluctuates a lot at different quantiles and is relatively low at the tails [9]. A traditional quantile regression is typically based on minimizing a check loss function [1], but often the relative quantile loss could be more relevant than the check loss function and hence might be used to gain more efficiency for inference. So far, to the best of our knowledge, there have been no consistent studies on other type of loss functions, such as the relative error loss for quantile regression.
In many practical applications, the magnitude of the relative error, rather than the absolute error, is of the major concern. In general, the relative error is more relevant when the range of predicted values is large and that of predictors is small. Narula and Wellington (1977) proposed an estimation approach for linear models by minimizing the sum of relative errors [10], without any theoretical results. Khoshgoftaar et al. (1992) studied the asymptotic properties of the estimators by minimizing both the squared relative loss and the absolute relative loss under nonlinear regression models [11], and made a great comparative study on them. Later, Chen et al. (2010) applied the least absolute relative error loss [12] for the linear regression model and proposed to estimate the model parameters by minimizing the sum of the absolute relative error where Y i = log(Y * i ) and they assumed log(ε * i ) has a mean of 1. Motivated by [12], Yang and Ye (2013) established the connection between relative error estimators and the M-estimation under a linear model [13]. However, their works only consider the absolute relative loss, not related to any quantile estimates of model distribution. None of these studies discussed the way to apply the relative error for a general quantile regression.
In this article, we propose a general class of relative loss functions via a Box-Cox transformation [14,15]. The proposed loss function includes the absolute loss and the relative loss as special cases. We also show that the proposed loss function is convex, scale-free and able to be elicited. We apply the proposed loss function for a linear quantile regression [1] and prove that the estimates of the regression coefficients are consistent and asymptotically normal. Through numerical studies on two concrete examples that have well-derived theoretical solutions, we show that the proposed method is feasible and verify that the numerical results have the expected theoretical properties obtained from the theoretical study. We also apply the proposed method to a prostate cancer study, showing that our method provides more accurate statistical inferences on quantile estimates compared to the regular quantile regression, especially at the region of tail quantiles.
The rest of this article is organized as follows. In Section 2, we introduce the model and propose the estimation procedure. In Section 3, we establish the consistency and asymptotic normality for the parameter estimates under certain regularity conditions. In Section 4, we examine the finite-sample properties using two simple simulations. An application example is given in Section 5 with a prostate cancer study. We conclude with a brief summary and remarks in Section 6.

Model and Methods
Let Y i be the response of interest and X i be a (p + 1)-dimensional covariate with the first element being 1. Consider a linear quantile regression model where β 0 (τ) is a (p + 1)-dimensional coefficient for some τ ∈ (0, 1), and i (τ) is a random error with the τth quantile being equal to 0 conditional on X i . In model (1), Y i could also be replaced by any other reasonable monotone transformation, but considering that a linear relationship in the transformed model may not be linear in the original scale, one may need to transform the result back to the original measurement scale for the interpretation.
Through an exponential transformation, model (1) can be rewritten as where T i = exp(Y i ) > 0, ε i (τ) = exp{ i (τ)} = exp{Y i − X T i β 0 (τ)}. Denote H(·) as a monotone function of real value, then according to the equivariance property of quantiles, Q τ {H(ε i )|X i } = H{Q τ (ε i |X i )} holds for any random error ε i . Here, Q τ (ε|X) is the τth conditional quantile of ε given X. This characteristic enables us to consider quantile regressions under a proper transformation of the random error so as to enhance the estimation efficiency.
To this end, we propose to conduct the model regression using a general class of relative loss functions, which leads to the objective function in the form where ρ τ (u) = u{τ − I(u < 0)} is the traditional quantile loss function and H γ (·) takes the Box-Cox transformation with H γ (u) = (u γ − 1)/γ for γ = 0 and H γ (u) = log(u) for γ = 0 (see Figure 1). The objective function, (3), can be further simplified as where V γ (u) = (u γ − u −γ )/γ if γ = 0, otherwise V γ (u) = 2 log(u) is a general relative loss function. In particular, if γ = 0, (4) is reduced to the objective function of traditional quantile regression, and if γ = 1 and τ = 0.5, (4) is reduced to , which is exactly the objective function of the least absolute relative error in [12]. The proposed framework is very flexible-it allows us to adapt the quantile regression to either the absolute or relative loss or somewhere in between by tuning the parameter γ. Furthermore, the function V γ (u) in (4) can guarantee the proposed criterion function to be convex (see Lemma A1), scale-free, and able to be elicited (see Definition 2 in [16]). Therefore, given γ, the minimizer of W n (β; γ, τ) with respect to β, denoted asβ n (γ; τ), can be obtained conveniently using classical algorithms, such as the Nelder-Mead simplex method recommended by Yin and Cai [17].
Condition (C1) is regular, and condition (C2) ensures the consistency and asymptotic normality, which can be treated as a generalized version of the zero mean and zero median assumptions for the least square estimation and least absolute deviation regression methods, respectively. Condition (C3) guarantees the identifiability of the adaptive parameter and regression parameters. Condition (C4) is to ensure the asymptotic normality of the estimates of the regression coefficients, similar to the finite second moment condition for the least square estimator in linear regressions.

Proof of Theorem 2.
To prove the asymptotic normality, we approximate E{W n (β; γ, τ) −W n (β 0 (τ); γ, τ)} for every β in a neighborhood of β 0 (τ) first. By the proof in Theorem 1, we know Thus, in probability as n → ∞. Let θ = √ n(β − β 0 (τ)), then the above equation is equivalent to Similar to the proof of Lemma B.4. in [20], using the arguments of VC-subgraph classes we can show with probability as n → ∞. Then, according to (10) and the Taylor expansion in the first term, (11) holds.

Adaptive Criteria for Choosing γ
The loss function defined in (4) provides a flexible framework that allows us to conduct a quantile regression adaptively to practical scenario. However, how to choose the tuning parameter γ for the proposed loss function adaptively to real data is a challenging but important problem. Reasonable criteria for choosing the tuning parameter need to be explored, so with that we could select the optimal γ based on the criterion using datadriven techniques.
According to Lemma A1, we conclude that γ(τ) = arg min γ∈[0,Γ] W n (β, γ; τ) ≡ 0. That is, the proposed quantile regression approach is equivalent to the traditional quantile regression under criterion (I). Thus, the optimal γ(τ) selected by criterion (I) provides the best model fitting at each quantile in the sense of regular quantile regression. Criterion (II) selects γ(τ), which reshapes the distribution of residuals for obtaining the best estimation efficiency. The optimal γ(τ) selected by criterion (II) is denoted as γ * (τ). According to the derived asymptotic covariance matrix in Theorem 2, we know that the value of γ * depends on τ as well as the distribution of ε. For a fixed τ and a specified distribution of ε, minimizing the variance ofβ n (γ; τ) is equivalent to maximizing {J γ (τ) + f ε (1)} 2 /B(γ; τ), and thus γ * (τ) = arg max In the sequel, we mainly discuss the performance of criterion (II).

Remark 2.
People can also define other criteria, such as selecting γ * (τ) by minimizing the summation of the standardized W n (β, γ; τ) and the standardized variance ofβ n (γ; τ), we denote this criterion as (III), then the selected γ(τ) shall be between 0 and γ * (τ). In practice, it is hard to assess whether a criterion is better than another one. The criterion that fits practical needs is the best.
To illustrate the feasibility of criterion (II) and show the existence of optimal γ(τ), we next give two concrete examples with specified quantiles and distributions of ε. Example 1. Assume that ε i follows the standard log-normal distribution and is independent of X i , i.e., i ∼ N(0, 1), then γ * (τ) = 1.12, 1.13, 1.14 for τ = 0.25, 0.5, 0.75, respectively.

Simulation Study
For numerical implementation of the proposed method, we first obtainedβ n (γ; τ) by minimizing (4) for each fixed γ ∈ [0, Γ] using the Nelder-Mead simplex algorithm, and then tuned and selected γ(τ) based on a criterion. We finally obtained the adapted valuê γ(τ) and the corresponding coefficientβ n (γ; τ). To verify the theoretical properties of the proposed method, we conducted simulation studies under two simple scenarios with finite samples to illustrate the proposed method. Scenario 1. We considered a simple univariate ,case, where α = 0, β = 1, X i follows the standard normal distribution, and i = * i − Φ −1 (τ), with * i following the standard normal distribution and Φ −1 (τ) indicating the τ-quantile of the standard normal distribution. We set the sample size n = 200, 400 and the number of replications as 500.
For Scenario 1, β 0 is of one dimension. So, we can plot the 3-dimensional surface of W n (β, γ; τ = 0.5), see Figure 3, which shows that W n (β, γ; τ) is local convex with respect to both γ and β. As illustrated previously in Example 1, if follows the standard normal distribution, the true value of γ * (τ) in theory is around 1.08 by criterion (II). Table 1 summarizes the simulation results at different quantiles with various adaptive criteria. The results show that the estimated regression coefficients have small biases, and the biases demonstrate a clear trend of asymptotic consistency for all settings. According to the results in Table 1 and the box-plots in Figures 4 and 5, we also see that the adaptive parameterγ(0.5) selected by criteria (I) and (II) converge to 0.0, 1.13 and 0.8, respectively. The values of γ(τ) by criterion (I) are all equal to 0, indicating the estimates by criterion (I) are equivalent to that by the traditional quantile regression. Compared to the estimate by traditional quantile regression, the proposed estimate under the adaptive criterion (II) enhances the estimation efficiency of coefficients by 5∼20%. Additionally, from the histogram in Figure 6 and the quantile-quantile (Q-Q) plot [21,22] in Figure 7, we see that the empirical distribution of the estimators follows a clear normal distribution.
The simulation results are presented in Table 2. It is shown that under criterion (I) that the estimated values of γ * (τ) are all equal to 0 as well, and the corresponding estimated coefficients at different quantiles have small biases and reasonable coverage probabilities. Under the adaptive criterion (II), the estimated values of γ * (τ) tend to converge to Γ = 2. In addition, the regression coefficients are all estimated accurately. Overall, the simulation results in Table 2 match well with the theoretical properties in Example 2 in Section 4. Specifically, the estimation efficiency of coefficients using the proposed method with criterion (II) increases by 60∼100% over the traditional quantile regression.   Est., the empirical mean of the estimated coefficients; SD, the sample standard deviation of the estimated coefficients; SE, the empirical mean of the estimated standard errors based on the bootstrap method; CP, the coverage probability of the 95% confidence intervals.

Application
We applied the proposed method to a prostate cancer study [23], where the prostate cancer data contain the medical records of 97 male patients who received a radical prostatectomy. The description of data is summarized in Table 3. The response variable of interest is the level of prostate antigen (PSA), and there are eight predictor variables, including the log of cancer volume, the log of prostate weight, age, the log of the amount of benign prostatic hyperplasia, seminal vesicle invasion, the log of capsular penetration, Gleason score and the percentage of Gleason scores of 4 or 5. The research goal is to study the covariate effects of the predictor variables at the level of prostate antigen.
We considered a linear quantile model for the association between prostate antigen and the predictor variables. For convenience, we first standardized the predictors as well as the dependent variables, and then included the intercept in the model. We took 200 bootstrap samples for the variance estimation of coefficients, and selected the adaptive parameter using criterion (II). We obtained the estimateγ(τ) = 2.0 for all grid points τ ∈ {0.1, 0.2, . . . , 0.8, 0.9}. Then, we further acquired the estimated regression coefficients with 95% Confidence Intervals (CIs) at the 0.25, 0.50 and 0.75 quantiles, which are summarized in Table 4, where the results are also compared to those of the traditional quantile regression. As is shown in the table, there is not much difference between the estimated coefficients by the traditional quantile regression and the proposed method with criterion (II). However, referring to the values of the 'CI ratio' ('CI Ratio' is defined as the ratio of the length of 95% CI by traditional QR over that by the proposed method using criterion (II)), we can see overall that the lengths of 95% CIs by the proposed method are significantly smaller than that by a traditional quantile regression. Figure 8 presents the estimated curves of coefficients with 95% confidence bands. The figure shows that the proposed method with criterion (II) provides smoother estimates and more stable and narrower confidence bands compared to those by a regular quantile regression, especially at the region of tail quantiles. From the results in both Table 4 and Figure 8, we concluded that the cancer volume, the prostate weight, and the seminal vesicle invasion are significantly associated with the level of prostate antigen, which is as expected since a high level of prostate antigen is generally regarded as strong evidence of prostate cancer. While the effects of the amount of benign prostatic hyperplasia, capsular penetration, Gleason score and the percentage of Gleason scores of 4 or 5 at the level of prostate antigen are statistically insignificant.  The solid lines present the estimated coefficients by proposed method with criterion (II), and the dash-dot lines are by the traditional quantile regression. The shaded areas give the 95% confidence bands of the estimated curves by proposed method with criterion (II). The 95% confidence bands of the curves by regular quantile regression are shown between the lower and upper dash-dot lines.

Discussion
We propose a general class of loss functions to conduct a quantile regression, which naturally unifies the absolute and the relative error criteria. The consistency and asymptotic normality of the resulting estimators are established. Numerical studies demonstrate the good performance of the proposed method using finite samples. Although our proposal is based on a quantile regression, similar ideas can also be extended to other statistical framework such as the M-estimator. Further research includes extending the proposed procedure to censored data, see, e.g., [7,24,25], or longitudinal data [6,26], and even to the case with a diverging number of parameters or ultra-high dimensionality situations [27].