Robust Relative Error Estimation

Hirose, Kei; Masuda, Hiroki

doi:10.3390/e20090632

Open AccessArticle

Robust Relative Error Estimation

by

Kei Hirose

^1,2,*

and

Hiroki Masuda

³

¹

Institute of Mathematics for Industry, Kyushu University, 744 Motooka, Nishi-ku, Fukuoka 819-0395, Japan

²

RIKEN Center for Advanced Intelligence Project, 1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan

³

Faculty of Mathematics, Kyushu University, 744 Motooka, Nishi-ku, Fukuoka 819-0395, Japan

^*

Author to whom correspondence should be addressed.

Entropy 2018, 20(9), 632; https://doi.org/10.3390/e20090632

Submission received: 11 July 2018 / Revised: 15 August 2018 / Accepted: 20 August 2018 / Published: 24 August 2018

(This article belongs to the Special Issue New Developments in Statistical Information Theory Based on Entropy and Divergence Measures)

Download

Browse Figures

Versions Notes

Abstract

:

Relative error estimation has been recently used in regression analysis. A crucial issue of the existing relative error estimation procedures is that they are sensitive to outliers. To address this issue, we employ the

γ

-likelihood function, which is constructed through

γ

-cross entropy with keeping the original statistical model in use. The estimating equation has a redescending property, a desirable property in robust statistics, for a broad class of noise distributions. To find a minimizer of the negative

γ

-likelihood function, a majorize-minimization (MM) algorithm is constructed. The proposed algorithm is guaranteed to decrease the negative

γ

-likelihood function at each iteration. We also derive asymptotic normality of the corresponding estimator together with a simple consistent estimator of the asymptotic covariance matrix, so that we can readily construct approximate confidence sets. Monte Carlo simulation is conducted to investigate the effectiveness of the proposed procedure. Real data analysis illustrates the usefulness of our proposed procedure.

Keywords:

γ-divergence; relative error estimation; robust estimation

1. Introduction

In regression analysis, many analysts use the (penalized) least squares estimation, which aims at minimizing the mean squared prediction error [1]. On the other hand, the relative (percentage) error is often more useful and/or adequate than the mean squared error. For example, in econometrics, the comparison of prediction performance between different stock prices with different units should be made by relative error; we refer to [2,3] among others. Additionally, the prediction error of photovoltaic power production or electricity consumption is evaluated by not only mean squared error but also relative error (see, e.g., [4]). We refer to [5] regarding the usefulness and importance of the relative error.

In relative error estimation, we minimize a loss function based on the relative error. An advantage of using such a loss function is that it is scale free or unit free. Recently, several researchers have proposed various loss functions based on relative error [2,3,6,7,8,9]. Some of these procedures have been extended to the nonparameteric model [10] and random effect model [11]. The relative error estimation via the

L_{1}

regularization, including the least absolute shrinkage and operator (lasso; [12]), and the group lasso [13], have also been proposed by several authors [14,15,16], to allow for the analysis of high-dimensional data.

In practice, a response variable

y (> 0)

can turn out to be extremely large or close to zero. For example, the electricity consumption of a company may be low during holidays and high on exceptionally hot days. These responses may often be considered to be outliers, to which the relative error estimator is sensitive because the loss function diverges when

y \to \infty

or

y \to 0

. Therefore, a relative error estimation that is robust against outliers must be considered. Recently, Chen et al. [8] discussed the robustness of various relative error estimation procedures by investigating the corresponding distributions, and concluded that the distribution of least product relative error estimation (LPRE) proposed by [8] has heavier tails than others, implying that the LPRE might be more robust than others in practical applications. However, our numerical experiments show that the LPRE is not as robust as expected, so that the robustification of the LPRE is yet to be investigated from the both theoretical and practical viewpoints.

To achieve a relative error estimation that is robust against outliers, this paper employs the

γ

-likelihood function for regression analysis by Kawashima and Fujisawa [17], which is constructed by the

γ

-cross entropy [18]. The estimating equation is shown to have a redescending property, a desirable property in robust statistics literature [19]. To find a minimizer of the negative

γ

-likelihood function, we construct a majorize-minimization (MM) algorithm. The loss function of our algorithm at each iteration is shown to be convex, although the original negative

γ

-likelihood function is nonconvex. Our algorithm is guaranteed to decrease the objective function at each iteration. Moreover, we derive the asymptotic normality of the corresponding estimator together with a simple consistent estimator of the asymptotic covariance matrix, which enables us to straightforwardly create approximate confidence sets. Monte Carlo simulation is conducted to investigate the performance of our proposed procedure. An analysis of electricity consumption data is presented to illustrate the usefulness of our procedure. Supplemental material includes our R package rree (robust relative error estimation), which implements our algorithm, along with a sample program of the rree function.

The reminder of this paper is organized as follows: Section 2 reviews several relative error estimation procedures. In Section 3, we propose a relative error estimation that is robust against outliers via the

γ

-likelihood function. Section 4 presents theoretical properties: the redescending property of our method and the asymptotic distribution of the estimator, the proof of the latter being deferred to Appendix A. In Section 5, the MM algorithm is constructed to find the minimizer of the negative

γ

-likelihood function. Section 6 investigates the effectiveness of our proposed procedure via Monte Carlo simulations. Section 7 presents the analysis on electricity consumption data. Finally, concluding remarks are given in Section 8.

2. Relative Error Estimation

Suppose that

x_{i}

=

{(x_{i 1}, \dots, x_{i p})}^{T}

(i = 1, . . ., n)

are predictors and

y = {(y_{1}, . . ., y_{n})}^{T}

is a vector of positive responses. Consider the multiplicative regression model

y_{i} = exp (x_{i}^{T} β) ε_{i} = exp (\sum_{j = 1}^{p} x_{i j} β_{j}) ε_{i}, (i = 1, \dots, n),

(1)

where

β = {(β_{1}, \dots, β_{p})}^{T}

is a p-dimensional coefficient vector, and

ε_{i}

are positive random variables. Predictors

x_{i} \in R^{p}

may be random and serially dependent, while we often set

x_{i 1} = 1

, that is, incorporate the intercept in the exponent. The parameter space

B \subset R^{p}

of

β

is a bounded convex domain such that

β_{0} \in B

. We implicitly assume that the model is correctly specified, so that there exists a true parameter

β_{0} = (β_{1, 0}, \dots, β_{p, 0}) \in B

. We want to estimate

β_{0}

from a sample

{(x_{i}, y_{i})

,

i = 1, \dots, n}

.

We first remark that the condition

x_{i 1} = 1

ensures that the model (1) is scale-free regarding variables

ε_{i}

, which is an essentially different nature from the linear regression model

y_{i} = x_{i}^{T} β + ε_{i}

. Specifically, multiplying a positive constant

σ

to

ε_{i}

results in the translation of the intercept in the exponent:

y_{i} = exp (x_{i}^{T} β) σ ε_{i} = exp (log σ + x_{i}^{T} β) ε_{i}

so that the change from

ε_{i}

to

σ ε_{i}

is equivalent to that from

β_{1}

to

β_{1} + log σ

. See Remark 1 on the distribution of

ε_{1}

.

To provide a simple expression of the loss functions based on the relative error, we write

t_{i} = t_{i} (β) = exp (x_{i}^{T} β), (i = 1, \dots, n) .

Chen et al. [6,8] pointed out that the loss criterion for relative error may depend on

| (y_{i} - t_{i}) / y_{i} |

and / or

| (y_{i} - t_{i}) / t_{i} |

. These authors also proposed general relative error (GRE) criteria, defined as

G (β) = \sum_{i = 1}^{n} g (|\frac{y_{i} - t_{i}}{y_{i}}|, |\frac{y_{i} - t_{i}}{t_{i}}|),

(2)

where

g : [0, \infty) \times [0, \infty) \to [0, \infty)

. Most of the loss functions based on the relative error are included in the GRE. Park and Stefanski [2] considered a loss function

g (a, b) = a^{2}

. It may highly depend on a small

y_{i}

because it includes

1 / y_{i}^{2}

terms, and then the estimator can be numerically unstable. Consistency and asymptotic normality may not be established under general regularity conditions [8]. The loss functions based on

g (a, b) = max {a, b}

[3] and

g (a, b) = a + b

(least absolute relative error estimation, [6]) can have desirable asymptotic properties [3,6]. However, the minimization of the loss function can be challenging, in particular for high-dimensional data, when the function is nonsmooth or nonconvex.

In practice, the following two criteria would be useful:

Least product relative error estimation (LPRE) Chen et al. [8] proposed the LPRE given by $g (a, b) = a b$ . The LPRE tries to minimize the product $| 1 - t_{i} / y_{i} | \times | 1 - y_{i} / t_{i} |$ , not necessarily both terms at once.
Least squared-sum relative error estimation (LSRE) Chen et al. [8] considered the LSRE given by $g (a, b) = a^{2} + b^{2}$ . The LSRE aims to minimize both $| 1 - t_{i} / y_{i} |$ and $| 1 - y_{i} / t_{i} |$ through sum of squares ${(1 - t_{i} / y_{i})}^{2} + {(1 - y_{i} / t_{i})}^{2}$ .

The loss functions of LPRE and LSRE are smooth and convex, and also possess desirable asymptotic properties [8]. The above-described GRE criteria and their properties are summarized in Table 1. Particularly, the “convexity” in the case of

g (a, b) = a + b

holds when

ε_{i} > 0

,

ε_{i} \neq 1

, and

\sum_{i = 1}^{n} x_{i} x_{i}^{T}

is positive definite, since the Hessian matrix of the corresponding

G (β)

is

\sum_{i = 1}^{n} | ε_{i} - ε_{i}^{- 1} | x_{i} x_{i}^{T}

a.s.

Although not essential, we assume that the variables

ε_{i}

in Equation (1) are i.i.d. with common density function h. As in Chen et al. [8], we consider the following class of h associated with g:

h (ε) : = \frac{C (g)}{ε} exp \{- ρ (ε)\} I_{+} (ε),

(3)

where

ρ (ε) = ρ (ε; g) : = g (|1 - \frac{1}{ε}|, | 1 - ε |),

and

C (g)

is a normalizing constant (

\int h (ε) d ε = 1

) and

I_{+}

denotes the indicator function of set

(0, \infty)

. Furthermore, we assume the symmetry property

g (a, b) = g (b, a)

,

a, b \geq 0

, from which it follows that

ε_{1} \sim ε_{1}^{- 1}

. The latter property is necessary for a score function to be associated with the gradient of a GRE loss function, hence being a martingale with respect to a suitable filtration, which often entails estimation efficiency. Indeed, the asymmetry of

g (a, b)

(i.e.,

g (a, b) \neq g (b, a)

) may produce a substantial bias in the estimation [3]. The entire set of our regularity conditions will be shown in Section 4.3. The conditions therein concerning g are easily verified for both LPRE and LSRE.

In this paper, we implicitly suppose that

i = 1, \dots, n

denote “time” indices. As usual, in order to deal with cases of non-random and random predictors in a unified manner, we employ the partial-likelihood framework. Specifically, in the expression of the joint density (with the obvious notation for the densities)

\begin{matrix} f (x_{1}, \dots, x_{n}, y_{1}, \dots, y_{n}) \\ = \{f (x_{1}) \prod_{i = 2}^{n} f (x_{i} | x_{1} `, \dots, x_{i - 1}, y_{1}, \dots, y_{i - 1})\} \{f (y_{1} | x_{1}) \prod_{i = 2}^{n} f (y_{i} | x_{1}, \dots, x_{i}, y_{1}, \dots, y_{i - 1})\}, \end{matrix}

we ignore the first product

{\dots}

and only look at the second one

{\dots}

, which is defined as the partial likelihood. We further assume that the ith-stage noise

ε_{i}

is independent of

(x_{1}, \dots, x_{i}, y_{1}, \dots, y_{i - 1})

, so that, in view of Equation (1), we have

f (y_{i} | x_{1}, \dots, x_{i}, y_{1}, \dots, y_{i - 1}) = f (y_{i} | x_{i}), i = 1, \dots, n .

The density function of response y given

x_{i}

is

\begin{matrix} f (y | x_{i}; β) & = & exp (- x_{i}^{T} β) h (y exp (- x_{i}^{T} β)) \end{matrix}

(4)

\begin{matrix} = & \frac{1}{t_{i}} h (\frac{y}{t_{i}}) \end{matrix}

(5)

From Equation (3), we see that the maximum likelihood estimator (MLE) based on the error distribution in Equation (5) is obtained by the minimization of Equation (2). For example, the density functions of LPRE and LSRE are

\begin{matrix} L P R E : & f (y | x_{i}) = \frac{1}{2 K_{0} (2)} y^{- 1} exp (- \frac{y}{t_{i}} - \frac{t_{i}}{y}), (y > 0), \\ L S R E : & f (y | x_{i}) = C_{L S R E} y^{- 1} exp \{- {(1 - \frac{t_{i}}{y})}^{2} - {(1 - \frac{y}{t_{i}})}^{2}\}, (y > 0), \end{matrix}

(6)

where

K_{ν} (z)

denotes the modified Bessel function of third kind with index

ν \in R

:

\begin{matrix} K_{ν} (z) = \frac{z^{ν}}{2^{ν + 1}} \int_{0}^{\infty} t^{- ν - 1} exp (- t - \frac{z^{2}}{4 t}) d t \end{matrix}

and

C_{L S R E}

is a constant term. Constant terms are numerically computed as

K_{0} (2) \approx 0.1139

and

C_{L S R E} \approx 0.911411

. Density (6) is a special case of the generalized inverse Gaussian distribution (see, e.g., [20]).

Remark 1.

We assume that the noise density h is fully specified in the sense that, given g, the density h does not involve any unknown quantity. However, this is never essential. For example, for the LPRE defined by Equation (6), we could naturally incorporate one more parameter

σ > 0

into h, the resulting form of

h (ε)

being

ε \mapsto \frac{1}{2 K_{0} (σ)} ε^{- 1} exp \{- \frac{σ}{2} (ε + \frac{1}{ε})\} I_{+} (ε) .

Then, we can verify that the distributional equivalence

ε_{1} \sim ε_{1}^{- 1}

holds whatever the value of σ is. Particularly, the estimation of parameter σ does make statistical sense and, indeed, it is possible to deduce the asymptotic normality of the joint maximum-(partial-) likelihood estimator of

(β, σ)

. In this paper, we do not pay attention to such a possible additional parameter, but instead regard it (whenever it exists) as a nuisance parameter, as in the noise variance in the least-squares estimation of a linear regression model.

3. Robust Estimation via $γ$ -Likelihood

In practice, outliers can often be observed. For example, the electricity consumption data can have the outliers on extremely hot days. The estimation methods via GRE criteria, including LPRE and LSRE, are not robust against outliers, because the corresponding density functions are not generally heavy-tailed. Therefore, a relative error estimation method that is robust against the outliers is needed. To achieve this, we consider minimizing the negative

γ

-(partial-)likelihood function based on the

γ

-cross entropy [17].

We now define the negative

γ

-(partial-)likelihood function by

\begin{matrix} ℓ_{γ, n} (β) = - \frac{1}{γ} log \{\frac{1}{n} \sum_{i = 1}^{n} f {(y_{i} | x_{i}; β)}^{γ}\} + \frac{1}{1 + γ} log \{\frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{\infty} f {(y | x_{i}; β)}^{1 + γ} d y\}, \end{matrix}

(7)

where

γ > 0

is a parameter that controls the degrees of robustness;

γ \to 0

corresponds to the negative log-likelihood function, and robustness is enhanced as

γ

increases. On the other hand, a too large

γ

can decrease the efficiency of the estimator [18]. In practice, the value of

γ

may be selected by a cross-validation based on

γ

-cross entropy (see, e.g., [18,21]). We refer to Kawashima and Fujisawa [22] for more recent observations on comparison of the

γ

-divergences between Fujisawa and Eguchi [18] and Kawashima and Fujisawa [17].

There are several likelihood functions which yield robust estimation. Examples include the

L_{q}

-likelihood [23], and the likelihood based on the density power divergence [24], referred to as

β

-likelihood. It is shown that the

γ

-likelihood, the

L_{q}

-likelihood, and the

β

-likelihood are closely related. The negative

β

-likelihood function

ℓ_{α, n} (β)

and the negative

L_{q}

-likelihood function

ℓ_{q, n} (β)

are, respectively, expressed as

\begin{matrix} ℓ_{α, n} (β) & = & - \frac{1}{α} \frac{1}{n} \sum_{i = 1}^{n} f {(y_{i} | x_{i}; β)}^{α} + \frac{1}{1 + α} \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{\infty} f {(y | x_{i}; β)}^{1 + α} d y, \end{matrix}

(8)

\begin{matrix} ℓ_{q, n} (β) & = & - \sum_{i = 1}^{n} \frac{f {(y_{i} | x_{i}; β)}^{1 - q} - 1}{1 - q} . \end{matrix}

(9)

The difference between

γ

-likelihood and

β

-likelihood is just the existence of the logarithm on

ℓ_{γ, n} (β)

. Furthermore, substituting

q = 1 - α

into Equation (9) gives us

\begin{matrix} ℓ_{q, n} (β) & = & - \frac{1}{α} \sum_{i = 1}^{n} f {(y_{i} | x_{i}; β)}^{α} + const . \end{matrix}

Therefore, the minimization of the negative

L_{q}

-likelihood function is equivalent to minimization of the negative

β

-likelihood function without second term in the right side of Equation (8). Note that the

γ

-likelihood has the redescending property, a desirable property in robust statistics literature, as shown in Section 4.2. Moreover, it is known that the

γ

-likelihood is the essentially unique divergence that is robust against heavy contamination (see [18] for details). On the other hand, we have not shown whether the

L_{q}

-likelihood and/or the

β

-likelihood have the redescending property or not.

The integration

\int f {(y | x_{i}; β)}^{1 + γ} d y

in the second term on the right-hand side of Equation (7) is

\begin{matrix} \int_{0}^{\infty} f {(y | x_{i}; β)}^{1 + γ} d y & = \frac{1}{t_{i}^{1 + γ}} \int_{0}^{\infty} {\{h (\frac{y}{t_{i}})\}}^{1 + γ} d y = : t_{i}^{- γ} C (γ, h), \end{matrix}

where

C (γ, h) : = \int_{0}^{\infty} h {(v)}^{1 + γ} d v

(10)

is a constant term, which is assumed to be finite. Then, Equation (7) is expressed as

\begin{matrix} ℓ_{γ, n} (β) & = & \underset{= : ℓ_{1} (β)}{\underset{︸}{- \frac{1}{γ} log \{\sum_{i = 1}^{n} f {(y_{i} | x_{i}; β)}^{γ}\}}} + \underset{= : ℓ_{2} (β)}{\underset{︸}{\frac{1}{1 + γ} log \{\sum_{i = 1}^{n} t_{i}^{- γ}\}}} + C_{0} (γ, h), \end{matrix}

(11)

where

C_{0} (γ, h)

is a constant term free from

β

. We define the maximum

γ

-likelihood estimator to be any element such that

{\hat{β}}_{γ} \in argmin ℓ_{γ, n} .

4. Theoretical Properties

4.1. Technical Assumptions

Let

\overset{p}{\to}

denote the convergence in probability.

Assumption A1 (Stability of the predictor).

There exists a probability measure

π (d x)

on the state space

X

of the predictors and positive constants

δ, δ^{'} > 0

such that

\frac{1}{n} \sum_{i = 1}^{n} {| x_{i} |}^{3} exp (δ^{'} {| x_{i} |}^{1 + δ}) = O_{p} (1),

and that

\frac{1}{n} \sum_{i = 1}^{n} η (x_{i}) \overset{p}{\to} \int_{X} η (x) π (d x), n \to \infty,

where the limit is finite for any measurable η satisfying that

sup_{x \in R^{p}} \frac{| η (x) |}{{(1 + | x |}^{3}) exp (δ^{'} {| x |}^{1 + δ})} < \infty .

Assumption A2 (Noise structure).

The a.s. positive i.i.d. random variables

ε_{1}, ε_{2}, \dots

have a common positive density h of the form (3):

h (ε) = \frac{C (g)}{ε} exp \{- ρ (ε)\} I_{+} (ε),

for which the following conditions hold.

Function $g : [0, \infty) \times [0, \infty) \to [0, \infty)$ is three times continuously differentiable on $(0, \infty)$ and satisfies that

$g (a, b) = g (b, a), a, b \geq 0 .$
There exist constants $κ_{0}, κ_{\infty} > 0$ , and $c > 1$ such that

$\frac{1}{c} (ε^{- κ_{0}} \lor ε^{κ_{\infty}}) \leq ρ (ε) \leq c (ε^{- κ_{0}} \lor ε^{κ_{\infty}})$

for every $ε > 0$ .
There exist constants $c_{0}, c_{\infty} \geq 0$ such that

$sup_{ε > 0} {(ε^{- c_{0}} \lor ε^{c_{\infty}})}^{- 1} max_{k = 1, 2, 3} |\partial_{ε}^{k} ρ (ε)| < \infty .$

Here and in the sequel, for a variable a, we denote by

\partial_{a}^{k}

the kth-order partial differentiation with respect to a.

Assumption 1 is necessary to identify the large-sample stochastic limits of the several key quantities in the proofs: without them, we will not be able to deduce an explicit asymptotic normality result. Assumption 2 holds for many cases, including the LPRE and the LSRE (i.e.,

g (a, b) = a b

and

a^{2} + b^{2}

), while excluding

g (a, b) = a^{2}

and

g (a, b) = b^{2}

. The smoothness condition on h on

(0, \infty)

is not essential and could be weakened in light of the M-estimation theory ([25], Chapter 5). Under these assumptions, we can deduce the following statements.

h is three times continuously differentiable on $(0, \infty)$ , and for each $α > 0$ ,

$\int_{0}^{\infty} h^{α} (ε) d ε < \infty and max_{k = 0, 1, 2, 3} sup_{ε > 0} |\partial_{ε}^{k} \{h {(ε)}^{α}\}| < \infty .$
For each $γ > 0$ and $α > 0$ (recall that the value of $γ > 0$ is given),

$lim_{ε ↓ 0} h {(ε)}^{γ} {|u_{h} (ε)|}^{α} = lim_{ε ↑ \infty} h {(ε)}^{γ} {|u_{h} (ε)|}^{α} = 0,$

(12)

where

$u_{h} (z) : = 1 + z \partial_{z} log h (z) = 1 + z \frac{h^{'} (z)}{h (z)} .$

The verifications are straightforward hence omitted.

Finally, we impose the following assumption:

Assumption A3 (Identifiability).

We have

β = β_{0}

if

ρ (e^{- x^{T} β} y) = ρ (e^{- x^{T} β_{0}} y) π (d x) \otimes λ_{+} (d y) - a . e . (x, y),

where

λ_{+}

denotes the Lebesgue measure on

(0, \infty)

.

4.2. Redescending Property

The estimating function based on the negative

γ

-likelihood function is given by

\sum_{i = 1}^{n} ψ (y_{i} | x_{i}; β) = 0 .

In our model, we consider not only too large

y_{i}

s but also too small

y_{i}

s as outliers: the estimating equation is said to have the redescending property if

\begin{matrix} lim_{y \to \infty} ψ (y | x; β_{0}) = lim_{y \to + 0} ψ (y | x; β_{0}) = 0 \end{matrix}

for each

x

. The redescending property is known as a desirable property in robust statistics literature [19]. Here, we show the proposed procedure has the redescending property.

The estimating equation based on the negative

γ

-likelihood function is

\begin{matrix} - \frac{\sum_{i = 1}^{n} f {(y_{i} | x_{i}; β)}^{γ} s (y_{i} | x_{i}; β)}{\sum_{j = 1}^{n} f {(y_{j} | x_{j}; β)}^{γ}} + \frac{\partial}{\partial β} ℓ_{2} (β) = 0, \end{matrix}

where

s (y | x; β) = \frac{\partial log f (y | x; β)}{\partial β} .

We have expression

\begin{matrix} ψ (y | x; β) = f {(y | x; β)}^{γ} \{s (y | x; β) - \frac{\partial}{\partial β} ℓ_{2} (β)\} . \end{matrix}

Note that

\frac{\partial}{\partial β} ℓ_{2} (β)

is free from y. For each

(x, β)

, direct computations give the estimate

|ψ (y | x; β)| \leq C (x; β) h {(exp (- x^{T} β) y)}^{γ} |u_{h} (exp (- x^{T} β) y)|

(13)

for some constant

C (x; β)

free from y. Hence, Equation (12) combined with the inequality (13) leads to the redescending property.

4.3. Asymptotic Distribution

Recall Equation (10) for the definition of

C (γ, h)

and let

\begin{matrix} C_{1} (γ, h) & : = \int_{0}^{\infty} ε h {(ε)}^{γ} h^{'} (ε) d ε, \\ C_{2} (γ, h) & : = \int_{0}^{\infty} u_{h} {(ε)}^{2} h {(ε)}^{2 γ + 1} d ε, \\ Π_{k} (γ) & : = \int x^{\otimes k} exp (- γ x^{T} β_{0}) π (d x), k = 0, 1, 2, \end{matrix}

where

x^{\otimes 0} : = 1 \in R

,

x^{\otimes 1} : = x \in R^{p}

, and

x^{\otimes 2} : = x x^{T} \in R^{p} \otimes R^{p}

; Assumptions 1 and 2 ensure that all these quantities are finite for each

γ > 0

. Moreover,

\begin{matrix} H_{γ}^{'} (β_{0}) & : = \int \int f {(y | x; β_{0})}^{γ + 1} d y π (d x) = C (γ, h) Π_{0} (γ), \\ H_{γ}^{''} (β_{0}) & : = \int \int f {(y | x; β_{0})}^{γ + 1} s (y | x; β_{0}) d y π (d x) = - \{C (γ, h) + C_{1} (γ, h)\} Π_{1} (γ), \\ Δ_{γ} (β_{0}) & : = C {(γ, h)}^{2} C_{2} (γ, h) Π_{0} {(γ)}^{2} Π_{2} (2 γ) \\ + {\{C (γ, h) + C_{1} (γ, h)\}}^{2} C (2 γ, h) Π_{0} (2 γ) Π_{1} {(γ)}^{\otimes 2} \\ - 2 C (γ, h) \{C (γ, h) + C_{1} (γ, h)\} \{C (2 γ, h) + C_{1} (2 γ, h)\} Π_{0} (γ) Π_{1} (2 γ) Π_{1} {(γ)}^{T}, \end{matrix}

(14)

\begin{matrix} J_{γ} (β_{0}) & : = C (γ, h) C_{2} (γ / 2, h) Π_{0} (γ) Π_{2} (γ) - {\{C (γ, h) + C_{1} (γ, h)\}}^{2} Π_{1} {(γ)}^{\otimes 2} . \end{matrix}

(15)

We are assuming that density h and tuning parameter

γ

are given a priori, hence we can (numerically) compute constants

C (γ, h)

,

C_{1} (γ, h)

, and

C_{2} (γ, h)

. In the following, we often omit “

(β_{0})

” from the notation.

Let

\overset{L}{\to}

denote the convergence in distribution.

Theorem 1.

Under Assumptions 1–3, we have

\sqrt{n} ({\hat{β}}_{γ} - β_{0}) \overset{L}{\to} N_{p} (0, J_{γ}^{- 1} Δ_{γ} J_{γ}^{- 1}) .

(16)

The asymptotic covariance matrix can be consistently estimated through expressions (14) and (15) with quantities

Π_{k} (γ)

therein replaced by the empirical estimates:

{\hat{Π}}_{k, n} (γ) : = \frac{1}{n} \sum_{i = 1}^{n} x_{i}^{\otimes k} exp (- γ x_{i}^{T} {\hat{β}}_{γ}) \overset{p}{\to} Π_{k} (γ), k = 0, 1, 2 .

(17)

The proof of Theorem 1 will be given in Appendix A. Note that, for

γ \to 0

, we have

C (γ, h) \to 1

,

C_{1} (γ, h) \to - 1

, and

C_{2} (γ, h) \to \int_{0}^{\infty} u_{h} {(ε)}^{2} h (ε) d ε

, which in particular entails

H_{γ}^{'} \to 1

and

H_{γ}^{″} \to 0

. Then, both

Δ_{γ}

and

J_{γ}

tend to the Fisher information matrix

I_{0} : = \int \int s {(y | x; β_{0})}^{\otimes 2} f (y | x, β_{0}) π (d x) d y = \int_{0}^{\infty} u_{h} {(ε)}^{2} h (ε) d ε \int x^{\otimes 2} π (d x)

as

γ \to 0

, so that the asymptotic distribution

N_{p} (0, J_{γ}^{- 1} Δ_{γ} J_{γ}^{- 1})

becomes

N_{p} (0, I_{0}^{- 1})

, the usual one of the MLE.

We also note that, without details, we could deduce a density-power divergence (also known as the

β

-divergence [26]) counterpart to Theorem 1 similarly but with slightly lesser computation cost; in that case, we consider the objective function

ℓ_{α, n} (β)

defined by Equation (8) instead of the

γ

-(partial-)likelihood (7). See Basu et al. [24] and Jone et al. [21] for details of the density-power divergence.

5. Algorithm

Even if the GRE criterion in Equation (2) is a convex function, the negative

γ

-likelihood function is nonconvex. Therefore, it is difficult to find a global minimum. Here, we derive the MM (majorize-minimization) algorithm to obtain a local minimum. The MM algorithm monotonically decreases the objective function at each iteration. We refer to Hunter and Lange [27] for a concise account of the MM algorithm.

Let

β^{(t)}

be the value of the parameter at the tth iteration. The negative

γ

-likelihood function in Equation (11) consists of two nonconvex functions,

ℓ_{1} (β)

and

ℓ_{2} (β)

. The majorization functions of

ℓ_{j} (β)

, say

{\tilde{ℓ}}_{j} (β | β^{(t)})

(

j = 1, 2

), are constructed so that the optimization of

{min}_{β} {\tilde{ℓ}}_{j} (β | β^{(t)})

is much easier than that of

{min}_{β} ℓ_{j} (β)

. The majorization functions must satisfy the following inequalities:

\begin{matrix} {\tilde{ℓ}}_{j} (β | β^{(t)}) & \geq & ℓ_{j} (β), \end{matrix}

(18)

\begin{matrix} {\tilde{ℓ}}_{j} (β^{(t)} | β^{(t)}) & = & ℓ_{j} (β^{(t)}) . \end{matrix}

(19)

Here, we construct majorization functions

{\tilde{ℓ}}_{j} (β | β^{(t)})

for

j = 1, 2

.

5.1. Majorization Function for $ℓ_{1} (β)$

Let

\begin{matrix} w_{i}^{(t)} & = & \frac{f {(y_{i} | x_{i}; β^{(t)})}^{γ}}{\sum_{j = 1}^{n} f {(y_{j} | x_{j}; β^{(t)})}^{γ}}, \end{matrix}

(20)

\begin{matrix} r_{i}^{(t)} & = & \sum_{j = 1}^{n} f {(y_{j} | x_{j}; β^{(t)})}^{γ} \frac{f {(y_{i} | x_{i}; β)}^{γ}}{f {(y_{i} | x_{i}; β^{(t)})}^{γ}} . \end{matrix}

(21)

Obviously,

\sum_{i = 1}^{n} w_{i}^{(t)} = 1

and

w_{i}^{(t)} r_{i}^{(t)} = f {(y_{i} | x_{i}; β)}^{γ}

. Applying Jensen’s inequality to

y = - log x

, we obtain inequality

\begin{matrix} - log (\sum_{i = 1}^{n} w_{i}^{(t)} r_{i}^{(t)}) \leq - \sum_{i = 1}^{n} w_{i}^{(t)} log r_{i}^{(t)} . \end{matrix}

(22)

Substituting Equation (20) and Equation (21) into Equation (22) gives

\begin{matrix} ℓ_{1} (β) \leq - \sum_{i = 1}^{n} w_{i}^{(t)} log f (y_{i} | x_{i}; β) + C, \end{matrix}

where

C = \frac{1}{γ} \sum_{i} w_{i}^{(t)} log w_{i}^{(t)}

. Denoting

{\tilde{ℓ}}_{1} (β | β^{(t)}) = - \sum_{i = 1}^{n} w_{i}^{(t)} log f (y_{i} | x_{i}; β) + C,

(23)

we observe that Equation (23) satisfies Equation (18) and Equation (19). It is shown that

{\tilde{ℓ}}_{1} (β | β^{(t)})

is a convex function if the original relative error loss function is convex. Particularly, the majorization functions

{\tilde{ℓ}}_{1} (β | β^{(t)})

based on LPRE and LSRE are both convex.

5.2. Majorization Function for $ℓ_{2} (β)$

Let

θ_{i} = - γ x_{i}^{T} β

. We view

ℓ_{2} (β)

as a function of

θ = {(θ_{1}, \dots, θ_{n})}^{T}

. Let

\begin{matrix} s (θ) : = log (\sum_{i = 1}^{n} t_{i}^{- γ}) = log (\sum_{i = 1}^{n} exp (θ_{i})) . \end{matrix}

(24)

By taking the derivative of

s (θ)

with respect to

θ

, we have

\begin{matrix} \frac{\partial s (θ)}{\partial θ_{i}} = π_{i}, \frac{\partial^{2} s (θ)}{\partial θ_{j} \partial θ_{i}} = π_{i} δ_{i j} - π_{i} π_{j}, \end{matrix}

where

π_{i} = exp (θ_{i}) / {\sum_{k = 1}^{n} exp (θ_{k})}

. Note that

\sum_{i = 1}^{n} π_{i} = 1

for any

θ

.

The Taylor expansion of

s (θ)

at

θ = θ^{(t)}

is expressed as

\begin{matrix} s (θ) & = & s (θ^{(t)}) + π^{(t) T} (θ - θ^{(t)}) + \frac{1}{2} {(θ - θ^{(t)})}^{T} \frac{\partial^{2} s (θ^{*})}{\partial θ \partial θ^{T}} (θ - θ^{(t)}), \end{matrix}

(25)

where

π^{(t)} = {(π_{1}^{(t)}, \dots, π_{n}^{(t)})}^{T}

and

θ^{*}

is an n-dimensional vector located between

θ

and

θ^{(t)}

. We define an

n \times n

matrix

B

as follows:

B : = \frac{1}{2} (I - \frac{1}{n} 1 1^{T}) .

It follows from [28] that, in the matrix sense,

\frac{\partial^{2} s (θ)}{\partial θ \partial θ^{T}} \leq B

(26)

for any

θ

. Combining Equation (25) and Equation (26), we have

\begin{matrix} s (θ) & \leq & s (θ^{(t)}) + π^{(t) T} (θ - θ^{(t)}) + \frac{1}{2} {(θ - θ^{(t)})}^{T} B (θ - θ^{(t)}) . \end{matrix}

(27)

Substituting Equation (24) into Equation (27) gives

\begin{matrix} log \{\sum_{i = 1}^{n} exp (- γ x_{i}^{T} β)\} & \leq & log \{\sum_{i = 1}^{n} exp (- γ x_{i}^{T} β^{(t)})\} - γ π^{(t) T} X (β - β^{(t)}) \\ + \frac{γ^{2}}{2} {(β - β^{(t)})}^{T} X^{T} B X (β - β^{(t)}), \end{matrix}

where

X = {(x_{1}, \dots, x_{n})}^{T}

. The majorization function of

ℓ_{2} (β)

is then constructed by

\begin{matrix} {\tilde{ℓ}}_{2} (β | β^{(t)}) & = & \frac{γ^{2}}{2 (1 + γ)} β^{T} X^{T} B X β - \frac{γ}{1 + γ} β^{T} (X^{T} π^{(t)} + γ X^{T} B X β^{(t)}) + C, \end{matrix}

(28)

where C is a constant term free from

β

. We observe that

{\tilde{ℓ}}_{2} (β | β^{(t)})

in Equation (28) satisfies Equation (18) and Equation (). It is shown that

{\tilde{ℓ}}_{2} (β | β^{(t)})

is a convex function because

X^{T} B X

is positive semi-definite.

5.3. MM Algorithm for Robust Relative Error Estimation

In Section 5.1 and Section 5.2, we have constructed the majorization functions for both

ℓ_{1} (β)

and

ℓ_{2} (β)

. The MM algorithm based on these majorization functions is detailed in Algorithm 1. The majorization function

{\tilde{ℓ}}_{1} (β | β^{(t)}) + {\tilde{ℓ}}_{2} (β | β^{(t)})

is convex if the original relative error loss function is convex. Particularly, the majorization functions of LPRE and LSRE are both convex.

Algorithm 1 Algorithm of robust relative error estimation.

1:: $t \leftarrow 0$
2:: Set an initial value of parameter vector $β^{(0)}$ .
3:: while $β^{(t)}$ is converged do
4:: Update the weights by Equation (20)
5:: Update $β$ by

$β^{(t + 1)} \leftarrow \arg min_{β} {{\tilde{ℓ}}_{1} (β | β^{(t)}) + {\tilde{ℓ}}_{2} (β | β^{(t)})},$

where ${\tilde{ℓ}}_{1} (β | β^{(t)})$ and ${\tilde{ℓ}}_{2} (β | β^{(t)})$ are given by Equation (23) and Equation (28), respectively.
6:: $t \leftarrow t + 1$
7:: end while

Remark 2.

Instead of the MM algorithm, one can directly use the quasi-Newton method, such as the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm to minimize the negative γ-likelihood function. In our experience, the BFGS algorithm is faster than the MM algorithm but is more sensitive to an initial value than the MM algorithm. The strengths of BFGS and MM algorithms would be shared by using the following hybrid algorithm:

We first conduct the MM algorithm with a small number of iterations.
Then, the BFGS algorithm is conducted. We use the estimate obtained by the MM algorithm as an initial value of the BFGS algorithm.

The stableness of the MM algorithm is investigated through the real data analysis in Section 7.

Remark 3.

To deal with high-dimensional data, we often use the

L_{1}

regulzarization, such as the lasso [12], elastic net [29], and Smoothly Clipped Absolute Deviation (SCAD) [30]. In robust relative error estimation, the loss function based on the lasso is expressed as

ℓ_{γ, n} (β) + λ \sum_{j = 1}^{p} | β_{j} |,

(29)

where

λ > 0

is a regularization parameter. However, the loss function in Equation (29) is non-convex and non-differentiable. Instead of directly minimizing the non-convex loss function in Equation (29), we may use the MM algorithm; the following convex loss function is minimized at each iteration:

{\tilde{ℓ}}_{1} (β | β^{(t)}) + {\tilde{ℓ}}_{2} (β | β^{(t)}) + λ \sum_{j = 1}^{p} | β_{j} | .

(30)

The minimization of Equation (30) can be realized by the alternating direction method of multipliers algorithm [14] or the coordinate descent algorithm with quadratic approximation of

{\tilde{ℓ}}_{1} (β | β^{(t)}) + {\tilde{ℓ}}_{2} (β | β^{(t)})

[31].

6. Monte Carlo Simulation

6.1. Setting

We consider the following two simulation models as follows:

\begin{matrix} Model 1 : & β_{0} = {(1, 1, 1)}^{T}, \\ Model 2 : & β_{0} = {\underset{6}{\underset{︸}{(0.5, \dots, 0.5}}, \underset{45}{\underset{︸}{0, \dots, 0)}}}^{T} . \end{matrix}

The number of observations is set to be

n = 200

. For each model, we generate T=10,000 datasets of predictors

x_{i}

(

i = 1, \dots, n

) according to

N (0, (1 - ρ) I + ρ 1 1^{T})

. Here, we consider the case of

ρ = 0.0

and

ρ = 0.6

. Responses

y_{i}

are generated from the mixture distribution

\begin{matrix} (1 - δ) f (y | x_{i}; β_{0}) + δ q (y) (i = 1, \dots, n), \end{matrix}

where

f (y | x; β_{0})

is a density function corresponding to the LPRE defined as Equation (6),

q (y)

is a density function of distribution of outliers, and

δ

(

0 \leq δ < 1

) is an outlier ratio. The outlier ratio is set to be

δ = 0,

0.05,

0.1,

and

0.2

in this simulation. We assume that

q (y)

follows a log-normal distribution (pdf:

q (y) = 1 / (\sqrt{2 π} y σ) exp {- {(log y - μ)}^{2} / (2 σ^{2})}

) with

(μ, σ) = (\pm 5, 1)

. When

μ = 5

, the outliers take extremely large values. On the other hand, when

μ = - 5

, the data values of outliers are nearly zero.

6.2. Investigation of Relative Prediction Error and Mean Squared Error of the Estimator

To investigate the performance of our proposed procedure, we use the relative prediction error (RPE) and the mean square error (MSE) for the tth dataset, defined as

\begin{matrix} R P E (t) & = & \sum_{i = 1}^{n} \frac{{[y_{i}^{new} (t) - exp {x_{i} {(t)}^{T} \hat{β} (t)}]}^{2}}{y_{i}^{new} (t) exp {x_{i} {(t)}^{T} \hat{β} (t)}}, \end{matrix}

(31)

\begin{matrix} M S E (t) & = & ∥ \hat{β} (t) - β_{0} ∥^{2}, \end{matrix}

(32)

respectively, where

\hat{β} (t)

is an estimator obtained from the dataset

{(x_{i} (t), y_{i} (t)); i = 1, \dots, n}

, and

y_{i}^{new} (t)

is an observation from

y_{i}^{new} (t) | x_{i} (t)

. Here,

y_{i}^{new} (t) | x_{i} (t)

follows a distribution of

f (y | x_{i} (t); β_{0})

and is independent of

y_{i} (t) | x_{i} (t)

. Figure 1 shows the median and error bar of {RPE(1), ⋯, RPE(T)} and {MSE(1), ⋯, MSE(T)}. The error bars are delineated by the 25th and 75th percentiles.

We observe the following tendencies from the results in Figure 1:

As the outlier ratio increases, the performance becomes worse in all cases. Interestingly, the length of the error bar of RPE increases as the outlier ratio increases.
The proposed method becomes robust against outliers as the value of $γ$ increases. We observe that a too large $γ$ , such as $γ = 10$ , leads to extremely poor RPE and MSE because most observations are regarded as outliers. Therefore, the not too large $γ$ , such as the $γ = 0.5$ used here, generally results in better estimation accuracy than the MLE.
The cases of $ρ = 0.6$ , where the predictors are correlated, are worse than those of $ρ = 0$ . Particularly, when $γ = 0$ , the value of RPE of $ρ = 0.6$ becomes large on the large outlier ratio. However, increasing $γ$ has led to better estimation performance uniformly.
The results for different simulation models on the same value of $γ$ are generally different, which implies the appropriate value of $γ$ may change according to the data generating mechanisms.

6.3. Investigation of Asymptotic Distribution

The asymptotic distribution is derived under the assumption that the true distribution of

y | x_{i}

follows

f (y | x_{i}; β_{0})

, that is,

δ = 0

. However, we expect that, when

γ

is sufficiently large and

δ

is moderate, the asymptotic distribution may approximate the true distribution well, a point underlined by Fujisawa and Eguchi ([18], Theorem 5.1) in the case of i.i.d. data. We investigate whether the asymptotic distribution given by Equation (16) appropriately works when there exist outliers.

The asymptotic covariance matrix in Equation (16) depends on

C (γ, h)

,

C_{1} (γ, h)

, and

C_{2} (γ, h)

. For the LPRE, simple calculations provide

\begin{matrix} C (γ, h) & = & \int_{0}^{\infty} h {(x)}^{1 + γ} d x = \frac{K_{γ} (2 + 2 γ)}{2^{γ} K_{0} {(2)}^{1 + γ}}, \\ C_{1} (γ, h) & = & \int_{0}^{\infty} x h {(x)}^{γ} h^{'} (x) d x = - \frac{K_{γ} (2 + 2 γ)}{(1 + γ) 2^{γ} K_{0} {(2)}^{1 + γ}}, \\ C_{2} (γ, h) & = & \int_{0}^{\infty} u {(x)}^{2} h {(x)}^{2 γ + 1} d x \\ = & \frac{2^{1 - 2 γ}}{{(2 γ + 1)}^{2} K_{0} {(2)}^{2 γ + 1}} \{γ (2 γ + 1) K_{2 γ - 2} (4 γ + 2) + (1 + γ + 2 γ^{2}) K_{2 γ - 1} (4 γ + 2)\} . \end{matrix}

The Bessel function of third kind,

K_{\cdot} (\cdot)

, can be numerically computed, and then we obtain the values of

C (γ, h)

,

C_{1} (γ, h)

, and

C_{2} (γ, h)

.

Let

z = {(z_{1}, \dots, z_{p})}^{T}

be

z : = \sqrt{n} {\{diag (J_{γ}^{- 1} Δ_{γ} J_{γ}^{- 1})\}}^{- \frac{1}{2}} ({\hat{β}}_{γ} - β_{0}) .

Equation (16) implies that

z_{j} \overset{L}{\to} N (0, 1), (j = 1, \dots, p) .

We expect that the histogram of

z_{j}

obtained by the simulation would approximate the density function of the standard normal distribution when there are no (or a few) outliers. When there exists a significant number of outliers, the asymptotic distribution of

z_{j}

may not be

N (0, 1)

but is expected to be close to

N (0, 1)

for large

γ

. Figure 2 shows the histograms of T=10,000 samples of

z_{2}

along with the density function of the standard normal distribution for

μ = 5

in Model 1.

When there are no outliers, the distribution of

z_{2}

is close to the standard normal distribution whatever the value of

γ

is selected. When the outlier ratio is large, the histogram of

z_{2}

is far from the density function of

N (0, 1)

for a small

γ

. However, when the value of

γ

is large, the histogram of

z_{2}

is close to the density function of

N (0, 1)

, which implies the asymptotic distribution in Equation (16) appropriately approximates the distribution of estimators even when there exist outliers. We observe that the result of the asymptotic distributions for other

z_{j}

s shows a similar tendency to that of

z_{2}

.

7. Real Data Analysis

We apply the proposed method to electricity consumption data from the UCI (University of California, Irvine) Machine Learning repository [32]. The dataset consists of 370 household electricity consumption observations from January 2011 to December 2014. The electricity consumption is in kWh at 15-minute intervals. We consider the problem of prediction of the electricity consumption for next day by using past electricity consumption. The prediction of the day ahead electricity consumption is needed when we trade electricity on markets, such as the European Power Exchange (EPEX) day ahead market (https://www.epexspot.com/en/market-data/dayaheadauction) and the Japan Power Exchange (JEPX) day ahead market (http://www.jepx.org/english/index.html). In the JEPX market, when the prediction value of electricity consumption

{\hat{y}}_{t}

is smaller than actual electricity consumption

y_{t}

, the price of the amount of

y_{t} - {\hat{y}}_{t}

becomes “imbalance price”, which is usually higher than the ordinary price. For details, please refer to Sioshansi and Pfaffenberger [33].

To investigate the effectiveness of the proposed procedure, we choose one household that includes small positive values of electricity consumption. The consumption data for 25 December 2014 were deleted because the corresponding data values are zero. We predict the electricity consumption from January 2012 to December 2014 (the data in 2011 are used only for estimating the parameter). The actual electricity consumption data from January 2012 to December 2014 are depicted in Figure 3.

We observe that several data values are close to zero from Figure 3. Particularly, from October to December 2014, several spikes exist that attain nearly zero values. In this case, the estimation accuracy is poor with ordinary GRE criteria, as shown in our numerical simulation in the previous section.

We assume the multiplicative regression model in Equation (1) to predict electricity consumption. Let

y_{t}

denote the electricity consumption at t (

t = 1, \dots, T

). The number of observations is

T = (365 \times 3 + 366 - 1) \times 96 =

146,160. Here, 96 is the number of measurements in one day because electricity demand is expressed in 15-minute intervals. We define

x_{t}

as

x_{t} = {(y_{t - d}, \dots, y_{t - d q})}^{T}

, where

d = 96

. In our model, the electricity consumption at t is explained by the electricity consumption of the past q days for the same period. We set

q = 5

for data analysis and use past

n = 100

days of observations to estimate the model.

The model parameters are estimated by robust LPRE. The values of

γ

are set to be regular sequences from 0 to 0.1, with increments of 0.01. To minimize the negative

γ

-likelihood function, we apply our proposed MM algorithm. As the electricity consumption pattern on weekdays is known to be completely different from that on weekends, we make predictions for weekdays and weekends separately. The results of the relative prediction error are depicted in Figure 4.

The relative prediction error is large when

γ = 0

(i.e., ordinary LPRE estimation). The minimum value of relative prediction error is 0.049 and the corresponding value of

γ

is

γ = 0.03

. When we set a too large value of

γ

, efficiency decreases and the relative prediction error might increase.

Figure 5 shows the prediction value when

γ = 0

. We observe that there exist several extremely large prediction values (e.g., 8 July 2013 and 6 November 2014) due to the model parameters, which are heavily affected by the nearly zero values of electricity consumption.

Figure 6 shows the prediction values when

γ = 0.03

. Extremely large prediction values are not observed and the prediction values are similar to the actual electricity demand in Figure 3. Therefore, our proposed procedure is robust against outliers.

Additionally, we apply the Yule–Walker method, one of the most popular estimation procedures in the autoregressive (AR) model. Note that the Yule–Walker method does not regard a small positive value of

y_{t}

as an outlier, so that we do not have to conduct the robust AR model for this dataset. The relative prediction error of the Yule–Walker is 0.123, which is larger than that of our proposed method (0.049).

Furthermore, to investigate the stableness of the MM algorithm described in Section 5, we also apply the BFGS method to obtain the minimizer of the negative

γ

-likelihood function. The optim function in R is used to implement the BFGS method. With the BFGS method, relative prediction errors diverge when

γ \geq 0.03

. Consequently, the MM algorithm is more stable than the BFGS algorithm for this dataset.

8. Discussion

We proposed a relative error estimation procedure that is robust against outliers. The proposed procedure is based on the

γ

-likelihood function, which is constructed by

γ

-cross entropy [18]. We showed that the proposed method has the redescending property, a desirable property in robust statistics literature. The asymptotic normality of the corresponding estimator together with a simple consistent estimator of the asymptotic covariance matrix are derived, which allows the construction of approximate confidence sets. Besides the theoretical results, we have constructed an efficient algorithm, in which we minimize a convex loss function at each iteration. The proposed algorithm monotonically decreases the objective function at each iteration.

Our simulation results showed that the proposed method performed better than the ordinary relative error estimation procedures in terms of prediction accuracy. Furthermore, the asymptotic distribution of the estimator yielded a good approximation, with an appropriate value of

γ

, even when outliers existed. The proposed method was applied to electricity consumption data, which included small positive values. Although the ordinary LPRE was sensitive to small positive values, our method was able to appropriately eliminate the negative effect of these values.

In practice, variable selection is one of the most important topics in regression analysis. The ordinary AIC (Akaike information criterion, Akaike [34]) cannot be directly applied to our proposed method because the AIC aims at minimizing the Kullback–Leibler divergence, whereas our method aims at minimizing the

γ

-divergence. As a future research topic, it would be interesting to derive the model selection criterion for evaluating a model estimated by the

γ

-likelihood method.

High-dimensional data analysis is also an important topic in statistics. In particular, the sparse estimation, such as the lasso [12], is a standard tool to deal with high-dimensional data. As shown in Remark 3, our method may be extended to

L_{1}

regularization. An important point in the regularization procedure is the selection of a regularization parameter. Hao et al. [14] suggested using the BIC (Bayesian information criterion)-type criterion of Wang et al. [35,36] for the ordinary LPRE estimator. It would also be interesting to consider the problem of regularization parameter selection in high-dimensional robust relative error estimation.

In regression analysis, we may formulate two types of

γ

-likelihood functions: Fujisawa and Eguchi’s formulation [18] and Kawashima and Fujisawa’s formulation [17]. [22] reported that the difference of performance occurs when the outlier ratio depends on the explanatory variable. In multiplicative regression model in Equation (1), the responses

y_{i}

highly depend on the exploratory variables

x_{i}

compared with the ordinary linear regression model because

y_{i}

is an exponential function of

x_{i j}

. As a result, the comparison of the above two formulations of the

γ

-likelihood functions would be important from both theoretical and practical points of view.

Supplementary Materials

The following are available online at https://www.mdpi.com/1099-4300/20/9/632/s1.

Author Contributions

K.H. proposed the algorithm, made the R package rree, conducted the simulation study, and analyzed the real data; H.M. derived the asymptotics; K.H. and H.M. wrote the paper.

Funding

This research was funded by the Japan Society for the Promotion of Science KAKENHI 15K15949, and the Center of Innovation Program (COI) from Japan Science and Technology Agency (JST), Japan (K.H.), and JST CREST Grant Number JPMJCR14D7 (H.M.)

Acknowledgments

The authors would like to thank anonymous reviewers for the constructive and helpful comments that improved the quality of the paper.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Theorem 1

All of the asymptotics will be taken under

n \to \infty

. We write

a_{n} ≲ b_{n}

if there exists a universal constant

c > 0

such that

a_{n} \leq c b_{n}

for every n large enough. For any random functions

X_{n}

and

X_{0}

on

\bar{B}

, we denote

X_{n} (β) \overset{p}{⇉} X_{0} (β)

if

{sup}_{β \in \bar{B}} | X_{n} (β) - X_{0} (β) | \overset{p}{\to} 0

; below, we will simply write

{sup}_{β}

for

{sup}_{β \in \bar{B}}

.

First, we state a preliminary lemma, which will be repeatedly used in the sequel.

Lemma A1.

Let

η (x; β)

and

ζ (x, y; β)

be vector-valued measurable functions satisfying that

\begin{matrix} sup_{β} max_{k \in {0, 1}} |\partial_{β}^{k} η (x; β)| & \leq \bar{η} (x), \\ sup_{β} max_{k \in {0, 1}} |\partial_{β}^{k} ζ (x, y; β)| & \leq \bar{ζ} (x, y), \end{matrix}

for some

\bar{η}

and

\bar{ζ}

such that

\bar{η} + \int_{0}^{\infty} \bar{ζ} (\cdot, y) d y \in ⋂_{q > 0} L^{q} (π) .

Then,

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} η (x_{i}; β) & \overset{p}{⇉} \int_{X} η (x; β) π (d x), \end{matrix}

(A1)

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} ζ (x_{i}, y_{i}; β) & \overset{p}{⇉} \int_{0}^{\infty} \int_{X} ζ (x, y; β) f (y | x, β_{0}) π (d x) d y . \end{matrix}

(A2)

Proof.

Equation (A1) is a special case of Equation (A2), hence we only show the latter. Observe that

\begin{matrix} sup_{β} |\frac{1}{n} \sum_{i = 1}^{n} ζ (x_{i}, y_{i}; β) - \int \int ζ (x, y; β) f (y | x, β_{0}) π (d x) d y| \\ \leq \frac{1}{\sqrt{n}} sup_{β} |\sum_{i = 1}^{n} \frac{1}{\sqrt{n}} (ζ (x_{i}, y_{i}; β) - \int ζ (x_{i}, y; β) f (y | x_{i}, β_{0}) d y)| \\ + sup_{β} |\frac{1}{n} \sum_{i = 1}^{n} \int ζ (x_{i}, y; β) f (y | x_{i}, β_{0}) d y - \int \int ζ (x, y; β) f (y | x, β_{0}) π (d x) d y| \\ = : \frac{1}{\sqrt{n}} sup_{β} |M_{n} (β)| + sup_{β} |C_{n} (β)| . \end{matrix}

For the first term, let us recall the Sobolev inequality ([37], Section 10.2):

E (sup_{β} {|M_{n} (β)|}^{q}) ≲ sup_{β} E \{| M_{n} {(β) |}^{q}\} + sup_{β} E \{| \partial_{β} M_{n} {(β) |}^{q}\}

(A7)

for

q > p

. The summands of

M_{n} (β)

trivially form a martingale difference array with respect to the filtration

F_{i} : = σ (x_{j}; j \leq i)

,

i \in N

: since we are assuming that the conditional distribution of

y_{i}

given

{(x_{i}, x_{i - 1}, x_{i - 2}, \dots), (y_{i - 1}, y_{i - 2}, \dots)}

equals that given

x_{i}

(Section 2 and Section 3), each summand of

M_{n} (β)

equals

\frac{1}{\sqrt{n}} (ζ (x_{i}, y_{i}; β) - E {ζ (x_{i}, y_{i}; β) | F_{i - 1}})

. Hence, by means of the Burkholder’s inequality for martingales, we obtain, for

q > p \lor 2

,

\begin{matrix} sup_{β} E \{| M_{n} {(β) |}^{q}\} & ≲ sup_{β} \frac{1}{n} \sum_{i = 1}^{n} E \{{|(ζ (x_{i}, y_{i}; β) - \int ζ (x_{i}, y; β) f (y | x_{i}, β_{0}) d y)|}^{q}\} < \infty . \end{matrix}

We can take the same route for the summands of

\partial_{β} M_{n} (β)

to conclude that

{sup}_{β} E \{| \partial_{β} M_{n} {(β) |}^{q}\} < \infty

. These estimates combined with Equation (A3) then lead to the conclusion that

sup_{β} |M_{n} (β)| = O_{p} (1) .

As for the other term, we have

C_{n} (β) \overset{p}{\to} 0

for each

β

and also

sup_{n} E (sup_{β} |\partial_{β} C_{n} (β)|) < \infty .

The latter implies the tightness of the family

{C_{n} (β)}_{n}

of continuous random functions on the compact set

\bar{B}

, thereby entailing that

C_{n} (β) \overset{p}{⇉} 0

. The proof is complete. □

Appendix A.1. Consistency

Let

f_{i} (β) : = f (y_{i} | x_{i}; β)

for brevity and

\begin{matrix} A_{γ, n} (β) & : = \frac{1}{n} \sum_{i = 1}^{n} f_{i} {(β)}^{γ}, \\ {\bar{A}}_{γ, n} (β) & : = \frac{1}{n} \sum_{i = 1}^{n} \int f {(y | x_{i}; β)}^{γ + 1} d y = C (γ, h) \frac{1}{n} \sum_{i = 1}^{n} exp (- γ x_{i}^{T} β) . \end{matrix}

By means of Lemma A1, we have

\begin{matrix} A_{γ, n} (β) & \overset{p}{⇉} A_{γ} (β) : = \int \int f {(y | x; β)}^{γ} f (y | x, β_{0}) π (d x) d y, \\ {\bar{A}}_{γ, n} (β) & \overset{p}{⇉} {\bar{A}}_{γ} (β) : = \int \int f {(y | x; β)}^{γ + 1} π (d x) d y = C (γ, h) \int exp (- γ x^{T} β) π (d x) . \end{matrix}

Since

{inf}_{β} \{A_{γ} (β) \land {\bar{A}}_{γ} (β)\} > 0

, we see that taking the logarithm preserves the uniformity of the convergence in probability: for the

γ

-likelihood function (7), it holds that

ℓ_{γ, n} (β) \overset{p}{⇉} ℓ_{γ, 0} (β) : = - \frac{1}{γ} log \{A_{γ} (β)\} + \frac{1}{1 + γ} log \{{\bar{A}}_{γ} (β)\} .

(A13)

The limit equals the

γ

-cross entropy from

g (\cdot | \cdot) = f (\cdot | \cdot; β_{0})

to

f (\cdot | \cdot; β)

. We have

ℓ_{γ, 0} (β) \geq ℓ_{γ, 0} (β_{0})

, the equality holding if and only if

f (\cdot | \cdot; β_{0}) = f (\cdot | \cdot; β)

(see [17], Theorem 1). By Equation (4), the latter condition is equivalent to

ρ (e^{- x^{T} β} y) = ρ (e^{- x^{T} β_{0}} y)

, followed by

β = β_{0}

from Assumption 3. This, combined with Equation (A4) and the argmin theorem (cf. [25], Chapter 5), concludes the consistency

{\hat{β}}_{γ} \overset{p}{\to} β_{0}

. Note that we do not need Assumption 3 if

ℓ_{γ, n}

is a.s. convex, which generally may not be the case for

γ > 0

.

Appendix A.2. Asymptotic Normality

First, we note that Assumption 2 ensures that, for every

α > 0

, there corresponds a function

{\bar{F}}_{α} \in L^{1} (f (y | x, β_{0}) π (d x) d y)

such that

max_{k = 0, 1, 2, 3} sup_{β} |\partial_{β}^{k} \{f {(y | x, β)}^{α}\}| \leq {\bar{F}}_{α} (x, y) .

This estimate will enable us to interchange the order of

\partial_{β}

and the

d y

-Lebesgue integration, repeatedly used below without mention.

Let

s_{i} (β) = s (y_{i} | x_{i}; β)

, and

\begin{matrix} S_{γ, n} (β) & : = \frac{1}{n} \sum_{i = 1}^{n} f_{i} {(β)}^{γ} s_{i} (β), \\ {\bar{S}}_{γ, n} (β) & : = \frac{1}{n} \sum_{i = 1}^{n} \int f {(y | x_{i}; β)}^{γ + 1} s (y | x_{i}; β) d y . \end{matrix}

Then, the

γ

-likelihood equation

\partial_{β} ℓ_{γ, n} (β) = 0

is equivalent to

Ψ_{γ, n} (β) : = {\bar{A}}_{γ, n} (β) S_{γ, n} (β) - A_{γ, n} (β) {\bar{S}}_{γ, n} (β) = 0 .

By the consistency of

{\hat{β}}_{γ}

, we have

P ({\hat{β}}_{γ} \in B) \to 1

; hence

P {Ψ_{γ, n} ({\hat{β}}_{γ}) = 0} \to 1

as well, for

B

is open. Therefore, virtually defining

{\hat{β}}_{γ}

to be

β_{0} \in B

if

Ψ_{γ, n} ({\hat{β}}_{γ}) = 0

has no root, we may and do proceed as if

Ψ_{γ, n} ({\hat{β}}_{γ}) = 0

a.s. Because of the Taylor expansion

(- \int_{0}^{1} \partial_{β} Ψ_{γ, n} (β_{0} + s ({\hat{β}}_{n} - β_{0})) d s) \sqrt{n} ({\hat{β}}_{γ} - β_{0}) = \sqrt{n} Ψ_{γ, n} (β_{0})

to conclude Equation (16), it suffices to show that (recall the definitions (14) and (15))

\begin{matrix} \sqrt{n} Ψ_{γ, n} (β_{0}) & \overset{L}{\to} N_{p} (0, Δ_{γ}), \end{matrix}

(A5)

\begin{matrix} - \partial_{β} Ψ_{γ, n} ({\hat{β}}_{n}^{'}) & \overset{p}{\to} J_{γ} for every {\hat{β}}_{n}^{'} \overset{p}{\to} β_{0} . \end{matrix}

(A6)

First, we prove Equation (A5). By direct computations and Lemma A1, we see that

\begin{matrix} \sqrt{n} Ψ_{γ, n} & = {\bar{A}}_{γ, n} \sqrt{n} (S_{γ, n} - {\bar{S}}_{γ, n}) - \sqrt{n} (A_{γ, n} - {\bar{A}}_{γ, n}) {\bar{S}}_{γ, n} \\ = \sum_{i = 1}^{n} \frac{1}{\sqrt{n}} \{H_{γ}^{'} (f_{i}^{γ} s_{i} - \int f {(y | x_{i}; β_{0})}^{γ + 1} s (y | x_{i}; β_{0}) d y) - (f_{i}^{γ} - \int f {(y | x_{i}; β_{0})}^{γ + 1} d y) H_{γ}^{″}\} \\ = : \sum_{i = 1}^{n} χ_{γ, i} . \end{matrix}

The sequence

{(χ_{γ, i})}_{i \leq n}

is an

(F_{j})

-martingale-difference array. It is easy to verify the Lapunov condition:

\exists α > 0, sup_{n} sup_{i \leq n} E ({|χ_{γ, i}|}^{2 + α}) < \infty .

Hence, the martingale central limit theorem concludes Equation (A5) if we show the following convergence of the quadratic characteristic:

\frac{1}{n} \sum_{i = 1}^{n} E (χ_{γ, i}^{\otimes 2} | F_{i - 1}) \overset{p}{\to} Δ_{γ} .

This follows upon observation that

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} E (χ_{γ, i}^{\otimes 2} | F_{i - 1}) \\ = {(H_{γ}^{'})}^{2} \frac{1}{n} \sum_{i = 1}^{n} var (f_{j}^{γ} s_{j} | F_{i - 1}) + {(H_{γ}^{''})}^{\otimes 2} \frac{1}{n} \sum_{i = 1}^{n} var (f_{j}^{γ} | F_{i - 1}) - 2 H_{γ}^{'} \frac{1}{n} \sum_{i = 1}^{n} cov (f_{j}^{γ} s_{j}, f_{j}^{γ} | F_{i - 1}) H_{γ}^{''} \\ = {(H_{γ}^{'})}^{2} \{\int \int f {(y | x; β_{0})}^{2 γ + 1} s {(y | x; β_{0})}^{\otimes 2} d y π (d x) - {(H_{γ}^{″})}^{\otimes 2}\} \\ + {(H_{γ}^{''})}^{\otimes 2} \{\int \int f {(y | x; β_{0})}^{2 γ + 1} d y π (d x) - {(H_{γ}^{'})}^{2}\} \\ - 2 H_{γ}^{'} \{\int \int f {(y | x; β_{0})}^{2 γ + 1} s (y | x; β_{0}) d y π (d x) - H_{γ}^{'} H_{γ}^{″}\} {(H_{γ}^{″})}^{T} + o_{p} (1) \\ = {(H_{γ}^{'})}^{2} \int \int f {(y | x; β_{0})}^{2 γ + 1} s {(y | x; β_{0})}^{\otimes 2} d y π (d x) \\ + {(H_{γ}^{″})}^{\otimes 2} \int \int f {(y | x; β_{0})}^{2 γ + 1} d y π (d x) - 2 H_{γ}^{'} \int \int f {(y | x; β_{0})}^{2 γ + 1} s (y | x; β_{0}) d y π (d x) {(H_{γ}^{''})}^{T} + o_{p} (1) \\ = Δ_{γ} + o_{p} (1) \end{matrix}

invoke the expression (4) for the last equality.

Next, we show Equation (A6). Under the present regularity condition, we can deduce that

sup_{β} | \partial_{β}^{2} Ψ_{γ, n} (β) | = O_{p} (1) .

It therefore suffices to verify that

- \partial_{β} Ψ_{γ, n} (β_{0}) \overset{p}{\to} J_{γ} (β_{0}) = J_{γ}

. This follows from a direct computation of

- \partial_{β} Ψ_{γ, n} (β_{0})

, combined with the applications of Lemma A1 (note that

{\bar{A}}_{γ, n}

and

A_{γ, n}

have the same limit in probability):

\begin{matrix} - \partial_{β} Ψ_{γ, n} (β_{0}) & = A_{γ, n} (\frac{1}{n} \sum_{i = 1}^{n} \int f {(y | x_{i}; β_{0})}^{γ + 1} s {(y | x_{i}; β_{0})}^{\otimes 2} d y) - {\bar{S}}_{γ, n} S_{γ, n}^{T} \\ + γ (S_{γ, n} {\bar{S}}_{γ, n}^{T} - {\bar{S}}_{γ, n} S_{γ, n}^{T}) \\ + γ \{A_{γ, n} (\frac{1}{n} \sum_{i = 1}^{n} \int f {(y | x_{i}; β_{0})}^{γ + 1} s {(y | x_{i}; β_{0})}^{\otimes 2} d y) - {\bar{A}}_{γ, n} (\frac{1}{n} \sum_{i = 1}^{n} f_{i}^{γ} s_{i}^{\otimes 2})\} \\ + \{A_{γ, n} (\frac{1}{n} \sum_{i = 1}^{n} \int f {(y | x_{i}; β_{0})}^{γ + 1} \partial_{β} s (y | x_{i}; β_{0}) d y) - {\bar{A}}_{γ, n} (\frac{1}{n} \sum_{i = 1}^{n} f_{i}^{γ} \partial_{β} s_{i})\} \\ = A_{γ, n} (\frac{1}{n} \sum_{i = 1}^{n} \int f {(y | x_{i}; β_{0})}^{γ + 1} s {(y | x_{i}; β_{0})}^{\otimes 2} d y) - {\bar{S}}_{γ, n} S_{γ, n}^{T} + o_{p} (1) \\ = H_{γ}^{'} \int \int f {(y | x; β_{0})}^{γ + 1} s {(y | x; β_{0})}^{\otimes 2} d y π (d x) - {(H_{γ}^{″})}^{\otimes 2} + o_{p} (1) \\ = J_{γ} + o_{p} (1) . \end{matrix}

Appendix A.3. Consistent Estimator of the Asymptotic Covariance Matrix

Thanks to the stability assumptions on the sequence

x_{1}, x_{2}, \dots

, we have

\frac{1}{n} \sum_{i = 1}^{n} x_{i}^{\otimes k} exp (- γ x_{i}^{T} β_{0}) \overset{p}{\to} Π_{k} (γ), k = 0, 1, 2 .

Moreover, for

δ, δ^{'} > 0

given in Assumption 1, we have

\begin{matrix} |\frac{1}{n} \sum_{i = 1}^{n} x_{i}^{\otimes k} exp (- γ x_{i}^{T} β_{0}) - \frac{1}{n} \sum_{i = 1}^{n} x_{i}^{\otimes k} exp (- γ x_{i}^{T} {\hat{β}}_{γ})| \\ ≲ (\frac{1}{n} \sum_{i = 1}^{n} {| x_{i} |}^{k + 1} exp (δ^{'} {| x_{i} |}^{1 + δ})) |{\hat{β}}_{γ} - β_{0}| = O_{p} (1) |{\hat{β}}_{γ} - β_{0}| \overset{p}{\to} 0 . \end{matrix}

These observations are enough to conclude Equation (17).

References

Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning, 2nd ed.; Springer Series in Statistics; Springer: New York, NY, USA, 2009. [Google Scholar]
Park, H.; Stefanski, L.A. Relative-error prediction. Stat. Probab. Lett. 1998, 40, 227–236. [Google Scholar] [CrossRef]
Ye, J.; Price Models and the Value Relevance of Accounting Information. SSRN Electronic Journal 2007. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1003067 (accessed on 20 August 2018).
Van der Meer, D.W.; Widén, J.; Munkhammar, J. Review on probabilistic forecasting of photovoltaic power production and electricity consumption. Renew. Sust. Energ. Rev. 2018, 81, 1484–1512. [Google Scholar] [CrossRef]
Mount, J. Relative error distributions, without the heavy tail theatrics. 2016. Available online: http://www.win-vector.com/blog/2016/09/relative-error-distributions-without-the-heavy-tail-theatrics/ (accessed on 20 August 2018).
Chen, K.; Guo, S.; Lin, Y.; Ying, Z. Least Absolute Relative Error Estimation. J. Am. Stat. Assoc. 2010, 105, 1104–1112. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Li, Z.; Lin, Y.; Zhou, G.; Zhou, W. Empirical likelihood for least absolute relative error regression. TEST 2013, 23, 86–99. [Google Scholar] [CrossRef]
Chen, K.; Lin, Y.; Wang, Z.; Ying, Z. Least product relative error estimation. J. Multivariate Anal. 2016, 144, 91–98. [Google Scholar] [CrossRef] [Green Version]
Ding, H.; Wang, Z.; Wu, Y. A relative error-based estimation with an increasing number of parameters. Commun. Stat. Theory Methods 2017, 47, 196–209. [Google Scholar] [CrossRef]
Demongeot, J.; Hamie, A.; Laksaci, A.; Rachdi, M. Relative-error prediction in nonparametric functional statistics: Theory and practice. J. Multivariate Anal. 2016, 146, 261–268. [Google Scholar] [CrossRef]
Wang, Z.; Chen, Z.; Chen, Z. H-relative error estimation for multiplicative regression model with random effect. Comput. Stat. 2018, 33, 623–638. [Google Scholar] [CrossRef]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Series B Methodol. 1996, 58, 267–288. [Google Scholar]
Yuan, M.; Lin, Y. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Series B Stat. Methodol. 2006, 68, 49–67. [Google Scholar] [CrossRef] [Green Version]
Hao, M.; Lin, Y.; Zhao, X. A relative error-based approach for variable selection. Comput. Stat. Data Anal. 2016, 103, 250–262. [Google Scholar] [CrossRef]
Liu, X.; Lin, Y.; Wang, Z. Group variable selection for relative error regression. J. Stat. Plan. Inference 2016, 175, 40–50. [Google Scholar] [CrossRef]
Xia, X.; Liu, Z.; Yang, H. Regularized estimation for the least absolute relative error models with a diverging number of covariates. Comput. Stat. Data Anal. 2016, 96, 104–119. [Google Scholar] [CrossRef]
Kawashima, T.; Fujisawa, H. Robust and Sparse Regression via γ-Divergence. Entropy 2017, 19, 608. [Google Scholar] [CrossRef]
Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. J. Multivariate Anal. 2008, 99, 2053–2081. [Google Scholar] [CrossRef]
Maronna, R.; Martin, D.; Yohai, V. Robust Statistics; John Wiley & Sons: Chichester, UK, 2006. [Google Scholar]
Koudou, A.E.; Ley, C. Characterizations of GIG laws: A survey. Probab. Surv. 2014, 11, 161–176. [Google Scholar] [CrossRef]
Jones, M.C.; Hjort, N.L.; Harris, I.R.; Basu, A. A comparison of related density-based minimum divergence estimators. Biometrika 2001, 88, 865–873. [Google Scholar] [CrossRef]
Kawashima, T.; Fujisawa, H. On Difference between Two Types of γ-divergence for Regression. 2018. Available online: https://arxiv.org/abs/1805.06144 (accessed on 20 August 2018).
Ferrari, D.; Yang, Y. Maximum Lq-likelihood estimation. Ann. Stat. 2010, 38, 753–783. [Google Scholar] [CrossRef]
Basu, A.; Harris, I.R.; Hjort, N.L.; Jones, M.C. Robust and efficient estimation by minimising a density power divergence. Biometrika 1998, 85, 549–559. [Google Scholar] [CrossRef]
Van der Vaart, A.W. Asymptotic Statistics; Vol. 3, Cambridge Series in Statistical and Probabilistic Mathematics; Cambridge University Press: Cambridge, UK, 1998. [Google Scholar]
Eguchi, S.; Kano, Y. Robustifing maximum likelihood estimation by psi-divergence. ISM Research Memorandam. 2001. Available online: https://www.researchgate.net/profile/Shinto Eguchi/publication/228561230Robustifing maximum likelihood estimation by psi-divergence/links545d65910cf2c1a63bfa63e6.pdf (accessed on 20 August 2018).
Hunter, D.R.; Lange, K. A tutorial on MM algorithms. Am. Stat. 2004, 58, 30–37. [Google Scholar] [CrossRef]
Böhning, D. Multinomial logistic regression algorithm. Ann. Inst. Stat. Math. 1992, 44, 197–200. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Series B Stat. Methodol. 2005, 67, 301–320. [Google Scholar] [CrossRef]
Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Friedman, J.; Hastie, T.; Tibshirani, R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw. 2010, 33, 1. [Google Scholar] [CrossRef] [PubMed]
Dheeru, D.; Karra Taniskidou, E. UCI Machine Learning Repository. 2017. Available online: https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 (accessed on 20 August 2018).
Sioshansi, F.P.; Pfaffenberger, W. Electricity Market Reform: An International Perspective; Elsevier: Oxford, UK, 2006. [Google Scholar]
Akaike, H. A new look at the statistical model identification. IEEE Trans. Automat. Contr. 1974, 19, 716–723. [Google Scholar] [CrossRef]
Wang, H.; Li, R.; Tsai, C.L. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 2007, 94, 553–568. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Wang, H.; Li, B.; Leng, C. Shrinkage tuning parameter selection with a diverging number of parameters. J. R. Stat. Soc. Series B Stat. Methodol. 2009, 71, 671–683. [Google Scholar] [CrossRef] [Green Version]
Friedman, A. Stochastic Differential Equations and Applications; Dover Publications: New York, NY, USA, 2006. [Google Scholar]

Figure 1. Median and error bar of relative prediction error (RPE) in Equation (31) and mean squared error (MSE) of

β

in Equation (32) when parameters of the log-normal distribution (distribution of outliers) are

(μ, σ) = (\pm 5, 1)

. The error bars are delineated by 25th and 75th percentiles.

Figure 1. Median and error bar of relative prediction error (RPE) in Equation (31) and mean squared error (MSE) of

β

in Equation (32) when parameters of the log-normal distribution (distribution of outliers) are

(μ, σ) = (\pm 5, 1)

. The error bars are delineated by 25th and 75th percentiles.

Figure 2. Histograms of T = 100,00 samples of

z_{2}

along with the density function of standard normal distribution for

μ = 5

in Model 1.

Figure 2. Histograms of T = 100,00 samples of

z_{2}

along with the density function of standard normal distribution for

μ = 5

in Model 1.

Figure 3. Electricity consumption from January 2012 to December 2014 for one of the 370 households.

Figure 4. Relative prediction error for various values of

γ

for household electricity consumption data.

Figure 4. Relative prediction error for various values of

γ

for household electricity consumption data.

Figure 5. Prediction value based on least product relative error (LPRE) loss for household electricity consumption data.

Figure 6. Prediction value based on the proposed method with

γ = 0.03

for household electricity consumption data.

Figure 6. Prediction value based on the proposed method with

γ = 0.03

for household electricity consumption data.

Table 1. Several examples of general relative error (GRE) criteria and their properties. “Likelihood” in the second column means the existence of a likelihood function that corresponds to the loss function. The properties of “Convexity” and “Smoothness” in the last two columns respectively indicate those with respect to

β

of the corresponding loss function.

Table 1. Several examples of general relative error (GRE) criteria and their properties. “Likelihood” in the second column means the existence of a likelihood function that corresponds to the loss function. The properties of “Convexity” and “Smoothness” in the last two columns respectively indicate those with respect to

β

of the corresponding loss function.

$g (a, b)$	Likelihood	Convexity	Smoothness
$a^{2}$			√
$a + b$	√	√
$max {a, b}$	√
$a b$	√	√	√
$a^{2} + b^{2}$	√	√	√

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hirose, K.; Masuda, H. Robust Relative Error Estimation. Entropy 2018, 20, 632. https://doi.org/10.3390/e20090632

AMA Style

Hirose K, Masuda H. Robust Relative Error Estimation. Entropy. 2018; 20(9):632. https://doi.org/10.3390/e20090632

Chicago/Turabian Style

Hirose, Kei, and Hiroki Masuda. 2018. "Robust Relative Error Estimation" Entropy 20, no. 9: 632. https://doi.org/10.3390/e20090632

APA Style

Hirose, K., & Masuda, H. (2018). Robust Relative Error Estimation. Entropy, 20(9), 632. https://doi.org/10.3390/e20090632

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robust Relative Error Estimation

Abstract

1. Introduction

2. Relative Error Estimation

3. Robust Estimation via $γ$ -Likelihood

4. Theoretical Properties

4.1. Technical Assumptions

4.2. Redescending Property

4.3. Asymptotic Distribution

5. Algorithm

5.1. Majorization Function for $ℓ_{1} (β)$

5.2. Majorization Function for $ℓ_{2} (β)$

5.3. MM Algorithm for Robust Relative Error Estimation

6. Monte Carlo Simulation

6.1. Setting

6.2. Investigation of Relative Prediction Error and Mean Squared Error of the Estimator

6.3. Investigation of Asymptotic Distribution

7. Real Data Analysis

8. Discussion

Supplementary Materials

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A. Proof of Theorem 1

Appendix A.1. Consistency

Appendix A.2. Asymptotic Normality

Appendix A.3. Consistent Estimator of the Asymptotic Covariance Matrix

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Robust Relative Error Estimation

Abstract

1. Introduction

2. Relative Error Estimation

3. Robust Estimation via γ -Likelihood

4. Theoretical Properties

4.1. Technical Assumptions

4.2. Redescending Property

4.3. Asymptotic Distribution

5. Algorithm

5.1. Majorization Function for ℓ 1 ( β )

5.2. Majorization Function for ℓ 2 ( β )

5.3. MM Algorithm for Robust Relative Error Estimation

6. Monte Carlo Simulation

6.1. Setting

6.2. Investigation of Relative Prediction Error and Mean Squared Error of the Estimator

6.3. Investigation of Asymptotic Distribution

7. Real Data Analysis

8. Discussion

Supplementary Materials

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A. Proof of Theorem 1

Appendix A.1. Consistency

Appendix A.2. Asymptotic Normality

Appendix A.3. Consistent Estimator of the Asymptotic Covariance Matrix

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3. Robust Estimation via $γ$ -Likelihood

5.1. Majorization Function for $ℓ_{1} (β)$

5.2. Majorization Function for $ℓ_{2} (β)$