1. Introduction
In this study, we focused on estimating the mean
of a real random variable
X, supposing that
are independent and identically distributed drawn from
X. It is well known that the empirical mean
is the most popular estimator of
m, and theoretical properties have been thoroughly studied [
1].
However, recent works have concentrated more on the performance of the estimator when the distribution is heavy-tailed (the second moment or fourth moment of the distribution does not exist), which is becoming more and more common in many research fields (see, e.g., Embrechts, Klüppelberg, and Mikosch [
2]). When the data have a heavy tail, the traditional method such as the empirical mean performs poorly, and appropriate robust estimators are required, which drives related research on M-estimator (generalizations of Maximum Likelihood estimator) for correction of the outliers (Huber [
3]).
There has been a renewed interest in the area of robust statistics over the last several decades. Nemirovsky and Yudin [
4], Hsu and Sabato [
5], and Jerrum et al. [
6] proposed various forms of Median-of-means (MOM) estimators to handle data in different situations. They called for dividing the data into several groups with equal size and then calculating the empirical mean within each group, finally taking the median of these empirical means as the formal MOM estimator, which reduces the impact of heavy-tailed data. Tukey and McLaughlin [
7] and Huber and Ronchetti [
8] tried to improve the performance of the empirical mean by using a truncation of
X (they name it truncated mean), which removed part of the sample containing
maximum and minimum values depending on the parameter
and then averaged the remaining values to improve the robustness. Catoni [
9], Audibert, and Catoni [
10] studied the properties of M-estimation for regression problems. The relevant works about robust techniques in various fields are summarized in Bartlett and Mendelson [
11], Maronna [
12], Bubeck, and Lugosi [
13].
Recently, Catoni [
14] modified the empirical mean to a new robust estimator. It is easy to observe that the empirical mean is the solution of the following estimation equation
If we change the form of Equation (
1) to
The solution of (
2) is called Catoni’s mean estimator, where
is a non-decreasing differentiable truncation function such that for any
,
, and
is a parameter to ensure the existence of the estimator. We denote Catoni’s mean estimator by
. The main purpose of the truncation function is to make
grow slower than
x, and then the effect of outliers due to heavy tails in X will be diminished. Although
is not the derivative of some explicit error function, it still can be considered as an influence function in robust theory.
By a mild assumption that the variance
of the distribution exists and choosing the parameter
to optimize the bounds, Catoni [
14] obtained the following performance of
.
Theorem 1. Let be independent, identically distributed random variables, which are drawn from X. We assume that the mean m and variance v of X exist. For any , and positive integer n such that . Catoni’s mean estimator with parameter satisfies, Moreover, if we choose α to be independent from x as follows, and assume , then The method of Catoni [
14] is widely promoted as a robust estimator by Brownlees, Joly, and Lugosi [
15], Minsker [
16], and Wang et al. [
17]. We need to point out here that the parameter
is the solution of the equation where the derivative of Catoni’s estimator’s deviation with respect to
equals to 0. When
, the Catoni’s estimator’s deviation is 0, and no specific
is needed. This also holds for Theorem 2.
The main contribution of this article is to improve Catoni’s estimator under the assumption of the third moment condition, and we named it the third-moment Catoni estimator. Starting from the adjustment of the truncation function denoted by
in our work, as
Figure 1 shows, the influence function with the third moment performs closer to the true value than the original one of Catoni’s. We obtained a more precise exponential moment upper bound, which leads to a better error bound.
Simultaneously, our work had a better performance for the samples drawn from the t-distribution, which is common in many fields of research(see Jones and Faddy [
18]). As a special case of the heavy-tailed distribution, the third moment of the t-distribution exists, which satisfies our assumptions about the distribution. We present the superiority of our estimator in a Monte Carlo simulation. We also show the performance of the proposed estimator under a skewed normal distribution to evaluate the adaptability of the estimator to other distributions.
The rest of the article is organized as follows. In
Section 2, we introduce the main result of the third-moment Catoni’s estimator. A Monte Carlo simulation is provided in
Section 3 to compare the performance of the proposed estimator with Catoni’s estimator for t-distribution.
Section 4 examines the performance of the proposed estimator on real data.
2. Main Result
Let denote an i.i.d. sample drawn from the distribution of X. Let m, v, and s be the mean, variance, and third central moment of X, respectively, which is .
The influence function
here should be considered wider than the original function as Catoni’s to obtain a more accurate exponential moment. In this study, we assumed that
Our mean estimator
is the unique solution of
, where
Next, we present our main result that bounds the with the appropriate choice of negative parameter :
Theorem 2. Let be independent, identically distributed random variables with finite mean m, variance v, and third central moment s. For any , the error bound between the estimator and the empirical mean satisfieswhere Under some technical assumptions that will be mentioned in the following corollary, we have the following upper bound on the probability of the exponential tail:
Corollary 1. Let be independent, identically distributed random variables with finite mean m, variance v and third central moment s. For any and assume that and , Remark 1. It is obvious that with the assumption that n is a positive integer and satisfies and , then By assuming that , we obtained a better estimator bias than (4) in Catoni’s result. Remark 2. When the sample was small, our result was still valid with a small s. We might consider the following example. Let be independent, identically distributed random variables, which are drawn from X. Assuming that the mean , variance , , , and whenever such as , which satisfies the assumption we have For the convenience of proof, we first present the following lemma (Cardano formula); refer to Høyrup’s [
19] for more details.
Lemma 1. For any general cubic equation of the form , one of the roots over the field of real numbers has the form:where the discriminant of the root Δ is as follows, when , the cubic equation has one real root; the cubic equation has three real roots when . Proof of Theorem 2. Due to the inequality (
5) about the
, we have the following exponential moment inequality of
, for all
:
with a brief calculation, we have
and
; so, the inequality (
10) can be bounded by the following term:
Let
whenever
has a finite third moment
s. We can obtain from the Markov inequality that for any
and
,
We can control the estimator
by the roots of the cubic equation as follows:
Equation (
13) above can be regarded as a cubic equation about
. To solve (
13), we first convert it into a standard-form one-dimensional cubic equation by letting
, and then we obtain the following equations:
For any
, according to Lemma 1, since
is always positive,
is always greater than 0. In this case, our equation has one real root and two imaginary roots, which means we can control the
by the root of (
13) as follows:
where the
,
p, and
q are the same as above. We can easily obtain from the formula above that
, implying that
with probability at least
, since
is a non-increasing function. Similarly,
with probability at least
. Then, by choosing the parameter
, we can derive the performance of the estimator
for the bias of the mean
m. That is, with probability at least
, we have
The proof of Theorem 2 is completed. □
Proof of Corollary 1. In fact, the right-hand side of (
7) can be bounded as follows without limiting the sign of s:
with the assumption
, which is weaker than Catoni’s, (
16) can be bounded by
Moreover, assuming that
, we can obtain that (
17) is bounded by
; then, (
8) holds. □
3. Simulation
In this section, we considered the performance of the estimator with respect to the t-distribution on applications by Monte Carlo simulation exercise results. We focused on the performance of the estimator in regression. Our data were simulated from a linear model generated from a t-distribution regressed by our proposed estimator; we measured the loss of the regression by the minimization of the norm.
The details of the simulation are as follows: we considered
n independent, identically distributed real random variables pairs
,
where
take their values in
while
in
, and the explanatory variables
are drawn from a multivariate normal distribution with 0 mean, and variance is a three-dimensional identity matrix. The response variable
is generated as follows:
where the parameter vector
is set to be
, and
is an error term with zero mean and unit variance, which is drawn from a Student t-distribution. Our main goal was to estimate the parameter
by minimizing the
risk
and then we defined the the
estimators
, the classical Catoni mean estimator
, and the third-moment Catoni’s estimator
as follows
where the
,
is the root of the right side of the equation, respectively;
is the widest choice defined in Catoni’s result, the parameter
, which is the same as Brownless’s work;
was set as above; and the parameter
. The measures for the performance of the estimator are as follows:
The simulation experiments repeated with different sample sizes, which ranged from 50 to 1000 and with degrees of freedom of the t-distribution ranging from 1 to 7. Each set of the sample size experiments was replicated 1000 times, and for each replication, we evaluated the performance of the regression by the mean of the sample
—that is, i.i.d.with the sample
,
. We used the following equation to evaluate the performance of the regression, which called excess risk.
Figure 3 displays the performance of the excess risk for three estimators when
and the degrees of freedom of the t-distribution ranged from 1 to 7; we can obtain that the proposed estimator performs better than the other estimators, which means more stability on the outliers.
The results of the Monte Carlo simulation including the performance of the estimator for different
n are presented in
Table 1, and we also compared the performance between the proposed estimator and other estimators with various risks in
Table 2 where sample size
and degrees of freedom
; the
represents the general
regression; the
C and
denote the original Catoni estimator and our third-moment Catoni estimator, respectively; and the ER, RB, and SMSE represents the excess risk, relative risk (
, with
), and the square root of the mean square error (
).
We can derive from the table that when the distribution has a heavy tail, our estimator performs better in most cases than the other two estimators, and the excess risk of the estimator decreases as the sample size increases. At the same time, with the degrees of freedom of the t-distribution rising, the tail of the t-distribution becomes thinner, which becomes closer to the normal distribution, and the performance of all procedures on excess risk is significantly improved; additionally, the proposed estimator also performs well for different risks.
We also examined the performance of the third-moment Catoni estimator on a skewed normal distribution in
Table 3; the model still follows (
18) where
follows a skewed normal distribution with shape parameter
and with other settings unchanged. We can draw conclusions from the table that the bias of the improved estimator is still smaller than the original one. However, the deviation in the estimator did not display a significant difference as the shape parameter
changed. We suppose that this results from the tail behavior of the skew normal distribution in that the existence of its fourth moment conflicts with the usual assumption that the fourth moment of heavy tail distribution does not exist. At the same time, neither Catoni’s estimator nor our estimator performed better than the estimator obtained by L1 regression.
4. Empirical Analysis
In this section, we used the proposed procedure to research the dataset “tumor cell resistance to death,” an artificial dataset consisting of two different types of tumor cells A and B, and the experiment records their resistance to different doses of experimental drugs. The explanatory variable
here is the dose of the drug, and the response variable
is the score representing the resistance to death, ranging from 0 to 4. These data are available in the R lqr package; Galarza et al. [
20] have studied these data by the quantile regression method.
In
Figure 4,
Figure 5,
Figure 6 and
Figure 7, we display the QQplot and the log-QQplot about the scores for cell A and cell B, and it can be seen that the distribution of both cells lacks normality; however, the normality is satisfied between cells and log-scores; besides, the boxplot and the bee colony diagram in
Figure 8 and
Figure 9 shows that both cell A and cell B have heavy-tails, which allows us to focus on the following regression model:
where
and
are defined before. Our focus was estimating the parameters
and
, the solution of the following equation:
Let
denote the solution of the
; then, the Catoni regression estimator of
and
is in the form as follows:
Moreover, we compared the proposed estimator with the classical OLS estimator in
Figure 10 and
Figure 11. The residuals plots are shown in
Figure 12,
Figure 13,
Figure 14 and
Figure 15, from which we can draw the conclusion that the distribution of the residual of the three-order Catoni regression performs more uniformly; furthermore, the Mean Squared Error of the third-moment Catoni regression and OLS regression was 0.1120, 0.1255 for cell A and 0.2268, 0.2335 for cell B, respectively, which indicates that the proposed method is a better regression.
5. Discussion
Estimating the mean of random variables is a classical issue in statistics [
1], and it has been well studied in classical statistics; however, with the discovery of heavy-tailed distribution in many research fields, its existence is an important challenge in statistics. When the data have heavy tails, the traditional estimators such as the empirical mean usually perform poorly. Therefore, how to find an appropriate robust procedure is a well-known problem and has aroused great interest. A new estimator based on reconstructing the structure of the empirical mean was proposed by Catoni, which has excellent theoretical properties on the bias.
The Catoni’s estimator is based on the existence of the variance v of the random variable. Therefore, with a weaker assumption on the moment conditions, it is an interesting issue whether the estimator has a better performance. In this study, we assumed that the third moment s of the data exists, and a more accurate upper bound of the exponential moment was obtained, which motivates an estimator with a better bias. To a certain extent, the assumption reduces the robustness to outliers, but it has a minimal effect in heavy tails distribution (the fourth moment does not exist). In future work, we have the following goals: first, we believe that our method can be applied as an improved mean estimator to any relevant model as long as the third moment of the distribution has good theoretical properties and wide application; second, it is an interesting idea to discuss and compare the bias bound of the proposed estimator with the minimax bound; finally, the estimation of the variance in regression models is very important in statistical inference. Considering that the deviation of the estimator given in our main theoretical results from the true value can be regarded as the confidence interval based on the known variance; therefore, the proposed estimator is not suitable for the estimation of variance, but it is an interesting issue how a proper variance estimator affects the bias of our estimator; additionally, we will consider variance estimation under heavy-tailed distributions in later work.