A New Estimator: Median of the Distribution of the Mean in Robustness

García-Pérez, Alfonso

doi:10.3390/math11122694

Open AccessArticle

A New Estimator: Median of the Distribution of the Mean in Robustness

by

Alfonso García-Pérez

Departamento de Estadística, I.O. y C.N., Universidad Nacional de Educación a Distancia (UNED), Paseo Senda del Rey 9, 28040 Madrid, Spain

Mathematics 2023, 11(12), 2694; https://doi.org/10.3390/math11122694

Submission received: 17 May 2023 / Revised: 6 June 2023 / Accepted: 12 June 2023 / Published: 14 June 2023

(This article belongs to the Special Issue Advances in Statistics: Theory, Methodology, Applications and Data Analysis)

Download

Browse Figures

Versions Notes

Abstract

:

In some statistical methods, the statistical information is provided in terms of the values used by classical estimators, such as the sample mean and sample variance. These estimations are used in a second stage, usually in a classical manner, to be combined into a single value, as a weighted mean. Moreover, in many applied studies, the results are given in these terms, i.e., as summary data. In all of these cases, the individual observations are unknown; therefore, computing the usual robustness estimators with them to replace classical non-robust estimations by robust ones is not possible. In this paper, the use of the median of the distribution

F_{\bar{x}}

of the sample mean is proposed, assuming a location-scale contaminated normal model, where the parameters of

F_{\bar{x}}

are estimated with the classical estimations provided in the first stage. The estimator so defined is called median of the distribution of the mean,

M d M

. This new estimator is applied in Mendelian randomization, defining the new robust inverse weighted estimator, RIVW.

Keywords:

robust statistics; von Mises expansions; saddlepoint approximations; Mendelian randomization

MSC:

62F35; 62E17; 62P99

1. Introduction

In the application of some statistical methods, such as clinical trials, the results are, usually, described in terms of the values taken by classical estimators, such as the sample mean and sample variance. These results are combined, in a second stage, as a weighted mean in a meta analysis. The same occurs in its alternative, Mendelian Randomization, one of the main topics in causal inference.

Moreover, in many applied studies, their results have been described in these terms, i.e., as summary data, not knowing the individual observations, to compute robust estimators with them, replacing the classical non-robust estimations with robust ones.

In this paper, a solution to this problem is proposed, correcting, if necessary, the given classical estimations because, although the individual observations are unknown, the mechanism that generates the data is known because it is the model.

Focusing on the mean estimation problem, the optimal estimator (uniformly minimum variance unbiased estimator) is the sample mean, when no outliers exist in the sample, and the normal distribution

N (μ, σ^{2})

is assumed as the model, with

μ

and

σ^{2}

being the usual parameters of the normal distribution, population mean, and variance. Assume that a proportion

ϵ

of outliers exists in the sample, i.e., a contaminated normal model (see [1], p. 2)

(1 - ϵ) N (μ, σ^{2}) + ϵ N (g_{1} μ, g_{2}^{2} σ^{2})

where most of the data are from a

N (μ, σ^{2})

, and a small part of them,

ϵ

, are from a normal model with more dispersion and a different location,

N (g_{1} μ, g_{2}^{2} σ^{2})

, where

g_{1}

is a contamination parameter that affects the location, and

g_{2}

is a contamination parameter that affects the scale. The optimality of the sample mean is lost because the optimal procedure and its properties heavily depend on the assumed probability model ([2], p. 2). This is the reason why classical statistics rests, basically, on the normal model and on the sample mean.

Additionally, under a contaminated normal model, the robustness of the sample mean is lost [1,3]. Under this model, the sample mean is not the maximum likelihood estimator [4], and even the normality of the sample mean is not guaranteed [5].

In this paper, a new estimator for a location–scale contaminated normal model is proposed, avoiding the extreme sensitivity of the sample mean but coinciding with it when no outliers are present in the data. The median of the distribution

F_{\bar{x}}

of the sample mean is proposed as a new estimator, where the parameters of

F_{\bar{x}}

are estimated with the classical estimations described in previous studies. This estimator is called the median of the distribution of the mean,

M d M

.

The two reasons why this new estimator relies on the distribution of the sample mean are that, first, the classical estimations are given in terms of the classical mean (and classical variance) and, second, this new procedure extends the classical one in the sense that if no outliers are present, this new estimator is the classical sample mean, i.e., with this method, the classical estimation is extended to the case in which outliers are present.

Another estimator somewhat related to

M d M

is the median of the means estimator

M o M

. However, this estimator is, finally, one of the sample means and, hence, is not robust (see [6]).

With the

M d M

, robustness and optimality are obtained if there are no outliers. Hence, with this approach, a new vision of the dilemma between optimality and robustness is provided.

Because the exact sample distribution of

\bar{x}

under a mixture distribution is not known, here it is estimated in a closed form with the von Mises (VOM) plus saddlepoint (SAD) method, a technique used by the author in several studies (see, for instance, [7,8]) but in another context. With this approximation, the estimator introduced in this paper can also be extended to other more general models than the normal mixture considered here.

The rest of the paper is structured as follows. In Section 2, the VOM+SAD approximation for the distribution of the sample mean is obtained under a location–scale contaminated normal model. The definition and some properties of this new location estimator are considered in Section 3, and a scale estimator, based on these ideas, is defined in Section 4, and an example of the application of this new estimator is considered in Section 5. These ideas are applied to Mendelian randomization in Section 6. Some conclusions are outlined in Section 7.

2. VOM+SAD Approximation of Sample Mean Distribution

Because the new estimator depends on the distribution of the sample mean, the distribution of the sample mean must be very precise, especially when the considered sample sizes are very small. For this situation, using a von Mises expansion ([9], p. 215, or [10], p. 578) that depends on Hampel’s influence function [11] is highly recommended.

Although, in the end, the obtained results are be applied to the mixture of normals model considered previously, these refer to more general models, F, G, and H, which indicate future extensions of this method.

The final approximation is called VOM+SAD and was previously obtained by the author in the context of spatial data (see [7,8]). Following the ideas developed in those two papers, considering the tail probability functional, initially, the approximation obtained is

P_{F} {T_{n} > t} ≃ P_{G} {T_{n} > t} + \int TAIF (x; t; T_{n}, G) d F (x)

which allows the approximation of the distribution of

T_{n}

when the observations follow model F by the distribution of

T_{n}

when the variables of

T_{n}

follow model G (pivotal distribution).

This approximation depends on the tail area influence function, TAIF, defined in [12].

Restricting this approximation to M estimators with a monotonic decreasing score function

ψ

(see [1], p. 46) and using the Lugannani and Rice formula ([13], or [14] p. 77, or [1] p. 314) to obtain a saddlepoint approximation for the TAIF, as the approximation given in [15] (p. 94), for M estimators, the VOM+SAD approximation obtained is

P_{F} {T_{n} > t} ≃ P_{G} {T_{n} > t} + \int \frac{ϕ (s)}{r_{1}} n^{1 / 2} (\frac{e^{z_{0} ψ (x, t)}}{\int e^{z_{0} ψ (y, t)} d G (y)} - 1) d F (x) .

In the case of a location–scale mixture normal model, the framework that it is considered in this paper, i.e., assuming that

Z_{i} \equiv (1 - ϵ) N (μ, σ^{2}) + ϵ N (g_{1} μ, g_{2}^{2} σ^{2})

, the VOM+SAD approximation is

P_{F} {T_{n} > t} ≃ P_{G} {T_{n} > t} + ϵ \frac{ϕ (s)}{r_{1}} \sqrt{n} (\frac{\int e^{z_{0} ψ (x, t)} d H (x)}{\int e^{z_{0} ψ (y, t)} d G (y)} - 1)

(1)

where

G = N (μ, σ^{2})

, and

H = N (g_{1} μ, g_{2}^{2} σ^{2})

.

VOM-SAD Approximation for the Distribution of the Sample Mean

In the particular case of the sample mean, the score function is

ψ (x, t) = x - t

. Remember that in the VOM+SAD approximation, the saddlepoint is computed under

G = N (μ, σ^{2})

. Under this pivotal distribution, it is

K (λ, t) = log \int e^{λ (y - t)} \frac{1}{σ \sqrt{2 π}} e^{- \frac{1}{2 σ^{2}} {(y - μ)}^{2}} d y = \frac{σ^{2} λ^{2}}{2} + λ (μ - t) .

Hence, from the saddlepoint equation

K^{'} (z_{0}, t) = 0

the saddlepoint

z_{0} = (t - μ) / σ^{2}

is obtained.

Additionally,

K (z_{0}, t) = - {(t - μ)}^{2} / (2 σ^{2})

,

ϕ (s) = ϕ (\sqrt{n} (t - μ) / σ)

,

r_{1} = (t - μ) / σ

, and

K^{″} (λ, t) = σ^{2}

. The leading term is

P_{G} {T_{n} > t} = 1 - Φ (\sqrt{n} (t - μ) / σ)

, and the quotient in last term in the right side of (1) is

\frac{\int e^{z_{0} ψ (x, t)} d H (x)}{\int e^{z_{0} ψ (y, t)} d G (y)} = exp \{(g_{1} - 1) μ z_{0} + \frac{1}{2} (g_{2}^{2} - 1) σ z_{0}^{2}\} .

Hence, the VOM+SAD approximation (1) is

P_{F} {\bar{x} > t} ≃ 1 - Φ (\sqrt{n} σ z_{0}) + ϵ \frac{\sqrt{n}}{σ z_{0}} ϕ (\sqrt{n} σ z_{0}) (e^{(g_{1} - 1) μ z_{0} + \frac{1}{2} (g_{2}^{2} - 1) σ z_{0}^{2}} - 1) .

If distributions F and G are not close enough, intermediate distributions can be considered, as in [16,17,18], to obtain a more accurate approximation.

3. Estimator Median of the Distribution of the Mean

If the previous distribution of the mean is

F_{\bar{x}} (x) = 1 - P_{F} {\bar{x} > x}

the median of this distribution, i.e.,

F_{\bar{x}}^{- 1} (1 / 2)

, is called the median of the distribution of the mean,

M d M

, i.e., this estimator is the solution of

F_{\bar{x}} (M d M) = \frac{1}{2} .

The parameters of

F_{\bar{x}}

are estimated with the classical estimations, the sample mean

\bar{x}

and the sample variance

s^{2}

.

Figure 1, Figure 2 and Figure 3 show that as contamination parameters

ϵ

,

g_{1}

, or

g_{2}

increase, the difference between

M d M

and

\bar{x}

, i.e.,

z_{0}

, increases.

The main reason for the definition of

M d M

is that the median is more robust than the sample mean and, hence, the influence of possible outliers, not knowing the individual observations, as assumed here, should be lower with the median of the distribution of the mean than with the sample mean, used in this distribution as an estimator of the location parameter. Furthermore, in the case without outliers, this estimator is equal to the classical sample mean.

As a limitation, observe that

M d M

is also sensitive if outliers already affect the sample mean or sample variance used in the estimation of the location or scale parameter

μ

or

σ^{2}

. Nevertheless, with

M d M

, this sensitivity is lower.

One way to check the behavior of

M d M

with respect to

\bar{x}

in a simple numerical example is to run the R sentences

>: x<-0.80*rnorm(11,2,1)+0.2*rnorm(11,3*2,1)
>: mean(x)
>: median(x)

in which we consider a random sample of

n = 11

sample data from a mixture normal

0.8 N (2, 1) + 0.2 N (3 \cdot 2, 1)

, i.e., a sample where

ϵ = 0.2, g_{1} = 3

and

g_{2} = 1

.

Finally, in future research, other robust estimators could be considered, such as the trimmed mean of the distribution of the sample mean.

4. Dispersion Estimator

With the ideas developed in this paper, a dispersion estimator should be

F_{\bar{x}}^{- 1} (3 / 4) - F_{\bar{x}}^{- 1} (1 / 4) .

5. Example

In most application papers, only the final values of the estimators used on them are given. Additionally, these estimators are usually the classical sample mean and sample variance and do not include the individual observations from which these estimators are obtained and, therefore, not providing the opportunity to robustify these values using robust techniques.

For this reason, a large number of examples could serve as an illustration of the estimator defined in this paper. Next, let us consider just one.

Example 1.

One of these studies is [19], where some vertebral column and thorax of Neanderthals fossils were re-evaluated using their vertebrae because, probably, as stated by the author, errors occurred in the reconstruction and the samples were wrongly classified. He mentions ([19], p. 23) a misclassification of 7/33, which can be considered as the value of the contamination parameter ϵ.

Because modern humans and Neanderthals have very similar vertebrae, no difference in the mean is assumed, using, hence a distortion factor

g_{1} = 1

. On the other hand, Neanderthals are slightly more stockier than modern humans, with the dispersion of the latter being larger, assuming that

g_{2} = 1.5

.

In Table 2 in [19], classical acceptance confidence intervals are provided for several vertebrae of 28 modern humans. They are based on the classical mean and variance, as the author says in this table. From the table, with respect vertebra T1, the remains of Kebara 2 and La Ferrassie can be considered as modern humans instead of Neanderthals, because they are inside of the confidence interval. The same happens with vertebra T7 but not with vertebra T5.

From these classical intervals, for vertebra T1, the classical sample mean and standard deviation are

\bar{x} = 16.6

and

S = 3.61

, respectively. In this case, the estimator median of the distribution of the mean takes the value

M d M = 15.034

, obtaining the new robust acceptance confidence interval equal to

[13.63, 16.43]

, which does not contain the remains, concluding then, that these remains are Neanderthals and not modern humans, as they were wrongly considered with the classical estimators.

With respect to vertebra T5,

M d M = 17.54

, and the new robust acceptance confidence interval is

[15.84, 19.24]

, with neither the classical nor the robust interval not including the remains, confirming that they are Neanderthals.

Finally, for vertebra T7,

M d M = 19.43

, and the new robust acceptance confidence interval

[17.73, 21.13]

, both this robust and the previous classical confidence interval including the remains of the La Ferrassie, hence being modern humans and not Neanderthals.

6. Robust Inverse-Weighted Estimator RIVW in Mendelian Randomization

Another field for the class of problems considered in this paper is randomized clinical trials (CTs). In each of these CTs, the sample mean and sample variance are the usual final result. These are usually combined, in a classical way, as a weighted mean in a meta-analysis. In CTs, the relationship of a variable X (called cause) with another variable Y (called effect) is analyzed, but reverse causality may exist or a lack complete randomization or, more importantly, confounders may be present.

Moreover, CTs are expensive and take a long time. With Mendelian randomization (MR), a method that has received a renewed interest in recent years, CTs are imitated because, in any person, all genetic material is randomized allocated from their parents, including DNA markers. Randomly, some people receive more DNA markers related with variable X and, for others, fewer. MR uses genetic variants (usually single-nucleotide polymorphisms (SNPs)) as instrumental variables Z.

Mathematically, MR is used to avoid possible biases in the regression of Y on X due to these three causes just mentioned. Formally, MR leads us to a two-step linear regression process; first, for every genetic variant

Z_{j}, j = 1, \dots, L

, a linear regression of X on

Z_{j}

is performed, where, for individuals,

i = 1, \dots, n_{j}

is

X_{i} ∣ Z_{i j} = β_{X_{0}} + β_{X_{j}} Z_{i j} + e_{X_{i j}}

from which the fitted values

{\hat{X}}_{i}

are obtained and used in a second regression of Y on these

\hat{X}

, obtaining finally [20]

Y_{i} ∣ Z_{i j} = β_{Y_{0}} + (β \cdot β_{X_{j}} + α_{j}) Z_{i j} + e_{Y_{i j}} = β_{Y_{0}} + β_{Y_{j}} Z_{i j} + e_{Y_{i j}}

where

β_{X_{j}}

and

β_{Y_{j}}

represent the association of

Z_{j}

with the exposure and the outcome (only through X), respectively. The parameter

β \cdot β_{X_{j}}

represents the effect of

Z_{j}

on Y through X, where

β

is the causal effect of X on Y that is being estimated. Moreover,

α_{j}

represents the association between

Z_{j}

and Y not through the exposure of interest. Finally, the errors terms

e_{X_{i j}}

and

e_{Y_{i j}}

are assumed to be independent because independent samples are assumed to be used to fit the two previous regression models.

In MR, the standard estimator of the parameter of interest

β

, the slope in the linear regression of Y on X, is the classical two-stage least squares estimator

{\hat{β}}_{R_{j}} = \frac{{\hat{β}}_{Y_{j}}}{{\hat{β}}_{X_{j}}}

which is the quotient of the slope of the regression of Y on

Z_{j}

,

{\hat{β}}_{Y_{j}}

, and the slope estimator of the regression of X on

Z_{j}

,

{\hat{β}}_{X_{j}}

. These classical estimations, one for each value of the instrumental variable

Z_{j}

, are combined with the classical inverse-variance weighted (IVW) estimator

IVW = \sum_{j = 1}^{L} ω_{j} {\hat{β}}_{R_{j}} / (\sum_{j = 1}^{L} ω_{j})

where

ω_{j} = 1 / v a r ({\hat{β}}_{R_{j}})

, which is used to weight the

{\hat{β}}_{R_{j}}

estimators, assuming that the L genetic variants are mutually independent. In this way, a single causal effect estimate from L genetic instruments is obtained.

This classic and widely used estimator is not robust because it has a

0 %

breakdown point because it is a weighted mean, see, for instance, [21].

In this section, the robustification of the classical estimator IVW is obtained, first, by replacing estimators

{\hat{β}}_{R_{j}}

with the median of the distribution of the mean,

M d M_{j}

estimators and, second, by replacing the weights

ω_{j}

with

v_{j}

, the inverse of the new dispersion estimator,

v_{j} = \frac{1}{F_{\bar{x}}^{- 1} (3 / 4) - F_{\bar{x}}^{- 1} (1 / 4)}

defining the new estimator, based on the

{\hat{β}}_{R_{j}}

distribution, as

RIVW = \frac{\sum_{j = 1}^{L} v_{j} M d M_{j}}{\sum_{j = 1}^{L} v_{j}} .

6.1. Distribution of ${\hat{β}}_{R_{j}}$ Estimator

In this section, an approximation for the distribution of

{\hat{β}}_{R_{j}}

is obtained for each genetic variant

Z_{j}, j = 1, \dots, L

, i.e, j is fixed. Moreover, because of the usual regression assumptions,

Z_{i j}

is not random in the two previous linear regressions, i.e., in the estimator

{\hat{β}}_{R_{j}}

.

Hence, with

μ_{X_{i}}

denoting the constant

μ_{X_{i}} = β_{X_{0}} + β_{X_{j}} Z_{i j}

avoiding the j in the notation of

μ_{X_{i}}

to simplify it, and with

μ_{Y_{i}}

being the constant

μ_{Y_{i}} = β_{Y_{0}} + β_{Y_{j}} Z_{i j},

and assuming no outliers in the sample, the variable

X_{i} ∣ Z_{i j}

follows a normal distribution

X_{i} ∣ Z_{i j} \equiv N (μ_{X_{i}}, σ_{X_{i}}^{2})

and

Y_{i} ∣ Z_{i j} \equiv N (μ_{Y_{i}}, σ_{Y_{i}}^{2}) .

The estimator

{\hat{β}}_{R_{j}}

is equal to

{\hat{β}}_{R_{j}} = \frac{{\hat{β}}_{Y_{j}}}{{\hat{β}}_{X_{j}}}

and, considering standardized data, i.e., that

{\hat{β}}_{R_{j}}

is computed as a correlations quotient,

{\hat{β}}_{R_{j}} = \frac{\sum_{i = 1}^{n_{j}} Y_{i} Z_{i j}}{\sum_{i = 1}^{n_{j}} X_{i} Z_{i j}}

its tail distribution is

\begin{matrix} P \{{\hat{β}}_{R_{j}} > a\} & = & P \{\frac{\sum_{i = 1}^{n_{j}} Y_{i} Z_{i j}}{\sum_{i = 1}^{n_{j}} X_{i} Z_{i j}} > a\} \\ = & P \{\sum_{i = 1}^{n_{j}} Y_{i} Z_{i j} - a \sum_{i = 1}^{n_{j}} X_{i} Z_{i j} > 0\} \\ = & P \{\sum_{i = 1}^{n_{j}} (Y_{i} - a X_{i}) Z_{i j} > 0\} . \end{matrix}

Letting

W_{i}

(removing the j if there is no risk of confusion) denote the random variable

W_{i} = W_{i j} = (Y_{i} - a X_{i}) Z_{i j}

,

i = 1, \dots, n_{j}

, where a and

Z_{i j}

are not random, the aim is to compute the distribution of the sample mean of the variables

W_{i}

at 0, i.e.,

P \{{\hat{β}}_{R_{j}} > a\} = P \{\sum_{i = 1}^{n_{j}} W_{i} > 0\} = P \{\bar{W} > 0\}

where

W_{i}

is independent but not identically distributed because

W_{i} ∣ Z_{i j} \equiv N (μ_{i}, σ_{i}^{2}), i = 1, \dots, n_{j}

where

μ_{i} = (μ_{Y_{i}} - a \cdot μ_{X_{i}}) Z_{i j}

and

σ_{i}^{2} = V ((Y_{i} - a \cdot X_{i}) Z_{i j})

which depends on

σ_{X_{i}}^{2}

and

σ_{Y_{i}}^{2}

. The values of these parameters are given from previous studies following the median of the distribution the mean method.

If the data contain no outliers, it will be

W_{i} ∣ Z_{i j} \equiv N (μ_{i}, σ_{i}^{2})

but, as usual, a proportion

ϵ

of outliers in the data is assumed, i.e., as a model for the observations

W_{i}

the following

F_{i} = (1 - ϵ) N (μ_{i}, σ_{i}^{2}) + ϵ N (g_{i 1} μ_{i}, g_{i 2}^{2} σ_{i}^{2})

where the contamination constants

g_{i 1}

and

g_{i 2}

are assumed to depend on

i = 1, \dots, n_{j}

.

To compute the distribution of

\bar{W}

under models

F_{i}

, assuming that the sample sizes

n_{j}

are small, a von Mises approximation (VOM), based on a von Mises expansion, is used to obtain an accurate approximation with small sample sizes.

6.2. VOM Approximation of the Distribution

In general, to approximate the tail probability of statistic

T_{n}

under a vector of model distributions

F = (F_{1}, \dots, F_{n})

, knowing its tail distribution under the vector of model distributions

G = (G_{1}, \dots, G_{n})

(called pivotal distributions), the von Mises expansion of the tail probability of

T_{n} (X_{1}, X_{2}, \dots, X_{n})

at

F

is used ([10], Section 2, or [22], Theorem 2.1, or [17], Corollary 2),

P_{F} {T_{n} (X_{1}, X_{2}, \dots, X_{n}) > t} = P_{F_{1}, \dots, F_{n}} {T_{n} (X_{1}, X_{2}, \dots, X_{n}) > t}

= P_{G} {T_{n} (X_{1}, X_{2}, \dots, X_{n}) > t} + \sum_{i = 1}^{n} \int_{X} {TAIF}_{i} (x; t; T_{n}, G) d F_{i} (x) + R e m

where the sample space

X \subset R^{m}

,

R e m = \frac{1}{2} \int_{X} \int_{X} T_{G_{F}}^{(2)} (x_{1}, x_{2}) d [F (x_{1}) - G (x_{1})] d [F (x_{2}) - G (x_{2})]

where

T_{G_{F}}^{(2)}

is the second derivative of the tail probability functional at the mixture distribution

G_{F} = (1 - λ) G + λ F

, for some

λ \in [0, 1]

; and

{TAIF}_{i}

is the ith (multivariate) partial tail area influence function of

T_{n}

at

G = (G_{1}, \dots, G_{n})

in relation to

G_{i}

,

i = 1, \dots, n

, introduced in [17], Definition 1,

{TAIF}_{i} (x; t; T_{n}, G) = \frac{\partial}{\partial ϵ} P_{G_{i}^{ϵ, x}} {T_{n} (X_{1}, \dots, X_{n}) > t} ∣_{ϵ = 0}

in those

x \in X

where the right-hand side exists. In the computation of

{TAIF}_{i}

, only

G_{i}

is contaminated; the other distributions remain fixed,

i = 1, \dots, n

.

In general,

R e m

is close to 0, and the von Mises approximation (VOM) is defined as

\begin{matrix} P_{F} {T_{n} (X_{1}, X_{2}, \dots, X_{n}) > t} & ≃ & P_{G} {T_{n} (X_{1}, X_{2}, \dots, X_{n}) > t} \\ + \sum_{i = 1}^{n} \int_{X} {TAIF}_{i} (x; t; T_{n}, G) d F_{i} (x) . \end{matrix}

(2)

Moreover, if

F

is a mixture distribution,

F = (1 - ϵ) G + ϵ H

,

R e m = O (ϵ^{2})

([23], p. 77). Additionally, because of the partial influence functions properties ([22], p. 3) that are valid for the partial tail area influence functions defined in [17], for any

T_{n}

it will be

\int_{X} {TAIF}_{i} (x; t; T_{n}, G) d G_{i} (x) = 0,

(3)

i.e., the integral with respect a given model of the

{TAIF}_{i}

that depends on this model is equal to 0. Hence,

P_{F} {T_{n} (X_{1}, X_{2}, \dots, X_{n}) > t} = P_{G} {T_{n} (X_{1}, X_{2}, \dots, X_{n}) > t}

+ (1 - ϵ) \sum_{i = 1}^{n} \int_{X} {TAIF}_{i} (x; t; T_{n}, G) d G_{i} (x)

+ ϵ \sum_{i = 1}^{n} \int_{X} {TAIF}_{i} (x; t; T_{n}, G) d H_{i} (x) + O (ϵ^{2})

= P_{G} {T_{n} (X_{1}, X_{2}, \dots, X_{n}) > t} + 0 + ϵ \sum_{i = 1}^{n} \int_{X} {TAIF}_{i} (x; t; T_{n}, G) d H_{i} (x) + O (ϵ^{2})

i.e., the VOM approximation is

P_{F} {T_{n} (X_{1}, X_{2}, \dots, X_{n}) > t} ≃ P_{G} {T_{n} (X_{1}, X_{2}, \dots, X_{n}) > t}

+ ϵ \sum_{i = 1}^{n} \int_{X} {TAIF}_{i} (x; t; T_{n}, G) d H_{i} (x) .

Moreover, because of Proposition 1 in [17],

\begin{matrix} {TAIF}_{i} (x; t; T_{n}, G) & = & - P_{G_{1}, \dots, G_{n}} {T_{n} (X_{1}, X_{2}, \dots, X_{n}) > t} \\ + & P_{G_{1}, \dots, G_{i - 1}, G_{i + 1}, \dots, G_{n}} {T_{n} (X_{1}, \dots, X_{i - 1}, x, X_{i + 1}, \dots, X_{n}) > t} \end{matrix}

and the VOM approximation of the tail probability

P_{F} {T_{n} (X_{1}, X_{2}, \dots, X_{n}) > t}

can also be expressed as

\begin{matrix} P_{F} {T_{n} (X_{1}, X_{2}, \dots, X_{n}) > t} & ≃ & (1 - n) P_{G} {T_{n} (X_{1}, X_{2}, \dots, X_{n}) > t} \\ + \int_{X} P_{G_{2}, \dots, G_{n}} {T_{n} (x, X_{2}, \dots, X_{n}) > t} d F_{1} (x) \\ + \int_{X} P_{G_{1}, G_{3}, \dots, G_{n}} {T_{n} (X_{1}, x, \dots, X_{n}) > t} d F_{2} (x) + \dots \\ + \int_{X} P_{G_{1}, \dots, G_{n - 1}} {T_{n} (X_{1}, \dots, X_{n - 1}, x) > t} d F_{n} (x) \end{matrix}

(4)

which allows an approximation of the tail probability

P_{F} {T_{n} (X_{1}, X_{2}, \dots, X_{n}) > t}

under models

F = (F_{1}, \dots, F_{n})

, knowing the value of this tail probability under near models

G = (G_{1}, \dots, G_{n})

.

In the particular case that

T_{n} (X_{1}, X_{2}, \dots, X_{n}) = \bar{W}

, the VOM approximation for the tail of

\bar{W}

can be expressed as (see (2) with

n = n_{j}, j = 1, \dots, L

and

t = 0

now)

P_{F} \{\bar{W} > 0\} ≃ P_{G} {W_{1} + \dots + W_{n_{j}} > 0} + \sum_{i = 1}^{n_{j}} \int_{R} {TAIF}_{i} (x; 0; \bar{W}, G) d F_{i} (x)

or, see (4),

P_{F} \{\bar{W} > 0\} ≃ P_{G} {W_{1} + \dots + W_{n_{j}} > 0}

+ \sum_{i = 1}^{n_{j}} \int_{R} [- P_{G} {W_{1} + \dots + W_{n_{j}} > 0}

+ P_{G_{1}, \dots, G_{i - 1}, G_{i + 1}, \dots, G_{n_{j}}} {W_{1} + \dots + W_{i - 1} + x + W_{i + 1} + \dots + W_{n_{j}} > 0} d F_{i} (x)]

or

\begin{matrix} P_{F} \{\bar{W} > 0\} & ≃ & (1 - n_{j}) P_{G} {W_{1} + \dots + W_{n_{j}} > 0} \\ + \int_{R} P_{G_{2}, \dots, G_{n_{j}}} {x + W_{2} + \dots + W_{n_{j}} > 0} d F_{1} (x) \\ + \int_{R} P_{G_{1}, G_{3}, \dots, G_{n_{j}}} {W_{1} + x + W_{3} + \dots + W_{n_{j}} > 0} d F_{2} (x) + \dots \\ + \int_{R} P_{G_{1}, \dots, G_{i - 1}, G_{i + 1}, \dots, G_{n_{j}}} {W_{1} + \dots + W_{i - 1} + x + W_{i + 1} + \dots + W_{n_{j}} > 0} d F_{i} (x) \\ + \dots + \int_{R} P_{G_{1}, \dots, G_{n_{j} - 1}} {W_{1} + \dots + W_{n_{j} - 1} + x > 0} d F_{n_{j}} (x) . \end{matrix}

If it is assumed as model for the observations

W_{i}

F_{i} = (1 - ϵ) N (μ_{i}, σ_{i}^{2}) + ϵ N (g_{i 1} μ_{i}, g_{i 2}^{2} σ_{i}^{2})

and it is denoted by

G_{i}^{g_{i 1}, g_{i 2}} \equiv N (g_{i 1} μ_{i}, g_{i 2}^{2} σ_{i}^{2})

, and by

G_{i} \equiv N (μ_{i}, σ_{i}^{2})

the pivotal distribution,

i = 1, \dots, n_{j}

, i.e.,

F_{i} = (1 - ϵ) G_{i} + ϵ G_{i}^{g_{i 1}, g_{i 2}} .

the generic component of this last equation is

\int_{R} P_{G_{1}, \dots, G_{i - 1}, G_{i + 1}, \dots, G_{n_{j}}} {W_{1} + \dots + W_{i - 1} + x + W_{i + 1} + \dots + W_{n_{j}} > 0} d F_{i} (x) =

\int_{R} P_{G_{1}, \dots, G_{i - 1}, G_{i + 1}, \dots, G_{n_{j}}} {W_{1} + \dots + W_{i - 1} + W_{i + 1} + \dots + W_{n_{j}} > - x} d F_{i} (x)

= \int_{R} [1 - Φ (\frac{- x - μ_{- i}}{σ_{- i}})] d F_{i} (x)

where

Φ

is the cumulative distribution function of a standard normal distribution,

μ_{- i} = μ_{1} + \dots + μ_{i - 1} + μ_{i + 1} + \dots + μ_{n_{j}}

and

σ_{- i}^{2} = σ_{1}^{2} + \dots + σ_{i - 1}^{2} + σ_{i + 1}^{2} + \dots + σ_{n_{j}}^{2} .

If

μ_{s} = μ_{1} + \dots + μ_{n_{j}} = μ_{- i} + μ_{i}

and

σ_{s}^{2} = σ_{1}^{2} + \dots + σ_{n_{j}}^{2} = σ_{- i}^{2} + σ_{i}^{2}

then,

P_{G} {W_{1} + \dots + W_{n_{j}} > 0} = 1 - Φ (\frac{- μ_{s}}{σ_{s}})

and

P_{F} \{{\hat{β}}_{R_{j}} > a\} = P_{F} \{\sum_{i = 1}^{n_{j}} W_{i} > 0\} = P_{F} \{\bar{W} > 0\}

≃ 1 - Φ (\frac{- μ_{s}}{σ_{s}}) + \sum_{i = 1}^{n_{j}} \int_{R} [Φ (\frac{- μ_{s}}{σ_{s}}) - Φ (\frac{- x - μ_{- i}}{σ_{- i}})] d F_{i} (x) .

(5)

Because

F_{i}

is a normal mixture

F_{i} = (1 - ϵ) G_{i} + ϵ G_{i}^{g_{i 1}, g_{i 2}}

the VOM approximation (5) is

= 1 - Φ (\frac{- μ_{s}}{σ_{s}}) + ϵ \sum_{i = 1}^{n_{j}} \int_{R} [Φ (\frac{- μ_{s}}{σ_{s}}) - Φ (\frac{- x - μ_{- i}}{σ_{- i}})] d G_{i}^{g_{i 1}, g_{i 2}} (x) .

Moreover, because of property (3) for the partial influence functions mentioned before, it is

\int_{R} [Φ (\frac{- μ_{s}}{σ_{s}}) - Φ (\frac{- x - μ_{- i}}{σ_{- i}})] d G_{i} (x) = 0

or

\int_{R} Φ (\frac{- x - μ_{- i}}{σ_{- i}}) d G_{i} (x) = Φ (\frac{- μ_{s}}{σ_{s}}) .

Hence, making the change of variable

(x + μ_{- i}) / σ_{- i} = y

, it is

\int_{R} Φ (\frac{- x - μ_{- i}}{σ_{- i}}) d G_{i}^{g_{i 1}, g_{i 2}} (x) = Φ (\frac{- μ_{s}^{g_{i 1}}}{σ_{s}^{g_{i 2}}})

where

μ_{s}^{g_{i 1}} = μ_{1} + \dots + μ_{i - 1} + g_{i 1} μ_{i} + μ_{i + 1} + \dots + μ_{n_{j}}

and

σ_{s}^{g_{i 2}} = \sqrt{σ_{1}^{2} + \dots + σ_{i - 1}^{2} + g_{i 2} σ_{i}^{2} + σ_{i + 1}^{2} + \dots + σ_{n_{j}}^{2}} .

Then, the VOM approximation to the distribution of

{\hat{β}}_{R_{j}}

is

P \{{\hat{β}}_{R_{j}} > a\} = 1 - Φ (\frac{- μ_{s}}{σ_{s}}) + ϵ \sum_{i = 1}^{n_{j}} [Φ (\frac{- μ_{s}}{σ_{s}}) - Φ (\frac{- μ_{s}^{g_{i 1}}}{σ_{s}^{g_{i 2}}})] .

Example 2.

In a study [24], whether low-density lipoprotein cholesterol (LDL-C) is a cause of coronary artery disease (CAD) was analyzed considering 28 DNA markers

DNA markers X Y
SNP exposure.beta exposure.se outcome.beta outcome.se
1 snp_1 0.0260 0.004 0.0677 0.0286
2 snp_2 -0.0440 0.004 -0.1625 0.0300
..............................................................
27 snp_27 0.0090 0.003 0.0000 0.0255
28 snp_28 -0.0360 0.007 0.0198 0.0647

Usually,

Z_{i} \equiv B (2, 0.5)

is assumed to be an instrumental variable to mimic biallelic SNPs in Hardy–Weinberg equilibrium. A value

I V W = 2.834214

was obtained.

With the method proposed in this paper, considering sample sizes of

n = 37

,

n_{1} = 17

,

n_{2} = 10

, and

n_{3} = 10

, and contamination parameters

ϵ = 0.05

,

g_{i 1} = 1

, and

g_{i 2} = 1.5

, for the first DNA marker is obtained

μ_{s} = μ_{1} + \dots + μ_{n_{j}} = 30 \times (0.0677 - a \times 0.0260)

σ_{i}^{2} = 0.0286 \times 37 = 1.0582

σ_{s} = \sqrt{1.0582 \times 37} = 6.257268

μ_{s}^{g_{i 1}} = μ_{s}

and

σ_{s}^{g_{i 2}} = \sqrt{σ_{1}^{2} + \dots + σ_{i - 1}^{2} + g_{i 2} σ_{i}^{2} + σ_{i + 1}^{2} + \dots + σ_{n_{j}}^{2}}

= \sqrt{1.0582 \times 36 + 1.5 \times 1.0582} = 6.299405

M d M_{1} = 2.59

v_{1} = \frac{1}{F_{\bar{x}}^{- 1} (3 / 4) - F_{\bar{x}}^{- 1} (1 / 4)} = \frac{1}{8.08 - (- 2.877)} = 0.0912 .

For all the 28 DNA markers, we have

\begin{matrix} M d M_{j} & 2.59 & 3.70 & 2.78 & 2.71 & 4.93 & \dots \\ v_{j} & 0.091 & 0.151 & 0.128 & 0.088 & 0.068 & \dots \end{matrix}

which are combined in the new robust estimate

R I V W = 2.042703 .

7. Conclusions

In this paper, a new method for estimating the parameters in a location–scale contamination model is introduced, in the case where individual observations are not available and, therefore, applying the usual robust methods is not possible, i.e., in summary data problems.

For the location problem, a new estimator was defined that is equal to the usual sample mean when no outliers exist and correcting classical estimations when outliers exist.

This new estimator was applied to one of the most used estimators in Mendelian randomization, the inverse-variance weighted estimator (IVW), defining a new estimator robust inverse weighted estimator (RIVW).

Funding

This study was partially supported by grant PID2021-124933NB-I00 from the Ministerio de Ciencia e Innovación (Spain).

Data Availability Statement

Not applicable.

Acknowledgments

The author is very grateful to the referees and to the assistant editor for their kind and professional remarks.

Conflicts of Interest

The author declares no conflict of interest.

References

Huber, P.J.; Ronchetti, E.M. Robust Statistics, 2nd ed.; John Wiley & Sons: New York, NY, USA, 2009. [Google Scholar]
Lehmann, E.L. Theory of Point Estimation; John Wiley & Sons: New York, NY, USA, 1983. [Google Scholar]
Hampel, F.R.; Ronchetti, E.M.; Rousseeuw, P.J.; Syahel, W.A. Robust Statistics. The Approach Based on Influence Functions; John Wiley & Sons: New York, NY, USA, 1986. [Google Scholar]
Basford, K.E.; McLachlan, G.J. Likelihood estimation with normal mixture models. Appl. Statist. 1985, 34, 282–289. [Google Scholar] [CrossRef]
Berckmoes, B.; Molenberghs, G. On the asymptotic behavior of the contaminated sample mean. Math. Methods Stat. 2018, 27, 312–323. [Google Scholar] [CrossRef]
Rodríguez, D.; Valdora, M. The breakdown point of the median of means tournament. Stat. Probab. Lett. 2019, 153, 108–112. [Google Scholar] [CrossRef]
García-Pérez, A. Saddlepoint approximations for the distribution of some robust estimators of the variogram. Metrika 2020, 83, 69–91. [Google Scholar] [CrossRef]
García-Pérez, A. New robust cross-variogram estimators and approximations for their distributions based on saddlepoint techniques. Mathematics 2021, 9, 762. [Google Scholar] [CrossRef]
Serfling, R.J. Approximation Theorems of Mathematical Statistics; John Wiley & Sons: New York, NY, USA, 1980. [Google Scholar]
Withers, C.S. Expansions for the distribution and quantiles of a regular functional of the empirical distribution with applications to nonparametric confidence intervals. Ann. Stat. 1983, 11, 577–587. [Google Scholar] [CrossRef]
Hampel, F.R. The Influence Curve and its role in robust estimation. J. Am. Statist. Assoc. 1974, 69, 383–393. [Google Scholar] [CrossRef]
Field, C.A.; Ronchetti, E. A tail area influence function and its application to testing. Seq. Anal. 1985, 4, 19–41. [Google Scholar] [CrossRef]
Lugannani, R.; Rice, S. Saddle point approximation for the distribution of the sum of independent random variables. Adv. Appl. Probab. 1980, 12, 475–490. [Google Scholar] [CrossRef]
Jensen, J.L. Saddlepoint Approximations; Clarendon Press: Oxford, UK, 1995. [Google Scholar]
Daniels, H.E. Saddlepoint approximations for estimating equations. Biometrika 1983, 70, 89–96. [Google Scholar] [CrossRef]
García-Pérez, A. Another look at the Tail Area Influence Function. Metrika 2011, 73, 77–92. [Google Scholar] [CrossRef]
García-Pérez, A. A linear approximation to the power function of a test. Metrika 2012, 75, 855–875. [Google Scholar] [CrossRef]
García-Pérez, A. A von Mises approximation to the small sample distribution of the trimmed mean. Metrika 2016, 79, 369–388. [Google Scholar] [CrossRef]
Gómez-Olivencia, A. The presacral spine of the La Ferrassie 1 Neandertal: A revised inventory. Bull. Mém. Soc. Anthropol. Paris 2013, 25, 19–38. [Google Scholar] [CrossRef]
Pires, H.F.; Smith, G.D.; Bowden, J. Robust inference in summary data Mendelian randomization via the zero modal pleiotropy assumption. Int. J. Epidemiol. 2017, 46, 1985–1998. [Google Scholar] [CrossRef] [Green Version]
Slob, E.A.W.; Burgess, S. A comparison of robust Mendelian randomization methods using summary data. Genet. Epidemiol. 2020, 44, 313–329. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Pires, A.M.; Branco, J.A. Partial influence functions. J. Multivar. Anal. 2002, 83, 451–468. [Google Scholar] [CrossRef] [Green Version]
Ronchetti, E. Accurate and robust inference. Econom. Stat. 2020, 14, 74–88. [Google Scholar] [CrossRef]
Waterworth, D.M.; Ricketts, S.L.; Song, K.; Chen, L.; Zhao, J.H.; Ripatti, S.; Aulchenko, Y.S.; Zhang, W.; Yuan, X.; Lim, N.; et al. Genetic Variants Influencing Circulating Lipid Levels and Risk of Coronary Artery Disease. Arterioscler. Thromb. Vasc. Biol. 2011, 30, 2264–2276. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Differences between

M d M

and

\bar{x}

as

ϵ

increases.

Figure 1. Differences between

M d M

and

\bar{x}

as

ϵ

increases.

Figure 2. Differences between

M d M

and

\bar{x}

as

g_{1}

increases.

Figure 2. Differences between

M d M

and

\bar{x}

as

g_{1}

increases.

Figure 3. Differences between

M d M

and

\bar{x}

as

g_{2}

increases.

Figure 3. Differences between

M d M

and

\bar{x}

as

g_{2}

increases.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

García-Pérez, A. A New Estimator: Median of the Distribution of the Mean in Robustness. Mathematics 2023, 11, 2694. https://doi.org/10.3390/math11122694

AMA Style

García-Pérez A. A New Estimator: Median of the Distribution of the Mean in Robustness. Mathematics. 2023; 11(12):2694. https://doi.org/10.3390/math11122694

Chicago/Turabian Style

García-Pérez, Alfonso. 2023. "A New Estimator: Median of the Distribution of the Mean in Robustness" Mathematics 11, no. 12: 2694. https://doi.org/10.3390/math11122694

APA Style

García-Pérez, A. (2023). A New Estimator: Median of the Distribution of the Mean in Robustness. Mathematics, 11(12), 2694. https://doi.org/10.3390/math11122694

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A New Estimator: Median of the Distribution of the Mean in Robustness

Abstract

1. Introduction

2. VOM+SAD Approximation of Sample Mean Distribution

VOM-SAD Approximation for the Distribution of the Sample Mean

3. Estimator Median of the Distribution of the Mean

4. Dispersion Estimator

5. Example

6. Robust Inverse-Weighted Estimator RIVW in Mendelian Randomization

6.1. Distribution of ${\hat{β}}_{R_{j}}$ Estimator

6.2. VOM Approximation of the Distribution

7. Conclusions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

A New Estimator: Median of the Distribution of the Mean in Robustness

Abstract

1. Introduction

2. VOM+SAD Approximation of Sample Mean Distribution

VOM-SAD Approximation for the Distribution of the Sample Mean

3. Estimator Median of the Distribution of the Mean

4. Dispersion Estimator

5. Example

6. Robust Inverse-Weighted Estimator RIVW in Mendelian Randomization

6.1. Distribution of β ^ R j Estimator

6.2. VOM Approximation of the Distribution

7. Conclusions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

6.1. Distribution of ${\hat{β}}_{R_{j}}$ Estimator