Next Article in Journal
Exploring Initialization Strategies for Metaheuristic Optimization: Case Study of the Set-Union Knapsack Problem
Previous Article in Journal
A Federated Personal Mobility Service in Autonomous Transportation Systems
Previous Article in Special Issue
Bicluster Analysis of Heterogeneous Panel Data via M-Estimation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A New Estimator: Median of the Distribution of the Mean in Robustness

by
Alfonso García-Pérez
Departamento de Estadística, I.O. y C.N., Universidad Nacional de Educación a Distancia (UNED), Paseo Senda del Rey 9, 28040 Madrid, Spain
Mathematics 2023, 11(12), 2694; https://doi.org/10.3390/math11122694
Submission received: 17 May 2023 / Revised: 6 June 2023 / Accepted: 12 June 2023 / Published: 14 June 2023

Abstract

:
In some statistical methods, the statistical information is provided in terms of the values used by classical estimators, such as the sample mean and sample variance. These estimations are used in a second stage, usually in a classical manner, to be combined into a single value, as a weighted mean. Moreover, in many applied studies, the results are given in these terms, i.e., as summary data. In all of these cases, the individual observations are unknown; therefore, computing the usual robustness estimators with them to replace classical non-robust estimations by robust ones is not possible. In this paper, the use of the median of the distribution F x ¯ of the sample mean is proposed, assuming a location-scale contaminated normal model, where the parameters of F x ¯ are estimated with the classical estimations provided in the first stage. The estimator so defined is called median of the distribution of the mean, M d M . This new estimator is applied in Mendelian randomization, defining the new robust inverse weighted estimator, RIVW.

1. Introduction

In the application of some statistical methods, such as clinical trials, the results are, usually, described in terms of the values taken by classical estimators, such as the sample mean and sample variance. These results are combined, in a second stage, as a weighted mean in a meta analysis. The same occurs in its alternative, Mendelian Randomization, one of the main topics in causal inference.
Moreover, in many applied studies, their results have been described in these terms, i.e., as summary data, not knowing the individual observations, to compute robust estimators with them, replacing the classical non-robust estimations with robust ones.
In this paper, a solution to this problem is proposed, correcting, if necessary, the given classical estimations because, although the individual observations are unknown, the mechanism that generates the data is known because it is the model.
Focusing on the mean estimation problem, the optimal estimator (uniformly minimum variance unbiased estimator) is the sample mean, when no outliers exist in the sample, and the normal distribution N ( μ , σ 2 ) is assumed as the model, with μ and σ 2 being the usual parameters of the normal distribution, population mean, and variance. Assume that a proportion ϵ of outliers exists in the sample, i.e., a contaminated normal model (see [1], p. 2)
( 1 ϵ ) N ( μ , σ 2 ) + ϵ N ( g 1 μ , g 2 2 σ 2 )
where most of the data are from a N ( μ , σ 2 ) , and a small part of them, ϵ , are from a normal model with more dispersion and a different location, N ( g 1 μ , g 2 2 σ 2 ) , where g 1 is a contamination parameter that affects the location, and g 2 is a contamination parameter that affects the scale. The optimality of the sample mean is lost because the optimal procedure and its properties heavily depend on the assumed probability model ([2], p. 2). This is the reason why classical statistics rests, basically, on the normal model and on the sample mean.
Additionally, under a contaminated normal model, the robustness of the sample mean is lost [1,3]. Under this model, the sample mean is not the maximum likelihood estimator [4], and even the normality of the sample mean is not guaranteed [5].
In this paper, a new estimator for a location–scale contaminated normal model is proposed, avoiding the extreme sensitivity of the sample mean but coinciding with it when no outliers are present in the data. The median of the distribution F x ¯ of the sample mean is proposed as a new estimator, where the parameters of F x ¯ are estimated with the classical estimations described in previous studies. This estimator is called the median of the distribution of the mean, M d M .
The two reasons why this new estimator relies on the distribution of the sample mean are that, first, the classical estimations are given in terms of the classical mean (and classical variance) and, second, this new procedure extends the classical one in the sense that if no outliers are present, this new estimator is the classical sample mean, i.e., with this method, the classical estimation is extended to the case in which outliers are present.
Another estimator somewhat related to M d M is the median of the means estimator M o M . However, this estimator is, finally, one of the sample means and, hence, is not robust (see [6]).
With the M d M , robustness and optimality are obtained if there are no outliers. Hence, with this approach, a new vision of the dilemma between optimality and robustness is provided.
Because the exact sample distribution of x ¯ under a mixture distribution is not known, here it is estimated in a closed form with the von Mises (VOM) plus saddlepoint (SAD) method, a technique used by the author in several studies (see, for instance, [7,8]) but in another context. With this approximation, the estimator introduced in this paper can also be extended to other more general models than the normal mixture considered here.
The rest of the paper is structured as follows. In Section 2, the VOM+SAD approximation for the distribution of the sample mean is obtained under a location–scale contaminated normal model. The definition and some properties of this new location estimator are considered in Section 3, and a scale estimator, based on these ideas, is defined in Section 4, and an example of the application of this new estimator is considered in Section 5. These ideas are applied to Mendelian randomization in Section 6. Some conclusions are outlined in Section 7.

2. VOM+SAD Approximation of Sample Mean Distribution

Because the new estimator depends on the distribution of the sample mean, the distribution of the sample mean must be very precise, especially when the considered sample sizes are very small. For this situation, using a von Mises expansion ([9], p. 215, or [10], p. 578) that depends on Hampel’s influence function [11] is highly recommended.
Although, in the end, the obtained results are be applied to the mixture of normals model considered previously, these refer to more general models, F, G, and H, which indicate future extensions of this method.
The final approximation is called VOM+SAD and was previously obtained by the author in the context of spatial data (see [7,8]). Following the ideas developed in those two papers, considering the tail probability functional, initially, the approximation obtained is
P F { T n > t } P G { T n > t } + TAIF x ; t ; T n , G d F ( x )
which allows the approximation of the distribution of T n when the observations follow model F by the distribution of T n when the variables of T n follow model G (pivotal distribution).
This approximation depends on the tail area influence function, TAIF, defined in [12].
Restricting this approximation to M estimators with a monotonic decreasing score function ψ (see [1], p. 46) and using the Lugannani and Rice formula ([13], or [14] p. 77, or [1] p. 314) to obtain a saddlepoint approximation for the TAIF, as the approximation given in [15] (p. 94), for M estimators, the VOM+SAD approximation obtained is
P F { T n > t } P G { T n > t } + ϕ ( s ) r 1 n 1 / 2 e z 0 ψ ( x , t ) e z 0 ψ ( y , t ) d G ( y ) 1 d F ( x ) .
In the case of a location–scale mixture normal model, the framework that it is considered in this paper, i.e., assuming that Z i ( 1 ϵ ) N ( μ , σ 2 ) + ϵ N ( g 1 μ , g 2 2 σ 2 ) , the VOM+SAD approximation is
P F { T n > t } P G { T n > t } + ϵ ϕ ( s ) r 1 n e z 0 ψ ( x , t ) d H ( x ) e z 0 ψ ( y , t ) d G ( y ) 1
where G = N ( μ , σ 2 ) , and H = N ( g 1 μ , g 2 2 σ 2 ) .

VOM-SAD Approximation for the Distribution of the Sample Mean

In the particular case of the sample mean, the score function is ψ ( x , t ) = x t . Remember that in the VOM+SAD approximation, the saddlepoint is computed under G = N ( μ , σ 2 ) . Under this pivotal distribution, it is
K ( λ , t ) = log e λ ( y t ) 1 σ 2 π e 1 2 σ 2 ( y μ ) 2 d y = σ 2 λ 2 2 + λ ( μ t ) .
Hence, from the saddlepoint equation K ( z 0 , t ) = 0 the saddlepoint z 0 = ( t μ ) / σ 2 is obtained.
Additionally, K ( z 0 , t ) = ( t μ ) 2 / ( 2 σ 2 ) , ϕ ( s ) = ϕ ( n ( t μ ) / σ ) , r 1 = ( t μ ) / σ , and K ( λ , t ) = σ 2 . The leading term is P G { T n > t } = 1 Φ ( n ( t μ ) / σ ) , and the quotient in last term in the right side of (1) is
e z 0 ψ ( x , t ) d H ( x ) e z 0 ψ ( y , t ) d G ( y ) = exp ( g 1 1 ) μ z 0 + 1 2 ( g 2 2 1 ) σ z 0 2 .
Hence, the VOM+SAD approximation (1) is
P F { x ¯ > t } 1 Φ ( n σ z 0 ) + ϵ n σ z 0 ϕ ( n σ z 0 ) e ( g 1 1 ) μ z 0 + 1 2 ( g 2 2 1 ) σ z 0 2 1 .
If distributions F and G are not close enough, intermediate distributions can be considered, as in [16,17,18], to obtain a more accurate approximation.

3. Estimator Median of the Distribution of the Mean

If the previous distribution of the mean is
F x ¯ ( x ) = 1 P F { x ¯ > x }
the median of this distribution, i.e., F x ¯ 1 ( 1 / 2 ) , is called the median of the distribution of the mean, M d M , i.e., this estimator is the solution of
F x ¯ ( M d M ) = 1 2 .
The parameters of F x ¯ are estimated with the classical estimations, the sample mean x ¯ and the sample variance s 2 .
Figure 1, Figure 2 and Figure 3 show that as contamination parameters ϵ , g 1 , or g 2 increase, the difference between M d M and x ¯ , i.e., z 0 , increases.
The main reason for the definition of M d M is that the median is more robust than the sample mean and, hence, the influence of possible outliers, not knowing the individual observations, as assumed here, should be lower with the median of the distribution of the mean than with the sample mean, used in this distribution as an estimator of the location parameter. Furthermore, in the case without outliers, this estimator is equal to the classical sample mean.
As a limitation, observe that M d M is also sensitive if outliers already affect the sample mean or sample variance used in the estimation of the location or scale parameter μ or σ 2 . Nevertheless, with M d M , this sensitivity is lower.
One way to check the behavior of M d M with respect to x ¯ in a simple numerical example is to run the R sentences
> 
x<-0.80*rnorm(11,2,1)+0.2*rnorm(11,3*2,1)
> 
mean(x)
> 
median(x)
in which we consider a random sample of n = 11 sample data from a mixture normal 0.8 N ( 2 , 1 ) + 0.2 N ( 3 · 2 , 1 ) , i.e., a sample where ϵ = 0.2 , g 1 = 3 and g 2 = 1 .
Finally, in future research, other robust estimators could be considered, such as the trimmed mean of the distribution of the sample mean.

4. Dispersion Estimator

With the ideas developed in this paper, a dispersion estimator should be
F x ¯ 1 ( 3 / 4 ) F x ¯ 1 ( 1 / 4 ) .

5. Example

In most application papers, only the final values of the estimators used on them are given. Additionally, these estimators are usually the classical sample mean and sample variance and do not include the individual observations from which these estimators are obtained and, therefore, not providing the opportunity to robustify these values using robust techniques.
For this reason, a large number of examples could serve as an illustration of the estimator defined in this paper. Next, let us consider just one.
Example 1.
One of these studies is [19], where some vertebral column and thorax of Neanderthals fossils were re-evaluated using their vertebrae because, probably, as stated by the author, errors occurred in the reconstruction and the samples were wrongly classified. He mentions ([19], p. 23) a misclassification of 7/33, which can be considered as the value of the contamination parameter ϵ.
Because modern humans and Neanderthals have very similar vertebrae, no difference in the mean is assumed, using, hence a distortion factor g 1 = 1 . On the other hand, Neanderthals are slightly more stockier than modern humans, with the dispersion of the latter being larger, assuming that g 2 = 1.5 .
In Table 2 in [19], classical acceptance confidence intervals are provided for several vertebrae of 28 modern humans. They are based on the classical mean and variance, as the author says in this table. From the table, with respect vertebra T1, the remains of Kebara 2 and La Ferrassie can be considered as modern humans instead of Neanderthals, because they are inside of the confidence interval. The same happens with vertebra T7 but not with vertebra T5.
From these classical intervals, for vertebra T1, the classical sample mean and standard deviation are x ¯ = 16.6 and S = 3.61 , respectively. In this case, the estimator median of the distribution of the mean takes the value M d M = 15.034 , obtaining the new robust acceptance confidence interval equal to [ 13.63 , 16.43 ] , which does not contain the remains, concluding then, that these remains are Neanderthals and not modern humans, as they were wrongly considered with the classical estimators.
With respect to vertebra T5, M d M = 17.54 , and the new robust acceptance confidence interval is [ 15.84 , 19.24 ] , with neither the classical nor the robust interval not including the remains, confirming that they are Neanderthals.
Finally, for vertebra T7, M d M = 19.43 , and the new robust acceptance confidence interval [ 17.73 , 21.13 ] , both this robust and the previous classical confidence interval including the remains of the La Ferrassie, hence being modern humans and not Neanderthals.

6. Robust Inverse-Weighted Estimator RIVW in Mendelian Randomization

Another field for the class of problems considered in this paper is randomized clinical trials (CTs). In each of these CTs, the sample mean and sample variance are the usual final result. These are usually combined, in a classical way, as a weighted mean in a meta-analysis. In CTs, the relationship of a variable X (called cause) with another variable Y (called effect) is analyzed, but reverse causality may exist or a lack complete randomization or, more importantly, confounders may be present.
Moreover, CTs are expensive and take a long time. With Mendelian randomization (MR), a method that has received a renewed interest in recent years, CTs are imitated because, in any person, all genetic material is randomized allocated from their parents, including DNA markers. Randomly, some people receive more DNA markers related with variable X and, for others, fewer. MR uses genetic variants (usually single-nucleotide polymorphisms (SNPs)) as instrumental variables Z.
Mathematically, MR is used to avoid possible biases in the regression of Y on X due to these three causes just mentioned. Formally, MR leads us to a two-step linear regression process; first, for every genetic variant Z j , j = 1 , , L , a linear regression of X on Z j is performed, where, for individuals, i = 1 , , n j is
X i Z i j = β X 0 + β X j Z i j + e X i j
from which the fitted values X ^ i are obtained and used in a second regression of Y on these X ^ , obtaining finally [20]
Y i Z i j = β Y 0 + ( β · β X j + α j ) Z i j + e Y i j = β Y 0 + β Y j Z i j + e Y i j
where β X j and β Y j represent the association of Z j with the exposure and the outcome (only through X), respectively. The parameter β · β X j represents the effect of Z j on Y through X, where β is the causal effect of X on Y that is being estimated. Moreover, α j represents the association between Z j and Y not through the exposure of interest. Finally, the errors terms e X i j and e Y i j are assumed to be independent because independent samples are assumed to be used to fit the two previous regression models.
In MR, the standard estimator of the parameter of interest β , the slope in the linear regression of Y on X, is the classical two-stage least squares estimator
β ^ R j = β ^ Y j β ^ X j
which is the quotient of the slope of the regression of Y on Z j , β ^ Y j , and the slope estimator of the regression of X on Z j , β ^ X j . These classical estimations, one for each value of the instrumental variable Z j , are combined with the classical inverse-variance weighted (IVW) estimator
IVW = j = 1 L ω j β ^ R j / ( j = 1 L ω j )
where ω j = 1 / v a r ( β ^ R j ) , which is used to weight the β ^ R j estimators, assuming that the L genetic variants are mutually independent. In this way, a single causal effect estimate from L genetic instruments is obtained.
This classic and widely used estimator is not robust because it has a 0 % breakdown point because it is a weighted mean, see, for instance, [21].
In this section, the robustification of the classical estimator IVW is obtained, first, by replacing estimators β ^ R j with the median of the distribution of the mean, M d M j estimators and, second, by replacing the weights ω j with v j , the inverse of the new dispersion estimator,
v j = 1 F x ¯ 1 ( 3 / 4 ) F x ¯ 1 ( 1 / 4 )
defining the new estimator, based on the β ^ R j distribution, as
RIVW = j = 1 L v j M d M j j = 1 L v j .

6.1. Distribution of β ^ R j Estimator

In this section, an approximation for the distribution of β ^ R j is obtained for each genetic variant Z j , j = 1 , , L , i.e, j is fixed. Moreover, because of the usual regression assumptions, Z i j is not random in the two previous linear regressions, i.e., in the estimator β ^ R j .
Hence, with μ X i denoting the constant
μ X i = β X 0 + β X j Z i j
avoiding the j in the notation of μ X i to simplify it, and with μ Y i being the constant
μ Y i = β Y 0 + β Y j Z i j ,
and assuming no outliers in the sample, the variable X i Z i j follows a normal distribution
X i Z i j N ( μ X i , σ X i 2 )
and
Y i Z i j N ( μ Y i , σ Y i 2 ) .
The estimator β ^ R j is equal to
β ^ R j = β ^ Y j β ^ X j
and, considering standardized data, i.e., that β ^ R j is computed as a correlations quotient,
β ^ R j = i = 1 n j Y i Z i j i = 1 n j X i Z i j
its tail distribution is
P β ^ R j > a = P i = 1 n j Y i Z i j i = 1 n j X i Z i j > a = P i = 1 n j Y i Z i j a i = 1 n j X i Z i j > 0 = P i = 1 n j ( Y i a X i ) Z i j > 0 .
Letting W i (removing the j if there is no risk of confusion) denote the random variable W i = W i j = ( Y i a X i ) Z i j , i = 1 , , n j , where a and Z i j are not random, the aim is to compute the distribution of the sample mean of the variables W i at 0, i.e.,
P β ^ R j > a = P i = 1 n j W i > 0 = P W ¯ > 0
where W i is independent but not identically distributed because
W i Z i j N ( μ i , σ i 2 ) , i = 1 , , n j
where
μ i = ( μ Y i a · μ X i ) Z i j
and
σ i 2 = V ( ( Y i a · X i ) Z i j )
which depends on σ X i 2 and σ Y i 2 . The values of these parameters are given from previous studies following the median of the distribution the mean method.
If the data contain no outliers, it will be
W i Z i j N ( μ i , σ i 2 )
but, as usual, a proportion ϵ of outliers in the data is assumed, i.e., as a model for the observations W i the following
F i = ( 1 ϵ ) N ( μ i , σ i 2 ) + ϵ N ( g i 1 μ i , g i 2 2 σ i 2 )
where the contamination constants  g i 1 and g i 2 are assumed to depend on i = 1 , , n j .
To compute the distribution of W ¯ under models F i , assuming that the sample sizes n j are small, a von Mises approximation (VOM), based on a von Mises expansion, is used to obtain an accurate approximation with small sample sizes.

6.2. VOM Approximation of the Distribution

In general, to approximate the tail probability of statistic T n under a vector of model distributions F = ( F 1 , , F n ) , knowing its tail distribution under the vector of model distributions G = ( G 1 , , G n ) (called pivotal distributions), the von Mises expansion of the tail probability of T n ( X 1 , X 2 , , X n ) at F is used ([10], Section 2, or [22], Theorem 2.1, or [17], Corollary 2),
P F { T n ( X 1 , X 2 , , X n ) > t } = P F 1 , , F n { T n ( X 1 , X 2 , , X n ) > t }
= P G { T n ( X 1 , X 2 , , X n ) > t } + i = 1 n X TAIF i x ; t ; T n , G d F i ( x ) + R e m
where the sample space X R m ,
R e m = 1 2 X X T G F ( 2 ) ( x 1 , x 2 ) d [ F ( x 1 ) G ( x 1 ) ] d [ F ( x 2 ) G ( x 2 ) ]
where T G F ( 2 ) is the second derivative of the tail probability functional at the mixture distribution G F = ( 1 λ ) G + λ F , for some λ [ 0 , 1 ] ; and TAIF i is the ith (multivariate) partial tail area influence function of T n at G = ( G 1 , , G n ) in relation to G i , i = 1 , , n , introduced in [17], Definition 1,
TAIF i ( x ; t ; T n , G ) = ϵ P G i ϵ , x { T n ( X 1 , , X n ) > t } ϵ = 0
in those x X where the right-hand side exists. In the computation of TAIF i , only G i is contaminated; the other distributions remain fixed, i = 1 , , n .
In general, R e m is close to 0, and the von Mises approximation (VOM) is defined as
P F { T n ( X 1 , X 2 , , X n ) > t } P G { T n ( X 1 , X 2 , , X n ) > t } + i = 1 n X TAIF i x ; t ; T n , G d F i ( x ) .
Moreover, if F is a mixture distribution, F = ( 1 ϵ ) G + ϵ H , R e m = O ( ϵ 2 ) ([23], p. 77). Additionally, because of the partial influence functions properties ([22], p. 3) that are valid for the partial tail area influence functions defined in [17], for any T n it will be
X TAIF i x ; t ; T n , G d G i ( x ) = 0 ,
i.e., the integral with respect a given model of the TAIF i that depends on this model is equal to 0. Hence,
P F { T n ( X 1 , X 2 , , X n ) > t } = P G { T n ( X 1 , X 2 , , X n ) > t }
+ ( 1 ϵ ) i = 1 n X TAIF i x ; t ; T n , G d G i ( x )
+ ϵ i = 1 n X TAIF i x ; t ; T n , G d H i ( x ) + O ( ϵ 2 )
= P G { T n ( X 1 , X 2 , , X n ) > t } + 0 + ϵ i = 1 n X TAIF i x ; t ; T n , G d H i ( x ) + O ( ϵ 2 )
i.e., the VOM approximation is
P F { T n ( X 1 , X 2 , , X n ) > t } P G { T n ( X 1 , X 2 , , X n ) > t }
+ ϵ i = 1 n X TAIF i x ; t ; T n , G d H i ( x ) .
Moreover, because of Proposition 1 in [17],
TAIF i x ; t ; T n , G = P G 1 , , G n { T n ( X 1 , X 2 , , X n ) > t } + P G 1 , , G i 1 , G i + 1 , , G n { T n ( X 1 , , X i 1 , x , X i + 1 , , X n ) > t }
and the VOM approximation of the tail probability P F { T n ( X 1 , X 2 , , X n ) > t } can also be expressed as
P F { T n ( X 1 , X 2 , , X n ) > t } ( 1 n ) P G { T n ( X 1 , X 2 , , X n ) > t } + X P G 2 , , G n { T n ( x , X 2 , , X n ) > t } d F 1 ( x ) + X P G 1 , G 3 , , G n { T n ( X 1 , x , , X n ) > t } d F 2 ( x ) + + X P G 1 , , G n 1 { T n ( X 1 , , X n 1 , x ) > t } d F n ( x )
which allows an approximation of the tail probability P F { T n ( X 1 , X 2 , , X n ) > t } under models F = ( F 1 , , F n ) , knowing the value of this tail probability under near models G = ( G 1 , , G n ) .
In the particular case that T n ( X 1 , X 2 , , X n ) = W ¯ , the VOM approximation for the tail of W ¯ can be expressed as (see (2) with n = n j , j = 1 , , L and t = 0 now)
P F W ¯ > 0 P G { W 1 + + W n j > 0 } + i = 1 n j R TAIF i x ; 0 ; W ¯ , G d F i ( x )
or, see (4),
P F W ¯ > 0 P G { W 1 + + W n j > 0 }
+ i = 1 n j R P G { W 1 + + W n j > 0 }
+ P G 1 , , G i 1 , G i + 1 , , G n j { W 1 + + W i 1 + x + W i + 1 + + W n j > 0 } d F i ( x )
or
P F W ¯ > 0 ( 1 n j ) P G { W 1 + + W n j > 0 } + R P G 2 , , G n j { x + W 2 + + W n j > 0 } d F 1 ( x ) + R P G 1 , G 3 , , G n j { W 1 + x + W 3 + + W n j > 0 } d F 2 ( x ) + + R P G 1 , , G i 1 , G i + 1 , , G n j { W 1 + + W i 1 + x + W i + 1 + + W n j > 0 } d F i ( x ) + + R P G 1 , , G n j 1 { W 1 + + W n j 1 + x > 0 } d F n j ( x ) .
If it is assumed as model for the observations W i
F i = ( 1 ϵ ) N ( μ i , σ i 2 ) + ϵ N ( g i 1 μ i , g i 2 2 σ i 2 )
and it is denoted by G i g i 1 , g i 2 N ( g i 1 μ i , g i 2 2 σ i 2 ) , and by G i N ( μ i , σ i 2 ) the pivotal distribution, i = 1 , , n j , i.e.,
F i = ( 1 ϵ ) G i + ϵ G i g i 1 , g i 2 .
the generic component of this last equation is
R P G 1 , , G i 1 , G i + 1 , , G n j { W 1 + + W i 1 + x + W i + 1 + + W n j > 0 } d F i ( x ) =
R P G 1 , , G i 1 , G i + 1 , , G n j { W 1 + + W i 1 + W i + 1 + + W n j > x } d F i ( x )
= R 1 Φ x μ i σ i d F i ( x )
where Φ is the cumulative distribution function of a standard normal distribution,
μ i = μ 1 + + μ i 1 + μ i + 1 + + μ n j
and
σ i 2 = σ 1 2 + + σ i 1 2 + σ i + 1 2 + + σ n j 2 .
If
μ s = μ 1 + + μ n j = μ i + μ i
and
σ s 2 = σ 1 2 + + σ n j 2 = σ i 2 + σ i 2
then,
P G { W 1 + + W n j > 0 } = 1 Φ μ s σ s
and
P F β ^ R j > a = P F i = 1 n j W i > 0 = P F W ¯ > 0
1 Φ μ s σ s + i = 1 n j R Φ μ s σ s Φ x μ i σ i d F i ( x ) .
Because F i is a normal mixture
F i = ( 1 ϵ ) G i + ϵ G i g i 1 , g i 2
the VOM approximation (5) is
= 1 Φ μ s σ s + ϵ i = 1 n j R Φ μ s σ s Φ x μ i σ i d G i g i 1 , g i 2 ( x ) .
Moreover, because of property (3) for the partial influence functions mentioned before, it is
R Φ μ s σ s Φ x μ i σ i d G i ( x ) = 0
or
R Φ x μ i σ i d G i ( x ) = Φ μ s σ s .
Hence, making the change of variable ( x + μ i ) / σ i = y , it is
R Φ x μ i σ i d G i g i 1 , g i 2 ( x ) = Φ μ s g i 1 σ s g i 2
where
μ s g i 1 = μ 1 + + μ i 1 + g i 1 μ i + μ i + 1 + + μ n j
and
σ s g i 2 = σ 1 2 + + σ i 1 2 + g i 2 σ i 2 + σ i + 1 2 + + σ n j 2 .
Then, the VOM approximation to the distribution of β ^ R j is
P β ^ R j > a = 1 Φ μ s σ s + ϵ i = 1 n j Φ μ s σ s Φ μ s g i 1 σ s g i 2 .
Example 2.
In a study [24], whether low-density lipoprotein cholesterol (LDL-C) is a cause of coronary artery disease (CAD) was analyzed considering 28 DNA markers
  • DNA markers X Y
  • SNP exposure.beta exposure.se outcome.beta outcome.se
  • 1 snp_1 0.0260 0.004 0.0677 0.0286
  • 2 snp_2 -0.0440 0.004 -0.1625 0.0300
  • ..............................................................
  • 27 snp_27 0.0090 0.003 0.0000 0.0255
  • 28 snp_28 -0.0360 0.007 0.0198 0.0647
Usually, Z i B ( 2 , 0.5 ) is assumed to be an instrumental variable to mimic biallelic SNPs in Hardy–Weinberg equilibrium. A value
I V W = 2.834214
was obtained.
With the method proposed in this paper, considering sample sizes of n = 37 , n 1 = 17 , n 2 = 10 , and n 3 = 10 , and contamination parameters ϵ = 0.05 , g i 1 = 1 , and g i 2 = 1.5 , for the first DNA marker is obtained
μ s = μ 1 + + μ n j = 30 × ( 0.0677 a × 0.0260 )
σ i 2 = 0.0286 × 37 = 1.0582
σ s = 1.0582 × 37 = 6.257268
μ s g i 1 = μ s
and
σ s g i 2 = σ 1 2 + + σ i 1 2 + g i 2 σ i 2 + σ i + 1 2 + + σ n j 2
= 1.0582 × 36 + 1.5 × 1.0582 = 6.299405
M d M 1 = 2.59
v 1 = 1 F x ¯ 1 ( 3 / 4 ) F x ¯ 1 ( 1 / 4 ) = 1 8.08 ( 2.877 ) = 0.0912 .
For all the 28 DNA markers, we have
M d M j 2.59 3.70 2.78 2.71 4.93 v j 0.091 0.151 0.128 0.088 0.068
which are combined in the new robust estimate
R I V W = 2.042703 .

7. Conclusions

In this paper, a new method for estimating the parameters in a location–scale contamination model is introduced, in the case where individual observations are not available and, therefore, applying the usual robust methods is not possible, i.e., in summary data problems.
For the location problem, a new estimator was defined that is equal to the usual sample mean when no outliers exist and correcting classical estimations when outliers exist.
This new estimator was applied to one of the most used estimators in Mendelian randomization, the inverse-variance weighted estimator (IVW), defining a new estimator robust inverse weighted estimator (RIVW).

Funding

This study was partially supported by grant PID2021-124933NB-I00 from the Ministerio de Ciencia e Innovación (Spain).

Data Availability Statement

Not applicable.

Acknowledgments

The author is very grateful to the referees and to the assistant editor for their kind and professional remarks.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Huber, P.J.; Ronchetti, E.M. Robust Statistics, 2nd ed.; John Wiley & Sons: New York, NY, USA, 2009. [Google Scholar]
  2. Lehmann, E.L. Theory of Point Estimation; John Wiley & Sons: New York, NY, USA, 1983. [Google Scholar]
  3. Hampel, F.R.; Ronchetti, E.M.; Rousseeuw, P.J.; Syahel, W.A. Robust Statistics. The Approach Based on Influence Functions; John Wiley & Sons: New York, NY, USA, 1986. [Google Scholar]
  4. Basford, K.E.; McLachlan, G.J. Likelihood estimation with normal mixture models. Appl. Statist. 1985, 34, 282–289. [Google Scholar] [CrossRef]
  5. Berckmoes, B.; Molenberghs, G. On the asymptotic behavior of the contaminated sample mean. Math. Methods Stat. 2018, 27, 312–323. [Google Scholar] [CrossRef]
  6. Rodríguez, D.; Valdora, M. The breakdown point of the median of means tournament. Stat. Probab. Lett. 2019, 153, 108–112. [Google Scholar] [CrossRef]
  7. García-Pérez, A. Saddlepoint approximations for the distribution of some robust estimators of the variogram. Metrika 2020, 83, 69–91. [Google Scholar] [CrossRef]
  8. García-Pérez, A. New robust cross-variogram estimators and approximations for their distributions based on saddlepoint techniques. Mathematics 2021, 9, 762. [Google Scholar] [CrossRef]
  9. Serfling, R.J. Approximation Theorems of Mathematical Statistics; John Wiley & Sons: New York, NY, USA, 1980. [Google Scholar]
  10. Withers, C.S. Expansions for the distribution and quantiles of a regular functional of the empirical distribution with applications to nonparametric confidence intervals. Ann. Stat. 1983, 11, 577–587. [Google Scholar] [CrossRef]
  11. Hampel, F.R. The Influence Curve and its role in robust estimation. J. Am. Statist. Assoc. 1974, 69, 383–393. [Google Scholar] [CrossRef]
  12. Field, C.A.; Ronchetti, E. A tail area influence function and its application to testing. Seq. Anal. 1985, 4, 19–41. [Google Scholar] [CrossRef]
  13. Lugannani, R.; Rice, S. Saddle point approximation for the distribution of the sum of independent random variables. Adv. Appl. Probab. 1980, 12, 475–490. [Google Scholar] [CrossRef]
  14. Jensen, J.L. Saddlepoint Approximations; Clarendon Press: Oxford, UK, 1995. [Google Scholar]
  15. Daniels, H.E. Saddlepoint approximations for estimating equations. Biometrika 1983, 70, 89–96. [Google Scholar] [CrossRef]
  16. García-Pérez, A. Another look at the Tail Area Influence Function. Metrika 2011, 73, 77–92. [Google Scholar] [CrossRef]
  17. García-Pérez, A. A linear approximation to the power function of a test. Metrika 2012, 75, 855–875. [Google Scholar] [CrossRef]
  18. García-Pérez, A. A von Mises approximation to the small sample distribution of the trimmed mean. Metrika 2016, 79, 369–388. [Google Scholar] [CrossRef]
  19. Gómez-Olivencia, A. The presacral spine of the La Ferrassie 1 Neandertal: A revised inventory. Bull. Mém. Soc. Anthropol. Paris 2013, 25, 19–38. [Google Scholar] [CrossRef]
  20. Pires, H.F.; Smith, G.D.; Bowden, J. Robust inference in summary data Mendelian randomization via the zero modal pleiotropy assumption. Int. J. Epidemiol. 2017, 46, 1985–1998. [Google Scholar] [CrossRef] [Green Version]
  21. Slob, E.A.W.; Burgess, S. A comparison of robust Mendelian randomization methods using summary data. Genet. Epidemiol. 2020, 44, 313–329. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  22. Pires, A.M.; Branco, J.A. Partial influence functions. J. Multivar. Anal. 2002, 83, 451–468. [Google Scholar] [CrossRef] [Green Version]
  23. Ronchetti, E. Accurate and robust inference. Econom. Stat. 2020, 14, 74–88. [Google Scholar] [CrossRef]
  24. Waterworth, D.M.; Ricketts, S.L.; Song, K.; Chen, L.; Zhao, J.H.; Ripatti, S.; Aulchenko, Y.S.; Zhang, W.; Yuan, X.; Lim, N.; et al. Genetic Variants Influencing Circulating Lipid Levels and Risk of Coronary Artery Disease. Arterioscler. Thromb. Vasc. Biol. 2011, 30, 2264–2276. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Differences between M d M and x ¯ as ϵ increases.
Figure 1. Differences between M d M and x ¯ as ϵ increases.
Mathematics 11 02694 g001
Figure 2. Differences between M d M and x ¯ as g 1 increases.
Figure 2. Differences between M d M and x ¯ as g 1 increases.
Mathematics 11 02694 g002
Figure 3. Differences between M d M and x ¯ as g 2 increases.
Figure 3. Differences between M d M and x ¯ as g 2 increases.
Mathematics 11 02694 g003
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

García-Pérez, A. A New Estimator: Median of the Distribution of the Mean in Robustness. Mathematics 2023, 11, 2694. https://doi.org/10.3390/math11122694

AMA Style

García-Pérez A. A New Estimator: Median of the Distribution of the Mean in Robustness. Mathematics. 2023; 11(12):2694. https://doi.org/10.3390/math11122694

Chicago/Turabian Style

García-Pérez, Alfonso. 2023. "A New Estimator: Median of the Distribution of the Mean in Robustness" Mathematics 11, no. 12: 2694. https://doi.org/10.3390/math11122694

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop