Abstract
In some statistical methods, the statistical information is provided in terms of the values used by classical estimators, such as the sample mean and sample variance. These estimations are used in a second stage, usually in a classical manner, to be combined into a single value, as a weighted mean. Moreover, in many applied studies, the results are given in these terms, i.e., as summary data. In all of these cases, the individual observations are unknown; therefore, computing the usual robustness estimators with them to replace classical non-robust estimations by robust ones is not possible. In this paper, the use of the median of the distribution of the sample mean is proposed, assuming a location-scale contaminated normal model, where the parameters of are estimated with the classical estimations provided in the first stage. The estimator so defined is called median of the distribution of the mean, . This new estimator is applied in Mendelian randomization, defining the new robust inverse weighted estimator, RIVW.
Keywords:
robust statistics; von Mises expansions; saddlepoint approximations; Mendelian randomization MSC:
62F35; 62E17; 62P99
1. Introduction
In the application of some statistical methods, such as clinical trials, the results are, usually, described in terms of the values taken by classical estimators, such as the sample mean and sample variance. These results are combined, in a second stage, as a weighted mean in a meta analysis. The same occurs in its alternative, Mendelian Randomization, one of the main topics in causal inference.
Moreover, in many applied studies, their results have been described in these terms, i.e., as summary data, not knowing the individual observations, to compute robust estimators with them, replacing the classical non-robust estimations with robust ones.
In this paper, a solution to this problem is proposed, correcting, if necessary, the given classical estimations because, although the individual observations are unknown, the mechanism that generates the data is known because it is the model.
Focusing on the mean estimation problem, the optimal estimator (uniformly minimum variance unbiased estimator) is the sample mean, when no outliers exist in the sample, and the normal distribution is assumed as the model, with and being the usual parameters of the normal distribution, population mean, and variance. Assume that a proportion of outliers exists in the sample, i.e., a contaminated normal model (see [1], p. 2)
where most of the data are from a , and a small part of them, , are from a normal model with more dispersion and a different location, , where is a contamination parameter that affects the location, and is a contamination parameter that affects the scale. The optimality of the sample mean is lost because the optimal procedure and its properties heavily depend on the assumed probability model ([2], p. 2). This is the reason why classical statistics rests, basically, on the normal model and on the sample mean.
Additionally, under a contaminated normal model, the robustness of the sample mean is lost [1,3]. Under this model, the sample mean is not the maximum likelihood estimator [4], and even the normality of the sample mean is not guaranteed [5].
In this paper, a new estimator for a location–scale contaminated normal model is proposed, avoiding the extreme sensitivity of the sample mean but coinciding with it when no outliers are present in the data. The median of the distribution of the sample mean is proposed as a new estimator, where the parameters of are estimated with the classical estimations described in previous studies. This estimator is called the median of the distribution of the mean, .
The two reasons why this new estimator relies on the distribution of the sample mean are that, first, the classical estimations are given in terms of the classical mean (and classical variance) and, second, this new procedure extends the classical one in the sense that if no outliers are present, this new estimator is the classical sample mean, i.e., with this method, the classical estimation is extended to the case in which outliers are present.
Another estimator somewhat related to is the median of the means estimator . However, this estimator is, finally, one of the sample means and, hence, is not robust (see [6]).
With the , robustness and optimality are obtained if there are no outliers. Hence, with this approach, a new vision of the dilemma between optimality and robustness is provided.
Because the exact sample distribution of under a mixture distribution is not known, here it is estimated in a closed form with the von Mises (VOM) plus saddlepoint (SAD) method, a technique used by the author in several studies (see, for instance, [7,8]) but in another context. With this approximation, the estimator introduced in this paper can also be extended to other more general models than the normal mixture considered here.
The rest of the paper is structured as follows. In Section 2, the VOM+SAD approximation for the distribution of the sample mean is obtained under a location–scale contaminated normal model. The definition and some properties of this new location estimator are considered in Section 3, and a scale estimator, based on these ideas, is defined in Section 4, and an example of the application of this new estimator is considered in Section 5. These ideas are applied to Mendelian randomization in Section 6. Some conclusions are outlined in Section 7.
2. VOM+SAD Approximation of Sample Mean Distribution
Because the new estimator depends on the distribution of the sample mean, the distribution of the sample mean must be very precise, especially when the considered sample sizes are very small. For this situation, using a von Mises expansion ([9], p. 215, or [10], p. 578) that depends on Hampel’s influence function [11] is highly recommended.
Although, in the end, the obtained results are be applied to the mixture of normals model considered previously, these refer to more general models, F, G, and H, which indicate future extensions of this method.
The final approximation is called VOM+SAD and was previously obtained by the author in the context of spatial data (see [7,8]). Following the ideas developed in those two papers, considering the tail probability functional, initially, the approximation obtained is
which allows the approximation of the distribution of when the observations follow model F by the distribution of when the variables of follow model G (pivotal distribution).
This approximation depends on the tail area influence function, TAIF, defined in [12].
Restricting this approximation to M estimators with a monotonic decreasing score function (see [1], p. 46) and using the Lugannani and Rice formula ([13], or [14] p. 77, or [1] p. 314) to obtain a saddlepoint approximation for the TAIF, as the approximation given in [15] (p. 94), for M estimators, the VOM+SAD approximation obtained is
In the case of a location–scale mixture normal model, the framework that it is considered in this paper, i.e., assuming that , the VOM+SAD approximation is
where , and .
VOM-SAD Approximation for the Distribution of the Sample Mean
In the particular case of the sample mean, the score function is . Remember that in the VOM+SAD approximation, the saddlepoint is computed under . Under this pivotal distribution, it is
Hence, from the saddlepoint equation the saddlepoint is obtained.
Additionally, , , , and . The leading term is , and the quotient in last term in the right side of (1) is
Hence, the VOM+SAD approximation (1) is
If distributions F and G are not close enough, intermediate distributions can be considered, as in [16,17,18], to obtain a more accurate approximation.
3. Estimator Median of the Distribution of the Mean
If the previous distribution of the mean is
the median of this distribution, i.e., , is called the median of the distribution of the mean, , i.e., this estimator is the solution of
The parameters of are estimated with the classical estimations, the sample mean and the sample variance .
Figure 1, Figure 2 and Figure 3 show that as contamination parameters , , or increase, the difference between and , i.e., , increases.
Figure 1.
Differences between and as increases.
Figure 2.
Differences between and as increases.
Figure 3.
Differences between and as increases.
The main reason for the definition of is that the median is more robust than the sample mean and, hence, the influence of possible outliers, not knowing the individual observations, as assumed here, should be lower with the median of the distribution of the mean than with the sample mean, used in this distribution as an estimator of the location parameter. Furthermore, in the case without outliers, this estimator is equal to the classical sample mean.
As a limitation, observe that is also sensitive if outliers already affect the sample mean or sample variance used in the estimation of the location or scale parameter or . Nevertheless, with , this sensitivity is lower.
One way to check the behavior of with respect to in a simple numerical example is to run the R sentences
- >
- x<-0.80*rnorm(11,2,1)+0.2*rnorm(11,3*2,1)
- >
- mean(x)
- >
- median(x)
in which we consider a random sample of sample data from a mixture normal , i.e., a sample where and .
Finally, in future research, other robust estimators could be considered, such as the trimmed mean of the distribution of the sample mean.
4. Dispersion Estimator
With the ideas developed in this paper, a dispersion estimator should be
5. Example
In most application papers, only the final values of the estimators used on them are given. Additionally, these estimators are usually the classical sample mean and sample variance and do not include the individual observations from which these estimators are obtained and, therefore, not providing the opportunity to robustify these values using robust techniques.
For this reason, a large number of examples could serve as an illustration of the estimator defined in this paper. Next, let us consider just one.
Example 1.
One of these studies is [19], where some vertebral column and thorax of Neanderthals fossils were re-evaluated using their vertebrae because, probably, as stated by the author, errors occurred in the reconstruction and the samples were wrongly classified. He mentions ([19], p. 23) a misclassification of 7/33, which can be considered as the value of the contamination parameter ϵ.
Because modern humans and Neanderthals have very similar vertebrae, no difference in the mean is assumed, using, hence a distortion factor . On the other hand, Neanderthals are slightly more stockier than modern humans, with the dispersion of the latter being larger, assuming that .
In Table 2 in [19], classical acceptance confidence intervals are provided for several vertebrae of 28 modern humans. They are based on the classical mean and variance, as the author says in this table. From the table, with respect vertebra T1, the remains of Kebara 2 and La Ferrassie can be considered as modern humans instead of Neanderthals, because they are inside of the confidence interval. The same happens with vertebra T7 but not with vertebra T5.
From these classical intervals, for vertebra T1, the classical sample mean and standard deviation are and , respectively. In this case, the estimator median of the distribution of the mean takes the value , obtaining the new robust acceptance confidence interval equal to , which does not contain the remains, concluding then, that these remains are Neanderthals and not modern humans, as they were wrongly considered with the classical estimators.
With respect to vertebra T5, , and the new robust acceptance confidence interval is , with neither the classical nor the robust interval not including the remains, confirming that they are Neanderthals.
Finally, for vertebra T7, , and the new robust acceptance confidence interval , both this robust and the previous classical confidence interval including the remains of the La Ferrassie, hence being modern humans and not Neanderthals.
6. Robust Inverse-Weighted Estimator RIVW in Mendelian Randomization
Another field for the class of problems considered in this paper is randomized clinical trials (CTs). In each of these CTs, the sample mean and sample variance are the usual final result. These are usually combined, in a classical way, as a weighted mean in a meta-analysis. In CTs, the relationship of a variable X (called cause) with another variable Y (called effect) is analyzed, but reverse causality may exist or a lack complete randomization or, more importantly, confounders may be present.
Moreover, CTs are expensive and take a long time. With Mendelian randomization (MR), a method that has received a renewed interest in recent years, CTs are imitated because, in any person, all genetic material is randomized allocated from their parents, including DNA markers. Randomly, some people receive more DNA markers related with variable X and, for others, fewer. MR uses genetic variants (usually single-nucleotide polymorphisms (SNPs)) as instrumental variables Z.
Mathematically, MR is used to avoid possible biases in the regression of Y on X due to these three causes just mentioned. Formally, MR leads us to a two-step linear regression process; first, for every genetic variant , a linear regression of X on is performed, where, for individuals, is
from which the fitted values are obtained and used in a second regression of Y on these , obtaining finally [20]
where and represent the association of with the exposure and the outcome (only through X), respectively. The parameter represents the effect of on Y through X, where is the causal effect of X on Y that is being estimated. Moreover, represents the association between and Y not through the exposure of interest. Finally, the errors terms and are assumed to be independent because independent samples are assumed to be used to fit the two previous regression models.
In MR, the standard estimator of the parameter of interest , the slope in the linear regression of Y on X, is the classical two-stage least squares estimator
which is the quotient of the slope of the regression of Y on , , and the slope estimator of the regression of X on , . These classical estimations, one for each value of the instrumental variable , are combined with the classical inverse-variance weighted (IVW) estimator
where , which is used to weight the estimators, assuming that the L genetic variants are mutually independent. In this way, a single causal effect estimate from L genetic instruments is obtained.
This classic and widely used estimator is not robust because it has a breakdown point because it is a weighted mean, see, for instance, [21].
In this section, the robustification of the classical estimator IVW is obtained, first, by replacing estimators with the median of the distribution of the mean, estimators and, second, by replacing the weights with , the inverse of the new dispersion estimator,
defining the new estimator, based on the distribution, as
6.1. Distribution of Estimator
In this section, an approximation for the distribution of is obtained for each genetic variant , i.e, j is fixed. Moreover, because of the usual regression assumptions, is not random in the two previous linear regressions, i.e., in the estimator .
Hence, with denoting the constant
avoiding the j in the notation of to simplify it, and with being the constant
and assuming no outliers in the sample, the variable follows a normal distribution
and
The estimator is equal to
and, considering standardized data, i.e., that is computed as a correlations quotient,
its tail distribution is
Letting (removing the j if there is no risk of confusion) denote the random variable , , where a and are not random, the aim is to compute the distribution of the sample mean of the variables at 0, i.e.,
where is independent but not identically distributed because
where
and
which depends on and . The values of these parameters are given from previous studies following the median of the distribution the mean method.
If the data contain no outliers, it will be
but, as usual, a proportion of outliers in the data is assumed, i.e., as a model for the observations the following
where the contamination constants and are assumed to depend on .
To compute the distribution of under models , assuming that the sample sizes are small, a von Mises approximation (VOM), based on a von Mises expansion, is used to obtain an accurate approximation with small sample sizes.
6.2. VOM Approximation of the Distribution
In general, to approximate the tail probability of statistic under a vector of model distributions , knowing its tail distribution under the vector of model distributions (called pivotal distributions), the von Mises expansion of the tail probability of at is used ([10], Section 2, or [22], Theorem 2.1, or [17], Corollary 2),
where the sample space ,
where is the second derivative of the tail probability functional at the mixture distribution , for some ; and is the ith (multivariate) partial tail area influence function of at in relation to , , introduced in [17], Definition 1,
in those where the right-hand side exists. In the computation of , only is contaminated; the other distributions remain fixed, .
In general, is close to 0, and the von Mises approximation (VOM) is defined as
Moreover, if is a mixture distribution, , ([23], p. 77). Additionally, because of the partial influence functions properties ([22], p. 3) that are valid for the partial tail area influence functions defined in [17], for any it will be
i.e., the integral with respect a given model of the that depends on this model is equal to 0. Hence,
i.e., the VOM approximation is
Moreover, because of Proposition 1 in [17],
and the VOM approximation of the tail probability can also be expressed as
which allows an approximation of the tail probability under models , knowing the value of this tail probability under near models .
In the particular case that , the VOM approximation for the tail of can be expressed as (see (2) with and now)
or, see (4),
or
If it is assumed as model for the observations
and it is denoted by , and by the pivotal distribution, , i.e.,
the generic component of this last equation is
where is the cumulative distribution function of a standard normal distribution,
and
If
and
then,
and
Hence, making the change of variable , it is
where
and
Then, the VOM approximation to the distribution of is
Example 2.
In a study [24], whether low-density lipoprotein cholesterol (LDL-C) is a cause of coronary artery disease (CAD) was analyzed considering 28 DNA markers
- DNA markers X Y
- SNP exposure.beta exposure.se outcome.beta outcome.se
- 1 snp_1 0.0260 0.004 0.0677 0.0286
- 2 snp_2 -0.0440 0.004 -0.1625 0.0300
- ..............................................................
- 27 snp_27 0.0090 0.003 0.0000 0.0255
- 28 snp_28 -0.0360 0.007 0.0198 0.0647
Usually, is assumed to be an instrumental variable to mimic biallelic SNPs in Hardy–Weinberg equilibrium. A value
was obtained.
With the method proposed in this paper, considering sample sizes of , , , and , and contamination parameters , , and , for the first DNA marker is obtained
and
For all the 28 DNA markers, we have
which are combined in the new robust estimate
7. Conclusions
In this paper, a new method for estimating the parameters in a location–scale contamination model is introduced, in the case where individual observations are not available and, therefore, applying the usual robust methods is not possible, i.e., in summary data problems.
For the location problem, a new estimator was defined that is equal to the usual sample mean when no outliers exist and correcting classical estimations when outliers exist.
This new estimator was applied to one of the most used estimators in Mendelian randomization, the inverse-variance weighted estimator (IVW), defining a new estimator robust inverse weighted estimator (RIVW).
Funding
This study was partially supported by grant PID2021-124933NB-I00 from the Ministerio de Ciencia e Innovación (Spain).
Data Availability Statement
Not applicable.
Acknowledgments
The author is very grateful to the referees and to the assistant editor for their kind and professional remarks.
Conflicts of Interest
The author declares no conflict of interest.
References
- Huber, P.J.; Ronchetti, E.M. Robust Statistics, 2nd ed.; John Wiley & Sons: New York, NY, USA, 2009. [Google Scholar]
- Lehmann, E.L. Theory of Point Estimation; John Wiley & Sons: New York, NY, USA, 1983. [Google Scholar]
- Hampel, F.R.; Ronchetti, E.M.; Rousseeuw, P.J.; Syahel, W.A. Robust Statistics. The Approach Based on Influence Functions; John Wiley & Sons: New York, NY, USA, 1986. [Google Scholar]
- Basford, K.E.; McLachlan, G.J. Likelihood estimation with normal mixture models. Appl. Statist. 1985, 34, 282–289. [Google Scholar] [CrossRef]
- Berckmoes, B.; Molenberghs, G. On the asymptotic behavior of the contaminated sample mean. Math. Methods Stat. 2018, 27, 312–323. [Google Scholar] [CrossRef]
- Rodríguez, D.; Valdora, M. The breakdown point of the median of means tournament. Stat. Probab. Lett. 2019, 153, 108–112. [Google Scholar] [CrossRef]
- García-Pérez, A. Saddlepoint approximations for the distribution of some robust estimators of the variogram. Metrika 2020, 83, 69–91. [Google Scholar] [CrossRef]
- García-Pérez, A. New robust cross-variogram estimators and approximations for their distributions based on saddlepoint techniques. Mathematics 2021, 9, 762. [Google Scholar] [CrossRef]
- Serfling, R.J. Approximation Theorems of Mathematical Statistics; John Wiley & Sons: New York, NY, USA, 1980. [Google Scholar]
- Withers, C.S. Expansions for the distribution and quantiles of a regular functional of the empirical distribution with applications to nonparametric confidence intervals. Ann. Stat. 1983, 11, 577–587. [Google Scholar] [CrossRef]
- Hampel, F.R. The Influence Curve and its role in robust estimation. J. Am. Statist. Assoc. 1974, 69, 383–393. [Google Scholar] [CrossRef]
- Field, C.A.; Ronchetti, E. A tail area influence function and its application to testing. Seq. Anal. 1985, 4, 19–41. [Google Scholar] [CrossRef]
- Lugannani, R.; Rice, S. Saddle point approximation for the distribution of the sum of independent random variables. Adv. Appl. Probab. 1980, 12, 475–490. [Google Scholar] [CrossRef]
- Jensen, J.L. Saddlepoint Approximations; Clarendon Press: Oxford, UK, 1995. [Google Scholar]
- Daniels, H.E. Saddlepoint approximations for estimating equations. Biometrika 1983, 70, 89–96. [Google Scholar] [CrossRef]
- García-Pérez, A. Another look at the Tail Area Influence Function. Metrika 2011, 73, 77–92. [Google Scholar] [CrossRef]
- García-Pérez, A. A linear approximation to the power function of a test. Metrika 2012, 75, 855–875. [Google Scholar] [CrossRef]
- García-Pérez, A. A von Mises approximation to the small sample distribution of the trimmed mean. Metrika 2016, 79, 369–388. [Google Scholar] [CrossRef]
- Gómez-Olivencia, A. The presacral spine of the La Ferrassie 1 Neandertal: A revised inventory. Bull. Mém. Soc. Anthropol. Paris 2013, 25, 19–38. [Google Scholar] [CrossRef]
- Pires, H.F.; Smith, G.D.; Bowden, J. Robust inference in summary data Mendelian randomization via the zero modal pleiotropy assumption. Int. J. Epidemiol. 2017, 46, 1985–1998. [Google Scholar] [CrossRef]
- Slob, E.A.W.; Burgess, S. A comparison of robust Mendelian randomization methods using summary data. Genet. Epidemiol. 2020, 44, 313–329. [Google Scholar] [CrossRef] [PubMed]
- Pires, A.M.; Branco, J.A. Partial influence functions. J. Multivar. Anal. 2002, 83, 451–468. [Google Scholar] [CrossRef]
- Ronchetti, E. Accurate and robust inference. Econom. Stat. 2020, 14, 74–88. [Google Scholar] [CrossRef]
- Waterworth, D.M.; Ricketts, S.L.; Song, K.; Chen, L.; Zhao, J.H.; Ripatti, S.; Aulchenko, Y.S.; Zhang, W.; Yuan, X.; Lim, N.; et al. Genetic Variants Influencing Circulating Lipid Levels and Risk of Coronary Artery Disease. Arterioscler. Thromb. Vasc. Biol. 2011, 30, 2264–2276. [Google Scholar] [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).