1. Introduction
Density-based minimum divergence methods are popular tools in statistical inference. In parametric estimation, this amounts to choosing the model density closest (in terms of the selected divergence) to the empirical data density. This approach often combines strong robustness properties with high asymptotic efficiency. An important class of density-based divergences is the class of
divergences (see [
1]). Under standard regularity conditions, all minimum
divergence estimators have full asymptotic efficiency at the model [
2]; many also have attractive robustness properties. The seminal Hellinger distance study of [
3] appears to be the first which demonstrated that strong robustness properties may be achieved simultaneously with full asymptotic efficiency. Later, the same has been demonstrated with respect to much of the
divergence class (see, e.g., [
4]). The usefulness of the corresponding procedures in providing robust alternatives to the likelihood ratio test has also been explored in the literature [
2,
4,
5]. The approach has been further refined and extended in many directions by later authors. On the whole, the utility of the minimum divergence procedures based on
divergences is well established in the literature.
One major criticism of this inference procedure is that it inevitably involves the use of some form of non-parametric smoothing (such as kernel density estimation) to produce a continuous estimate of the true density. This can throw up several potential difficulties including the problematic bandwidth selection issue and the slow convergence of the of kernel density estimator to the “truth” (particularly for high dimensional data). The theoretical derivations are also harder. Development of methods which eliminate these difficulties may be worthwhile even if that involves a marginal loss in asymptotic efficiency.
An alternative class of minimum divergence estimators which avoids non-parametric smoothing in the construction of the empirical divergence is the class of minimum Bregman divergence estimators. An important example is the family of density power divergences (henceforth DPD(
), where
is the tuning parameter); the corresponding minimum density power divergence estimators (henceforth MDPDE(
)) have been shown to combine strong robustness properties with high asymptotic efficiency (see [
6]). Divergences within the Bregman class have been called decomposable divergences by [
7] and non-kernel divergences by [
8]. These divergences have simple estimating equations and much of their asymptotic properties can be obtained from the M-estimation theory. The Kullback–Leibler divergence, which is a decomposable divergence, is the only common member between the
divergence and the Bregman divergence classes.
The application of Bregman divergences continues to expand rapidly into complex domains. Recent studies have successfully adapted these measures for multivariate time series analysis, including change-point and anomaly detection [
9,
10]. In the realm of geometric data analysis, variants like the Total Bregman Divergence have been applied to K-Means clustering for point cloud denoising [
11]. On the theoretical front, new bounds on excess minimum risk have been established for generalized divergence measures [
12], while recent work in econometrics has utilized these concepts for the robust learning of tail dependence [
13].
In the context of density-based minimum divergence estimation, we have several “good” choices available. To justify the development of another family of estimators, one must demonstrate that the new estimators are competitive, if not better than the existing standard. Within the class of minimum divergence estimators which do not require any nonparametric smoothing, the MDPDE(
) is the current standard. In this paper, we will develop a family of divergences yielding minimum divergence estimators which satisfy this requirement; at the least, this family provides a highly competitive standard. Our proposed class of divergences will be called the exponentially weighted divergence family, indexed by a tuning parameter
(henceforth referred to as EWD(
)). The corresponding minimum divergence estimator will be denoted by MEWDE(
). This divergence will also be useful in the field of hypothesis testing. Although the likelihood ratio test has several asymptotic optimality properties at the model, it is also known to have very poor robustness properties. Many density-based minimum distance procedures yield robust tests of hypothesis with high efficiency, e.g., Refs. [
14,
15,
16]. Some of these papers address the general problem of parametric hypothesis testing based on the density power divergence. The present work considers this problem in the context of a general Bregman divergence, with special emphasis on our proposed EWD(
) class. To summarize, the main results of this paper are fourfold:
We introduce the exponentially weighted divergence (EWD), a novel sub-class of Bregman divergences. Unlike the popular density power divergence (DPD), the EWD is constructed to ensure the associated weight function remains bounded within the interval.
We derive the asymptotic properties (consistency and normality) of the minimum EWD estimator (MEWDE) for independent non-homogeneous data.
We extend the framework to parametric hypothesis testing and derive the asymptotic null distribution of the test statistic.
We evaluate the method through simulations and real-data applications, comparing its performance against the density power divergence (DPD) and classical robust estimators.
It is instructive to situate the proposed EWD within the broader landscape of robust statistics, while classical robust methods such as M-estimators, Least Median of Squares (LMS), and MM-estimators focus on minimizing residual functions, the proposed EWD approach belongs to the class of minimum divergence estimators. This distinction is crucial: by minimizing a probability distance, the MEWDE naturally achieves high asymptotic efficiency at the model—a property that often requires complex, multi-step procedures in standard high-breakdown regression (e.g., MM-estimation). Furthermore, within the divergence framework, it is important to distinguish the EWD from the class of
-divergences [
1], which includes the Hellinger distance and Pearson’s
; while
-divergences offer strong geometric properties, they often result in estimating equations that are computationally intensive or require non-parametric density estimation components. In contrast, the EWD belongs to the Bregman divergence class. This ensures that the resulting estimators share the tractable, decomposable structure of the Maximum Likelihood Estimator (MLE) and can be easily implemented for any member of the exponential family. The EWD thus combines the desirable bounded-influence properties often sought in
-divergence inference with the computational simplicity of the Bregman framework.
The rest of this paper is organized as follows.
Section 2 introduces the general framework of minimum Bregman divergence estimation for independent non-homogeneous observations and defines our proposed sub-class, the EWD family.
Section 3 establishes the theoretical properties of the estimator, including Fisher consistency and asymptotic normality, and analyzes its robustness via the influence function.
Section 4 presents extensive simulation studies comparing the finite-sample performance of the proposed method against existing standards and discusses a data-driven strategy for selecting the optimal tuning parameter controlling trade-off between efficiency and robustness of MEWDEs.
Section 5 demonstrates the practical utility of the estimator through several real-data examples, ranging from univariate problems to multiple linear regression.
Section 6 extends the framework to robust hypothesis testing and derives the asymptotic null distribution of the proposed test statistic. Finally,
Section 7 offers concluding remarks, while proofs and additional regression examples are provided in the Appendices.
4. Simulation Studies for MEWDE(β)
4.1. Introduction
For i.i.d. data, when the true density
g belongs to the model, i.e.,
for some
, let
refer to the MEWDE of an unknown parameter
. For
, under the regularity conditions (1)–(7) outlined in
Section 3.2, the score function satisfies the requirements of the Lindeberg–Feller Central Limit Theorem. Although this theorem [
18] is typically invoked for non-identically distributed data, it remains applicable in our i.i.d. setting because the finite variance of the score function automatically satisfies the Lindeberg condition. Consequently, applying the standard Taylor expansion technique for M-estimators, we formally state that the asymptotic normality of
is an
s- variate normal distribution with mean vector
and dispersion matrix given by
, where
As , both J and K reduce to the Fisher information matrix. We now consider different parametric families and compare the performance of MEWDEs and MDPDEs under different contamination scenarios.
4.2. Tuning Parameter Selection
In minimum EWD estimation, small values of
provide greater model efficiency, while large values of
provide greater outlier stability and protection against small model violations. Given any real dataset, we must choose the “optimal”, data-based tuning parameter
so that the procedure has the right amount of balance for the dataset in question. We follow the approach of [
19] to derive the optimal estimate of the tuning parameter. This approach constructs an empirical estimate of the Mean Squared Error as a function of the tuning parameter
and a pilot estimator
given by
where
is the trace of a matrix and
and
are the matrices defined in Equations (
10) and (
11), respectively, evaluated at
. Further, tr(·) denotes the trace of a matrix. By minimizing the objective function given in Equation (
17) over
, we get a data driven “optimal” estimate of the tuning parameter. Ref. [
19] proposes the minimum
estimator as the pilot estimator in the above calculation.
4.3. Simulation Scheme
To rigorously evaluate the finite-sample performance of the proposed MEWDE, we compared it against the minimum density power divergence estimator (MDPDE) and the classical Maximum Likelihood Estimator (MLE). We adopted an “Oracle” tuning approach for the simulation study. For each estimator and each simulation scenario (combination of sample size
n and contamination level
), we performed a grid search to identify the optimal tuning parameter (
for DPD,
for EWD) that minimizes the empirical Mean Squared Error (MSE). This approach reports the best possible performance achievable by each estimator, isolating the theoretical capability of the divergence measure from the variability of data-driven tuning selection. Note that the practical data-driven algorithm for selecting
in real applications is detailed in
Section 4.2.
4.4. Results
We first define the empirical finite sample relative efficiency (FSRE) of the MEWDE (or MDPDE) as the ratio of MSE(MLE) to MSE(MEWDE) (or to MSE(MDPDE)). Under this metric, a value of indicates efficiency equivalent to the MLE, while values indicate superior performance (error reduction) in the presence of contamination. To provide a comprehensive benchmark, we also report the performance of classical robust methods, specifically Huber’s M-estimator and Tukey’s Bisquare.
We consider three separate simulation designs involving (a) estimation of the mean of a univariate normal distribution with known standard deviation, (b) estimation of the standard deviation of a univariate normal distribution with known mean, and (c) estimation of the mean parameter of an exponential distribution. For simulation (a), the true distribution is taken to be
and the contaminating distribution is
. We run simulations for
and estimate the mean under the
model. Our findings are presented in
Table 1.
Table 1 presents the finite sample relative efficiency (FSRE) results, calculated as the ratio of the MSE of the MLE to the MSE of the robust estimator. Consequently, a value greater than 1 indicates the robust method is less efficient than the MLE, while values less than 1 (common under contamination) indicate superior performance. The results cover the pure location model with sample sizes
and contamination proportions
.
Table 1 demonstrates that under an independent “Oracle” tuning protocol, the MEWDE offers a superior efficiency–robustness trade-off in the most common contamination regimes. In the absence of outliers (
), the estimator retains high efficiency comparable to the MLE (FSRE
), confirming that robustness does not come at the cost of performance under the null model. Crucially, under moderate contamination (
), the MEWDE consistently outperforms both the MDPDE and classical robust methods; for example, at
and
, the MEWDE achieves a ten-fold reduction in MSE compared to the MDPDE (0.027 vs. 0.261), while the MDPDE exhibits greater stability in small-sample, severe-contamination settings (
); the MEWDE recovers its superiority as sample size increases (
), suggesting that the exponential weight function provides sharper discrimination against outliers when sufficient data is available to stabilize the density estimate. For completeness, the full tabulated results and efficiency comparisons for Scenarios (b) and (c) are provided in
Appendix C. These additional results exhibit trends consistent with Scenario (a), confirming the method’s stability across different distributional assumptions.
4.5. Sensitivity to Tuning Parameter Misspecification
A practical concern for robust M-estimation is the sensitivity of the estimator to the choice of the tuning parameter
, while our primary results rely on an Oracle selection, practitioners require guidance on a “safe” range for
that yields robust performance without requiring prior knowledge of the contamination level. To address this, we examined the efficiency ratio defined as
across a grid of tuning parameters
. In this metric, values greater than
indicate that the proposed estimator outperforms the non-robust MLE. We varied the sample size
and contamination proportion
(using a mean-shift contamination
) to identify the regions of stability and breakdown. The sensitivity of the estimator is illustrated in
Figure 4. In each panel, the black curve represents the MSE profile as
varies, while the vertical red dashed line marks the Oracle optimal
. The results (summarized in
Figure 4) reveal three distinct performance regimes that offer concrete guidance for parameter selection:
Efficiency Cost (Clean Data, ): In the absence of outliers, the efficiency ratio is consistently below , reflecting the expected cost of robustness. However, for small tuning parameters (), the ratio remains high (approx. for ), confirming that the MEWDE retains the majority of the statistical information when the model is correctly specified.
Robustness Regime (): Under moderate contamination, the estimator demonstrates a wide “basin of stability.” The MEWDE consistently dominates the MLE, achieving efficiency ratios between and across the entire range of tested . This indicates that for the most common contamination scenarios, the method is highly insensitive to parameter misspecification; any yields superior results.
Severe Contamination Limit (): A critical divergence occurs under severe contamination. For small sample sizes (), the estimator breaks down regardless of , as the minority inliers are insufficient to anchor the density. However, for moderate sample sizes (), we observe a clear transition: small tuning parameters () maintain robustness (ratio ), while larger values () lead to performance degradation.
Figure 4.
Sensitivity analysis showing the Mean Squared Error (MSE) of the MEWDE as a function of the tuning parameter for different sample sizes.
Figure 4.
Sensitivity analysis showing the Mean Squared Error (MSE) of the MEWDE as a function of the tuning parameter for different sample sizes.
Based on these findings, we recommend as a default operating range. This interval minimizes efficiency loss on clean data while providing maximum protection against severe contamination in intermediate sample sizes (), avoiding the breakdown risks associated with larger smoothing parameters.
4.6. Performance of Data-Driven Tuning
A common critique of minimum divergence methods is the reliance on a tuning parameter
that must be selected by the user, while the primary simulation results presented in
Table 1 utilized an “Oracle” approach (selecting the
that minimizes the actual squared error) to establish the theoretical limit of the method, this is impossible in practical applications where the true parameter is unknown.
To assess the practical feasibility of the MEWDE, we implemented the data-driven tuning algorithm proposed by Warwick and Jones [
19]. This method selects the optimal tuning parameter
by minimizing an empirical estimate of the Asymptotic Mean Squared Error (AMSE):
where
is the MEWDE computed with tuning parameter
, and
is the sandwich variance estimate. Crucially, this criterion requires a pilot estimator
to approximate the bias. For this simulation, we following [
19]’s proposal to use the minimum
estimator as
. The simulation setting mirrors the pure location model described in
Section 4.3. We generated independent observations from a contaminated normal mixture model:
where
denotes the standard normal density. We evaluated the performance across sample sizes
and contamination proportions
. For each replicate, the data-driven algorithm selected
from the grid
without knowledge of the true parameter or contamination level. The results of this comparison are presented in
Table 2.
The comparison between the Oracle and data-driven MEWDE is presented in
Table 2. In the pure data scenario (
), the data-driven approach incurs an efficiency loss of approximately 19–24% compared to the Oracle. However, as contamination is introduced, the performance of the data-driven tuning rapidly converges to that of the Oracle. For
, the percentage loss drops significantly, often falling below 1% (see
Table 2). In these cases, the presence of outliers forces both the Oracle and the data-driven selector to adopt robust tuning parameters, eliminating the gap between the theoretical optimum and the practical estimate. This confirms that the Warwick–Jones algorithm, anchored by the minimum
estimator, effectively adapts to the data structure, providing necessary robustness without requiring prior knowledge of the contamination level.
4.7. Stability in Multivariate Settings
To address concerns regarding the performance of the MEWDE as the dimensionality of the parameter space increases, we extended our simulation study to a multivariate linear regression setting. We considered models with
predictors and contamination rates of
. As shown in
Table 3, the MLE suffers from severe degradation as dimensionality and contamination increase; for example, with
,
and
contamination, the MLE squared error explodes to
. In contrast, the MEWDE (
) maintains remarkable stability with an error of
, comparable to its performance on clean data. Furthermore, the results demonstrate that
provides an excellent trade-off, achieving near-optimal efficiency on clean data while offering full protection against vertical outliers.
4.8. Computational Complexity
The computational cost of the MEWDE is dominated by the evaluation of the objective function and its gradient at each step of the iterative solver. For a sample size n, the evaluation of the estimating equation requires summing the score function over all observations; this operation is . The integral term required for Fisher consistency depends only on the parameters, not the sample size, and thus adds a constant overhead per iteration (assuming fixed dimensionality). Consequently, if the numerical solver (e.g., Newton–Raphson) requires k iterations to converge, the total complexity for a fixed tuning parameter is . When selecting the optimal tuning parameter via the data-driven grid search described above, the estimator is re-computed for each of the G candidate values in the grid. The total complexity for the tuning procedure is therefore . Since the grid size G and the number of iterations k are typically small and fixed relative to the sample size, the overall procedure remains linear in n, i.e., . This ensures that the data-driven selection of remains computationally feasible even for large datasets, comparable to standard likelihood-based methods.
Remark 5. The estimators were computed by solving the system of estimating equations defined in Equation (3). We utilized the multiroot
function from the
rootSolve
package in R [20], which implements a Newton–Raphson method with diagonal Jacobian approximation. The convergence tolerance was set to . This root-finding approach proved to be numerically stable and computationally efficient, with average runtimes comparable to standard M-estimation routines and no significant overhead compared to the MDPDE. Remark 6. Under light or no contamination, the MEWDE and MDPDE exhibit nearly indistinguishable performance, as expected given their shared first-order efficiency at the model. The practical advantage of the EWD becomes apparent primarily in moderate to heavy contamination settings, where bounded weighting prevents both outlier influence and inlier over-emphasis.
7. Closing Remarks
In this paper, we have presented an estimator based on a sub-class of density-based Bregman divergences, which is seen to be outperforming the existing standard (i.e., the DPD-based estimator). We have shown several asymptotic and distributional properties of the proposed estimator, both in the context of i.i.d data as well as independent and non-homogeneous data. A special case of linear regression (both simple and multiple) has been explored in the context of real data. We have also discussed “judicial” choice(s) of the tuning parameter which, when chosen properly, yields highly robust and efficient estimators which can often dominate the MDPDE. We have also considered an hypothesis testing strategy for parametric models which may serve as robust alternatives to the classical likelihood ratio and other likelihood-based tests. As we have noted, when the weight function generated by EWD converges to 1 as its argument, the value of the density function increases. We feel that this is the more balanced way for weighting the observations, rather than the weighing provided by the DPD, where the weights increase indefinitely with increase in the value of the density.
7.1. Practical Recommendations for Practitioners
Based on the theoretical properties and empirical results presented in this work, we offer the following guidelines for practitioners choosing between the proposed MEWDE and existing alternatives like the MDPDE:
Hypothesis Testing and Size Control: The MEWDE is strictly preferable in hypothesis testing scenarios where controlling the Type I error rate is critical (e.g., regulatory clinical trials). As demonstrated in the power analysis (
Figure 8), the MEWDE maintains the correct nominal size (≈0.05) even under heavy contamination, whereas the MDPDE can suffer from size inflation.
Interpretation of Weights: In settings involving high-leverage points, the MEWDE offers a safer interpretation of “downweighting.” The exponentially decaying weight function is strictly bounded in , ensuring that no observation—regardless of how well it fits the model—can exert undue influence. This contrasts with power-divergence weights, which can theoretically grow unbounded depending on the model parameterization.
Implementation in Complex Models: For practitioners concerned with the integration burden of the asymptotic null distribution, we recommend the parametric bootstrap approach detailed in
Section 6.2. This method is computationally stable, integrates easily with existing statistical software (requiring only a standard optimization routine), and avoids the complexity of estimating the eigenvalues of the weighted chi-square distribution.
7.2. Diagnostics for Model Failure
Since the theoretical guarantees of the MEWDE rely on the compatibility between the assumed model family and the bulk of the data, practitioners should monitor the resulting robust weights () as a post hoc diagnostic. In a successful fit, weights should cluster near 1 for inliers and near 0 for outliers. A specific failure mode occurs in “heavy-tailed” settings (e.g., Cauchy data modeled as normal) where the tails of the data far exceed the support of the model. In this case, the estimator may attempt to flatten the density to accommodate the spread, causing the estimated density values to become negligible for all points. This results in a weight collapse, where the contributing weights . Practitioners observing consistently low weights across the sample should interpret this as a signal that the chosen parametric family is insufficiently heavy-tailed for the data, necessitating either a different model family (e.g., t-distribution) or a breakdown-point estimator such as LTS.
7.3. Limitations
While the MEWDE demonstrates strong robustness against outliers and contamination within the support of the distribution, it is important to acknowledge certain theoretical limitations. First, like most M-estimators derived from the exponential family, the validity of the asymptotic theory (specifically Assumptions 5–7 relies on the integrability of the weighted score function). In settings with extremely heavy-tailed data (e.g., Cauchy-distributed errors) or severe leverage points where the underlying moments do not exist, the required integrals may diverge. In such pathological cases, high-breakdown methods such as Least Trimmed Squares (LTS) or MM-estimators may be more appropriate, albeit at the cost of efficiency. Second, the current formulation of the EWD focuses on the standard regime; as noted earlier, extending this framework to high-dimensional settings () requires the incorporation of sparsity-inducing penalties, which is a subject of our ongoing research.
7.4. Future Work
It may also be mentioned that the proposal based on the EWD has the potential to be useful in all the situations where the DPD has been successfully applied, such as generalized linear models, survival analysis and Bayesian inference, to name a few. We hope to pursue all of these in our future research. A systematic comparison of alternative bounded weight functions within the Bregman framework is another interesting direction for future research, but lies beyond the scope of the present paper. In case of hypothesis testing, we have only investigated the analogues of the likelihood ratio tests. Other Wald-type tests based on the EWD should also be studied which are likely to have simpler asymptotic null distributions compared to that in Equation (
21). The DPD-based Wald-type test has been extensively used in the literature, and comparisons with EWD-based tests will be interesting. We also hope to refine the tuning parameter selection strategy using the recently developed method of [
26].