Next Article in Journal
Abundance Estimation Using Minimum Order Set Distances in Line Transect Sampling
Previous Article in Journal
Penalized Likelihood Estimation of Continuation Ratio Models for Ordinal Response and Its Application in CGSS Data
Previous Article in Special Issue
Entropy and Minimax Risk Diversification: An Empirical and Simulation Study of Portfolio Optimization
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Performance Evaluation of the Robust Stein Estimator in the Presence of Multicollinearity and Outliers

by
Lwando Dlembula
*,
Chioneso Show Marange
and
Lwando Orbet Kondlo
Department of Computational Sciences, University of Fort Hare, Alice 5700, South Africa
*
Author to whom correspondence should be addressed.
Stats 2026, 9(1), 21; https://doi.org/10.3390/stats9010021
Submission received: 28 October 2025 / Revised: 26 November 2025 / Accepted: 3 December 2025 / Published: 22 February 2026
(This article belongs to the Special Issue Robust Statistics in Action II)

Abstract

Multicollinearity and outliers are common challenges in multiple linear regression, often adversely affecting the properties of least squares estimators. To address these issues, several robust estimators have been developed to handle multicollinearity and outliers individually or simultaneously. More recently, the robust Stein estimator (RSE) was introduced, which integrates shrinkage and robustness to effectively mitigate the impact of both multicollinearity and outliers. Despite its theoretical advantages, the finite-sample performance of this approach under multicollinearity and outliers remains underexplored. First, outliers in the y direction have been the main focus of earlier research on the RSE, not considering that leverage points could substantially impact regression results. Second, this study addresses the gap by considering outliers in the y direction and leverage points, providing a more thorough assessment of the RSE robustness. Finally, to extend the limited existing benchmark, we compare and evaluate the RSE performance with a wide range of robust and classical estimators. This extends existing benchmarking, which is limited in the current literature. Several Monte Carlo (MC) simulations were conducted, considering both normal and heavy-tailed error distributions, with sample sizes, multicollinearity levels, and outlier proportions varied. Performance was evaluated using bootstrap estimates of root mean squared error (RMSE) and bias. The MC simulation results indicated that the RSE outperformed other estimators under several scenarios where both multicollinearity and outliers are present. Finally, real data studies confirm the MC simulation results.

1. Introduction

Linear regression analysis is an important statistical tool that helps model a relationship between the response variable ( y ) and the predictors ( X ), and it is often applied in all fields of study. A linear model has the following form:
y = X β + ε ,
where y is an n × 1 vector of observed values of the dependent variable, X = [ x 1 , x 2 , , x n ] is an n × p matrix with a row vector x i = [ x i 1 , x i 2 , , x i p ] of components x i j of the regressors, i = 1 , 2 , 3 , , p , β is a vector p × 1 of unknown parameters and ε is an n × 1 vector of errors. Under classical assumptions that the errors have zero mean, constant variance and are uncorrelated, then the ordinary least-squares (OLS) estimate β ^ is the best linear unbiased estimator (BLUE) of β according to the Gauss-Markov theorem. The OLS estimate is BLUE even without assuming normally distributed residuals and minimizes the sum of squared residuals. If an error term ( ε ) follows a normal distribution ε N ( 0 , σ 2 I ) , this assumption of normality is essential to perform hypothesis tests and construct confidence intervals. If there is more than one predictor variable, then the analysis is referred to as multiple linear regression analysis [1]. A multiple regression model could be used to describe this relationship:
y = β 0 + β 1 x 1 + β 2 x 2 + + β p x p + ε .
Applying the ordinary least squares estimator (OLSE) in simple or multiple linear regression always requires some assumptions, that is, normality of the error terms, equal variance of the error terms, absence of outliers, leverage points, and multicollinearity. Therefore, the OLS estimate is the best method in regression analysis if the assumptions are met. However, if these assumptions are not satisfied, the results can easily be affected [2].
Various problems can occur in the estimation of the model parameters, one of which is if the assumption of normality is not fulfilled, or due to outliers in the data [3]. According to [4,5], the normality of error distributions is based on the central limit theorem, which is a limit theorem based on approximations. Hampel and Huber [4] critically challenged the prevailing assumption that the OLSE remains approximately optimal under conditions of approximate normality. This is because it has been shown that typical error distributions of real-world datasets usually deviate slightly from normality and, in most cases, are tailed. Additionally, outliers in the dependent variable lead to large residual values, which further result in the failure of the normality assumption of the error terms. According to [6], the data that come from the regression line is an outlier. Outliers are data that have different characteristics from other non-outlier data. There exist several statistical methods used to detect outliers, such as the Cook distance index plot and the potential residual plot (e.g., [7]). The existence of outliers can affect the data analysis process, leading to incorrect decision making. However, the existence of outliers can be overcome by several methods, including robust regression [8]. Several robust regression estimators have been provided to address the problem of outliers in different regression models; these include the Huber maximum estimator (HME), the modified maximum likelihood estimator (MME), the least trimmed squares estimator (LTSE), the least median of squares estimator (LMSE), and the S estimator (SE) [9].
In addition, multicollinearity is another pressing challenge in linear regression models that occurs when predictor variables have a high degree of correlation. This problem causes maximum likelihood estimates to be unstable and inefficient. Therefore, many biased estimators have been provided to address this problem in different regression models [10,11]. However, both multicollinearity and outliers can exist simultaneously in a model. To address both issues, various estimators have been combined for robust estimation in order to handle multicollinearity and outliers [12,13,14]. Despite existing robust estimators such as HME, MME, SE, LTSE, and LMSE handling outliers effectively, they suffer a significant efficiency loss under multicollinearity due to increased variance of coefficient estimations. In contrast, shrinkage estimators, such as the James-Stein estimator (JSE), handle multicollinearity but entirely fail in the presence of outliers, since they rely on OLS-based estimation, which is extremely susceptible to contamination. Existing estimators struggle to maintain high efficiency under multicollinearity while maintaining resilience to outliers [9].
To address this gap, Lukman et al. [14] recently introduced the robust Stein estimator (RSE) by combining shrinkage to reduce multicollinearity with M-estimation to prevent outlier effect. Despite its promising theoretical foundation, the RSE has not received much attention in the literature and its practical performance has not been extensively examined under various contamination settings. There are particularly limited studies that investigate how well it holds up in the presence of leverage points, heavy-tailed error distributions, and varied degrees of multicollinearity. In this study, we examine a robust version of the Stein estimator, which is the combination of the HME and the JSE, which was originally proposed by [14] and can handle both multicollinearity and outliers simultaneously. The performance of this estimator is assessed and compared with that of other robust regression methods utilizing different simulation scenarios and real data applications.
The rest of the paper is organized as follows. In Section 2, we begin with the materials and methods. Section 3 presents the results of Monte Carlo (MC) simulation study and the empirical application results that shows how RSE performs on real-world data. The study concludes with a discussion and some conclusions that summarize the main findings and identify potential areas for further research in Section 4 and Section 5.

2. Materials and Methods

In this section, attention will be paid to the robust methods considered in this study and their properties. The selected estimators were chosen for their theoretical significance and their easy implementation and computational viability due to their availability in R.

2.1. Review on Robust Estimators

When the data are contaminated by one or a few outliers, the issue of detecting such observations becomes challenging. However, in general, data sets often contain multiple outliers or clusters of influential observations, and they can also exhibit multicollinearity among their variables. These two factors provide additional insight into the co-occurrence of both outliers and multicollinearity, which complicates regression analysis, with biased and unstable parameter estimates as a common result. Alma [15] points out that robust estimation is an important method for analyzing data contaminated by outliers. It is an approach to regression analysis that aims to overcome some of the limitations of classical parametric and nonparametric approaches. Well-performing estimates are robust estimates designed to become less sensitive to outliers and multicollinearity. Here, we present a review of existing robust estimators used in linear regression.

2.1.1. The Huber Maximum Likelihood Estimator (HME)

The most common general robust method is M-estimates, introduced by Huber et al. [5], which minimizes the objective function. In this study, we use a turning constant k = 1.345 , which offers around 95% efficiency while protecting against outliers [16]. The rlm() function in the MASS R statistical package was used to implement the HME. For y-direction outliers, this estimator serves as a baseline without considering multicollinearity into account:
i = 1 n ρ ( r i ) = i = 1 n ρ e i ( β ) σ ,
where r i = e i ( β ) σ are called scaled residuals. Differentiating with respect to β , assuming σ is fixed, and setting the partial derivatives to zero, results in the following normal equations:
i = 1 n ψ e i ( β ) σ x i = 0 .
Let X * = [ 1 n X ] denote the design matrix n × ( p + 1 ) , where 1 n is the intercept column, and X ˙ contains the predictors p. The ith observation can be written as x i = ( 1 , x i 1 , x i 2 , , x i p ) , where i = 1 , 2 , , n (observations) and j = 1 , 2 , , p (predictors). We consider spectral decomposition T ( X X ) T = Λ = diag ( λ j ) , j = 1 , 2 , , p * , ( p * = p + 1 ) . Following [14], the model in Equation (1) is transformed as follows:
y i = j = 1 p * α j h i j + ε i , i = 1 , 2 , , n
where λ 1 λ 2 λ p * , and T is an orthogonal matrix ( p * × p * ) whose columns are eigenvectors corresponding to these eigenvalues. The transformation is defined as H = X ˙ T , α = T β , and T ( X X ) T = H H = Λ . Therefore, the OLS of α is written as follows:
α ^ L S = Λ 1 H y .
The M-estimator of α is
α ^ M = min i = 1 n ρ ε i σ = min i = 1 n ρ y i j = 1 p * α j h i j σ ,
where ρ is a robust function and σ is a scale parameter. The estimator α ^ M is obtained as the solution to the M-estimating Equations (3) and (4).

2.1.2. James Stein Estimator (JSE)

As a remedy to the problem of multicollinearity, the James-Stein estimator (JSE), originally proposed by [17,18], applies shrinkage to the ordinary least squares estimator to improve estimation efficiency under correlated regressors. In the standard regression context, the JSE is given by the following:
β ^ JSE = 1 ( p 2 ) σ ^ 2 β ^ OLS X X β ^ OLS β ^ OLS , for p > 2 ,
where β ^ OLS is the ordinary least squares estimator and σ ^ 2 is the estimated error variance. In the transformed α -space defined in Equation (5), following [14], the JSE can be expressed as follows:
α ^ JSE = c α ^ LS ,
where 0 < c < 1 and α ^ LS is an unbiased estimate of α . If c = 1 , then α ^ JSE = α ^ LS . The shrinkage factor c is given by the following:
c = α ^ LS α ^ LS α ^ LS α ^ LS + σ 2 tr ( X X ) 1 = j = 1 p * λ j α ^ j 2 σ 2 + λ j α ^ j 2 .
The equivalence between the two forms in Equation (10) follows from the spectral decomposition, noting that tr ( ( X X ) 1 ) = j = 1 p * λ j 1 and the canonical transformation diagonalize the covariance structure into independent components along the eigenvector directions.

2.1.3. The Robust Stein Estimator (RSE)

The robust Stein estimator was first introduced by [14] as an alternative method to the Stein and ridge estimators to improve the accuracy of the estimation. However, both methods are non-robust to outliers. In the study, it was indicated that the Stein estimator is sensitive to outliers in the y-direction. They proposed the RSE [14], which is defined as:
α ^ M - J S E = c * α ^ M ,
where α ^ M is the M-estimate of α , the shrinkage factor c * in the RSE (M–Stein) is often defined using the following components:
c * = j = 1 p * λ j α M 2 Ψ j j + λ j α M 2 ,
where
  • λ j are the ordered eigenvalues of the design information matrix (e.g., X′X in linear models)
  • α M , j are the components of the robust estimator α ^ M ,
  • Ψ j j are the diagonal entries of the variance-covariance matrix Var ( α ^ M ) .
The robust covariance matrix Ψ is estimated using the sandwich estimator [5]:
Ψ ^ = A ^ 2 B ^ , where A ^ 2 = 1 n p * i = 1 n ψ e i σ ^ 2 , B ^ = σ ^ 2 ( n p * ) 2 i = 1 n ψ e i σ ^ 2 ,
and σ ^ is a robust scale estimate. The approximate covariance of α ^ M - JSE is the following:
Cov α ^ M - JSE c * Cov α ^ M ( c * ) = c * Ψ ( X X ) 1 ( c * ) .
The approximate bias is the following:
Bias α ^ M - JSE E c * α ^ M α = ( c * 1 ) α .
The matrix mean squared error (MMSE) is approximately the following:
MMSE ( α ^ M - J S E ) c * Ψ ( X X ) 1 ( c * ) + Bias ( α ^ M - J S E ) Bias ( α ^ M - J S E ) .
The scalar mean squared error (SMSE) is approximately given by:
SMSE ( α ^ M - J S E ) ( c * ) 2 j = 1 p * Ψ j j λ j + ( c * 1 ) 2 j = 1 p * α j 2 ,
which can also be expressed as:
SMSE ( α ^ M - J S E ) j = 1 p * Ψ j j λ j α j 4 ( Ψ j j + λ j α j 2 ) 2 + j = 1 p * Ψ j j 2 α j 2 ( Ψ j j + λ j α j 2 ) 2 .

2.1.4. The Least Median of Squares Estimator (LMSE)

In order to reach the highest possible 50% breakdown point (BP), the LMSE minimizes the median of squared residuals rather than their sum [19]. Using the robustbase R statistics package, the LMSE was implemented using the lmsreg() function. Siegel [19] defined the least median of squares (LMS) estimator as:
β ^ L M S = argmin β med { e i 2 } ,
which replaces the sum by the median in the Least Squares Estimator (LSE). The procedure is essentially based on the idea presented by [20]. The advantage of the LMSE is that it is very robust to outliers in both the y direction and the leverage points, and it has been shown that LMSE has the highest possible BP of 0.5. Unfortunately, it has a very low efficiency and can be unstable. Moreover, due to its slow convergence rate, the LMSE does not have a well-defined influence function. Due to these properties, the LMSE is usually used as the initial estimate of residuals for other more efficient methods such as MM-estimators [20].

2.1.5. The Least Trimmed Squares Estimator (LTSE)

Another regression estimator that has a BP of nearly 50% is the least trimmed square (LTS) estimator proposed by [20]. Traditional OLS methods are highly sensitive to outliers, meaning that a few extreme data points can dramatically affect the estimated model. LTSE is designed to be more robust to these outliers by focusing on a subset of the data with the smallest residuals. The estimator chooses the regression coefficients β to minimize the sum of the smallest h of the squared residuals and is defined as follows:
β ^ L T S = argmin β i = 1 h e ( i ) 2 ( β ) ,
where e ( i ) 2 ( β ) represents the i-th ordered squared residuals e ( 1 ) 2 e ( n ) 2 and h is called the trimming constant, which must satisfy n / 2 < h n . The constant h determines the BP of the LTSE. Typically, h = n / 2 + ( p + 1 ) / 2 can reach the maximum, where denotes the floor function (rounded to the next smallest integer). When h = n , LTSE is exactly equivalent to the LSE whose BP is zero. Although its convergence rate [21] makes it asymptotically normal, which is better than LMSE, it still suffers a very low efficiency of only 7%. In this study, we used the function ltsReg(), which is available in the robustbase R statistical package, and set h = 50 % B P .

2.1.6. The S-Estimator (SE)

To find a simple high-breakdown regression estimator that shares the flexibility and nice asymptotic properties of the M-estimator, Rousseeuw et al. [21] introduced the S-estimate. The SE was developed as a robust alternative to traditional estimators like OLSE, particularly when the data contain outliers or high-leverage points. They called it the S-estimate because it is derived from the M-scale estimate equation:
1 n i = 1 n ρ e i ( β ) σ ^ = δ .
In M-estimates, when σ is unknown, we use Equation (21) to obtain the scale parameter and regression estimates simultaneously. Let δ = E ϕ [ ρ ( x ) ] , where ϕ represents the standard normal density, and let
d ( e ) = # { i : 1 i n , e i = 0 } n .
When d ( e ) < 1 δ / a (where a is the upper bound of ρ and a ( 0 , ) ), the scale has a unique positive solution. If d ( e ) = 1 δ / a , it may have infinite solutions, including σ = 0 , and if d ( e ) > 1 δ / a , then there is no solution. To avoid such indeterminacies, we define that whenever d ( e ) 1 δ / a , we set σ ( e ) = 0 . To define the S-estimator, let ρ satisfy the following condition:
(A1)
ρ is symmetric, continuously differentiable, and ρ ( 0 ) = 0 . There exists c > 0 such that ρ is strictly increasing in [ 0 , c ] and constant in [ c , ] .
For each vector β , using Equation (21), we can calculate the dispersion of residuals σ ^ ( e 1 ( β ) , , e n ( β ) ) , where ρ satisfies (A1). Then, the S-estimator β ^ is defined by:
β ^ = arg min β σ ^ ( e 1 ( β ) , , e n ( β ) ) .
The tuning constant k is set to 1.547 to guarantee a 50 % breakdown point. We used the function lmrob.S(), which is available in the robustbase R statistical package.

2.1.7. The Modified Maximum Likelihood Estimator (MME)

The MM estimation is a special type of M-estimation developed by [20], and is particularly useful when using non-normal data or outliers, and combines a high breakdown point (50%) with high efficiency through the [20] three-stage procedure. The function lmrob(), which is part of the robustbase R statistical package, was used in this study, with a constant k set to 4.685 to guarantee a 95 % efficiency. Similar to HME, MME does not directly address multicollinearity, although it does improve upon HME’s breakdown point while retaining efficiency.
Stage 1
Compute an initial consistent robust estimate β ^ 0 with a high breakdown point (BP), possibly 50%, but not necessarily efficient.
Stage 2
Compute the M-scale σ ^ of the residuals e i ( β ^ 0 ) using Equation (21), with a function ρ 0 satisfying (A1) and choosing a constant δ such that δ / a = 0.5 , where a = sup ρ 0 ( e ) . Thus, the asymptotic BP of σ ^ is 0.5 [16].
Stage 3
Let ρ 1 be another ρ -function satisfying (A1) such that
sup ρ 1 ( e ) = sup ρ 0 ( e ) = a ,
and
ρ 1 ( e ) ρ 0 ( e ) .
Let ψ 1 = ρ 1 and define the objective function as follows:
L ( β ) = i = 1 n ρ 1 e i ( β ) σ ^ , with ρ 1 ( 0 / 0 ) = 0 .
Then, the MM-estimate β ^ 1 is defined as any solution to the following equation:
i = 1 n ψ 1 e i ( β ) σ ^ x i = 0 ,
that also satisfies
L ( β ^ 1 ) L ( β ^ 0 ) .
Yohai [20] showed that any value of β that satisfies Equations (26) and (27), for example, a local minimum, will have the same efficiency as the global minimum, and its BP is not less than that of β ^ 0 . Thus, although the absolute minimum of L ( β ) exists, it is not necessary to find it. In the first stage, the robust initial estimate β ^ 0 should satisfy regression, scale, and affine equivariance and have a high BP. LMS, LTS, and S-estimates are possible candidates. For Stage 2, one way to choose ρ 0 and ρ 1 is as follows:
ρ 0 ( e ) = ρ e k 0 , ρ 1 ( e ) = ρ e k 1 .
where ρ is a function that satisfies (A1). In order to satisfy Equation (23), we must have 0 < k 0 < k 1 . The value of k 0 should be chosen so that δ / a = 0.5 is maintained. k is set to 4.685 to guaranty the efficiency of 95 % . In this study, the function lmrob() was used, which is part of the robustbase R statistical package.

2.2. Criteria for Assessing Estimator Performance

Following a study by Fitrianto et al. [22], two typical precision measures for the estimator θ ^ are RMSE and Bias. These measures can be estimated using bootstrap resampling. Specifically, given B = 5000 paired bootstrap samples. The bootstrap estimate of Bias and RMSE is as follows:

2.2.1. Bias Under the Bootstrap

Bias ^ = 1 B b = 1 B θ ^ b * θ ^ = θ ^ ¯ * θ ^ ,
where θ ^ b * are bootstrap copies of θ ^ .

2.2.2. RMSE Under the Bootstrap

RMSE ^ = 1 B b = 1 B θ ^ b * θ ^ 2 ,
where θ ^ b * are bootstrap copies of θ ^ and B represents the number of bootstrap replications.

3. Results

3.1. Monte Carlo Simulation Study

The data generation process that follows was adapted from [23,24]. We generate n samples from the model using R studio:
y i = β 0 + β 1 X i 1 + β 2 X i 2 + β 3 X i 3 + ε i , i = 1 , 2 , , n
where the error terms are generated as ε i N ( 0 , 1 ) , and the explanatory variables are generated using the following equation:
X i j = Z i j 1 ρ 2 + ρ Z i j , i = 1 , 2 , , n , j = 1 , 2 , 3 , . . . , p
where Z i j are independent standard normal random numbers that are held fixed for a given sample of size n, and ρ is the degree of multicollinearity between predictors. In this study, we consider three values of ρ = 0.10, 0.50, and 0.98. The study by [14] mainly examined the RSE under high correlation; we extended the analysis to low and moderate levels to better understand its performance in varying degrees of multicollinearity.
The true values of the regression parameters are chosen as β 0 = β 1 = = β p 1 = 1 , which is a common restriction in simulation studies (e.g., [12,14,25]). Additionally, the performance of the RSE is also evaluated for heavy-tailed error distributions, such as t-distributions and the Cauchy distribution.
Five different sample sizes are considered n { 25 , 50 , 100 , 150 , 200 } , with the number of predictors fixed at p = 3 , and different proportions of outliers are evaluated π = 0.00 % , 10 % , 25 % , 50 % . In each simulation setting, we perform N = 5000 MC replications, which is chosen as a compromise between achieving a low MC error and keeping the computation time reasonable.
To introduce outliers in the data, we consider two types, that is, y direction outliers and leverage points. For outliers in the y direction, we first generate clean data according to Equations (30) and (31), then randomly select n × π observations and replace their response values by adding a large constant, that is, y i outlier = y i + 10 × σ ε , where σ ε is the standard deviation of the error term. For leverage points, we randomly select n × π observations and replace their predictor values with X i j outlier = X i j + 10 × SD ( X j ) , where SD ( X j ) is the standard deviation of the j-th predictor.
A summary of the simulation scenarios is provided in Table 1. In addition, we evaluate the performance of the proposed estimators in the following ten cases:
  • Case I: ε i N ( 0 , 1 ) —Standard normal distribution with 0.10, 0.50, and 0.98 multicollinearity and 0.00% outliers.
  • Case II: ε i t 7 —Student’s t-distribution with seven degrees of freedom with 0.10, 0.50, and 0.98 multicollinearity and 0.00% outliers.
  • Case III: ε i t 2 —Student’s t-distribution with two degrees of freedom, 0.10, 0.50, and 0.98 multicollinearity and 0.00% outliers.
  • Case IV: ε i C ( 0 , 1 ) —Cauchy distribution errors with 0.10, 0.50, and 0.98 multicollinearity and 0.00% outliers.
  • Case V: ε i N ( 0 , 1 ) —Standard normal distribution with 10%, 25%, and 50% outliers with no multicollinearity.
  • Case VI: ε i t 2 —Student’s t-distribution with two degrees of freedom, 10%, 25%, and 50% outliers with no multicollinearity.
  • Case VII: ε i N ( 0 , 1 ) —10%, 25%, and 50% outliers in the y-direction with 0.10, 0.50, and 0.98 multicollinearity.
  • Case VIII: ε i N ( 0 , 1 ) —10%, 25%, and 50% leverage points with 0.10, 0.50, and 0.98 multicollinearity.
  • Case IX: ε i t 2 (Student’s t-distribution with two degrees of freedom), with multicollinearity levels of 0.10, 0.50, and 0.98, and outliers in the y-direction at 10%, 25%, and 50%.
  • Case X: ε i t 2 (Student’s t-distribution with two degrees of freedom), with multicollinearity levels of 0.10, 0.50, and 0.98, and leverage points at 10%, 25%, and 50%.
Table 1. Summary of simulation parameters considered.
Table 1. Summary of simulation parameters considered.
Sample Size (n)Multicollinearity ( ρ )Outlier ( π )
250.100.00
500.500.10
1000.980.25
150 0.50
200

3.2. Simulation Study Findings

Case I (Table S1 and Figure 1): JSE and RSE produce the smallest RMSE values in all sample sizes, consistent with their theoretical optimality properties. MME, HME, OLSE, and LTSE also achieve RMSE values comparable to those of JSE and RSE only when ρ = 0.50 and 0.98. In contrast, LMSE and SE display relatively higher RMSEs, reflecting their lower efficiency in the presence of multicollinearity. JSE and RSE have achieved exactly zero bias for low multicollinearity ρ = 0.10 and 0.50 in sample sizes at n 100 , and LMSE shows the highest bias in nearly all scenarios. All estimators exhibit monotonic behavior, with RMSE values consistently decreasing as the sample size increases.
Case II (Table S2 and Figure 2): The RSE consistently achieves the lowest RMSE under all the conditions in this case. The JSE consistently achieves low bias across different sample sizes and multicollinearity levels. The LMSE has the overall worst performance in all simulation scenarios.
Case III (Table S3 and Figure 3): In this case, the performance of the OLSE, JSE, and LMSE is substantially worse than that of any other robust estimator. The RSE, MME, and HME exhibit RMSE values that are almost similar, but the RSE outperforms all estimators in this case, as shown in Figure 3. HME and MME consistently achieve low bias in almost all sample sizes and multicollinearity levels, and JSE has the highest bias at ρ = 0.50 and 0.98.
Case IV (Table S4 and Figure 4): Under Cauchy-distributed errors, OLSE and JSE demonstrate erratic RMSE behavior, with values increasing due to the heavy-tailed nature of the distribution. In contrast, robust estimators maintain lower RMSEs and a monotonic decrease with increasing sample size, highlighting their stability. Furthermore, as the correlation coefficient ( ρ ) increases, the performance of OLSE and JSE worsens further. The RSE is the best estimator under the Cauchy distribution, as shown by Figure 4. HME and MME continue to maintain a consistently lowest bias in all scenarios at ρ = 0.98 and n = 200. OLSE and JSE have the highest bias with values exceeding one.
Case V (Tables S5 and S6, and Figure 5 and Figure 6): The MME performs the best under the y direction, and the LTSE performs the best under the x direction. The RSE yields a higher RMSE because it lacks robustness in both y direction outliers and leverage points. LTSE and SE maintain a small bias and low RMSE in the presence of leverage points, whereas RSE, JSE, and OLSE break down due to high sensitivity, as shown in Figure 5 and Figure 6.
Case VI (Tables S7 and S8 and Figure 7 and Figure 8): Table S7 and Figure 7 show that MME, SE, HME and LTSE consistently achieve the lowest RMSE and almost zero bias, making them the best performers, while RSE, OLSE and JSE perform the worst due to high RMSE values. Table S8 and Figure 8 indicate that LTSE, MME, and SE are the most robust and accurate, while RSE and JSE again show the poorest performance.
Case VII (Tables S9–S11, and Figure 9, Figure 10 and Figure 11): When the data contain multicollinearity and outliers in the response direction, the performance of OLSE, JSE, and LMSE worsens significantly, particularly as the percentage of outliers and the degree of multicollinearity increase, as shown by Figure 9, Figure 10 and Figure 11. The following estimators, LMSE, OLSE and JSE in Figure 11, completely fail when 50% of the observations are contaminated. In contrast, the RSE, SE, HME, and MME demonstrate superior performance relative to OLSE, JSE, and LMSE under the same conditions. In addition, LTSE and SE exhibit even better robustness, maintaining substantially lower RMSE values compared to OLSE, JSE and LMSE, particularly in the presence of low contamination, which is 10% in Figure 9. When there are 25% of outliers in Figure 10, we observe patterns similar to Figure 9. The RSE performs very well in cases of outliers and multicollinearity, followed by MME, HME, and LTSE.
Case VIII (Tables S12–S14, and Figure 12, Figure 13 and Figure 14): RSE outperforms all estimators in all scenarios when the data contain multicollinearity and leverage points. The performance of LMSE, SE, LTSE, and OLSE worsens significantly as the percentage of outliers and multicollinearity increases, as shown by Figure 12, Figure 13 and Figure 14. The bias is the smallest for SE and MME when π = 25 % and 50%.
Cases IX (Tables S15–S17 and Figure 15, Figure 16 and Figure 17): The RSE consistently performs better than average in all simulation scenarios. When 10% outliers and multicollinearity are present, HME, MME, LTSE, and SE work with similar results as shown in Figure 15. In contrast, JSE and OLSE always have the poorest performance in all of these scenarios. LMSE also shows less than ideal performance accuracy, specifically below 10% contamination, over which it generates the highest bias values. Overall, the HME and MME are the most accurate estimators with a consistently lower bias compared to other methods.
Cases X (Tables S18–S20, and Figure 18, Figure 19 and Figure 20): RMSE values are shown in Figure 18, Figure 19 and Figure 20 under different multicollinearity levels and the leverage points. The RMSE increases with both higher multicollinearity and larger proportions of leverage points. RSE consistently provides the lowest RMSE, reflecting good stability, while OLSE and JSE perform the worst. In terms of bias, HME and MME show the lowest bias in all scenarios, and JSE shows the highest bias. These findings also emphasize the better performance of RSE and therefore its robustness and precision and show the relative bias performance of the other estimators under challenging data conditions.
In summary, JSE and OLSE only work well when there are no outliers since they are very sensitive to outliers. The estimator’s bias and RMSE values increased in the case of degree of multicollinearity ( ρ ) and the outlier percentage ( π ), and decreased in the case of increasing sample size (n) when other factors were fixed. When only the multicollinearity problem existed in the model (as in Figure 1 and Figure 2, or when there were no outliers, i.e., π = 0.00 % ), OLSE, JSE, HME, MME and RSE were better than LMSE, LTSE and SE. But in Figure 4, OLSE and JSE performed the worst, even though they had no outliers. When both problems existed (as in Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15, Figure 16, Figure 17, Figure 18, Figure 19 and Figure 20 or when there were outliers, i.e., π > 0.00 % ), RSE, HME, and MME were better than LMSE, LTSE, SE, JSE and OLSE, respectively, for all values π , ρ , and n. Finally, the RSE achieved the best performance among all given estimators when there are outliers and multicollinearity.

3.3. Empirical Applications

In the previous section, we conducted an MC simulation study to compare the performance of the estimators. However, simulations are usually performed under some ideal conditions. In contrast to the MC simulation, this section considers three datasets as an illustrative example for handling outliers and multicollinearity in linear regression. The datasets are the milk dataset, the real estate valuation dataset and the Hawkins–Bradu–Kass data set. We adopted a systematic trimming procedure according to [26] to modify the outlier contamination levels in each data set because real datasets frequently contain variable and uncontrollable levels of outliers. This method allows us to direct comparison of theoretical, simulated and real-data results, which enables us to produce multiple versions of each data set with particular outlier percentages that match our simulation study design. A summary of the datasets considered for this study is provided in Table A1.
Study 1: Milk Dataset
The milk dataset provided by [27] is a composition of milk with eight variables. The eight variables are density, fat content, protein content, casein content, cheese dry substance measured in the factory, cheese dry substance measured in the laboratory, milk dry substance, and cheese produced. According to [28], there are 17 outliers in this dataset, which makes the percentage of outliers 20%. Observations (1st–3rd, 12th, 13th–17th, 27th, 41st, 44th, 47th, 70th, 74th, 75th, and 77th) are outliers. The milk dataset has severe multicollinearity (CN = 164.0314) and 20% y-direction outliers, which makes it similarly equivalent to simulation Case VII ( ρ = 0.98 , π = 0.10 to 0.25 ).

3.4. Analysis for the Milk Dataset

3.4.1. Multicollinearity Detection for Milk Dataset

The degree of multicollinearity in regression analysis is determined by the variance inflation factor (VIF). If VIF 10 , it is generally considered that multicollinearity is high [29]. Another metric to identify multicollinearity in regression models is the condition number (CN). A CN between 30 and 100 indicates extremely high linear multicollinearity (see [30]). Table A2 presents the VIF for the milk data set, which is a crucial diagnostic tool to detect multicollinearity in regression analysis. It is noted that in the milk data set, variables X 4 and X 5 exhibit high VIF. In Figure 21, it is observed that strong positive correlations are present between the variables X 2 , X 3 , X 4 and X 5 ( ρ > 0.60 ), with X 4 and X 5 indicating severe ( ρ = 0.98 ).

3.4.2. Outlier Detection Using Cook’s Distance for Milk Dataset

The Cook’s distance plot for the milk data set, with a threshold value of 0.047, is shown in Figure 22. It was discovered that two observations (70 and 74) greatly exceed this threshold, with observation 70 achieving a value above 2.0, and these data points are considered to be very significant and deserve more investigation. Their existence indicates the possibility of anomalies or leverage points that might influence the predictions of the model.

3.4.3. Testing for Normality for Milk Dataset

For this study, both theoretical and graphical techniques will be used to assess the normality of the OLS residuals. Theoretically, we used the Shapiro–Wilk test to evaluate consistency to the normality assumption. Extensive empirical evidence demonstrates that the Shapiro–Wilk test is among the most powerful goodness-of-fit procedures for detecting departures from normality, including in the context of regression residuals [31,32,33,34,35]. Figure 23 shows a diagnostic plot to assess residual normality in milk data. Visual inspection of these plots reveals that the milk data display approximately normal residuals. The results of the theoretical test yielded a p-value of 0.1964 ( p > 0.05 ), indicating that the milk data follows a normal distribution.

3.4.4. Model Fit and Evaluation for Milk Dataset

To compare the performance of the methods, the regression model was fitted using the milk data set for each method, considering the sparsity of the models. Finally, bias and RMSE were used to evaluate how well the methods performed. The results presented in Table 2 cover the bias and the RMSE of each estimation method for the milk dataset. The results show that the RSE has outperformed all estimates when there are both outliers and multicollinearity. When there is multicollinearity and 0% outliers, the JSE, OLSE, and other robust estimates have almost the same RMSE values, the same thing that was happening in our simulation study. The next estimators with similar performance when multicollinearity and outliers exist are JSE, MME, and HME. Figure 24 was used to assess bias and RMSE. The OLSE bias (red line) was observed to increase with increasing outlier percentage. The RSE RMSE values (black line) decrease with increasing outlier percentage.
In Table 3 we fit the models to the milk dataset. The estimates of the coefficients show that only X 6 is significant across all methods.
Study 2: Real Estate Valuation Dataset
Real estate valuation data was collected in 2018 from Sindian District, New Taipei City, Taiwan, and recently published by [36]. The response variable in this study is the house prices per unit area, measured in 10,000 New Taiwan Dollars per Ping, where one Ping corresponds to 3.3 m 2 . Explanatory variables include transaction date ( X 1 ), house age in years ( X 2 ), distance to the nearest metro station in meters ( X 3 ), number of convenience stores within walking distance ( X 4 ), geographic coordinate latitude in degrees ( X 5 ) and geographic coordinate longitude in degrees ( X 6 ). The real estate valuation dataset has moderate multicollinearity (CN = 14.7723) and 21.5% outliers, positioning it among simulation Cases I and VII with moderate correlation ( ρ = 0.50 ) and π = 0.25 . Furthermore, the non-normal error distribution ( p < 0.0001 ) confirms strong estimator performance under conditions comparable to simulation Cases III and VI (heavy-tailed errors with multicollinearity).

3.5. Analysis for the Real Estate Valuation Dataset

3.5.1. Multicollinearity Detection for Real Estate Valuation Dataset

The real estate valuation dataset exhibits low VIF in all predictors; therefore, the CN of 14.7723 indicates that moderate multicollinearity exists among predictors, as shown in Table A2. Figure 25 shows that there are strong negative correlations between variables X 3 , X 4 , X 5 and X 6 , particularly between X 3 and X 6 ( ρ = 0.81 ).

3.5.2. Outlier Detection Using Cook’s Distance for the Real Estate Valuation Dataset

Figure 26 displays the Cook’s distance plot with a 0.01 threshold for the real estate valuation dataset. In comparison to the other points, observation 271 shows a significantly higher Cook’s D value, indicating a significant impact on the regression results. Data points 149, 221, and 313 show additional moderate influences.

3.5.3. Testing for Normality for Real Estate Valuation Dataset

Figure 27 shows clear deviations from normality due to skewed residual distributions. Therefore, the theoretical test confirms that the data are not normally distributed as the p-value is less than 0.05 ( p < 0.0001 ).

3.5.4. Model Fit and Evaluation for Real Estate Valuation Dataset

Table 4 presents the estimated bias and RMSE of various estimators with increasing levels of outlier contamination and multicollinearity. The RSE demonstrates the most consistent performance in terms of RMSE across all contamination levels, particularly excelling at higher outlier percentages. Although JSE shows the lowest RMSE under clean data, its performance deteriorates significantly with more outliers. HME and OLSE maintain relatively low bias at all levels, but often at the cost of higher RMSE. Figure 28 was used to assess bias and RMSE. It was observed that HME, OLSE, and RSE maintain a consistently low bias across all outlier levels. The RSE and JSE are the clear top performers with low RMSE values. Despite the fact that RSE outperforms all other estimators in terms of RMSE, it exhibits a higher bias compared to some of them. This indicates that while RSE achieves the best overall predictive accuracy by effectively balancing bias and variance, it does not perform best in terms of bias alone. Therefore, as the percentage of outliers increases, the RSE consistently outperforms the other estimators, demonstrating strong robustness to both multicollinearity and outliers.
Table 5 shows the coefficient estimates with 95% confidence intervals, which provide formal statistical validation beyond RMSE comparisons. The study found that all predictor variables are statistically significant for RSE, JSE, HME and OLSE.
Study 3: Hawkins Bradu Kass Dataset
We utilized an artificial Hawkins Bradu Kass dataset constructed by [37] consisting of one dependent and three independent variables with 75 observations, of which 14 are outliers. The Hawkins–Bradu–Kass dataset has high degrees of multicollinearity (CN = 102.5294, ρ > 0.95 among all predictors) and 18.67% outliers, which makes it a suitable empirical validation for simulation Case VII at extreme multicollinearity levels ( ρ = 0.98 , π = 0.10 to 0.25 ). The non-normal residual distributions correspond to those used in simulation Cases III and IX.

3.6. Analysis for the Hawkins Bradu Kass Dataset

3.6.1. Multicollinearity Detection for Hawkins Bradu Kass Dataset

It should be noted that all predictors exhibit exceptionally high VIF values in the Hawkins Bradu Kass dataset, as shown in Table A2, and the observed CN of 102.5294 indicates that there is strong multicollinearity among the regressors. Figure 29 shows that all regressors are highly correlated ( ρ > 0.95 ).

3.6.2. Outlier Detection Using Cook’s Distance for Hawkins Bradu Kass Dataset

The Cook’s distance plot for the Hawkins Bradu Kass dataset, with a threshold value of 0.053, is shown in Figure 30. Cook’s D values in observations 12 and 14 significantly surpass the threshold, highlighting the fact that they are highly significant outliers. If not adequately addressed, these points might bias the parameter estimates and the overall performance of the model.

3.6.3. Testing for Normality for Hawkins Bradu Kass Dataset

Figure 31 shows clear deviations from normality due to skewed residual distributions. Therefore, the theoretical test confirms that the data are not normally distributed as the p-value is less than 0.05 ( p < 0.0001 ).

3.6.4. Model Fit and Evaluation for Hawkins Bradu Kass Dataset

As shown in Table 6 and Figure 32, in the absence of outliers, the JSE achieved the lowest RMSE, while the HME had the lowest bias. Under 18.67% outlier contamination with multicollinearity, SE produced the lowest bias, and RSE achieved the lowest RMSE. MME also performed well under contamination, demonstrating both low bias and RMSE. In contrast, OLSE experienced a significant increase in both bias and RMSE when exposed to outliers. In general, robust estimators such as RSE and MME display greater stability and resilience under data contamination. In Table 7 we fit the models to the Hawkins–Bradu–Kass dataset. The estimates of the coefficients show that only X 3 is significant under OLSE, HME, and RSE.

4. Discussion

The findings are consistent with the current literature on robust regression estimators. The superior performance of RSE’s in the presence of simultaneous multicollinearity and outlier contamination is attributed to its dual-component structure, that is, the HME component reduces the influence of outliers, while the JSE component maintains estimates through shrinkage, thus alleviating variance inflation [14]. The MME performed better when outliers only appeared in the y-direction and there was no multicollinearity. This pattern is consistent with the findings of [38,39], who found that MME was the most effective estimator for pure vertical contamination. The failure of OLSE and JSE in the presence of Cauchy-distributed errors corroborates the conclusions of [24,40], which indicate that heavy-tailed distributions lead to extreme observations dominating the sum of squared residuals. A basic trade-off in robust estimation is demonstrated by the failure of high-breakdown estimators (LMSE, LTSE, SE) under multicollinearity without outliers [21]. The general findings of this study support the idea that no single estimator is always the best. MME maintains advantages in environments dominated by y-direction outliers without multicollinearity [38,39,41], while RSE performs best when both multicollinearity and outliers are present [14]. These results highlight how crucial diagnostic analysis is before deciding on an estimation technique.

5. Conclusions

In this study, multicollinearity and outliers from both the leverage points and the y direction are evaluated in relation to the performance of RSE, which combines the shrinkage factors of ME and JSE. We evaluate the efficiency of RSE in an extensive MC simulation study with bias and RMSE criteria. Currently, the simulation study is conducted under a number of distributional circumstances (normally, t-distributions and Cauchy distributed errors), sample sizes, multicollinearity levels and outliers from leverage points and y direction independently.
When there are multicollinearity and outliers from leverage points and the y direction, the RSE outperforms classical OLSE, JSE, HME, MME, LMSE, LTSE, and SE. Additionally, when comparing the RSE to other existing robust estimators, we see that in the majority of the simulation scenarios evaluated, the RSE performs better. Based on the simulation study and its application to real datasets, when there is multicollinearity and outliers from the leverage points and the y direction, we discover that the RSE is the optimal estimator. Future research may extend this work by comparing the RSE through simulation studies using various distributions, such as the Weibull, Gamma, and Poisson distributions, across different sample sizes, levels of outlier contamination, and degrees of multicollinearity.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/stats9010021/s1: Table S1: Estimated Bias and RMSE Values with Multicollinearity and no Outliers for Normal Distribution ε i N ( 0 , 1 ) ; Table S2: Estimated Bias and RMSE Values with Multicollinearity and no Outliers for Student’s t-distribution with 7 Degrees of Freedom ε i t 7 ; Table S3: Estimated Bias and RMSE Values with Multicollinearity and no Outliers for Student’s t-distribution with 2 Degrees of Freedom ε i t 2 ; Table S4: Estimated Bias and RMSE Values with Multicollinearity and no Outliers for Cauchy Distribution ε i C ( 0 , 1 ) ; Table S5: Estimated Bias and RMSE Values for Normal Errors with Outliers in the y Direction and no Multicollinearity; Table S6: Estimated Bias and RMSE Values under Normal Errors with Leverage Point Outliers and no Multicollinearity; Table S7: Estimated Bias and RMSE under ε i t 2 Distribution with Outliers in the y Direction and no Multicollinearity; Table S8: Estimated Bias and RMSE for ε i t 2 Distribution with Leverage Point Outliers and no Multicollinearity; Table S9: Estimated Bias and RMSE for Normal Errors with 10% Outliers in the y Direction and Multicollinearity; Table S10: Estimated Bias and RMSE Values for Normal Errors with 25% Outliers in the y Direction and Multicollinearity; Table S11: Estimated Bias and RMSE Values for Normal Errors with 50% Outliers in the y Direction and Multicollinearity; Table S12: Estimated Bias and RMSE Values for Normal Errors with 10% Leverage Point Outliers and Multicollinearity; Table S13: Estimated Bias and RMSE Values for Normal Errors with 25% Leverage Point Outliers and Multicollinearity; Table S14: Estimated Bias and RMSE Values for Normal Errors with 50% Leverage Point Outliers and Multicollinearity; Table S15: Estimated Bias and RMSE Values for ε i t 2 Distribution with 10% Outliers in the y Direction and Multicollinearity; Table S16: Estimated Bias and RMSE Values for ε i t 2 Distribution with Multicollinearity and 25% Outliers in the y Direction; Table S17: Estimated Bias and RMSE Values for ε i t 2 Distribution with 50% Outliers in the y Direction and Multicollinearity; Table S18: Estimated Bias and RMSE Values for ε i t 2 Distribution with 10% Leverage Point Outliers and Multicollinearity; Table S19: Estimated Bias and RMSE Values for ε i t 2 Distribution with 25% Leverage Point Outliers and Multicollinearity; Table S20: Estimated Bias and RMSE Values for ε i t 2 Distribution with 50% Leverage Point Outliers and Multicollinearity.

Author Contributions

Conceptaulization, L.D. and C.S.M.; methodology, L.D. and C.S.M.; software, L.D.; validation, L.D. and C.S.M.; formal analysis, L.D.; investigation, L.D. and C.S.M.; resources, L.D.; writing original draft preparation, L.D.; writing review and editing, L.D., C.S.M. and L.O.K.; visualization, L.D., C.S.M. and L.O.K.; supervisor, C.S.M.; project administration, C.S.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by a fee waiver bursary from the hosting University’s Research and Innovation Office, which also contributed partial funding for the article processing charges (APC).

Data Availability Statement

Data derived from public domain resources. The three datasets used and presented in this study are available in different online sources which are in the public domain [27,36,37].

Acknowledgments

Firstly, the authors extend sincere gratitude to the Almighty God. The authors would also like to express their sincere gratitude to the reviewers and the editor for their insightful comments and constructive feedback, which significantly contributed in improving the quality of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Summary of the datasets.
Table A1. Summary of the datasets.
Study No.DatasetnpNo. of OutliersOutliers (%)
Milk dataset (original)8682020%
1Milk trimmed dataset62800%
Milk trimmed dataset818810%
Real Estate Valuation (original)41478821.25%
2Real Estate Valuation trimmed dataset212700%
Real Estate Valuation trimmed dataset26874115.3%
3Hawkins–Bradu–Kass dataset (original)7541418.67%
Hawkins–Bradu–Kass trimmed dataset61400%
Table A2. Variance inflation factor (VIF) of the milk, the real estate, and Hawkins Bradu Kass datasets.
Table A2. Variance inflation factor (VIF) of the milk, the real estate, and Hawkins Bradu Kass datasets.
DatasetVariableVIFVariableVIF
Milk dataset X 1 2.2007 X 5 24.7561
X 2 8.2865 X 6 3.2919
X 3 7.2187 X 8 2.1834
X 4 24.2253
Real estate valuation dataset X 1 1.0147 X 5 1.6023
X 2 1.0143 X 6 2.9263
X 3 4.3230
X 4 1.6170
Hawkins Bradu Kass datase X 1 13.4320
X 2 23.8535
X 3 33.4325

References

  1. Sani, M. Robust Diagnostic and Parameter Estimation for Multiple Linear and Panel Data Regression Models. Ph.D. Thesis, Universiti Putra Malaysia, Selangor, Malaysia, 2018. [Google Scholar]
  2. Algamal, Z.; Lukman, A.; Abonazel, M.R.; Awwad, F. Performance of the Ridge and Liu Estimators in the zero-inflated Bell Regression Model. J. Math. 2022, 2022, 9503460. [Google Scholar] [CrossRef]
  3. Fernandes, G.; Rodrigues, J.J.; Carvalho, L.F.; Al-Muhtadi, J.F.; Proença, M.L. A comprehensive survey on network anomaly detection. Telecommun. Syst. 2019, 70, 447–489. [Google Scholar] [CrossRef]
  4. Hampel, F.R. Robust statistics: A brief introduction and overview. In Research Report/Seminar für Statistik, Eidgenössische Technische Hochschule (ETH); Seminar für Statistik, Eidgenössische Technische Hochschule: Zürich Switzerland, 2001; Volume 94. [Google Scholar]
  5. Huber, P.J. Robust regression: Asymptotics, conjectures and Monte Carlo. Ann. Stat. 1973, 1, 799–821. [Google Scholar] [CrossRef]
  6. Draper, N.R. Applied regression analysis bibliography update 1992–1993. Commun. Stat. Theory Methods 1994, 23, 2701–2731. [Google Scholar] [CrossRef]
  7. Hadi, A.S. A modification of a method for the detection of outliers in multivariate samples. J. R. Stat. Soc. Ser. Stat. Methodol. 1994, 56, 393–396. [Google Scholar] [CrossRef]
  8. Neter, J.; Wasserman, W.; Kutner, M.H. Applied Linear Regression Models; Richard D. Irwin: Homewood, IL, USA, 1983. [Google Scholar]
  9. Fitrianto, A.; Xin, S.H. Comparisons between robust regression approaches in the presence of outliers and high leverage points. BAREKENG J. Ilmu Mat. Dan Terap. 2022, 16, 243–252. [Google Scholar] [CrossRef]
  10. Arum, K.; Ugwuowo, F. Combining principal component and robust ridge estimators in linear regression model with multicollinearity and outlier. Concurr. Comput. Pract. Exp. 2022, 34, e6803. [Google Scholar] [CrossRef]
  11. Jegede, S.L.; Lukman, A.F.; Ayinde, K.; Odeniyi, K.A. Jackknife Kibria-Lukman M-estimator: Simulation and application. J. Niger. Soc. Phys. Sci. 2022, 4, 251–264. [Google Scholar] [CrossRef]
  12. Dawoud, I.; Kibria, B.G. A new biased estimator to combat the multicollinearity of the Gaussian linear regression model. Stats 2020, 3, 526–541. [Google Scholar] [CrossRef]
  13. Dawoud, I.; Abonazel, M.R. Robust Dawoud–Kibria estimator for handling multicollinearity and outliers in the linear regression model. J. Stat. Comput. Simul. 2021, 91, 3678–3692. [Google Scholar] [CrossRef]
  14. Lukman, A.F.; Farghali, R.A.; Kibria, B.G.; Oluyemi, O.A. Robust-stein estimator for overcoming outliers and multicollinearity. Sci. Rep. 2023, 13, 9066. [Google Scholar] [CrossRef]
  15. Alma, Ö.G. Comparison of robust regression methods in linear regression. Int. J. Contemp. Math. Sci. 2011, 6, 409–421. [Google Scholar]
  16. Huber, P.J. The place of the L1-norm in robust estimation. Comput. Stat. Data Anal. 1987, 5, 255–262. [Google Scholar] [CrossRef]
  17. Stein, C. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics; University of California Press: Oakland, CA, USA, 1956; Volume 3, pp. 197–207. [Google Scholar]
  18. James, W.; Stein, C. Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability; University of California Press: Oakland, CA, USA, 1961; Volume 1, pp. 361–379. [Google Scholar]
  19. Siegel, A.F. Robust regression using repeated medians. Biometrika 1982, 69, 242–244. [Google Scholar] [CrossRef]
  20. Yohai, V.J. High breakdown-point and high efficiency robust estimates for regression. Ann. Stat. 1987, 15, 642–656. [Google Scholar] [CrossRef]
  21. Rousseeuw, P.; Yohai, V.J. Robust regression by means of S-estimators. In Robust and Nonlinear Time Series Analysis: Proceedings of a Workshop Organized by the Sonderforschungsbereich 123 "Stochastische Mathematische Modelle", Heidelberg 1983; Springer: Berlin/Heidelberg, Germany, 1984; pp. 256–272. [Google Scholar]
  22. Fitrianto, A.; Midi, H. Estimating bias and rmse of indirect effects using rescaled residual bootstrap in mediation analysis. WSEAS Trans. Math. 2010, 9, 397–406. [Google Scholar]
  23. Algamal, Z.Y. Performance of ridge estimator in inverse Gaussian regression model. Commun. Stat. Theory Methods 2019, 48, 3836–3849. [Google Scholar] [CrossRef]
  24. Affindi, A.; Ahmad, S.; Mohamad, M. A comparative study between ridge MM and ridge least trimmed squares estimators in handling multicollinearity and outliers. J. Phys. Conf. Ser. 2019, 1366, 012113. [Google Scholar] [CrossRef]
  25. Suhail, M.; Chand, S.; Aslam, M. New quantile based ridge M-estimator for linear regression models with multicollinearity and outliers. Commun. Stat. Simul. Comput. 2023, 52, 1417–1434. [Google Scholar] [CrossRef]
  26. Lee, B.K.; Lessler, J.; Stuart, E.A. Weight trimming and propensity score weighting. PLoS ONE 2011, 6, e18174. [Google Scholar] [CrossRef]
  27. Daudin, J.; Duby, C.; Trecourt, P. Stability of principal component analysis studied by the bootstrap method. Stat. J. Theor. Appl. Stat. 1988, 19, 241–258. [Google Scholar] [CrossRef]
  28. Mutalib, S.S.S.; Satari, S.Z.; Yusoff, W.N.S.W. Comparison of robust estimators for detecting outliers in multivariate datasets. J. Phys. Conf. Ser. 2021, 1988, 012095. [Google Scholar] [CrossRef]
  29. Bowerman, B.; O’Connell, R.T.; Dickey, D. Linear Statistical Models: An Applied Approach; PWS: Singapore, 1990. [Google Scholar]
  30. Yong-Wei, G.; Willan, A.; Watts, D. A method to measure and test the damage of multicollinearity to parameter estimation. Sci. Surv. Mapp. 2008, 2, 1–44. [Google Scholar]
  31. Shapiro, S.S.; Wilk, M.B. An analysis of variance test for normality (complete samples). Biometrika 1965, 52, 591–611. [Google Scholar] [CrossRef]
  32. Dyer, A.R. Comparisons of tests for normality with a cautionary note. Biometrika 1974, 61, 185–189. [Google Scholar] [CrossRef]
  33. Marmolejo-Ramos, F.; González, J. A power comparison of various tests of univariate normality on ex-Gaussian distributions. Methodology 2012, 9, 137–149. [Google Scholar] [CrossRef]
  34. Marange, C.S.; Qin, Y. An empirical likelihood ratio based comparative study on goodness of fit tests for normality of residuals in linear models. Metod. Zv. Adv. Methodol. Stat. 2019, 16, 1–16. [Google Scholar] [CrossRef]
  35. Huang, C.J.; Bolch, B.W. On the testing of regression disturbances for normality. J. Am. Stat. Assoc. 1974, 69, 330–335. [Google Scholar] [CrossRef]
  36. Yeh, I.C.; Hsu, T.K. Building real estate valuation models with comparative approach through case-based reasoning. Appl. Soft Comput. 2018, 65, 260–271. [Google Scholar] [CrossRef]
  37. Hawkins, D.M.; Bradu, D.; Kass, G.V. Location of several outliers in multiple-regression data using elemental sets. Technometrics 1984, 26, 197–208. [Google Scholar] [CrossRef]
  38. Ibrahim, A.; Dike, I.J.; Abdulrasheed, B. Comparative study of some estimators of linear regression models in the presence of outliers. FUDMA J. Sci. 2022, 6, 368–376. [Google Scholar] [CrossRef]
  39. Rahayu, D.A.; Nursholihah, U.F.; Suryaputra, G.; Surono, S. Comparasion of the m, mm and s estimator in robust regression analysis on indonesian literacy index data 2018. EKSAKTA J. Sci. Data Anal. 2023, 4, 11–22. [Google Scholar] [CrossRef]
  40. Gad, A.M.; Qura, M.E. Regression estimation in the presence of outliers: A comparative study. Int. J. Probab. Stat. 2016, 5, 65–72. [Google Scholar]
  41. Almetwally, E.M.; Almongy, H. Comparison between M-estimation, S-estimation, and MM estimation methods of robust estimation with application and simulation. Int. J. Math. Arch. 2018, 9, 1–9. [Google Scholar]
Figure 1. RMSE under normal errors with multicollinearity and without outliers.
Figure 1. RMSE under normal errors with multicollinearity and without outliers.
Stats 09 00021 g001
Figure 2. RMSE under student’s t-distribution with seven degrees of freedom with multicollinearity and without outliers.
Figure 2. RMSE under student’s t-distribution with seven degrees of freedom with multicollinearity and without outliers.
Stats 09 00021 g002
Figure 3. RMSE under Student’s t-distribution with two degrees of freedom with multicollinearity and without outliers.
Figure 3. RMSE under Student’s t-distribution with two degrees of freedom with multicollinearity and without outliers.
Stats 09 00021 g003
Figure 4. RMSE under Cauchy distribution with multicollinearity and without outliers.
Figure 4. RMSE under Cauchy distribution with multicollinearity and without outliers.
Stats 09 00021 g004
Figure 5. RMSE under normal errors with no multicollinearity and outliers in y-direction.
Figure 5. RMSE under normal errors with no multicollinearity and outliers in y-direction.
Stats 09 00021 g005
Figure 6. RMSE under normal errors with no multicollinearity and the presence of leverage points.
Figure 6. RMSE under normal errors with no multicollinearity and the presence of leverage points.
Stats 09 00021 g006
Figure 7. RMSE with ε i t 2 , no multicollinearity, and the presence of outliers in the y-direction.
Figure 7. RMSE with ε i t 2 , no multicollinearity, and the presence of outliers in the y-direction.
Stats 09 00021 g007
Figure 8. RMSE with ε i t 2 , no multicollinearity, and the presence of leverage points.
Figure 8. RMSE with ε i t 2 , no multicollinearity, and the presence of leverage points.
Stats 09 00021 g008
Figure 9. RMSE with normal errors with multicollinearity and 10% outliers in y-direction.
Figure 9. RMSE with normal errors with multicollinearity and 10% outliers in y-direction.
Stats 09 00021 g009
Figure 10. RMSE with normal errors with multicollinearity and 25% outliers in y-direction.
Figure 10. RMSE with normal errors with multicollinearity and 25% outliers in y-direction.
Stats 09 00021 g010
Figure 11. RMSE with normal errors with multicollinearity and 50% outliers in y-direction.
Figure 11. RMSE with normal errors with multicollinearity and 50% outliers in y-direction.
Stats 09 00021 g011
Figure 12. RMSE with normal errors with multicollinearity and 10% leverage points.
Figure 12. RMSE with normal errors with multicollinearity and 10% leverage points.
Stats 09 00021 g012
Figure 13. RMSE with normal mrrors with multicollinearity and 25% leverage points.
Figure 13. RMSE with normal mrrors with multicollinearity and 25% leverage points.
Stats 09 00021 g013
Figure 14. RMSE with normal errors with multicollinearity and 50% leverage points.
Figure 14. RMSE with normal errors with multicollinearity and 50% leverage points.
Stats 09 00021 g014
Figure 15. RMSE with ε i t 2 for different multicollinearity levels and 10% y-outliers.
Figure 15. RMSE with ε i t 2 for different multicollinearity levels and 10% y-outliers.
Stats 09 00021 g015
Figure 16. RMSE with ε i t 2 for different multicollinearity levels and 25% y-outliers.
Figure 16. RMSE with ε i t 2 for different multicollinearity levels and 25% y-outliers.
Stats 09 00021 g016
Figure 17. RMSE with ε i t 2 for different multicollinearity levels and 50% y-outliers.
Figure 17. RMSE with ε i t 2 for different multicollinearity levels and 50% y-outliers.
Stats 09 00021 g017
Figure 18. RMSE with ε i t 2 for different multicollinearity levels and 10% leverage points.
Figure 18. RMSE with ε i t 2 for different multicollinearity levels and 10% leverage points.
Stats 09 00021 g018
Figure 19. RMSE with ε i t 2 for different multicollinearity levels and 25% leverage points.
Figure 19. RMSE with ε i t 2 for different multicollinearity levels and 25% leverage points.
Stats 09 00021 g019
Figure 20. RMSE with ε i t 2 for different multicollinearity levels and 50% leverage points.
Figure 20. RMSE with ε i t 2 for different multicollinearity levels and 50% leverage points.
Stats 09 00021 g020
Figure 21. Correlation matrix for the milk dataset.
Figure 21. Correlation matrix for the milk dataset.
Stats 09 00021 g021
Figure 22. Plot to check for outliers using the milk dataset.
Figure 22. Plot to check for outliers using the milk dataset.
Stats 09 00021 g022
Figure 23. Plots to check for normality on OLS using the milk dataset.
Figure 23. Plots to check for normality on OLS using the milk dataset.
Stats 09 00021 g023
Figure 24. Comparison of bias and RMSE across multicollinearity and outlier levels for the milk dataset.
Figure 24. Comparison of bias and RMSE across multicollinearity and outlier levels for the milk dataset.
Stats 09 00021 g024
Figure 25. Correlation matrix for the real estate valuation dataset.
Figure 25. Correlation matrix for the real estate valuation dataset.
Stats 09 00021 g025
Figure 26. Plot to check for outliers using a real estate valuation dataset.
Figure 26. Plot to check for outliers using a real estate valuation dataset.
Stats 09 00021 g026
Figure 27. Plots to check for normality on OLS using the real estate valuation dataset.
Figure 27. Plots to check for normality on OLS using the real estate valuation dataset.
Stats 09 00021 g027
Figure 28. Comparison of bias and RMSE across multicollinearity and outlier levels for the real estate valuation dataset.
Figure 28. Comparison of bias and RMSE across multicollinearity and outlier levels for the real estate valuation dataset.
Stats 09 00021 g028
Figure 29. Correlation matrix for the Hawkins Bradu Kass dataset.
Figure 29. Correlation matrix for the Hawkins Bradu Kass dataset.
Stats 09 00021 g029
Figure 30. Plot to check for outliers using the Hawkins Bradu Kass dataset.
Figure 30. Plot to check for outliers using the Hawkins Bradu Kass dataset.
Stats 09 00021 g030
Figure 31. Plots to check for normality on OLS using the Hawkins Bradu Kass dataset.
Figure 31. Plots to check for normality on OLS using the Hawkins Bradu Kass dataset.
Stats 09 00021 g031
Figure 32. Comparison of bias and RMSE across multicollinearity and outlier levels for the Hawkins Bradu Kass dataset.
Figure 32. Comparison of bias and RMSE across multicollinearity and outlier levels for the Hawkins Bradu Kass dataset.
Stats 09 00021 g032
Table 2. Estimated bias and RMSE of the estimators under multicollinearity and varying outlier levels for the milk dataset.
Table 2. Estimated bias and RMSE of the estimators under multicollinearity and varying outlier levels for the milk dataset.
Estimator0% Outlier10% Outlier20% Outlier
BiasRMSEBiasRMSEBiasRMSE
OLSE0.00020.02470.00070.02950.00280.0270
HME0.00100.02800.00320.02820.00350.0250
MME0.00260.04720.00590.01260.00580.0340
LMSE0.00440.08900.00560.06670.00640.0606
LTSE0.00410.07670.00060.05310.00650.0489
SE0.00430.07530.00540.02350.00640.0214
JSE0.00060.02280.00140.01260.00290.0111
RSE0.00130.02610.00300.01140.00370.0108
Note: Bold values indicate the lowest Bias or RMSE within each outlier category.
Table 3. Coefficient estimates with 95% confidence intervals for the milk dataset.
Table 3. Coefficient estimates with 95% confidence intervals for the milk dataset.
Method X 1 X 2 X 3 X 4 X 5 X 6 X 7
OLSELCL 0.0102 0.0147 0.0235 0.0176 0.0357 0.0051 0.0013
Estimate 0.0032 0.0016 0.0044 0.0066 0.0172 0.0079 * 0.0187
UCL 0.0030 0.0257 0.0169 0.0286 0.0022 0.0110 0.0409
HMELCL 0.0093 0.0152 0.0187 0.0142 0.0313 0.0051 0.0042
Estimate 0.0023 0.0006 0.0086 0.0060 0.0145 0.0077 * 0.0123
UCL 0.0031 0.0210 0.0193 0.0254 0.0029 0.0107 0.0320
MMELCL 0.0105 0.0155 0.0300 0.0196 0.0333 0.0044 0.0235
Estimate 0.0019 0.0031 0.0098 0.0029 0.0149 0.0078 * 0.0053
UCL 0.0038 0.0260 0.0237 0.0233 0.0102 0.0112 0.0343
LMSELCL 0.0184 0.0373 0.0589 0.0456 0.0603 0.0015 0.0481
Estimate 0.0014 0.0061 0.0008 0.0046 0.0122 0.0078 * 0.0056
UCL 0.0101 0.0459 0.0571 0.0482 0.0487 0.0136 0.0577
LTSELCL 0.0150 0.0249 0.0472 0.0326 0.0467 0.0037 0.0319
Estimate 0.0015 0.0063 0.0013 0.0041 0.0121 0.0079 * 0.0052
UCL 0.0059 0.0381 0.0449 0.0322 0.0297 0.0121 0.0419
SELCL 0.0178 0.0178 0.0464 0.0366 0.0430 0.0036 0.0319
Estimate 0.0008 0.0059 0.0068 0.0030 0.0157 0.0078 * 0.0037
UCL 0.0054 0.0349 0.0474 0.0281 0.0298 0.0125 0.0352
JSELCL 0.0101 0.0134 0.0201 0.0140 0.0310 0.0046 0.0009
Estimate 0.0029 0.0012 0.0047 0.0055 0.0153 0.0072 * 0.0165
UCL 0.0026 0.0225 0.0175 0.0266 0.0023 0.0103 0.0341
RSELCL 0.0094 0.0139 0.0140 0.0124 0.0300 0.0048 0.0032
Estimate 0.0022 0.0011 0.0089 0.0055 0.0137 0.0072 * 0.0116
UCL 0.0028 0.0161 0.0187 0.0244 0.0032 0.0098 0.0291
Note: LCL (Lower Class Limit), UCL (Upper Class Limit), (*) indicates significance at α = 0.05 level.
Table 4. Estimated bias and RMSE of the estimators under multicollinearity and varying outlier levels for the real estate valuation dataset.
Table 4. Estimated bias and RMSE of the estimators under multicollinearity and varying outlier levels for the real estate valuation dataset.
Estimator0% Outlier15.3% Outlier21.5% Outlier
BiasRMSEBiasRMSEBiasRMSE
OLSE168.267954,504.99002732.828066,897.13507291.033059,865.6200
HME7683.236067,391.21001937.276062,329.28852379.869047,601.0400
MME12,588.780067,391.21002322.315070,523.8053720.093350,175.4500
LMSE18,806.2100167,922.30009802.2880188,380.70128417.1460137,477.5000
LTSE17,386.5200116,365.60005397.667090,165.84986875.523071,319.8900
SE21,667.580089,613.290012,931.590028,573.51154669.253020,823.2300
JSE1446.898029,700.38003393.470013,195.99178122.088013,441.6400
RSE9355.848040,048.42003861.250013,122.06833046.367010,501.8300
Note: Bold values indicate the lowest Bias or RMSE within each outlier category.
Table 5. Coefficient estimates with 95% confidence intervals for real estate valuation dataset.
Table 5. Coefficient estimates with 95% confidence intervals for real estate valuation dataset.
Method X 1 X 2 X 3 X 4 X 5 X 6
OLSELCL874.29−352.62−6.72678.32110,419.70−153,479.40
Estimate3754.65 *−269.16 *−5.52 *1136.75 *201,935.60 *−103,319.30 *
UCL6728.76−184.54−4.431521.31285,090.50−59,497.40
HMELCL93.95−367.64−5.89895.64131,361.90−117,826.80
Estimate2357.41 *−298.01 *−5.01 *1198.02 *208,792.30 *−81,590.80 *
UCL4548.51−230.50−4.071532.01293,689.80−44,261.00
MMELCL−2705.04−432.29−10.94363.69115,550.90−122,244.10
Estimate1203.37−302.09 *−5.22 *1182.22 *208,400.50 *−62,398.90
UCL4668.72−151.12−2.802003.35316,911.509656.81
LMSELCL−5535.30−538.38−13.25−44.80−34,083.48−199,199.20
Estimate486.03−267.82 *−5.83 *1173.77199,015.50−48,595.50
UCL8262.26−48.12−1.612589.28442,348.0053,781.28
LTSELCL−1966.84−428.95−11.33545.0176,641.88−124,414.00
Estimate686.87−297.89 *−5.61 *1194.42 *198,832.30 *−51,877.60 *
UCL4318.37−182.37−3.191891.99330,466.30−8131.57
SELCL−3289.21−467.78−11.85269.4583,853.34−146,443.70
Estimate445.64−275.69 *−5.86 *1173.76 *197,567.90 *−47,627.10
UCL5866.78−115.15−2.582161.03360,706.0020,103.39
JSELCL1046.77−345.96−6.69655.91113,213.80−152,953.80
Estimate3700.06 *−265.66 *−5.46 *1120.25 *200,256.10 *−102,075.30 *
UCL6705.15−177.59−4.401496.77278,830.30−60,578.70
RSELCL308.02−360.72−5.94886.87121,104.70−116,835.90
Estimate2396.59 *−296.85 *−5.01 *1179.11 *207,385.00 *−81,953.30 *
UCL4438.46−228.47−4.151492.47290,164.10−44,765.30
Note: LCL (Lower Class Limit), UCL (Upper Class Limit), (*) indicates significance at α = 0.05 level.
Table 6. Estimated bias and RMSE of the estimators under multicollinearity and varying outlier levels for the Hawkins–Bradu–Kass dataset.
Table 6. Estimated bias and RMSE of the estimators under multicollinearity and varying outlier levels for the Hawkins–Bradu–Kass dataset.
Estimator0% Outlier18.67% Outlier
BiasRMSEBiasRMSE
OLSE0.00040.06330.06720.3983
HME0.00020.06830.06710.3721
MME0.00050.07210.01470.1127
LMSE0.00580.28180.01220.2698
LTSE0.00310.15420.02380.2135
SE0.00920.17040.00780.1494
JSE0.00120.05930.05880.3669
RSE0.00250.06200.01520.0847
Note: Bold values indicate the lowest Bias or RMSE within each outlier category.
Table 7. Coefficient estimates with 95% confidence intervals for the Hawkins–Bradu–Kass dataset.
Table 7. Coefficient estimates with 95% confidence intervals for the Hawkins–Bradu–Kass dataset.
Method X 1 X 2 X 3
OLSELCL 0.2438 0.7004 0.0006
Estimate 0.0406 0.2679 0.3962 *
UCL 0.3007 0.3679 0.7260
HMELCL 0.2977 0.7001 0.0064
Estimate 0.0751 0.1623 0.4021 *
UCL 0.2196 0.1594 0.7244
MMELCL 0.2534 0.0893 0.2494
Estimate 0.0183 0.0209 0.0329
UCL 0.2108 0.1308 0.3916
LMSELCL 0.4282 0.3817 0.4724
Estimate 0.0901 0.0093 0.0764
UCL 0.5907 0.6711 0.5469
LTSELCL 0.3576 0.2258 0.2670
Estimate 0.0100 0.0141 0.0077
UCL 0.2831 0.5694 0.5269
SELCL 0.1541 0.2276 0.3194
Estimate 0.0712 0.0005 0.0802
UCL 0.3936 0.1350 0.1223
JSELCL 0.2190 0.6939 0.0057
Estimate 0.0224 0.1955 0.3012
UCL 0.2675 0.2750 0.6923
RSELCL 0.2530 0.6822 0.0000
Estimate 0.0646 0.1111 0.3135 *
UCL 0.1614 0.1412 0.7138
Note: LCL (Lower Class Limit), UCL (Upper Class Limit), (*) indicates significance at α = 0.05 level.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dlembula, L.; Marange, C.S.; Kondlo, L.O. Performance Evaluation of the Robust Stein Estimator in the Presence of Multicollinearity and Outliers. Stats 2026, 9, 21. https://doi.org/10.3390/stats9010021

AMA Style

Dlembula L, Marange CS, Kondlo LO. Performance Evaluation of the Robust Stein Estimator in the Presence of Multicollinearity and Outliers. Stats. 2026; 9(1):21. https://doi.org/10.3390/stats9010021

Chicago/Turabian Style

Dlembula, Lwando, Chioneso Show Marange, and Lwando Orbet Kondlo. 2026. "Performance Evaluation of the Robust Stein Estimator in the Presence of Multicollinearity and Outliers" Stats 9, no. 1: 21. https://doi.org/10.3390/stats9010021

APA Style

Dlembula, L., Marange, C. S., & Kondlo, L. O. (2026). Performance Evaluation of the Robust Stein Estimator in the Presence of Multicollinearity and Outliers. Stats, 9(1), 21. https://doi.org/10.3390/stats9010021

Article Metrics

Back to TopTop