3.1. Monte Carlo Simulation Study
The data generation process that follows was adapted from [
23,
24]. We generate
n samples from the model using R studio:
where the error terms are generated as
, and the explanatory variables are generated using the following equation:
where
are independent standard normal random numbers that are held fixed for a given sample of size
n, and
is the degree of multicollinearity between predictors. In this study, we consider three values of
= 0.10, 0.50, and 0.98. The study by [
14] mainly examined the RSE under high correlation; we extended the analysis to low and moderate levels to better understand its performance in varying degrees of multicollinearity.
The true values of the regression parameters are chosen as
, which is a common restriction in simulation studies (e.g., [
12,
14,
25]). Additionally, the performance of the RSE is also evaluated for heavy-tailed error distributions, such as
t-distributions and the Cauchy distribution.
Five different sample sizes are considered , with the number of predictors fixed at , and different proportions of outliers are evaluated . In each simulation setting, we perform MC replications, which is chosen as a compromise between achieving a low MC error and keeping the computation time reasonable.
To introduce outliers in the data, we consider two types, that is,
y direction outliers and leverage points. For outliers in the
y direction, we first generate clean data according to Equations (
30) and (
31), then randomly select
observations and replace their response values by adding a large constant, that is,
, where
is the standard deviation of the error term. For leverage points, we randomly select
observations and replace their predictor values with
, where
is the standard deviation of the
j-th predictor.
A summary of the simulation scenarios is provided in
Table 1. In addition, we evaluate the performance of the proposed estimators in the following ten cases:
Case I: —Standard normal distribution with 0.10, 0.50, and 0.98 multicollinearity and 0.00% outliers.
Case II: —Student’s t-distribution with seven degrees of freedom with 0.10, 0.50, and 0.98 multicollinearity and 0.00% outliers.
Case III: —Student’s t-distribution with two degrees of freedom, 0.10, 0.50, and 0.98 multicollinearity and 0.00% outliers.
Case IV: —Cauchy distribution errors with 0.10, 0.50, and 0.98 multicollinearity and 0.00% outliers.
Case V: —Standard normal distribution with 10%, 25%, and 50% outliers with no multicollinearity.
Case VI: —Student’s t-distribution with two degrees of freedom, 10%, 25%, and 50% outliers with no multicollinearity.
Case VII: —10%, 25%, and 50% outliers in the y-direction with 0.10, 0.50, and 0.98 multicollinearity.
Case VIII: —10%, 25%, and 50% leverage points with 0.10, 0.50, and 0.98 multicollinearity.
Case IX: (Student’s t-distribution with two degrees of freedom), with multicollinearity levels of 0.10, 0.50, and 0.98, and outliers in the y-direction at 10%, 25%, and 50%.
Case X: (Student’s t-distribution with two degrees of freedom), with multicollinearity levels of 0.10, 0.50, and 0.98, and leverage points at 10%, 25%, and 50%.
Table 1.
Summary of simulation parameters considered.
Table 1.
Summary of simulation parameters considered.
| Sample Size (n) | Multicollinearity () | Outlier () |
|---|
| 25 | 0.10 | 0.00 |
| 50 | 0.50 | 0.10 |
| 100 | 0.98 | 0.25 |
| 150 | | 0.50 |
| 200 | | |
3.2. Simulation Study Findings
Case I (
Table S1 and
Figure 1): JSE and RSE produce the smallest RMSE values in all sample sizes, consistent with their theoretical optimality properties. MME, HME, OLSE, and LTSE also achieve RMSE values comparable to those of JSE and RSE only when
= 0.50 and 0.98. In contrast, LMSE and SE display relatively higher RMSEs, reflecting their lower efficiency in the presence of multicollinearity. JSE and RSE have achieved exactly zero bias for low multicollinearity
= 0.10 and 0.50 in sample sizes at
, and LMSE shows the highest bias in nearly all scenarios. All estimators exhibit monotonic behavior, with RMSE values consistently decreasing as the sample size increases.
Case II (
Table S2 and
Figure 2): The RSE consistently achieves the lowest RMSE under all the conditions in this case. The JSE consistently achieves low bias across different sample sizes and multicollinearity levels. The LMSE has the overall worst performance in all simulation scenarios.
Case III (
Table S3 and
Figure 3): In this case, the performance of the OLSE, JSE, and LMSE is substantially worse than that of any other robust estimator. The RSE, MME, and HME exhibit RMSE values that are almost similar, but the RSE outperforms all estimators in this case, as shown in
Figure 3. HME and MME consistently achieve low bias in almost all sample sizes and multicollinearity levels, and JSE has the highest bias at
= 0.50 and 0.98.
Case IV (
Table S4 and
Figure 4): Under Cauchy-distributed errors, OLSE and JSE demonstrate erratic RMSE behavior, with values increasing due to the heavy-tailed nature of the distribution. In contrast, robust estimators maintain lower RMSEs and a monotonic decrease with increasing sample size, highlighting their stability. Furthermore, as the correlation coefficient (
) increases, the performance of OLSE and JSE worsens further. The RSE is the best estimator under the Cauchy distribution, as shown by
Figure 4. HME and MME continue to maintain a consistently lowest bias in all scenarios at
= 0.98 and n = 200. OLSE and JSE have the highest bias with values exceeding one.
Case V (
Tables S5 and S6, and
Figure 5 and
Figure 6): The MME performs the best under the
y direction, and the LTSE performs the best under the
x direction. The RSE yields a higher RMSE because it lacks robustness in both
y direction outliers and leverage points. LTSE and SE maintain a small bias and low RMSE in the presence of leverage points, whereas RSE, JSE, and OLSE break down due to high sensitivity, as shown in
Figure 5 and
Figure 6.
Case VI (
Tables S7 and S8 and
Figure 7 and
Figure 8):
Table S7 and
Figure 7 show that MME, SE, HME and LTSE consistently achieve the lowest RMSE and almost zero bias, making them the best performers, while RSE, OLSE and JSE perform the worst due to high RMSE values.
Table S8 and
Figure 8 indicate that LTSE, MME, and SE are the most robust and accurate, while RSE and JSE again show the poorest performance.
Case VII (
Tables S9–S11, and
Figure 9,
Figure 10 and
Figure 11): When the data contain multicollinearity and outliers in the response direction, the performance of OLSE, JSE, and LMSE worsens significantly, particularly as the percentage of outliers and the degree of multicollinearity increase, as shown by
Figure 9,
Figure 10 and
Figure 11. The following estimators, LMSE, OLSE and JSE in
Figure 11, completely fail when 50% of the observations are contaminated. In contrast, the RSE, SE, HME, and MME demonstrate superior performance relative to OLSE, JSE, and LMSE under the same conditions. In addition, LTSE and SE exhibit even better robustness, maintaining substantially lower RMSE values compared to OLSE, JSE and LMSE, particularly in the presence of low contamination, which is 10% in
Figure 9. When there are 25% of outliers in
Figure 10, we observe patterns similar to
Figure 9. The RSE performs very well in cases of outliers and multicollinearity, followed by MME, HME, and LTSE.
Case VIII (
Tables S12–S14, and
Figure 12,
Figure 13 and
Figure 14): RSE outperforms all estimators in all scenarios when the data contain multicollinearity and leverage points. The performance of LMSE, SE, LTSE, and OLSE worsens significantly as the percentage of outliers and multicollinearity increases, as shown by
Figure 12,
Figure 13 and
Figure 14. The bias is the smallest for SE and MME when
and 50%.
Cases IX (
Tables S15–S17 and
Figure 15,
Figure 16 and
Figure 17): The RSE consistently performs better than average in all simulation scenarios. When 10% outliers and multicollinearity are present, HME, MME, LTSE, and SE work with similar results as shown in
Figure 15. In contrast, JSE and OLSE always have the poorest performance in all of these scenarios. LMSE also shows less than ideal performance accuracy, specifically below 10% contamination, over which it generates the highest bias values. Overall, the HME and MME are the most accurate estimators with a consistently lower bias compared to other methods.
Cases X (
Tables S18–S20, and
Figure 18,
Figure 19 and
Figure 20): RMSE values are shown in
Figure 18,
Figure 19 and
Figure 20 under different multicollinearity levels and the leverage points. The RMSE increases with both higher multicollinearity and larger proportions of leverage points. RSE consistently provides the lowest RMSE, reflecting good stability, while OLSE and JSE perform the worst. In terms of bias, HME and MME show the lowest bias in all scenarios, and JSE shows the highest bias. These findings also emphasize the better performance of RSE and therefore its robustness and precision and show the relative bias performance of the other estimators under challenging data conditions.
In summary, JSE and OLSE only work well when there are no outliers since they are very sensitive to outliers. The estimator’s bias and RMSE values increased in the case of degree of multicollinearity (
) and the outlier percentage (
), and decreased in the case of increasing sample size (
n) when other factors were fixed. When only the multicollinearity problem existed in the model (as in
Figure 1 and
Figure 2, or when there were no outliers, i.e.,
), OLSE, JSE, HME, MME and RSE were better than LMSE, LTSE and SE. But in
Figure 4, OLSE and JSE performed the worst, even though they had no outliers. When both problems existed (as in
Figure 9,
Figure 10,
Figure 11,
Figure 12,
Figure 13,
Figure 14,
Figure 15,
Figure 16,
Figure 17,
Figure 18,
Figure 19 and
Figure 20 or when there were outliers, i.e.,
), RSE, HME, and MME were better than LMSE, LTSE, SE, JSE and OLSE, respectively, for all values
,
, and
n. Finally, the RSE achieved the best performance among all given estimators when there are outliers and multicollinearity.
3.3. Empirical Applications
In the previous section, we conducted an MC simulation study to compare the performance of the estimators. However, simulations are usually performed under some ideal conditions. In contrast to the MC simulation, this section considers three datasets as an illustrative example for handling outliers and multicollinearity in linear regression. The datasets are the milk dataset, the real estate valuation dataset and the Hawkins–Bradu–Kass data set. We adopted a systematic trimming procedure according to [
26] to modify the outlier contamination levels in each data set because real datasets frequently contain variable and uncontrollable levels of outliers. This method allows us to direct comparison of theoretical, simulated and real-data results, which enables us to produce multiple versions of each data set with particular outlier percentages that match our simulation study design. A summary of the datasets considered for this study is provided in
Table A1.
Study 1: Milk Dataset
The milk dataset provided by [
27] is a composition of milk with eight variables. The eight variables are density, fat content, protein content, casein content, cheese dry substance measured in the factory, cheese dry substance measured in the laboratory, milk dry substance, and cheese produced. According to [
28], there are 17 outliers in this dataset, which makes the percentage of outliers 20%. Observations (1st–3rd, 12th, 13th–17th, 27th, 41st, 44th, 47th, 70th, 74th, 75th, and 77th) are outliers. The milk dataset has severe multicollinearity (CN = 164.0314) and 20%
y-direction outliers, which makes it similarly equivalent to simulation Case VII (
,
to
).