4.1. Monte Carlo Simulations
We performed Monte Carlo simulations to assess the robustness of the semi-parametric Gini regression with outlying observations in $\mathbf{X}$ and the presence of heteroskedasticity. In this section, we assume that the heteroskedasticity shape is known. The steps of the Monte Carlo simulation were as follows.
Step 1: Generate three independent normal distributions ${\mathbf{x}}_{j}\sim \mathcal{N}(0,1)$ of size $n=100$ such that $j=2,3,4$, with ${\mathbf{x}}_{1}=(1,\dots ,1)$ and $\mathbf{X}=({\mathbf{x}}_{1},{\mathbf{x}}_{2},{\mathbf{x}}_{3},{\mathbf{x}}_{4})$.
Step 2: Sort the matrix $\mathbf{X}$ by ascending order according to the vector ${\mathbf{x}}_{2}$ (the second column of $\mathbf{X}$). Multiply the last row of $\mathbf{X}$ by $\theta =100,\dots ,10,000$ (with increments of 100) in order to inflate the most important value of ${\mathbf{x}}_{2}$: ${\mathbf{X}}_{n}^{o}:=\theta {\mathbf{X}}_{n}$.
Step 3: For each outlier $\theta $, perform $B=1000$ simulations, i.e., generate B × 3 independent normal distributions ${\mathbf{x}}_{j}\sim \mathcal{N}(0,1)$ for all $j=2,3,4$ with one outlier valued to be ${\mathbf{X}}_{n}^{o}:=\theta {\mathbf{X}}_{n}$.
Step 4: Generate heteroskedasticity as follows ${\mathbf{\Omega}}^{G}=diag\left(\sqrt{100i}\right)$ and fix a vector ${\mathit{\beta}}_{ag}=(10,3,-10,58)$ to compute $\mathbf{y}=\mathbf{X}{\mathit{\beta}}_{ag}+{\tilde{\epsilon}}_{g}$ with ${\epsilon}_{g,i}\sim \mathcal{N}(0,1)$ independent of ${\mathbf{x}}_{j}$ and with ${\tilde{\epsilon}}_{g}={\left[{\mathbf{\Omega}}^{G}\right]}^{-1}{\epsilon}_{g}$.
Step 5: Regress $\mathbf{y}$ on ${\mathbf{X}}^{o}$ with the semi-parametric Gini regression and with GLS in order to estimate ${\mathit{\beta}}_{ag}$. Compute the standard deviation of ${\widehat{\mathit{\beta}}}_{ag}$ by jackknife in the first case, and the standard deviation of the GLS estimators in the second case (for each value of $\theta $). Measure the mean squared error of the coefficient estimates ${\widehat{\mathit{\beta}}}_{ag}$ over B replications (for each $\theta $) for both techniques: Gini and GLS.
The jackknife standard deviations of the estimators
${\widehat{\beta}}_{ag,k}$ for
$k=1,\dots ,4$ are reported in
Figure 1 for each value of the outlier
$\theta $. As depicted in
Figure 1, jackknife standard deviations of the Aitken-Gini estimator are lower than those of the GLS estimator, which are drastically affected by the introduction of one outlier in a sample of
$n=100$ observations. This corresponds to a contamination of the sample of only
$1\%$. Since the outlying observation corresponds to the last row of
$\mathbf{X}$ (
${\mathbf{X}}_{n}^{o}=\theta {\mathbf{X}}_{n}$), in which there is the most important value of
${\mathbf{x}}_{2}$, the vector
${\mathbf{x}}_{2}$ is the most contaminated regressor. Therefore, as shown in
Figure 1 (top right), important variations of the standard deviation of
${\widehat{\beta}}_{ag,2}$ are recorded, especially for the GLS estimator (red curve) compared with the Gini one (black curve).
In addition, it is possible to compute the mean squared errors of the GLS estimator and the Aitken-Gini one. The contamination process is the same as before.
The Aitken-Gini estimator is better than the usual GLS estimator for the constant of the model (
Figure 2, top left) and for the second regressor
${\mathbf{x}}_{2}$ (
Figure 2, top right), as depicted in
Figure 2. This is due to the fact that the outlier is generated by multiplying
${\mathbf{X}}_{n}$ by
$\theta $, which corresponds to inflating the most important value of
${\mathbf{x}}_{2}$ (Step 2). Consequently, the Aitken-Gini estimator
${\widehat{\beta}}_{ag,2}$ yields a robust estimation compared with GLS. For the other cases, the MSE of the generalized least squares estimator are less important.
4.2. Tests
The aim of this subsection is to prove that the usual White test for heteroskedasticity has a low power whenever outlying observations arise in the sample, even if the contamination rate is around 1%. Another test is proposed based on the co-Gini operator, and it is shown that a good power may be obtained compared with the standard White test.
Although GLS estimators may be affected by outliers, it is worth mentioning that Aitken-Gini estimators and GLS estimators are based on two different notions of heteroskedasticity. The Aitken-Gini one captures another type of variability, the co-Gini based on ranks, compared with GLS based on the variance. In the following, focus is put on White’s test since it is commonly employed in the literature.
White’s model and its Gini counterpart are given by,
and,
where
${\mathbf{r}}_{{\epsilon}_{g,i}}$ is the rank of
${\epsilon}_{g,i}$ and
${\mathbf{r}}_{ik}$ is the rank of
${x}_{ik}$ (within the vector
${\mathbf{x}}_{k}$). The intuition of the White-Gini test is to exhibit the variables
${\mathbf{x}}_{k}$ that depend on the rank of the individuals. This is the case for example when we regress incomes on age. We have the same intuition for White’s test performed with OLS. However, the squared residuals and the squared covariates may be inflated because of the outliers. In this respect, it is possible to use Eq.(White-Gini) to test for heteroskedasticity. It is noteworthy that this equation cannot be estimated by the semi-parametric Gini regression since the rank vector of
${\mathbf{x}}_{k}$ and the rank vector of
${\mathbf{x}}_{k}\otimes {\mathbf{r}}_{k}$ are collinear (⊗ being the Hadamard product). Consequently, both equations are estimated by OLS. The advantage of dealing with Eq.(White-Gini) is to capture the shape of heteroskedasticity in the presence of outliers. The standard White-OLS equation aims at capturing quadratic shapes in the covariates. However, the model fails to achieve this goal in the presence of outliers because outliers are squared. In the White-Gini equation, the product
${x}_{ik}{\mathbf{r}}_{ik}$ allows the quadratic shape to be detected, while the intensity of the outliers are attenuated by the role of the rank vector
${\mathbf{r}}_{k}$.
In the following tables, we provide the mean ${R}^{2}$ of each model over the number of Monte Carlo experiments $B=500,1000,5000$. We provide in parenthesis the power of the Fisher test related to the significance of the ${R}^{2}$ in each model. The Monte Carlo simulations with contamination were based on the same simulation process described in Algorithm 1.
In
Table 1, one observation is contaminated for a sample size
$n=30$, that is 3.33% of the sample. The same generating process was used as in the Monte Carlo simulations performed in the previous section. The outlying observation consists in multiplying the most important value of
${\mathbf{x}}_{2}$ by
$\theta $. Without outlier, the White-Gini model provides an
${R}^{2}$ of 0.17 (in mean over
B) with a very low test power around 2%, whereas the White-OLS model yields an
${R}^{2}$ of 0.26 with a power of 14%. However, when
$\theta $ is valued to be 100, the White-Gini model performs quite well with an
${R}^{2}$ of 0.45 and a power of 51%, whereas the power decreases slightly in the White-OLS model. In the White-Gini model, thanks to the rank vector of
$\mathbf{x}$ as a regressor, the regression curve stays distant from the outlying observation and becomes closer to the other points. Then, the variability of the model explained by the regression curve increases (and then
${R}^{2}$ increases). On the contrary, for the standard model (White-OLS), the regression curve moves toward the outlying observation so that the variability of the residuals increases, and in this case
${R}^{2}$ decreases.
In
Table 2, the sample size is
$n=100$ so that the contamination represents 1% of the sample. Without outlier, the White-Gini model provides a very low test power around 7%, compared with 64–70% for White-OLS model. The test power increases to reach 70% in the first case against 59% in the second case.
Finally, in
Table 3,
${R}^{2}$ and test power remain quite equivalent in both models. As mentioned in the literature, for large samples, White-OLS provides an excellent power. When the outliers are dilute in the sample, for instance when the contamination of the sample is only concerned with
$0,1$% of the sample, because the sample size is large and the number of outliers very low, both tests produce the same power.
As shown in
Table 1,
Table 2 and
Table 3, when outlying observations affect the sample, the power of the White-Gini test is higher than that of the usual White test.
4.3. The Feasible Generalized Gini Regression
After testing the presence of heteroskedasticity with outlying observations, a new procedure is proposed to estimate
$\mathbf{\Omega}$. The so-called feasible generalized least squares (FGLS), introduced by
Zellner (
1962), was adapted to the Gini regression, the feasible generalized Gini regression (FGGR). Aitken’s theorem no longer applies if
$\mathbf{\Omega}$ is unknown and must be estimated. The feasible generalized least squares estimator is not the best linear unbiased estimator, nevertheless
Kakwani (
1967) proved that it is still unbiased under general conditions, and
Schmidt (
1976) discussed the fact that most of the properties of generalized least squares estimation remain intact in large samples, when plugging in an estimator of
$\mathbf{\Omega}$. The form of the heteroskedasticity is unknown, but it can be approximated with a flexible model as,
in a “Breusch–Pagan” version (in the sense that we consider a linear form of heteroskedasticity), such that
$\mathbb{E}\left(\mathbf{u}\right)=\mathbf{1}$ and
$\mathbf{u}$ independent of
$\mathbf{X}$. Instead of using the least squared estimator, the semi-parametric Gini estimator in Equation (
2) is employed to deal with contaminated data in
$\mathbf{X}$:
From
$\widehat{\mathbf{\Omega}}$, we deduce an estimation of
$\mathbf{P}$ denoted
$\widehat{\mathbf{P}}={\widehat{\mathbf{\Omega}}}^{-\frac{1}{2}}$. Thus, we get that
$\widehat{\mathbf{X}}:={\widehat{\mathbf{\Omega}}}^{-\frac{1}{2}}\mathbf{X}$. Let
${\mathbf{R}}_{\widehat{\mathbf{x}}}$ be the rank matrix of
$\widehat{\mathbf{X}}$, hence the FGGR estimator is given by:
On the contrary, the usual FGLS estimator is given by,
with
${\mathbf{b}}_{g}$ estimated by generalized least squares in the first step [Equation (9)]. However, in Proposition 2, it is shown that the Aitken-Gini estimator is based on the
$\mathbf{P}$-rank idempotent hypothesis. This assumption states that the rank vector of the residuals must remain invariant after the transformation of the model with respect to matrix
$\mathbf{P}$, thereby an Aitken estimator is obtained. However, whenever one outlier occurs in the sample, e.g., the
ith row of
$\mathbf{X}$ is such that
${\mathbf{X}}_{i}^{o}\to \pm \infty $), then it comes that
${\epsilon}_{g,i}\to \pm \infty $ so that the respect of the
$\mathbf{P}$-rank idempotent hypothesis is not necessarily ensured for the outlying observation. Consequently, the FGGR estimator is biased. Indeed, the semi-parametric Gini estimator in Equation (10) is based on the rank matrix
${\mathbf{R}}_{\widehat{\mathbf{x}}}$ of
$\mathbf{P}\mathbf{X}$ which contains errors since
$\mathbf{P}$ is computed on the basis of contaminated data. Replacing
${\mathbf{R}}_{\widehat{\mathbf{x}}}$ by
${\mathbf{R}}_{\mathbf{x}}$, being the rank matrix of
$\mathbf{X}$, avoids such a contamination. It is worth mentioning that replacing the rank matrix of the covariates by another rank matrix of some data correlated to the covariates corresponds to the Gini instrumental variable estimator introduced by
Yitzhaki and Schechtman (
2004). In this respect, the FGGR estimator becomes a feasible generalized Gini regression by instrumental variable (FGGR-IV):
Because ${\mathbf{R}}_{\mathbf{x}}$ is the rank matrix of the initial contaminated data, it comes that the residuals of the transformed model issued from the FGGR-IV estimator are more likely to respect the $\mathbf{P}$-rank idempotent hypothesis compared with FGGR based on $\mathbf{P}\mathbf{X}$ because both matrices are contaminated.
We performed some Monte Carlo simulations to compare the mean squared errors of the FGGR-IV and FGLS estimators.
4Step 1: Generate three independent normal distributions ${\mathbf{x}}_{j}\sim \mathcal{N}(0,1)$ of size $n=100$ for all $j=2,3,4$ with ${\mathbf{x}}_{1}=(1,\dots ,1)$.
Step 2: As in the previous simulation, sort the matrix $\mathbf{X}=({\mathbf{x}}_{1},\dots ,{\mathbf{x}}_{4})$ by ascending order according to ${\mathbf{x}}_{2}$, except multiply the last row of $\mathbf{X}$ by $\theta =1,\dots ,100$ (with increments of 1 to avoid problems of matrix invertibility) to inflate the most important value of ${\mathbf{x}}_{2}$.
Step 3: For each outlier $\theta $, perform $B=1000$ simulations, i.e., generate B × 3 independent normal distributions ${\mathbf{x}}_{j}\sim \mathcal{N}(0,1)$ for all $j=2,3,4$ with one outlier valued to be ${\mathbf{X}}_{n}^{o}:=\theta {\mathbf{X}}_{n}$.
Step 4: Fix a vector ${\mathit{\beta}}_{ag}=(10,3,-10,58)$ to compute $\mathbf{y}={\beta}_{ag,1}{\mathbf{x}}_{1}+{\beta}_{ag,2}{\mathbf{x}}_{2}+{\beta}_{ag,3}{\mathbf{x}}_{3}\ast {\mathbf{x}}_{3}+{\beta}_{ag,4}{\mathbf{x}}_{4}+{\tilde{\epsilon}}_{g}$ with ${\epsilon}_{g,i}\sim \mathcal{N}(0,1)$, that is, suppose that the heteroskedasticity comes from ${\mathbf{x}}_{3}$.
Step 5: Compute the coefficients estimated based on FGGR-IV and FGLS with their MSEs over $B=1000$ replications.
Figure 3 depicts an interesting correction of heteroskedasticity performed by the FGGR-IV estimator when outliers contaminate only 1% of the sample. The FGGR-IV estimator provides MSEs close to 0 except for the constant and for
${\widehat{\beta}}_{ag,3}$. Because the model has been specified such that
${\beta}_{ag,3}{\mathbf{x}}_{3}\ast {\mathbf{x}}_{3}$, then the outlier is even more inflated in this case (bottom left in
Figure 3).
To show that the FGGR-IV estimator is relevant with other forms of heteroskedasticity, Step 4 was replaced by Step 4’, in which an ARCH(1) is modeled (see
Figure 4):
Step 4’: Fix a vector ${\mathit{\beta}}_{ag}=(10,3,-10,58)$ to compute $\mathbf{y}={\beta}_{ag,1}{\mathbf{x}}_{1}+{\beta}_{ag,2}{\mathbf{x}}_{2}+{\beta}_{ag,3}{\mathbf{x}}_{3}+{\beta}_{ag,4}{\mathbf{x}}_{4}+{\tilde{\epsilon}}_{g}\ast \sqrt{\mathbf{1}+2{\u03f5}^{2}}$ with ${\epsilon}_{g,i}\sim \mathcal{N}(0,1)$ and ${\u03f5}_{i}\sim \mathcal{N}(0,1)$.
The results depicted in
Figure 4 are even more clear: the MSEs of all FGGR-IV estimators tend toward 0, consequently the bias of those estimators also tend toward 0.