Next Article in Journal
Two-Dimensional Index of Departure from the Symmetry Model for Square Contingency Tables with Nominal Categories
Previous Article in Journal
Electron-Acoustic (Un)Modulated Structures in a Plasma Having (r, q)-Distributed Electrons: Solitons, Super Rogue Waves, and Breathers
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Detection of Influential Observations in Spatial Regression Model Based on Outliers and Bad Leverage Classification

by
Ali Mohammed Baba
1,2,
Habshah Midi
1,3,*,
Mohd Bakri Adam
1,3 and
Nur Haizum Abd Rahman
1,3
1
Institute for Mathematical Research, Universiti Putra Malaysia, Serdang 43400, Selangor, Malaysia
2
Department of Mathematical Sciences, Abubakar Tafawa Balewa University, Bauchi 0248, Nigeria
3
Department of Mathematics and Statistics, Faculty of Science, Universiti Putra Malaysia, Serdang 43400, Selangor, Malaysia
*
Author to whom correspondence should be addressed.
Symmetry 2021, 13(11), 2030; https://doi.org/10.3390/sym13112030
Submission received: 6 August 2021 / Revised: 7 October 2021 / Accepted: 22 October 2021 / Published: 27 October 2021

Abstract

:
Influential observations (IOs), which are outliers in the x direction, y direction or both, remain a problem in the classical regression model fitting. Spatial regression models have a peculiar kind of outliers because they are local in nature. Spatial regression models are also not free from the effect of influential observations. Researchers have adapted some classical regression techniques to spatial models and obtained satisfactory results. However, masking or/and swamping remains a stumbling block for such methods. In this article, we obtain a measure of spatial Studentized prediction residuals that incorporate spatial information on the dependent variable and the residuals. We propose a robust spatial diagnostic plot to classify observations into regular observations, vertical outliers, good and bad leverage points using a classification based on spatial Studentized prediction residuals and spatial diagnostic potentials, which we refer to as I S R s P o s i and E S R s P o s i . Observations that fall into the vertical outliers and bad leverage points categories are referred to as IOs. Representations of some classical regression measures of diagnostic in general spatial models are presented. The commonly used diagnostic measure in spatial diagnostics, the Cook’s distance, is compared to some robust methods, H i 2 (using robust and non-robust measures), and our proposed I S R s P o s i and E S R s P o s i plots. Results of our simulation study and applications to real data showed that the Cook’s distance, non-robust H si 1 2 and robust H s i 2 2 were not very successful in detecting IOs. The H si 1 2 suffered from the masking effect, and the robust H s i 2 2 suffered from swamping in general spatial models. Interestingly, the results showed that the proposed E S R s P o s i plot, followed by the I S R s P o s i plot, was very successful in classifying observations into the correct groups, hence correctly detecting the real IOs.

1. Introduction

Belsley et al. [1] defined an influential observation (IO) as one which, either individually or together with several other observations, has a demonstrably large impact on the calculated values of various estimates. An influential observation could be an outlier in the X-space (leverage points) or outlier in the Y-space (vertical outlier). Leverage points can be classified into good (GLPs) and bad leverage points (BLPs). Unlike BLPs, GLPs follow the pattern of the majority of the data; hence, they are not considered as IOs as they have little or no influence on the calculated values of numerous estimates [2,3]. In this connection, Rashid et al. [2] stated that IOs could be vertical outliers (VO) or BLPs. Thus, it is very crucial to identify IOs as they are responsible for misleading conclusions about the fitted regression models and various other estimates. Once the IOs are identified, there is a need to study their impact on the model and subsequent analyses. There is a handful of studies on the diagnostic of IOs in linear regression; some examples are [1,3,4,5,6,7,8,9,10,11,12]. Other articles in the literature deal with regressions with correlated residuals, e.g., [13,14,15,16,17]. However, only a few articles deal with the detection of IOs in spatial regression models; some examples include [18,19,20,21,22]. Some robust estimation methods in spatial regression are [23,24,25]. Christensen et al. [18] and Haining [19] adapted one of the diagnostic measures in [3] to detect influential observations in spatial error autoregression model. They achieved this by defining correlated errors through the spatial weight matrix and coefficient of spatial autocorrelation in the error term. They also presented the spatial Studentized prediction residuals and the spatial leverage terms that contain error terms in spatial information.
The presence of high or low attribute value in the neighbourhood of a spatial location may result in the inability to detect the true spatial outlier, or the false identification of a good observation as an outlier [26]. Hadi [27] has also noted that spatial outlier detection methods inherit the problem of masking and swamping. Masking occurs when outlying observations are incorrectly declared as inliers. Swamping on the other hand, occurs when clean observations are incorrectly classified as outliers [28]. Aggarwal [29] observed that spatial outlier breaks the spatial autocorrelation and continuity of spatial locations. Spatial autocorrelation is a systematic pattern in attribute values that are recorded in several locations on a map. Attribute values in one location that are associated with values at neighbouring locations indicate the presence of autocorrelation. Positive autocorrelation indicates similar values that are clustered together. Negative autocorrelation indicates low attribute values in the neighbourhood of high attribute values and vice-versa [30].
Robust estimation methods mostly focus on estimations that are not influenced much by the effects of outliers. Anselin [23] has extended the bootstrap estimation to mixed regressive spatial autoregressive models, where pseudo error terms are generated by sampling from the vector of error terms. The spatial structure of the data is maintained through the generation of error terms. Politis et al. [31] and Heagerty and Lumley [32] also adopted the bootstrap method on blocks of contiguous locations to generate replicates of the estimates of the asymptotic standard error of statistics. Cerioli and Riana [24] argued that a robust estimator of the spatial autocorrelation parameters did not exist based on all datasets. They proposed a forward search algorithm based on blocks of contagious spatial locations (BFS). The BFS algorithm are drawn in such a way that the blocks retain the spatial dependence structure of the original data. Yildirim [25] proposed a robust estimation method of the log-likelihood with influence function in the spatial error model. This is achieved iteratively using scoring algorithm to estimate the parameters. Though they succeeded in obtaining robust estimates, identifying spatial outliers, which is vital in spatial statistics [26], was not achieved. Popular graphical techniques to detect spatial outliers are the scatterplot [33], the Moran’s scatterplot [30] and the pockets of nonstationarity [34]. Besides being prone to the problem of masking and swamping [26], they focused mainly on spatial outliers in the Y- space only.
Diagnostic works on models that have both spatial autocorrelations in dependent variable and residual terms are missing in the literature. The problem of masking and swamping is prevalent in spatial regression model diagnostics, which may be due to the presence of vertical outliers as well as leverage points, as in the case of linear regression ([27]). This motivates us to represent the spatial Studentized prediction residuals and spatial leverage values in the general spatial model, and to adapt and extend some robust diagnostic measures of detection of outliers and IOs in linear regression, such as Hadi’s potential ( p o i i ) , Cook’s distance ( C D i ) [3], the overall potential influence ( H i 2 ) [10], and the external (ESRs) and internal (ISRs) Studentized residuals [1,9,10], to spatial regression models in order to minimize the problem of masking and swamping in spatial models.
In this article, we propose a robust spatial diagnostic plot and adapt some diagnostic measures in the linear regression model. Representations of the diagnostic measures in the spatial regression model are obtained, with a special emphasis on the general spatial regression model (GSM) that performs autoregression on both the dependent variable and error terms.
The main objective of this study is to propose a robust spatial diagnostic plot. Other objectives are: (1) to represent the leverage values of the hat matrix of the linear regression in the GSM model; (2) to extend the ISR of the linear regression to the GSM model; (3) to extend the ESR of the linear regression to the GSM model; (4) to extend the Cook’s distance and the overall potential influence of the linear regression to the GSM model (5) to develop a method of identification of the influential observations of the GSM model by proposing a procedure of classification of the observations into regular observations, vertical outliers, good and bad leverage points, and hence IOs; (6) to evaluate the performances of the proposed methods by using simulation studies; (7) to apply the proposed methods on gasoline price data for retail sites in Sheffield, UK, COVID-19 data in Georgia, USA, and the life expectancy data from USA counties. The significance of this study is that it can contribute to the development of a method of identification of influential observations in spatial regression models.

2. Identification of Influential Observations in a Linear Regression Model

Consider a k-variable regression model:
y = X β + ε
where y is an n × 1 vector of observations of dependent variables, X is an n × k matrix of independent variables, β is a k × 1 vector of unknown regression parameters, ε is an n × 1 vector of random errors with identical normal distributions, that is, ε ~ N I D ( 0 , σ 2 ) .
The ordinary least squares (OLS) estimates in Equation (1) are given by:
β ^ = ( X T X ) 1 X T y
The vector of predicted values can be written as:
y ^ = X β ^ = X ( X T X ) 1 X T y = P y ,
where P = X ( X T X ) 1 X T is the hat/leverage matrix. The diagonal elements of the leverage matrix are called the hat values, denoted as p i i , and given by:
p i i = x i T ( X T X ) 1 x i ,                         i = 1 , 2 , , n .
The hat matrix is often used as diagnostics to identify leverage points. Leverage is the amount of influence exerted by the observed response y i on the predicted variable y ^ i . As a result, a large leverage value indicates that the observed response has a large effect on the predicted response.
Hoaglin and Welsh [3] suggested that an observation which exceeds 2 k n , where 2 k n is the average value of p i i , is considered as a leverage point, while Vellman and Welsch suggested 3 k n as a cut-off point for leverage points. Huber [7] suggested that the ranges p i i   0.2 , 0.2 < p i i   0.5 and p i i > 0.5 are safe, risky and to be avoided, respectively, for leverage values.
Unfortunately, the hat matrix suffers from the masking effect. As a result, p i i often fails to detect high leverage points. Hadi [10] suggested a single-case-deleted measure called potentials or Hadi’s potentials. The diagonal element of a potential denoted as p o i i , is given by:
p 0 i i = x i T ( X ( i ) T X ( i ) ) x i ,               i = 1 , 2 , , n
where X ( i ) is the matrix X with the i t h row deleted. We can rewrite p 0 i i as a function of p i i as:
p 0 i i = p i i 1 p i i ,                                   i = 1 , 2 , , n .
The vector of the residuals, r , can be written as:
r = y y ^ = ( I P ) y = Q y ,
The Studentized residuals (internally Studentized residuals) denoted as ISRs and R-Student residuals (externally Studentized residuals) denoted as ESRs are widely used measures for the identification of outliers (see [7]). The ISR, denoted as t i , is defined as:
t i = r i σ ^ 1 p i i
where σ ^ is the standard deviation of the residuals, r i and p i i are the i t h residual and diagonal element of the matrix P , respectively (see [9] for details). Meanwhile, Chatterjee and Hadi [9] defined ESR denoted as t i * and given by:
t i * = r i σ ^ ( i ) 1 p i i
where σ ^ ( i ) is the residuals mean square excluding the i t h case. The ESR follows a Student’s t-distribution with ( n k 1 ) degrees of freedom [9].
One of the most employed measures of influence in linear regression is the Cook’s distance [3]. It measures the influence on the regression coefficient estimate or the predicted values. The Cook’s distance is given by
C D ^ i ( X T X , k σ ^ 2 ) = ( β ^ ( i ) β ^ ) T ( X T X ) ( β ^ ( i ) β ^ ) k σ ^ 2 ,
where β ^ is the vector of estimates of β using the full data, β ^ ( i ) is the vector of estimates of β with the i t h observation of y i and x i omitted, k is the number of parameters and σ ^ 2 is the estimate of variance. Any i t h observation is declared influential observation (IO) if C D ^ i > F [ 0.5 ;   k ,   ( n k ) ] . Meloun [12] noted that any observation in which C D i > 1 is considered as an influential observation. The Cook’s distance can also be written as [8,9]:
C D ^ i ( X T X , k σ ^ 2 ) = ( y ^ y ^ i ) T ( y ^ y ^ i ) k σ ^ 2
Computing the C D ^ i ( X T X , k σ ^ 2 ) does not require fitting a regression equation for each of the i t h observations and the full model; instead, Equation (3) can further be simplified as ([3,8,9]):
C D ^ i ( X T X , k σ ^ 2 ) = 1 k t i 2 p i i q i i
where t i = e i σ ^ q i i is the ISR and p i i q i i   ( q i i = 1 p i i ) is referred to as the potential [7,8,9]. Interestingly, the Cook’s distance is a measure of influence based on the potential ( p i i q i i ) and Studentized residual ( t i ).
Hadi [10] demonstrated the drawback of methods that are multiplicative of functions, such as the Cook’s distance [3], Andrews–Pregibon statistic [5], Cook and Weisberg statistic [8], etc. (see [10] for details), and proposed a method that is additive of the functions. Though both the multiplicative and additive methods are functions of residuals and leverage values, the former diminishes towards zero for smaller value of any of the two functions or both, while in the latter case, the measure is large if one of the two functions or both are large. He proposed a measure of overall potential influence, denoted as H i 2 , and defined as follows:
H i 2 = k m e I T ( I m P I ) 1 e I e T e e I T e I + 1 m t r ( P I ( I m P I ) 1 ) ,
with k, the number of the parameters in the model, I = { i 1 , i 2 , , i m } the set of indices of observations of length m, and P I the leverage indexed by I.
For m = 1 and I = i , Equation (7) simplifies to:
H i 2 = k ( 1 p i i ) e i 2 ( e T e e i 2 ) + p i i 1 p i i = k ( 1 p i i ) d i 2 ( 1 d i 2 ) + p i i 1 p i i ,
where p i i = k , d i 2 = 1 , d i 2 = e i 2 e T e is the square of the i t h normalized residual.
Hadi [10] suggested a cut-off point for Hadi’s potential ( p o i i   )   a n d   , H i 2 denoted as ( l 1 ) which is given as follows:
l 1 = m e a n ( p o i ) + c V a r ( p o i )
= k n + c n s k 2 n ( n 1 ) ,
where c = 2 ,   3 , s = p i i and p o i is the vector of Hadi’s potential. Since both the mean and the standard deviation are easily affected by outliers, he suggested to employ such a confidence-bound type of cut-off points by replacing the mean and the standard deviation by robust estimators, namely the median and normalized median absolute deviation, respectively. The resulting cut-off point is denoted as l 2 ;
l 2 = M e d ( p o i ) + c M A D ( p o i ) ,

3. Influential Observations in Spatial Regression Models

The general spatial autoregressive model (GSM) ([21,35,36]) includes the spatial lag term and spatially correlated error structure. The data generating process (DGP) of the general spatial model is given by:
y = ρ W 1 y + X β + ξ ,     ξ = λ W 2 ξ + ε ,     ε N ( 0 , σ 2 I n ) ,
where y is an n × 1 vector of dependent variables. X is an n × k matrix of explanatory variables. W 1 and W 2 are n × n spatial weight matrices. I n is an n × n identity matrix. ξ is the spatially correlated error term, and ε is the random residual term. The parameter ρ is the coefficient of the spatially lagged dependent variables W 1 y , and λ is the coefficient of the spatially correlated errors.
The general spatial autoregressive model in Equation (9) can be rewritten as:
A y = X β   + B 1 ε ,
where A = I n ρ W 1 , ξ = B 1 ε , B = I n λ W 2 , ξ N ( 0 , σ 2 V ) , and V = ( B T B ) 1 . Estimation of the parameters is achieved using the maximum likelihood estimation method.
The log-likelihood function ( L ) is given by:
L = n 2 l n ( σ 2 ) + l n | A | + l n | B | 1 2 σ 2 ( A y X β ) T B T B ( A y X β )
Let ρ ^ , λ ^ , σ ^ 2 , β ^ be the maximum likelihood estimates (MLEs) of ρ , λ , σ 2 ,   β , respectively. The MLEs are obtained iteratively using numerical methods in the maximum likelihood estimation. Anselin [35] and LeSage [36] discussed the maximum likelihood estimation procedure of the parameters.

3.1. Leverage in Spatial Regression Model

Denote the vector of parameters in Equation (11) as β a y . The estimate of β a y , β ^ a y , is given by:
β ^ a y = ( X T V ^ 1 X ) 1 X T V ^ 1 A ^ y .
The model (11) is viewed as fitting a general linear model, A y on X , that has correlated residual terms. Set z = A y , where v a r ( A y ) = σ 2 V . Therefore,
z ^ = X β ^ a y = X ( X T V ^ 1 X ) 1 X T V ^ 1 z = P a y z
The hat matrix, in this case, is given by P a y ,
P a y = X ( X T V ^ 1 X ) 1 X T V ^ 1 .
Let Q a y = I n P a y . Though P a y and Q a y have satisfied the idempotence property and their sum of diagonal elements equals k and n k , respectively, they are not symmetric. As a result, they are not positive semi-definite, and as such, the diagonal elements of P a y will have negative values. The hat matrices P a y and Q a y are not symmetric, and their diagonal values do not lie between 0 and 1 (inclusive).
Martins [15] proposed a measure of leverage that is orthogonal, in the models with correlated residuals, whose diagonal values lie in the interval [0, 1], which we denote by P a y * , such that:
P a y * = V ^ 1 P a y = V ^ 1 X ( X T V ^ 1 X ) 1 X T V ^ 1
Let Q a y * = I n P a y * . P a y * and Q a y * are idempotent, symmetric and orthogonal with respect to V , i.e.,
  • P a y * V ^ P a y * = P a y *
  • Q a y * V ^ Q a y * = Q a y *
  • P a y * V ^ Q a y * = 0
Note that the sum of the diagonal elements of P a y * and Q a y * , the leverage, does not sum to k and n k .
Again, consider a new set of dependent variables obtained by pre-multiplying Equation (11) by the matrix B ( B as defined in Equation (10)) so that z * = B A y . Schall and Dunne [14] defined the matrix V 1 as a singular value decomposition such that V 1 = B Δ B T ; where B is of the same order as V 1 and Δ is a diagonal matrix. The transformation z * is the principal component score. Puterman [13] and Haining [19] defined it as canonical variates such that B X ( X T V 1 X ) X T B T is positive semi-definite. By setting z * = B A y , Equation (9) is rewritten in a generalized least squares (GLS) form as:
z * = X * β s + ε , ε N ( 0 , σ 2 I n )
where X * = B X .
The estimate β ^ s of β s is now given by:
β ^ s = ( X * T X * ) 1 X * T z *
Thus,
z ^ * = X * ( X * T X * ) 1 X * T z *
where, A ^ = I n ρ ^ W 1 and B ^ = I n λ ^ W 2 . Note that y ^ is deduced from Equation (13) as follows:
B ^ A ^ y ^ = B ^ X ( X T B ^ T B ^ X ) 1 X T B ^ T B ^ A ^ y
y i e l d s y ^ = A ^ 1 X ( X T V ^ 1 X ) 1 X T V ^ 1 A ^ y
Denote the projection matrix in the transformed spatial regression model as P s , then:
P s = X * ( X * T X * ) 1 X * T
= B ^ X ( X T V ^ 1 X ) 1 X T B ^ T ,     V ^ = ( B ^ T B ^ ) 1
The properties of the leverage in the transformed spatial model in Equation (13) are:
Property I: idempotent and symmetric.
Property Ia: idempotence
P s 2 = B ^ X ( X T V ^ 1 X ) 1 X T B ^ T B ^ X ( X T V ^ 1 X ) 1 X T B ^ T
= B ^ X ( X T V ^ X ) 1 X T B ^ T
= P s
Hence, P s is idempotent.
Property Ib: symmetric
P s T = ( B ^ X ( X T V ^ 1 X ) 1 X T B ^ T ) T
= B ^ X ( X T V ^ 1 X ) 1 X T B ^ T
= P s
The matrix P s is symmetric. Therefore, P s in the transformation z * = B ^ A ^ y is both idempotent and symmetric.
Property II: the sum of the diagonal terms of the projection matrix is k , the number of parameters including the constant term.
t r a c e ( P s ) = t r a c e ( B ^ X ( X T V ^ 1 X ) 1 X T B ^ T )
= t r a c e ( B ^ T B ^ X ( X T V ^ 1 X ) 1 X T ) ( c y c l i c   p e r m u t a t i o n   o f   t r a c e   o f   m a t r i x )
= t r a c e ( X T V ^ 1 X ( X T V ^ 1 X ) 1 ) ( c y c l i c   p e r m u t a t i o n   o f   t r a c e   o f   m a t r i x )
= t r a c e ( I k )
= k ,
where I k is an k × k identity matrix.
Therefore, i = 1 k p s i i = k . p s i i is the i t h diagonal element of the leverage P s .
Property III: bounds on the spatial leverage.
The bound on the leverage of the classical regression is 0 p i i 1 due to the fact that the hat matrix P satisfies all the orthogonal properties, including symmetry. As such, it is positive semi-definite. However, the spatial leverage P a y is not symmetric because positive semi-definite matrix is symmetric [37,38,39]. The transformation in Equation (11) yields the projection P s that satisfies the symmetry condition.
From the idempotent property of P s ,
P s = P s 2 .
Equating diagonal terms of LHS and RHS, we have:
p s i i = p s i i 2 + j i p s i j p s j i ,     j i p s i j p s j i 0 ,
where p s i j are the off-diagonal terms. Equation (14) implies that p s i i 0 . Therefore,
p s i i p s i i 2
y i e l d s p s i i 1 .
Note that P s and Q s are orthogonal:
  • P s P s = P s
  • Q s Q s = Q s
  • P s Q s = 0
The model in Equation (9) gives rise to different special spatial regressions in accordance with different restrictions. Such special spatial regression models are the spatial autoregressive-regressive model (SAR) and the spatial error model (SEM). While the former has spatial autoregression in the response variable, the latter has spatial autoregression in the model residual; model (9) (GSM) combines both features.
The spatial autoregressive-regressive model is obtained when the coefficient of the lagged spatial autoregression in the residuals of Equation (9) is zero, i.e., λ = 0 . Thus, the SAR model is given by:
y = ρ W 1 y + X β + ε ,     ε N ( 0 , σ 2 I n ) .
The P s corresponding to the model in Equation (13) reduces to:
P s = X ( X T X ) 1 X T ,
with the transformation in Equation (11) simplifying to z * = A y , since V = ( B T B ) 1 and B = I n , when λ = 0 . Clearly, the hat matrix in the SAR model preserves the features of the hat matrix in the classical regression model.
In the spatial error model (SEM), the coefficient of the spatial autoregression on the lagged dependent variable is zero, i.e., ρ = 0 . This yields the model:
y = X β + ξ ,     ξ = λ W 2 ξ + ε ,     ε N ( 0 , σ 2 I n )
The transformation in Equation (11) simplifies to z * = B y , and the projection matrix remains:
P s = B X ( X T V 1 X ) 1 X T B T .
It can be observed that the leverage measure in the spatial regression model is dominated by the autocorrelation in the residual term.
Works on spatial regression diagnostics in the literature mainly focus on the autocorrelation in the residuals, mostly using a time series analogy [13,14,15]. Some remarkable works on the spatial regression model can be found in [18,19,21].

3.2. Influential Observations in Spatial Regression Model

The leverages P s and Q s in Equation (11) satisfy all the properties of a projection matrix, including that the sum of the diagonal terms of P s and Q s equal k and n     k , respectively. It also incorporates the autocorrelation in the dependent variables, W y . Hence, it can be used as a diagnostic measure of leverage points in a spatial regression model.
By extending the results of linear regression to spatial regression with slight modification, the Cook’s distance in the spatial regression of Equation (13), denoted as C D s i , can be formulated as follows:
C D ^ s i = ( β ^ s ( i )     β ^ s ) T ( X * T X * ) ( β ^ s ( i ) β ^ s ) k σ ^ 2
= ( β ^ s ( i )     β ^ s ) T ( ( B X ) T ( B X ) ) ( β ^ s ( i )     β ^ s ) k σ ^ 2
= ( β ^ s ( i )     β ^ s ) T ( X T B T B X ) ( β ^ s ( i )     β ^ s ) k σ ^ 2
= ( β ^ s ( i )     β ^ s ) T ( X T V 1 X ) ( β ^ s ( i )     β ^ s ) k σ ^ 2 ,
where:
β ^ s ( i ) = X ( i ) ( X ( i ) T V ^ ( i , i ) 1 X ( i ) ) 1 X ( i ) T V ^ ( i , i ) 1 A ^ ( i , i ) Y ( i ) .
V ^ ( i , i ) and A ^ ( i , i ) denote V ^ and A ^ with the i t h row and the i t h column deleted, respectively.
The spatial Cook distance, C D s i , is declared large if C D s i > 0.70 [19]. In its simplified form, the Cook’s distance in spatial regression is written as
C D ^ s i ( X T V 1 X , k σ ^ 2 ) = 1 k t s i 2 p si q si ,
where t s i is the spatial Studentized prediction residual (also called spatial internally Studentized residual), p s i is the spatial leverage, which is the i t h diagonal element of P s , and q s i = 1 p s i . Let r s i = y i y ^ i , then:
t s i = b i T a i r s i σ ^ q s i ,
where b i and a i are the i t h columns of matrices B and A , respectively. The spatial Studentized residual has a cut-off point of 2 to declare a point large [19,40].
Similarly, the spatial externally Studentized residual (ESRs), is defined as:
t s i * = r s i σ ^ ( i ) 1 p s i
= t s i n k 1 n k t s i 2 ,       σ ^ ( i ) = σ ^ ( n k t s i n k 1 ) .
where σ ^ ( i ) is the residuals mean square excluding the i t h case. The ESRs follow a Student’s t-distribution with ( n k 1 ) degrees of freedom. Thus, the spatial Studentized prediction residuals contain the neighbourhood information of both the dependent variable and the residual of each r s i , and the leverage P s contains the residual autocorrelation effect. The spatial potential, which is analogous to the potential in [10], is defined in Equation (19) as:
p o s i = p s i q s i  
where q s i = 1 p s i . Let q o s i = 1 p o s i .
We define the spatial measure of overall potential influence as
H s i 2 = k q o s i d i 2 ( 1 d i 2 ) + p o s i q o s i  
When measuring the influence of an observation in a linear regression model by using the Cook’s distance [3], the observation in question is deleted, and the model is then refitted. In a similar way, usually a group of suspected influential observations is deleted in the linear regression and admitted into the model if it is proven clean (BACON [41,42], DGRP [11]). This is because IOs in linear regression are global in nature; however, in a spatial regression model, IOs are local. Haining [20] noted that spatial outliers are local in nature; their attribute values are outliers if they are extreme relative to the set of values in their neighbourhood on the map. IOs in spatiotemporal statistics usually carry vital information in applications. Kou et al. [26] further pointed out that detecting spatial outliers can help in locating extreme meteorological events such as tornadoes and hurricanes, identify aberrant genes or tumour cells, discover highway traffic congestion points, pinpoint military targets in satellite images, determine possible locations of oil reservoirs and detect water pollution incidents. Thus, measuring the influence of multiple spatial locations requires a contiguous set of points to reveal the unusual features related to that neighbourhood.
Although methods that detect multiple outliers in spatial regression work well (see [21]), we refer to methods that group observations as clean or suspect, irrespective of their positions (with reference to spatial data), and admit them into the model as clean observations according to some conditions.
According to Hadi [10], examining each value of influence measure alone, such as P s i , ISRs, ESRs, C D s i and H s i 2 , might not be successful to indicate the IOs or the source of influence. Imon [43] and Mohammed [44] noted that one should consider both outliers and leverage points when identifying IOs. The easiest way to capture IOs is by using diagnostic plots. Following [43,45], we adopt their rules for the classification of observations into four categories, namely regular observations, vertical outliers, GLPs and BLPs. Once observations are classified accordingly, those observations that fall in the vertical outliers and BLPs categories are referred to as IOs. However, due to the local nature of spatial IOs, we have to make some modifications to the classification scheme. In this paper, a new diagnostic plot is proposed by plotting the ISRs (or ESRs) on the Y-axis against the spatial potential, P o s i , on the X-axis. We consider the ISRs and ESRs because both measures contain spatial information. On the other hand, the potentials that are obtained from the transformed model in Equation (13) are considered in order to reflect spatial dependence. Hence, the proposed diagnostic plots are denoted as I S R s P o s i and E S R s P o s i plot, and they are based on the following classification schemes:
(a)
I S R s P o s i
i
i t h observation is declared RO if | I S R s | < 2.0 and p o s i < l 2 .
ii
i t h observation is GLP if | I S R s | < 2.0 and p o s i > l 2 .
iii
i t h observation is BLP if | I S R s | > 2.0 and p o s i > l 2 .
iv
i t h observation is IO if | I S R s | 2.0 and p o s i l 2 .
Figure 1 and Figure 2 show the classification of the observations as RO, GLP and IOs according to I S R s P o s i and E S R s P o s i , respectively.
(b)
E S R s P o s i
i
i t h observation is declared RO if | E S R s | < t n k 1 and p o s i < l 2 .
ii
i t h observation is GL if | E S R s | < t n k 1 and p o s i > l 2 .
iii
i t h observation is IO if | E S R s | > t n k 1 and p s i > l 2 .
iv
i t h observation is IO if | E S R s | t n k 1 and p s i l 2 .

4. Results and Discussions

In this section, the performance of all the proposed methods, i.e., the Cook’s Distance ( C D ^ s i ), H s i 2 ( H s i 1 2 (non-robust) and H s i 2 2 (robust)), I S R s P o s i and E S R s P o s i , is evaluated using a simulation study, artificial data and real datasets of gasoline price data in the southwest area of Sheffield, UK, COVID-19 data in the counties of the State of Georgia, USA and the life expectancy data in counties of the USA.

Simulated Data

We simulated the spatial regression model in Equation (9) for a square spatial grid with sample size, n = 400 , ρ = 0.4 , λ = 0.5 and W 1 = W 2 , using row-standardized Queen’s contiguity spatial weights. x 0 = 1 , x 1 N ( 0 , 1 ) , β 0 = 0 , β 1 = 1 (bold face 0 and 1 refer to column vectors of values zeros and ones, respectively). The contamination is taken at two percent in each of X and y directions. The contamination in the y direction is taken from the Cauchy distribution because of its fat tails. Contamination in the X direction is taken from the following multivariate distribution,
X ( [ 0 2 ] ,   [ 1 0 0 1 ] )
However, it is important to note that during the contamination, some of the contaminations may have attributes similar to those in their neighbourhood, as noted by Dowd [46], and spatial simulation is conditioned to a real dataset.
Figure 3 shows the graph of average attribute values in the neighbourhood of locations against their attribute values with added contamination. It can be observed that some of the added contamination, in black dots, are in the middle of clean data points while some stand out from the bulk of the data (i.e., away from their average neighbourhood values), which clearly indicates outlyingness.
Table 1 presents the values of ISRs, ESRs and p o s i , where values in parentheses are their corresponding cut-off points. It shows seven locations with large Studentized residuals according to ISRs and ESRs. There are 54 observations with large potentials (>0.0078). Two out of the fifty-four potentials correspond to Studentized residuals greater than the thresholds of ISRs and ESRs (locations 51 and 201).
In order to confirm the outlyingness of the locations classified as spatial IOs, the threshold of each outlier neighbourhood given by
med i + 3 M A D i
is computed for the Studentized residuals of the classified location and its immediate neighbourhood, where med i is the median of the Studentized residuals and M A D i is the median absolute deviation. The absolute value of the Studentized residuals is compared to the neighbourhood threshold for confirmation as an outlier.
The C D s i detected location 201, which has large ISRs, ESRs and p s i . I S R s P o s i and E S R s P o s i classified locations 1, 4, 35, 51, 91, 201 and 265 as IOs. As noted on Figure 4, I S R s P o s i and E S R s P o s i classified locations 1, 4, 35, 91 and 265 as outliers in the y direction, and locations 51 and 201 in both X and y directions. The cut-off limits of E S R s P o s i are narrower than 2 for the 5% cut-off point of the Student’s t-distribution, which is around 1.96 for large sample sizes.
H s i 1 2 classified location 1 only as IO. Location 1 has large ISRs and ESRs with small p o s i . It is an outlier in the y direction. H s i 2 2 identified 60 locations as IOs, including all the locations classified by the other methods. However, a diagnostic examination of the 53 other locations classified by H s i 2 2 alone reveals that all locations that have small ISRs and ESRs with large potential values are classified as IOs. Moreover, the locations with small Studentized residuals, which show no difference with their neighbourhood, are classified as IOs. This is a clear case of swamping, perhaps due the local nature of the spatial IOs.
In a 1000-run of the simulation described above at different error variances of 0.01, 0.1, 0.2 and 0.3 as shown in Table 2, the C D s i consistently maintained low classification of influential observations with consistent swamping rates of 0%. The I S R s P o s i demonstrated a high detection to the tune of 98% while E S R s P o s i had 100% accurate classification of the IOs, both with swamping rates of 0%. H s i 1 2 had less than 40% accurate classification with zero swamping rate, while the H s i 2 2 had up to 99% accurate IO classification, but usually with very high swamping rates.

5. Illustrative Examples

5.1. Example 1

The gasoline price data for 61 retail sites in the southwest area of Sheffield from [19] were used in Example 1. The analysis indicated the presence of spatial interaction in the error term with a Moran’s I of 0.239.
The fitted SEM model is given by Equation (21):
y ^ M = 35.78 + 0.71 X F + λ ^ W ξ
where, y M and X M are the March and February sales from the southwest Sheffield gasoline sale data, respectively, λ ^ = 0.15 is the estimate of coefficient of correlation in the residuals, W is the standardized weight matrix and ξ is the vector of correlated residuals.
Table 3 shows the results of the detected IOs in the SEM model for the gasoline data with all the sites detected by the methods. A “yes” under a method column indicates that the site has been detected by the method as IO and a “no” means otherwise. The values in bold in columns ISRs, ESRs and p s i indicate large Studentized residuals and potentials greater than 0.0335, respectively. Figure 5 shows the classification of observations by I S R s P o s i and E S R s P o s i .
The C D s i , I S R s P o s i , E S R s P o s i , and H s i 1 2 coincidentally identified site 25 only as IO. H s i 2 2 detects 11 more sites as IOs in addition to site 25. Haining [19] has made elaborate diagnostic analysis of the data where he emphasized the effect of site 25 as IO in the data. Our methods have classified site 30 in addition to location 25 as IO. Figure 5 shows the graph of the lagged residuals against the residuals. It is noticeable from the graph that site 30 has also been marked as an IO.
Though the H s i 2 2 has detected all the suspected IOs, it is prone to swamping. The remaining high potentials are classified as GLP by I S R s P o s i and E S R s P o s i since their Studentized values are small.
Figure 6 shows the graph of classification of the I S R s P o s i (a) and E S R s P o s i (b) indicating the outliers in red dots, where both are classified as outliers in both the X and y directions.

5.2. Example 2

The data for example 2 were the COVID-19 data for the 159 counties of the State of Georgia, USA, as of 30 June 2020 (http://dph.georgia.gov/covid-19-daily-status-report; accessed on 30 June 2020) and the health ranking (http://www.countyheathrankings.org; accessed on 30 June 2020). The case-rate per 100,000 of COVID-19 was the dependent variable. The independent variables were the population of black race in the county ( X 1 ), population of Asians ( X 2 ), population of Hispanic ( X 3 ), population of people that are 65 years and above ( X 4 ), population of female in the county ( X 5 ) and life expectancy ( X 6 ).
The model was fitted with the SAR model (model with the lowest Akaike information criteria (AIC) value of 2192). The SAR model is presented in Equation (22):
y ^ = ρ ^ W y ^ + β ^ 0 + i = 1 6 β ^ i X i
where ρ ^ = 0.6967 , β ^ 0 = 1087.7388 , β ^ 1 = 9.7831 ,   β ^ 2 = 6.2210 , β ^ 3 = 54.1405 , β ^ 4 = 28.5874 , β ^ 5 = 4.8288 and β ^ 6 = 40.3323 . X 1 , X 3 and ρ ^ are significant at 5%, while X 2 and X 5 are significant at 10%. X 4 and X 6 are not significant.
The Cook’s distance only classified county 50 as an IO. The I S R s P o s i and E S R s P o s i coincided in detecting counties 3, 26, 49, 50, 70, 120, 135, 141 and 142 as IOs. The H s i 1 2 (non-robust) detected 26 and 50 as IOs. The H s i 2 2 (robust) detected 3, 26, 50, 58, 67, 70, 98, 118, 120, 128, 131, 134, 135, 139, 141, 142, 153 and 155 counties. Table 4 shows the detected locations by the various methods with large ISRs, ESRs and high potentials in bold font.
The IOs identified by I S R s P o s i and E S R s P o s i both have large Studentized residuals and large potentials as can be observed in Table 4. Figure 7 shows the outliers in X, y and both X and y directions. The C D s i detected the largest Studentized residual with a high potential as IO. The H s i 1 2 identified two observations with large Studentized values and high potential values. The H s i 2 2 detected all suspected IOs, but with many having both small values of Studentized residuals and potential values.
While examining the outlyingness of the classified counties, we find that county 50 is clearly an IO since it has both large Studentized residual and a large potential value. It is outside the threshold value of its neighbourhood.
Four of the counties classified by I S R s P o s i and E S R s P o s i   (i.e., 26, 50, 70 and 128) are classified as vertical outliers while the counties 3, 49, 120, 135, 141 and 142 have large potential values and Studentized values greater than their threshold values and are classified as BLPs and hence IOs.
Besides the counties classified by I S R s P o s i and E S R s P o s i , all the other counties detected by H s i 2 2 have their Studentized difference residuals below their neighbourhood threshold. Though their potential values are mostly large, their prediction Studentized residuals are small in both ISRs and ESRs.

5.3. Example 3

In example 3, the life expectancy of the counties of the US was measured by population density ( X 1 ), fair/poor health status ( X 2 ), obesity ( X 3 ), population in rural area ( X 4 ), inactivity rate ( X 5 ), population of smokers ( X 6 ), population of black people ( X 7 ), population of Asians ( X 8 ) and population of Hawaiians ( X 9 ). The data were obtained from the Kaggle website (https://www.kaggle.com/johnjdavisiv/us-counties-covid19-weather-sociohealth-data; accessed on 13 December 2020).
The spatial error model (SEM) had the lowest AIC value and was fitted to the data. The model was significant at the 5% level with a significant Moran’s I of 0.2160. X 1 and X 4 were not significant at the 5%. All the other estimates were significant at the 5% level.
The fitted model is given by:
y ^ = β ^ 0 + i = 1 9 β ^ i X i + λ ^ W ξ
where λ ^ = 0.4343 , β ^ 0 = 88.4885 , β ^ 1 = 0.0000, β ^ 2 = 0.0954 , β ^ 3 = 0.0377 , β ^ 4 = 0.0040 , β ^ 5 = 0.0630 , β ^ 6 = 0.3892 , β ^ 7 = 0.0113 , β ^ 8 = 0.1437 , β ^ 9 = 0.2016 . Counties with fair/poor health facility had a 0.1 lower life expectancy for an increase in the population. Counties with a larger number of obese people had a decrease in life expectancy of 0.03. Similarly, those counties with a large number of people with inactivity had a life expectancy decreased by 0.06, and counties with a larger number of smokers had a life expectancy decreased by 0.04 per increase in the population. Countries with a higher number of black people and Hawaiians had a life expectancy decreased by 0.01 and 0.2, respectively, while those with a higher number of Asians had an increased rate of 0.14 in population.
The I S R s P o s i classified 139 counties as IOs, while E S R s P o s i classified eight more counties, making a total of 147. H s i 1 2 and H s i 2 2 have classified 24 and 324 counties as IOs, respectively. C D s i classified no county as IO.

6. Conclusions

In this article, we demonstrated the application of influential observations (IOs) detection techniques from the classical regression to the spatial regression model. Measures that contained spatial information in the spatial autoregression in the dependent variables and residuals were obtained. We also evaluated the performance of some methods employed in classical regression to their spatial counterparts. Though the methods work well in classical regression models, they are mostly prone to either masking or swamping in spatial applications. This is attributable to the local nature of spatial outliers. Hence, we proposed new I S R s P o s i and E S R s P o s i plots to classify observations into four categories: regular observations, vertical outliers, good leverage points and bad leverage points, whereby IOs are those observations which fall in the vertical and bad leverage point categories. Interestingly, the proposed E S R s P o s i diagnostic plot was very successful in classifying observations into the correct categories followed by the I S R s P o s i , as demonstrated by the results obtained from a simulation study and real data examples. Thus, the newly established E S R s P o s i plot can be a suitable alternative to identify IOs in the spatial regression model.

Author Contributions

Conceptualization, A.M.B. and H.M.; methodology, A.M.B.; software, A.M.B.; validation, A.M.B. and H.M.; formal analysis, A.M.B. and N.H.A.R.; investigation, A.M.B. and N.H.A.R.; resources, A.M.B., H.M. and M.B.A.; data curation, A.M.B.; writing—original draft preparation, A.M.B.; writing—review and editing, A.M.B., H.M., M.B.A. and N.H.A.R.; visualization, A.M.B. and M.B.A.; supervision, H.M.; project administration, H.M.; funding acquisition, H.M. All authors have read and agreed to the published version of the manuscript.

Funding

This article was partially supported by the Fundamental Research Grant Scheme (FRGS) under the Ministry of Higher Education, Malaysia with project number FRGS/1/2019/STG06/UPM/01/1.

Data Availability Statement

Data are available online. Data for Example 1 are available in page 332 of [19]. Data for Example 2 are available online, website link (http://dph.georgia.gov/covid-19-daily-status-report; accessed on 30 June 2020 and http://www.countyheathrankings.org; accessed on 30 June 2020). Data for Example 3 are available online, website link (https://www.kaggle.com/johnjdavisiv/us-counties-covid19-weather-sociohealth-data; accessed on 13 June 2020).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Belsley, D.A.; Kuh, E.; Welsch, R.E. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity; John Wiley & Sons: New York, NY, USA, 1980; Volume 571. [Google Scholar]
  2. Rashid, A.M.; Midi, H.; Slwabi, W.D.; Arasan, J. An Efficient Estimation and Classification Methods for High Dimensional Data Using Robust Iteratively Reweighted SIMPLS Algorithm Based on Nu -Support Vector Regression. IEEE Access 2021, 9, 45955–45967. [Google Scholar] [CrossRef]
  3. Cook, R.D. Influential Observations in Linear Regression. J. Am. Stat. Assoc. 1977, 74, 169–174. [Google Scholar] [CrossRef]
  4. Hoaglin, D.C.; Welsch, R.E. The Hat Matrix in Regression and ANOVA. Am. Stat. 1978, 32, 17. [Google Scholar] [CrossRef]
  5. Andrews, D.F.; Pregibon, D. Finding the Outliers That Matter. J. R. Stat. Soc. Ser. B (Methodol.) 1978, 40, 85–93. [Google Scholar] [CrossRef]
  6. Hawkins, D.M. Identification of Outliers; Springer: Berlin/Heidelberg, Germany, 1980; Volume 11. [Google Scholar]
  7. Huber, P. Robust Statistics; John Wiley and Sons: New York, NY, USA, 1981. [Google Scholar]
  8. Cook, R.D.; Weisberg, S. Monographs on statistics and applied probability. In Residuals and Influence in Regression; Chapman and Hall: New York, NY, USA, 1982; ISBN 978-0-412-24280-9. [Google Scholar]
  9. Chatterjee, S.; Hadi, A.S. Sensitivity Analysis in Linear Regression; John Wiley & Sons: New York, NY, USA, 1988; Volume 327. [Google Scholar]
  10. Hadi, A.S. A New Measure of Overall Potential Influence in Linear Regression. Comput. Stat. Data Anal. 1992, 14, 1–27. [Google Scholar] [CrossRef]
  11. Habshah, M.; Norazan, M.R.; Rahmatullah Imon, A.H.M. The Performance of Diagnostic-Robust Generalized Potentials for the Identification of Multiple High Leverage Points in Linear Regression. J. Appl. Stat. 2009, 36, 507–520. [Google Scholar] [CrossRef]
  12. Meloun, M.; Militkỳ, J. Statistical Data Analysis: A Practical Guide; Woodhead Publishing Limited: Sawston, Cambridge, UK, 2011. [Google Scholar]
  13. Puterman, M.L. Leverage and Influence in Autocorrelated Regression Models. J. R. Stat. Soc. Ser. C (Appl. Stat.) 1988, 37, 76–86. [Google Scholar] [CrossRef]
  14. Schall, R.; Dunne, T.T. A Unified Approach to Outliers in the General Linear Model. Sankhyā Indian J. Stat. Ser. B 1988, 50, 157–167. [Google Scholar]
  15. Martin, R.J. Leverage, Influence and Residuals in Regression Models When Observations Are Correlated. Commun. Stat.-Theory Methods 1992, 21, 1183–1212. [Google Scholar] [CrossRef]
  16. Shi, L.; Chen, G. Influence Measures for General Linear Models with Correlated Errors. Am. Stat. 2009, 63, 40–42. [Google Scholar] [CrossRef]
  17. Cerioli, A.; Riani, M. Robust Transformations and Outlier Detection with Autocorrelated Data. In From Data and Information Analysis to Knowledge Engineering; Spiliopoulou, M., Kruse, R., Borgelt, C., Nürnberger, A., Gaul, W., Eds.; Studies in Classification, Data Analysis, and Knowledge Organization; Springer: Berlin/Heidelberg, Germany, 2006; pp. 262–269. ISBN 978-3-540-31313-7. [Google Scholar]
  18. Christensen, R.; Johnson, W.; Pearson, L.M. Prediction Diagnostics for Spatial Linear Models. Biometrika 1992, 79, 583–591. [Google Scholar] [CrossRef]
  19. Haining, R. Diagnostics for Regression Modeling in Spatial Econometrics*. J. Reg. Sci. 1994, 34, 325–341. [Google Scholar] [CrossRef]
  20. Haining, R.P.; Haining, R. Spatial Data Analysis: Theory and Practice; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
  21. Dai, X.; Jin, L.; Shi, A.; Shi, L. Outlier Detection and Accommodation in General Spatial Models. Stat. Methods Appl. 2016, 25, 453–475. [Google Scholar] [CrossRef]
  22. Singh, A.K.; Lalitha, S. A Novel Spatial Outlier Detection Technique. Commun. Stat. -Theory Methods 2018, 47, 247–257. [Google Scholar] [CrossRef]
  23. Anselin, L. Some Robust Approaches to Testing and Estimation in Spatial Econometrics. Reg. Sci. Urban Econ. 1990, 20, 141–163. [Google Scholar] [CrossRef]
  24. Cerioli, A.; Riani, M. Robust Methods for the Analysis of Spatially Autocorrelated Data. Stat. Methods Appl. 2002, 11, 335–358. [Google Scholar] [CrossRef]
  25. Yildirim, V.; Mert Kantar, Y. Robust Estimation Approach for Spatial Error Model. J. Stat. Comput. Simul. 2020, 90, 1618–1638. [Google Scholar] [CrossRef]
  26. Kou, Y.; Lu, C.-T. Outlier Detection, Spatial. In Encyclopedia of GIS; Springer: Boston, MA, USA, 2008; pp. 1539–1546. [Google Scholar]
  27. Hadi, A.S.; Imon, A.H.M.R. Identification of Multiple Outliers in Spatial Data. Int. J. Stat. Sci. 2018, 16, 87–96. [Google Scholar]
  28. Hadi, A.S.; Simonoff, J.S. Procedures for the Identification of Multiple Outliers in Linear Models. J. Am. Stat. Assoc. 1993, 88, 1264–1272. [Google Scholar] [CrossRef]
  29. Aggarwal, C.C. Spatial Outlier Detection. In Outlier Analysis; Springer: New York, NY, USA, 2013; pp. 313–341. ISBN 978-1-4614-6395-5. [Google Scholar]
  30. Anselin, L. Local Indicators of Spatial Association—LISA. Geogr. Anal. 1995, 27, 93–115. [Google Scholar] [CrossRef]
  31. Politis, D.; Romano, J.; Wolf, M. Bootstrap Sampling Distributions. In Subsampling; Springer: New York, NY, USA, 1999; Available online: https://link.springer.com/chapter/10.1007/978-1-4612-1554-7_1 (accessed on 16 September 2021).
  32. Heagerty, P.J.; Lumley, T. Window Subsampling of Estimating Functions with Application to Regression Models. J. Am. Stat. Assoc. 2000, 95, 197–211. [Google Scholar] [CrossRef]
  33. Anselin, L. Exploratory Spatial Data Analysis and Geographic Information Systems. New Tools Spat. Anal. 1994, 17, 45–54. [Google Scholar]
  34. Cressie, N.A.C. Statistics for Spatial Data, Rev. ed.; Wiley series in probability and mathematical statistics; Wiley: New York, NY, USA, 1993; ISBN 978-0-471-00255-0. [Google Scholar]
  35. Anselin, L. Spatial Econometrics: Methods and Models; Studies in Operational Regional Science; Springer: Dordrecht, The Netherlands, 1988; Volume 4, ISBN 978-90-481-8311-1. [Google Scholar]
  36. LeSage, J.P. The Theory and Practice of Spatial Econometrics; University of Toledo: Toledo, OH, USA, 1999; Volume 28. [Google Scholar]
  37. Olver, P.J.; Shakiban, C.; Shakiban, C. Applied Linear Algebra; Springer: Berlin/Heidelberg, Germany, 2006; Volume 1. [Google Scholar]
  38. Horn, R.A.; Johnson, C.R. Matrix Analysis, 2nd ed.; Cambridge University Press: Cambridge, NY, USA, 2012; ISBN 978-0-521-83940-2. [Google Scholar]
  39. Liesen, J.; Mehrmann, V. Linear Algebra; Springer Undergraduate Mathematics Series; Springer International Publishing: Cham, Germany, 2015; ISBN 978-3-319-24344-3. [Google Scholar]
  40. Shekhar, S.; Lu, C.-T.; Zhang, P. A Unified Approach to Detecting Spatial Outliers. GeoInformatica 2003, 7, 139–166. [Google Scholar] [CrossRef]
  41. Billor, N.; Hadi, A.S.; Velleman, P.F. BACON: Blocked Adaptive Computationally Efficient Outlier Nominators. Comput. Stat. Data Anal. 2000, 34, 279–298. [Google Scholar] [CrossRef]
  42. Imon, A. Identifying Multiple High Leverage Points in Linear Regression. J. Stat. Stud. 2002, 3, 207–218. [Google Scholar]
  43. Rahmatullah Imon, A.H.M. Identifying Multiple Influential Observations in Linear Regression. J. Appl. Stat. 2005, 32, 929–946. [Google Scholar] [CrossRef]
  44. Midi, H.; Mohammed, A. The Identification of Good and Bad High Leverage Points in Multiple Linear Regression Model. Math. Methods Syst. Sci. Eng. 2015, 147–158. [Google Scholar]
  45. Bagheri, A.; Midi, H. Diagnostic Plot for the Identification of High Leverage Collinearity-Influential Observations. Sort: Stat. Oper. Res. Trans. 2015, 39, 51–70. [Google Scholar]
  46. Dowd, P. The variogram and kriging: Robust and resistant estimators. In Geostatistics for Natural Resources Characterization; Springer: Berlin/Heidelberg, Germany, 1984; pp. 91–106. [Google Scholar]
Figure 1. Classification of RO, GLP, and IO according ISRs – Posi.
Figure 1. Classification of RO, GLP, and IO according ISRs – Posi.
Symmetry 13 02030 g001
Figure 2. Classification of RO, GLP, and IO according ESRs – Posi.
Figure 2. Classification of RO, GLP, and IO according ESRs – Posi.
Symmetry 13 02030 g002
Figure 3. Graph of average attribute values in neighbourhood of locations against the attribute values in the locations with contamination (black points).
Figure 3. Graph of average attribute values in neighbourhood of locations against the attribute values in the locations with contamination (black points).
Symmetry 13 02030 g003
Figure 4. Graph of IO classification according to GLP, BLP and vertical outlier in ISRsPosi and ESRsPosi for simulated data. (a) ISRsPosi; (b) ESRsPosi.
Figure 4. Graph of IO classification according to GLP, BLP and vertical outlier in ISRsPosi and ESRsPosi for simulated data. (a) ISRsPosi; (b) ESRsPosi.
Symmetry 13 02030 g004
Figure 5. Graph of the lagged residuals against the residuals, of the 61 sites of the southwest Sheffield fitted with SEM, showing the IO points in red dots.
Figure 5. Graph of the lagged residuals against the residuals, of the 61 sites of the southwest Sheffield fitted with SEM, showing the IO points in red dots.
Symmetry 13 02030 g005
Figure 6. Graph of IO classification according to GLP, BLP and vertical outlier in South West gasoline data. (a) I S R s P o s i ; (b) E S R s P o s i .
Figure 6. Graph of IO classification according to GLP, BLP and vertical outlier in South West gasoline data. (a) I S R s P o s i ; (b) E S R s P o s i .
Symmetry 13 02030 g006
Figure 7. Graph of IO classification according to GLP, BLP and vertical outlier in the State of Georgia, USA COVID-19 data. (a) I S R s P o s i ; (b) E S R s P o s i .
Figure 7. Graph of IO classification according to GLP, BLP and vertical outlier in the State of Georgia, USA COVID-19 data. (a) I S R s P o s i ; (b) E S R s P o s i .
Symmetry 13 02030 g007
Table 1. ISRs, ESRs and posi of locations with large Studentized residuals in the simulated GSM model, with their cut-off points in parentheses.
Table 1. ISRs, ESRs and posi of locations with large Studentized residuals in the simulated GSM model, with their cut-off points in parentheses.
LocationISRs
(2.00)
ESRs
(1.97)
posi
(0.0078)
115.037822.81790.0008
44.58474.70460.0033
35−7.1434−7.63970.0026
51−4.4695−4.58010.0430
914.76134.89650.0068
201−6.9336−7.38400.0280
265−2.2644−2.27620.0068
Table 2. Influential observations classification rate based on large prediction Studentized residuals and large potentials.
Table 2. Influential observations classification rate based on large prediction Studentized residuals and large potentials.
σ 2 MethodAccurate Classification (%)Swamping (%)
C D s i 22.250.0
I S R s P o s i 98.540.0
0.01 E S R s P o s i 100.000.0
H s i 1 2 39.450.0
H s i 2 2 99.7181.41
C D s i 20.640.0
I S R s P o s i 98.360.0
0.1 E S R s P o s i 100.000.0
H s i 1 2 38.090.0
H s i 2 2 99.1476.48
C D s i 17.860.00
I S R s P o s i 97.510.00
0.2 E S R s P o s i 100.000.00
H s i 1 2 37.230.00
H s i 2 2 97.3469.25
C D s i 16.360.00
I S R s P o s i 96.570.00
0.3 E S R s P o s i 100.000.00
H s i 1 2 36.230.00
H s i 2 2 96.0064.42
Table 3. Sites with IOs in the analysis of the southwest Sheffield gasoline data.
Table 3. Sites with IOs in the analysis of the southwest Sheffield gasoline data.
S/NSiteISRs
(2.00)
ESRs
(2.00)
posi
0.0335
CDsiISRsPosiESRsPosi H s i 1 2 H s i 2 2
1.3−1.8879−1.93010.3538noNoNonoYes
2.91.48101.49620.0223noNoNonoYes
3.221.01271.01290.0779noNoNonoYes
4.255.42927.54810.2773yesYesYesyesYes
5.261.44381.45730.1352noNoNonoYes
6.302.20542.28130.2489noNoNonoYes
7.401.56921.58900.0194noNoNonoYes
8.411.19741.20580.0218noNoNonoYes
9.42−1.9150−1.95980.0378noNoNonoYes
10.460.10030.09950.1319noNoNonoYes
11.55−1.2042−1.20890.0219noNoNonoYes
12.61−1.8011−1.83630.0319noNoNonoYes
Table 4. Detected IOs counties by different methods in the Georgia COVID-19 data.
Table 4. Detected IOs counties by different methods in the Georgia COVID-19 data.
CountyISRs
(2.00)
ESRs
(1.98)
psi
0.0851
CDsiISRsPosiESRsPosi H s i 1 2 H s i 2 2
32.22452.25390.0257nonononoYes
264.57334.90600.1956noyesyesyesYes
492.76852.83130.0265noyesYesnoYes
505.75046.47370.2298yesyesYesyesYes
580.70900.70790.6893nonoNoyesYes
670.10180.10150.3524nonoNonoYes
703.13343.22850.0895noyesYesnoYes
98−1.8549−1.86990.0105nonoNonoYes
118−1.5657−1.57310.0827nonoNonoYes
1203.01683.10060.0544noyesYesnoYes
128−2.0152−2.03590.4557noyesyesnoYes
131−1.6718−1.68620.0565nonoNonoYes
134−1.6168−1.62530.0818nonoNonoYes
1352.16742.19420.0338noyesYesnoYes
1412.67262.72830.0163noYesyesnoYes
1422.18052.20790.0174YesyesnonoYes
153−1.2693−1.27180.2234NoNononoYes
155−1.2334−1.23590.2472NoNononoYes
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Baba, A.M.; Midi, H.; Adam, M.B.; Abd Rahman, N.H. Detection of Influential Observations in Spatial Regression Model Based on Outliers and Bad Leverage Classification. Symmetry 2021, 13, 2030. https://doi.org/10.3390/sym13112030

AMA Style

Baba AM, Midi H, Adam MB, Abd Rahman NH. Detection of Influential Observations in Spatial Regression Model Based on Outliers and Bad Leverage Classification. Symmetry. 2021; 13(11):2030. https://doi.org/10.3390/sym13112030

Chicago/Turabian Style

Baba, Ali Mohammed, Habshah Midi, Mohd Bakri Adam, and Nur Haizum Abd Rahman. 2021. "Detection of Influential Observations in Spatial Regression Model Based on Outliers and Bad Leverage Classification" Symmetry 13, no. 11: 2030. https://doi.org/10.3390/sym13112030

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop