Variable Selection of Heterogeneous Spatial Autoregressive Models via Double-Penalized Likelihood

Ruiqin Tian; Miaojie Xia; Dengke Xu

doi:10.3390/sym14061200

,

and

¹

School of Mathematics, Hangzhou Normal University, Hangzhou 311121, China

²

College of Economics, Hangzhou Dianzi University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Symmetry2022, 14(6), 1200;https://doi.org/10.3390/sym14061200

This article belongs to the Section Mathematics

Version Notes

Order Reprints

Abstract

Heteroscedasticity is often encountered in spatial-data analysis, so a new class of heterogeneous spatial autoregressive models is introduced in this paper, where the variance parameters are allowed to depend on some explanatory variables. Here, we are interested in the problem of parameter estimation and the variable selection for both the mean and variance models. Then, a unified procedure via double-penalized quasi-maximum likelihood is proposed, to simultaneously select important variables. Under certain regular conditions, the consistency and oracle property of the resulting estimators are established. Finally, both simulation studies and a real data analysis of the Boston housing data are carried to illustrate the developed methodology.

Keywords:

heterogeneous spatial autoregressive models; double-penalized quasi-maximum likelihood; variable selection; SCAD; tuning parameters

1. Introduction

In the field of modern economics and statistics, the spatial autoregressive (SAR) model has always been a hot topic for economists and statisticians. Theories and methods for estimation as well as other inferences based on linear SAR models and their extensions have been deeply studied. Among them, there is a lot of wonderful literature based on linear SAR models, such as Cliff and Ord [1], Anselin [2], and Anselin and Bera [3]. Other research results for linear SAR models can also be referred to: Xu and Lee [4], Liu et al. [5], Xie et al. [6], Xie et al. [7], and so on. In order to capture the nonlinear relationship between response variables and some explanatory variables, a variety of semiparametric SAR models have recently been proposed and studied in depth. For example, based on a partially linear spatial autoregressive model, Su and Jin [8] developed a profile quasi-maximum likelihood-estimation method and discussed the asymptotic properties of the obtained estimators. Du et al. [9] developed partially linear additive spatial autoregressive models and proposed an estimation method via combining the spline approximations with an instrumental-variables-estimation method. Cheng and Chen [10] discussed partially linear single-index spatial autoregressive model as well as obtained the consistency and asymptotic normality of the proposed estimators, under some mild conditions. Other research results of semiparametric SAR models can also be seen in Wei et al. [11], Hu et al. [12], and so on. Previous research on various SAR models focused mainly on the homoskedasticity assumption, that the variance of the unobservable error, conditional on the explanatory variables, is constant. It is well-known that if the innovations are heteroskedastic, most existing statistical inference methods under homoskedasticity assumption could lead to incorrect inference, see Lin and Lee [13]. Therefore, many researchers want to relax the homoskedasticity assumption for spatial autoregressive models, by allowing different variances for each unobservable error. For example, Dai et al. [14] developed a Bayesian local-influence analysis for heterogeneous spatial autoregressive models. However, the variance terms in these above-mentioned heterogeneous spatial autoregressive models are assumed fixed and do not depend on the regression variables. Furthermore, in many application fields, such as economics and quality management, it is a topic of interest to model the variance itself, which is helpful to identify the factors that affect the variability in the observations. Thus, we intend to propose a class of heterogeneous spatial autoregressive models where the variance parameters are modelled in terms of covariates.

In addition, many authors have done a lot of research on joint mean and variance models. For example, Wu and Li [15] proposed a variable-selection procedure via a penalized maximum likelihood for the joint mean and dispersion models of the inverse Gaussian distribution. Xu and Zhang [16] discussed a Bayesian estimation for semiparametric joint mean and variance models, based on B-spline approximations of nonparametric components. Zhao et al. [17] studied variable selection for beta-regression models with varying dispersion, where both the mean and the dispersion are modeled by explanatory variables. Li et al. [18] proposed an efficient unified variable selection of the joint location, scale, and skewness models, based on a penalized likelihood method. Zhang et al. [19] developed a Bayesian quantile-regression analysis for semiparametric mixed-effects double-regression models, on the basis of the asymmetric Laplace distribution for the errors.

As we all know, variable selection is the most important problem to be considered in regression analysis. In practice, a large number of variables should be used in the initial analysis, but many of them may not be important and should be excluded from the final model, in order to improve the accuracy of the prediction. When the number of predictive variables is large, traditional variable selection methods such as stepwise regression and best-subset selection are not computationally feasible. Therefore, various shrinkage methods have been proposed in recent years and have gained much attention, such as the LASSO (Tibshirani [20]), the adaptive LASSO (Zou [21]), and the SCAD (Fan and Li [22]). Based on these shrinkage methods, the variable selection for SAR models (see Liu et al. [5]; Xie et al. [6]; Xie et al. [7]; Luo and Wu [23]) and the variable selection for other models without spatial dependence (see, for example, Li and Liang [24]; Zhao and Xue [25]; Tian et al. [26]) have been studied extensively in recent years. To the best of our knowledge, most existing variable-selection methods in spatial-data analysis are limited, to only select the mean explanatory variables in the literature, and little work has been done to select the variance explanatory variables.

Therefore, in this paper we aim to make variable selection of heterogeneous spatial autoregressive models (heterogeneous SAR models), based on penalized quasi-maximum likelihood, using different penalty functions. The proposed variable-selection method simultaneously selects important explanatory variables in a mean model and a variance model. Furthermore, it can be proven that this variable selection procedure is consistent, and the obtained estimators of regression coefficients have oracle property under certain regular conditions. This indicates that the penalized estimators work as well as if the subset of true zero coefficients were already known. A simulation and a real data analysis of a Boston housing price are used to illustrate the proposed variable selection method.

The outline of the paper is organized as follows. Section 2 introduces new heterogeneous spatial autoregressive models. Then, we propose a unified variable-selection procedure for the joint models via the double-penalized quasi-maximum likelihood method. Section 3 gives the theoretical results of the resulting estimators. The computation of the penalized quasi-maximum likelihood estimator as well as the choice of the tuning parameters are presented in Section 4. The finite-sample performance of the method is investigated, based on some simulation studies, in Section 5. Section 6 gives a real data analysis of a Boston housing price to illustrate the proposed method. Some conclusions as well as a brief discussion are given in Section 7. Some assumptions and the technical proofs of all the asymptotic results are provided in Appendix A.

2. Variable Selection via Penalized Quasi-Maximum Likelihood

2.1. Heterogeneous SAR Models

The classical spatial autoregressive models have the following form:

Y = ρ W Y + X β + ε,

(1)

where

Y = {(y_{1}, y_{2}, \dots, y_{n})}^{T}

is an n-dimensional observation vector on the dependent variable;

| ρ | < 1

is an unknown spatial parameter; and W is a specified spatial weight matrix of known constants with zero diagonal elements. Let

X = {(x_{1}, x_{2}, \dots, x_{n})}^{T}

is an

n \times p

matrix whose ith row

x_{i}^{T} = (x_{i 1}, \dots, x_{i p})

is the observation of the explanatory variables and

β = {(β_{1}, \dots, β_{p})}^{T}

be a

p \times 1

vector of unknown regression parameters in the mean model;

ε

is an n-dimensional vector of independent identically distributed disturbances with zero mean and finite variance

σ^{2}

.

Furthermore, similar to Xu and Zhang [16], we consider variance heterogeneity in the models and assume an explicit variance modeling related to other explanatory variables, that is:

σ_{i}^{2} = h (z_{i}^{T} γ),

(2)

where

z_{i}^{T} = (z_{i 1}, \dots, z_{i q})

is the observation of explanatory variables associated with the variance of

y_{i}

and

γ = {(γ_{1}, \dots, γ_{q})}^{T}

is a

q \times 1

vector of regression parameters in the variance model. There might be some components of

z_{i}

which coincide with some

x_{i}

’s. In addition,

h (\cdot)

is a known function, for the identifiability of the models, so we always assume that

h (\cdot)

is a monotone function and

h (\cdot) > 0

, which takes into account the positiveness of the variance. For example, the researcher can take

h (x) = exp (x)

in general. So, this paper considers the following heterogeneous SAR models:

\{\begin{matrix} Y = ρ W Y + X β + ε, \\ Σ = diag (σ_{1}^{2}, σ_{2}^{2}, \dots, σ_{n}^{2}) \\ σ_{i}^{2} = exp (z_{i}^{T} γ), \\ i = 1, 2, \dots, n . \end{matrix}

(3)

According to the idea of the quasi-maximum likelihood estimators (Lee [27]), the log-likelihood function of the model (3) is

ℓ_{n} (ρ, β, γ | Y, X, Z) = - \frac{n}{2} l n (2 π) - \frac{1}{2} \sum_{i = 1}^{n} z_{i}^{T} γ + l n | S | - \frac{1}{2} e^{T} Σ^{- 1} e,

(4)

where

e = S Y - X β, S = I_{n} - ρ W

, and

I_{n}

is an

(n \times n)

identity matrix.

2.2. Penalized Quasi-Maximum Likelihood

In order to obtain the desired sparsity in the resulting estimators, we propose the penalized quasi-maximum likelihood

L (ρ, β, γ) = ℓ_{n} (ρ, β, γ) - n \sum_{j = 1}^{p} p_{λ_{j}^{(1)}} (| β_{j} |) - n \sum_{k = 1}^{q} p_{λ_{k}^{(2)}} (| γ_{k} |) .

(5)

For notational simplicity, we rewrite (5) in the following

L (θ) = ℓ_{n} (θ) - n \sum_{j = 2}^{s} p_{λ_{n}} (| θ_{j} |),

(6)

where

θ = {(θ_{1}, \dots, θ_{s})}^{T} = {(ρ, β_{1}, \dots, β_{p}; γ_{1}, \dots, γ_{q})}^{T}

with

s = p + q + 1

, and

p_{λ_{n}} (\cdot)

is a given-penalty function with the tuning parameters

λ_{n}

. The data-driven criteria, such as cross validation (CV), generalized cross-validation (GCV), or the BIC-type tuning-parameter selector (see Wang et al. [28]), can be used to choose the tuning parameters, which is described in Section 4. Here we use the same penalty function

p (\cdot)

for all the regression coefficients but with different tuning parameters

λ_{n}

, for the mean parameters and the variance parameters, respectively. Note that the penalty functions and tuning parameters are not necessarily the same for all the parameters. For example, we wish to keep some important variables in the final model and, therefore, do not want to penalize their coefficients. In this paper, we mainly use the smoothly clipped absolute deviation (SCAD) penalty, with a first derivative that satisfies

p_{λ}^{'} (t) = λ \{I (t \leq λ) + \frac{{(a λ - t)}_{+}}{(a - 1) λ} I (t > λ)\}

in which

a = 3.7

is taken in our work. The details about SCAD can be seen in Fan and Li [22].

The penalized quasi-maximum likelihood estimator of

θ

is denoted by

\hat{θ}

, which maximizes the function

L (θ)

in (6) with respect to

θ

. The technical details and an algorithm for calculating the penalized quasi-maximum likelihood estimator

\hat{θ}

are provided in Section 4.

3. Asymptotic Properties

We next study the asymptotic properties of the resulting penalized quasi-maximum likelihood estimators. We first introduce some notations. Let

θ_{0}

denote the true value of

θ

. Furthermore, let

θ_{0} = {(θ_{01}, \dots, θ_{0 s})}^{T} = {({(θ_{0}^{(1)})}^{T}, {(θ_{0}^{(2)})}^{T})}^{T} .

For ease of presentation and without loss of generality, it is assumed that

θ_{0}^{(1)}

is an

s_{1} \times 1

nonzero regression coefficient and that

θ_{0}^{(2)} = 0

is an

(s - s_{1}) \times 1

vector of the zero-valued regression coefficient. Let

a_{n} = max_{2 \leq j \leq s} {| p_{λ_{n}}^{'} (| θ_{0 j} |) |, θ_{0 j} \neq 0}

and

b_{n} = max_{2 \leq j \leq s} {p_{λ_{n}}^{″} (| θ_{0 j} |) : θ_{0 j} \neq 0} .

Theorem 1.

Suppose that

a_{n} = O_{p} (n^{- \frac{1}{2}}), b_{n} \to 0

, and

λ_{n} \to 0

as

n \to \infty .

Under the conditions

(C 1)

–

(C 10)

in Appendix A, with probability tending to 1 there must exist a local maximizer

\hat{θ}

of the penalized quasi-maximum likelihood function

L (θ)

in (6), such that

\hat{θ}

is a

\sqrt{n}

-consistent estimator of

θ_{0}

.

The following theorem gives the asymptotic normality of

\hat{θ}

. Let

A_{n} = diag (0, p_{λ_{n}}^{″} (| θ_{01}^{(1)} |), \dots, p_{λ_{n}}^{″} (| θ_{0 s_{1}}^{(1)} |)),

B_{n} = (0, p_{λ_{n}}^{'} (| θ_{01}^{(1)} |) sgn (θ_{01}^{(1)}), \dots, p_{λ_{n}}^{'} (| θ_{0 s_{1}}^{(1)} |) sgn (θ_{0 s_{1}}^{(1)}))^{T},

where

θ_{0 j}^{(1)}

is the jth component of

θ_{0}^{(1)}

(1 \leq j \leq s_{1})

.

Furthermore, denote

\begin{matrix} I_{n} (θ_{0}) & = & - E (\frac{1}{n} \frac{\partial^{2} ℓ_{n} (θ_{0})}{\partial θ \partial θ^{T}}) \\ = & (\begin{matrix} \frac{1}{n} {[G_{n} X β_{0})}^{T} Σ^{- 1} (G_{n} X β_{0}) + t r (G_{n}^{s} G_{n})] & * & * \\ \frac{1}{n} X^{T} Σ^{- 1} (G_{n} X β_{0}) & \frac{1}{n} X^{T} Σ^{- 1} X & 0 \\ \frac{1}{n} \sum_{i = 1}^{n} g_{n, i i} Z_{i}^{T} & 0 & \frac{1}{2 n} Z^{T} Z \end{matrix}, \end{matrix}

\begin{matrix} J_{n} (θ_{0}) = (\begin{matrix} J_{1} & * & * \\ \frac{1}{n} X^{T} Σ^{- 2} G_{n} μ_{3} & 0 & \frac{1}{2 n} X^{T} diag (μ_{3}) Σ^{- 2} Z \\ J_{2} & * & \frac{1}{4 n} [Z^{T} Σ^{- 2} diag (μ_{4}) Z - Z^{T} Z] \end{matrix}, \end{matrix}

where

J_{1} = \frac{1}{n} [μ_{4}^{T} Σ^{- 2} 1_{n} t r (G_{n}^{2}) + 2 {(G_{n} X β_{0})}^{T} G_{n} Σ^{- 2} μ_{3} - 3 t r (G_{n}^{2})]

,

J_{2} = \frac{1}{2 n} [Z^{T} Σ^{- 2} diag (μ_{3}) (G_{n} X β_{0}) + Z^{T} Σ^{- 2} G_{n} μ_{4} - 3 \sum_{i = 1}^{n} g_{n, i i} Z_{i}^{T}]

,

G_{n} = W A^{- 1}

, and

G_{n}^{s} = G_{n} + G_{n}^{T}

, thus,

I_{n} (θ_{0})

is called as the average Hessian matrix (the information matrix when

{e_{i}}

are normal). In addition,

J_{n} (θ_{0})

is a symmetric matrix,

1_{n}

is an n-dimensional column vector of ones,

μ_{j} = E (e_{i}^{j}), j = 2, 3, 4

,

G_{i n}

is the ith row of

G_{n}

, and

g_{n, i j}

is the

(i, j)

th entry of

G_{n}

. Besides,

diag (C)

represents a diagonal matrix with C, where C is an arbitrary vector.

Theorem 2.

Suppose that the penalty function

p_{λ} (t)

satisfies

\underset{n \to \infty}{lim inf} \underset{t \to 0^{+}}{lim inf} \frac{p_{λ_{n}}^{'} (t)}{λ_{n}} > 0,

and under the same mild conditions as these given in Theorem 1, if

λ_{n} \to 0

and

\sqrt{n} λ_{n} \to \infty

as

n \to \infty,

and

J (θ_{0}) = {lim}_{n \to \infty} J_{n} (θ_{0})

, then the

\sqrt{n}

-consistent estimator

\hat{θ} = {({({\hat{θ}}^{(1)})}^{T}, {({\hat{θ}}^{(2)})}^{T})}^{T}

in Theorem 1 must satisfy

(i): ${\hat{θ}}^{(2)} = 0$ with probability tending to 1.
(ii): $\sqrt{n} (I_{n}^{(1)} (θ_{0}^{(1)}) + A_{n}) {({\hat{θ}}^{(1)} - θ_{0}^{(1)}) + {(I_{n}^{(1)} (θ_{0}^{(1)}) + A_{n})}^{- 1} B_{n}} \overset{L}{⟶} N_{s_{1}} (0, I_{1} (θ_{0}^{(1)}) + J_{1} (θ_{0}^{(1)})),$

where

I_{n}^{(1)} (θ_{0}^{(1)})

,

I_{1} (θ_{0}^{(1)})

, and

J_{1} (θ_{0}^{(1)})

are the first super-left submatrix of

I_{n} (θ_{0})

,

I (θ_{0}) = {lim}_{n \to \infty} I_{n} (θ_{0})

and

J (θ_{0})

. In addition, “

\overset{L}{⟶}

” denotes convergence in distribution.

4. Computation

4.1. Algorithm

Since

L (θ)

is irregular at the origin, the commonly used gradient method is not applicable. Now, an iterative algorithm is developed based on the local quadratic approximation of the penalty function

p_{λ_{n}} (\cdot)

, as in Fan and Li [22].

Firstly, note that the first two derivatives of the log-likelihood function

ℓ_{n} (θ)

are continuous. Around a fixed point

θ_{0} = {(ρ_{0}, β_{0}^{T}, γ_{0}^{T})}^{T}

, we approximate the log-likelihood function by

ℓ_{n} (θ) \approx ℓ_{n} (θ_{0}) + {[\frac{\partial ℓ_{n} (θ_{0})}{\partial θ}]}^{T} (θ - θ_{0}) + \frac{1}{2} {(θ - θ_{0})}^{T} [\frac{\partial^{2} ℓ_{n} (θ_{0})}{\partial θ \partial θ^{T}}] (θ - θ_{0}) .

Moreover, given an initial value

t_{0}

, the penalty function

p_{λ_{n}}^{'} (t)

can be approximated by a quadratic function

p_{λ_{n}} (| t |) \approx p_{λ_{n}} (| t_{0} |) + \frac{1}{2} \frac{p_{λ_{n}}^{'} (| t_{0} |)}{| t_{0} |} (t^{2} - t_{0}^{2}), for t \approx t_{0} .

Therefore, we can approximate the penalized quasi-maximum likelihood function (6) by the following formula

L (θ) \approx ℓ_{n} (θ_{0}) + {[\frac{\partial ℓ_{n} (θ_{0})}{\partial θ}]}^{T} (θ - θ_{0}) + \frac{1}{2} {(θ - θ_{0})}^{T} [\frac{\partial^{2} ℓ_{n} (θ_{0})}{\partial θ \partial θ^{T}}] (θ - θ_{0}) - \frac{n}{2} θ^{T} A_{n λ} (θ_{0}) θ,

where

\begin{matrix} A_{n λ} (θ_{0}) = diag \{0, \frac{p_{λ_{n 1}^{(1)}}^{'} (| β_{01} |)}{| β_{01} |}, \dots, \frac{p_{λ_{n p}^{(1)}}^{'} (| β_{0 p} |)}{| β_{0 p} |}, \frac{p_{λ_{n 1}^{(2)}}^{'} (| γ_{01} |)}{| γ_{01} |}, \dots, \frac{p_{λ_{n q}^{(2)}}^{'} (| γ_{0 q} |)}{| γ_{0 q} |}\}, \end{matrix}

θ = {(θ_{1}, \dots, θ_{s})}^{T} = {(ρ, β_{1}, \dots, β_{p}; γ_{1}, \dots, γ_{q})}^{T}

and

θ_{0} = {(θ_{01}, \dots, θ_{0 s})}^{T} = {(ρ_{0}, β_{01}, \dots, β_{0 p}; γ_{01}, \dots, γ_{0 q})}^{T} .

Accordingly, the quadratic maximization problem for

L (θ)

leads to a solution iterated by

θ_{n e w} \approx θ_{0} + {\{\frac{\partial^{2} ℓ_{n} (ρ_{0}, β_{0}, γ_{0})}{\partial θ \partial θ^{T}} - n A_{n λ} (θ_{0})\}}^{- 1} \{n A_{n λ} (θ_{0}) θ_{0} - \frac{\partial ℓ_{n} (ρ_{0}, β_{0}, γ_{0})}{\partial θ}\} .

Secondly, based on the log-likelihood function (4) we can obtain the score functions

U (θ) = \frac{\partial ℓ_{n} (ρ, β, γ)}{\partial θ} = {(U_{1}^{T} (ρ), U_{2}^{T} (β), U_{3}^{T} (γ))}^{T},

where

U_{1} (ρ) = \frac{\partial ℓ_{n} (ρ, β, γ)}{\partial ρ} = Y^{T} W^{T} Σ^{- 1} (S Y - X β) - t r (G)

,

U_{2} (β) = \frac{\partial ℓ_{n} (ρ, β, γ)}{\partial β} = X Σ^{- 1} (S Y - X^{T} β)

,

U_{3} (γ) = \frac{\partial ℓ_{n} (ρ, β, γ)}{\partial γ} = - \frac{1}{2} \sum_{i = 1}^{n} z_{i} + \frac{1}{2} \sum_{i = 1}^{n} \frac{{({\bar{y}}_{i} - x_{i}^{T} β)}^{2}}{σ_{i}^{2}} z_{i},

S Y = {({\bar{y}}_{1}, {\bar{y}}_{2}, \dots, {\bar{y}}_{n})}^{T}

. Denote

H (θ) = \frac{\partial^{2} ℓ_{n} (ρ, β, γ)}{\partial θ \partial θ^{T}} = (\begin{matrix} \frac{\partial^{2} ℓ_{n} (ρ, β, γ)}{\partial ρ \partial ρ^{T}} & \frac{\partial^{2} ℓ_{n} (ρ, β, γ)}{\partial ρ \partial β^{T}} & \frac{\partial^{2} ℓ_{n} (ρ, β, γ)}{\partial ρ \partial γ^{T}} \\ \frac{\partial^{2} ℓ_{n} (ρ, β, γ)}{\partial β \partial ρ^{T}} & \frac{\partial^{2} ℓ_{n} (ρ, β, γ)}{\partial β \partial β^{T}} & \frac{\partial^{2} ℓ_{n} (ρ, β, γ)}{\partial β \partial γ^{T}} \\ \frac{\partial^{2} ℓ_{n} (ρ, β, γ)}{\partial γ \partial ρ^{T}} & \frac{\partial^{2} ℓ_{n} (ρ, β, γ)}{\partial γ \partial β^{T}} & \frac{\partial^{2} ℓ_{n} (ρ, β, γ)}{\partial γ \partial γ^{T}} \end{matrix}),

where

\frac{\partial^{2} ℓ_{n} (ρ, β, γ)}{\partial ρ \partial ρ^{T}} = - Y^{T} W^{T} Σ^{- 1} W Y - t r (G^{2})

,

\frac{\partial^{2} ℓ_{n} (ρ, β, γ)}{\partial β \partial β^{T}} = - X Σ^{- 1} X^{T}

,

\frac{\partial^{2} ℓ_{n} (ρ, β, γ)}{\partial γ \partial γ^{T}} = - \frac{1}{2} \sum_{i = 1}^{n} \frac{{({\bar{y}}_{i} - x_{i}^{T} β)}^{2}}{σ_{i}^{2}} z_{i} z_{i}^{T}

,

\frac{\partial^{2} ℓ_{n} (ρ, β, γ)}{\partial β \partial ρ^{T}} = - X^{T} Σ^{- 1} W Y

,

\frac{\partial^{2} ℓ_{n} (ρ, β, γ)}{\partial β \partial γ^{T}} = - \sum_{i = 1}^{n} \frac{({\bar{y}}_{i} - x_{i}^{T} β)}{σ_{i}^{2}} x_{i} z_{i}^{T}

,

\frac{\partial^{2} ℓ_{n} (ρ, β, γ)}{\partial γ \partial ρ^{T}} = - \sum_{i = 1}^{n} y_{w_{i}} e_{i} z_{i} / σ_{i}^{2}

, here,

Y_{w} = W Y = {(y_{w_{1}}, y_{w_{2}}, \dots, y_{w_{n}})}^{T}

. Finally, we give the following Algorithm 1, which can summarize the computation of penalized quasi-maximum likelihood estimators of the parameters in the heterogeneous SAR Models.

Algorithm 1

Step 1. The ordinary quasi-maximum likelihood estimators (without penalty)

β^{(0)}, γ^{(0)}, ρ^{(0)}

of

β, γ, ρ

are taken as their initial values.

Step 2.

θ^{(m)} = {ρ^{(m)}, {β^{(m)}}^{T}, {γ^{(m)}}^{T}}^{T}

are given as current values, then update them by

θ^{(m + 1)} = θ^{(m)} + {\{H (θ^{(m)}) - n A_{n λ} (θ^{(m)})\}}^{- 1} \{n A_{n λ} (θ^{(m)}) θ^{(m)} - U (θ^{(m)})\} .

Step 3. Repeat Step 2 above until

| θ^{(m + 1)} - θ^{(m)} | < ϵ

, where

ϵ

is a given small number, such as

10^{- 5}

.

4.2. Choosing the Tuning Parameters

The penalty function

p_{λ_{n}^{(l)}} (\cdot)

involves the tuning parameters

λ_{n}^{(l)}

(

l = 1, 2

) that control the amount of the penalty. Many selection criteria, such as CV, GCV, and BIC selection can be used to select the tuning parameters. Wang et al. [28] suggested using a BIC for the SCAD estimator in linear models and partially linear models and proved that its model selection consistency property, i.e., the optimal parameter chosen by the BIC, can identify the true model with the probability tending to one. Hence, their suggestion will be adopted in this paper. Nevertheless, in real application, how to simultaneously select a total of

p + q

shrinkage parameters

{λ_{n i}, i = 1, \dots, p + q}

is challenging. To bypass this difficulty, we follow the idea of Li et al. [18], and simplify the tuning parameters as follows,

(i): $λ_{n j}^{(1)} = \frac{λ_{n}}{| {\tilde{β}}_{j}^{(0)} |}, j = 1, \dots, p;$
(ii): $λ_{n k}^{(2)} = \frac{λ_{n}}{| {\tilde{γ}}_{k}^{(0)} |}, k = 1, \dots, q$

where

{\tilde{β}}_{j}^{(0)}

and

{\tilde{γ}}_{k}^{(0)}

are, respectively, the jth element and kth element of the unpenalized estimates

{\tilde{β}}^{(0)}

and

{\tilde{γ}}^{(0)}

. Consequently, the original

p + q

dimensional problem about

λ_{n i}

becomes a one-dimensional problem about

λ_{n}

.

λ_{n}

can be selected according to the following BIC-type criterion

B I C_{λ_{n}} = - \frac{2}{n} ℓ_{n} (\hat{ρ}, \hat{β}, \hat{γ}) + d f_{λ} \times \frac{log (n)}{n},

where

0 \leq d f_{λ} \leq s

is simply the number of nonzero coefficients of

{\hat{θ}}_{λ_{n}}

, and, here,

{\hat{θ}}_{λ_{n}} = {(\hat{ρ}, {\hat{β}}^{T}, {\hat{γ}}^{T})}^{T}

is the estimate of

θ

for a given

λ_{n}

.

The tuning parameter can be obtained as

{\hat{λ}}_{n} = arg min_{λ_{n}} B I C_{λ_{n}} .

From our simulation studies, we found that this method works well.

5. Simulation Study

In this section we conduct a simulation study to assess the small sample performance of the proposed procedure. In this simulation, we choose m to be 5 and R to be 30, 40, 60, thus

n = R \times m

to be 150, 200, and 300. X and Z are, respectively, generated from multivariate normal distribution with zero mean vector and covariance matrix being

Σ_{0} = (c_{i j})

where

c_{i j} = {0.5}^{| i - j |}, i = 1, \dots, 8, j = 1, \dots, 8

. Besides, let the spatial parameter

ρ = 0.5, 0, - 0.5

, which represents different spatial dependencies;

β = {(3, 1.8, 2.2, 0, 0, 0, 0, 0)}^{T}

and the structure of the variance model is

log (σ_{i}^{2}) = z_{i}^{T} γ

with

γ = {(1, 1, 1, 0, 0, 0, 0, 0)}^{T}

. In these simulation, the weight matrix is taken to be

W = I_{R} \otimes H_{m}, H_{m} = (l_{m} l_{m}^{T} - I_{m}) / (m - 1),

where

l_{m}

is an m-dimensional vector with all elements being 1,⊗ means Kronecker product.

We generated

M = 500

random samples of size n = 150, 200, and 300, respectively. For each random sample, the proposed variable-selection method, based on penalized quasi-maximum likelihood with SCAD and ALASSO penalty functions, is considered. The unknown tuning parameters

λ^{(l)}, (l = 1, 2)

for the penalty function are chosen by the BIC criterion in the simulation. The average number of the estimated zero coefficients with 500 simulation runs is reported in Table 1, Table 2 and Table 3. Note that “C” in the tables means the average number of zero regression coefficients that are correctly estimated as zero, and “IC” depicts the average number of non-zero regression coefficients that are erroneously set to zero. The performance of estimators

\hat{β}

, and

\hat{γ}

is assessed by the mean square error(MSE), defined as

MSE (\hat{β}) = E {(\hat{β} - β_{0})}^{T} (\hat{β} - β_{0}), MSE (\hat{γ}) = E {(\hat{γ} - γ_{0})}^{T} (\hat{γ} - γ_{0}) .

Table 1. Variable selections for

β

and

γ

under different sample sizes when

ρ = 0.5

.

Table 2. Variable selections for

β

and

γ

under different sample sizes when

ρ = 0

.

Table 3. Variable selections for

β

and

γ

under different sample sizes when

ρ = - 0.5

.

The simulation results are reported in Table 1, Table 2 and Table 3.

From Table 1, Table 2 and Table 3, we can make the following observations: (i) as n increases, the performances of variable-selection procedures become better and better. For example, the values in the column labeled ‘MSE’ of

β

and

γ

become smaller and smaller; the values in the column labeled ‘C’ become closer and closer to the true number of zero regression coefficients in the models. (ii) Based on different spatial parameters, the results of variable selection are similar. (iii) Two different penalty functions (SCAD and ALASSO) are used in this paper, and both methods perform almost equally well. (iv) Under the same circumstances, the ‘MSE’ of the mean parameters is smaller than the variance parameters’, which are very common in parametric estimation due to the fact that lower-order moments are easier to estimate than higher-order moments’.

6. Real Data Analysis

In this section, the proposed variable selection method is used to analyze the Boston housing price data that has been analyzed by many authors, for example, Pace and Gilley [29], Su and Yang [30], and so on. The database can be found in the spdep library of R. The dataset contains 14 variables with 506 observations, and a detailed description of these variables is listed in Table 4.

Table 4. Description of the variables in Boston housing data.

This dataset has been used by Pace and Gilley [29] on the basis of spatial econometric models, and longitude–latitude coordinates for the census tracts are added to the dataset. In this paper, we take MEDV as the response variable, and the other 13 variables in Table 4 are treated as explanatory variables. Similar with Pace and Gilley [29] as well as Su and Yang [30], we use the Euclidean distances in terms of longitude and latitude to generate weight matrix

W = (w_{i j})

, where

w_{i j} = \max (1 - \frac{d_{i j}}{d_{0}}, 0),

d_{i j}

is the Euclidean distance and

d_{0}

is the threshold distance, which is set to be 0.05 as in Su and Yang [30]. Thus, a spatial weight matrix is used with 19.1% nonzero elements. In addition, Z-variables in the variance model are taken to be the same as X-variables in the mean model. Then, the heterogeneous SAR models are considered here as follows:

\{\begin{matrix} Y_{i} = ρ \sum_{j = 1}^{n} w_{i j} Y_{j} + \sum_{k = 1}^{13} X_{i k} β_{k} + ε_{i}, \\ σ_{i}^{2} = exp (\sum_{k = 1}^{13} Z_{i k} γ_{k}), \\ i = 1, 2, \dots, 506 . \end{matrix}

(7)

The ordinary quasi-maximum likelihood estimators (QMLE) and the penalized quasi-maximum likelihood estimators using the SCAD and ALASSO penalty functions are all considered. The tuning parameter was selected by the BIC. The estimated spatial parameter and the regression coefficients by different penalty-estimation methods are presented in Table 5. From Table 5, we can clearly see the following facts. (i) As expected, the spatial parameters are estimated very close to each other, and the other estimates are slightly different among different methods. (ii) Both the SCAD and ALASSO methods can eliminate many unimportant variables in joint mean and variance models. Concretely, the ALASSO method can select the same variables with the SCAD method in the mean model, and the SCAD method can select two more variables than the ALASSO method in the variance model. (iii) The important explanatory variables, selected by the proposed methods, are basically consistent with existing research results, for example, the regression coefficient of

X_{11}

is negative in the mean model, which reveals that the housing price would decrease as the pupil–teacher ratio increases. In addition, based on the estimate of

γ

’s, we can obtain the estimates of

σ_{i}^{2}

based on different methods and list the scatter plot of

{\hat{σ}}_{i}^{2}

in Figure 1, Figure 2 and Figure 3, which shows that the heteroscedasticity modeling for this dataset is reasonable.

Table 5. Penalized quasi-maximum likelihood estimators for

β

and

γ

.

Figure 1. The scatter plot of

{\hat{σ}}_{i}^{2}

based on the ordinary quasi-maximum likelihood estimators (QMLE).

Figure 2. The scatter plot of

{\hat{σ}}_{i}^{2}

based on the penalized quasi-maximum likelihood using SCAD.

Figure 3. The scatter plot of

{\hat{σ}}_{i}^{2}

based on the penalized quasi-maximum likelihood using ALASSO.

7. Conclusions and Discussion

Within the framework of heterogeneous spatial autoregressive models, we proposed a variable-selection method on the basis of a penalized quasi-maximum likelihood approach. Like the mean, the variance parameters may be dependent on various explanatory variables of interest, so that simultaneous variable selection to the mean and variance models becomes important to avoid the modeling biases and reduce the model complexities. We have proven that the proposed penalized quasi-maximum likelihood estimators of the parameters in the mean and variance models are asymptotically consistent and normally distributed under some mild conditions. Simulation studies and a real data analysis of the Boston housing data are conducted to illustrate the proposed methodology. The results show that the proposed variable selection method is highly efficient and computationally fast.

Furthermore, there are several interesting issues that merit further research. For example, (i) it is interesting to increase the model flexibility by introducing nonparametric functions in the context of spatial autoregressive model and to study the variable selection for both the parametric component and the nonparametric component; (ii) a possible extension of the heterogeneous spatial autoregressive models is considered when the response variables are missing under a different missingness mechanism; and (iii) in the penalized estimation, we can also penalize the spatial parameter just like regression coefficients, which can help us directly judge whether the analyzed data have a spatial structure. These are all topics of interest and worthy of further study.

Author Contributions

Conceptualization, R.T. and D.X.; methodology, R.T. and D.X.; software, M.X. and D.X.; data curation, R.T. and D.X.; formal analysis, M.X. and D.X.; writing—original draft, R.T. and D.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Statistical Science Research Project (2021LY061), the Startup Foundation for Talents at Hangzhou Normal University (2019QDL039), and the Statistical Science Research Project of Zhejiang Province (in 2022).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Theorems

To prove the theorems in the paper, we require the following regular conditions:

C1.: The ${e_{i}}, i = 1, \dots, n$ are independent with $E (e_{i}) = 0$ and $v a r (e_{i}) = σ_{i}^{2}$ . The moment $E (| e_{i} |^{4 + v})$ exists for some $v > 0$ .
C2.: The elements $w_{n, i j} = O (1 / h_{n})$ and $w_{n, i i} = 0, i, j = 1, \dots, n$ are in W, where $h_{n} / n \to 0$ as $n \to \infty$ .
C3.: The matrix S is a nonsingular matrix.
C4.: The sequences of matrices ${W}$ and ${S^{- 1}}$ are uniformly bounded in both row and column sums.
C5.: The ${lim}_{n \to \infty} n^{- 1} X^{T} Σ^{- 1} X$ and ${lim}_{n \to \infty} n^{- 1} Z^{T} Z$ exist and are nonsingular. The elements of X and Z are uniformly bounded constants for all n.
C6.: $S^{- 1} (ρ)$ are uniformly bounded in row or column sums, uniformly in $ρ$ in a closed subset $Λ$ of $(- 1, 1)$ . The true $ρ_{0}$ is an interior point of $Λ$ .
C7.: The ${lim}_{n \to \infty} {(X, G X β_{0})}^{T} Σ^{- 1} (X, G X β_{0})$ exists and is a nonsingular matrix.
C8.: The ${lim}_{n \to \infty} E (n^{- 1} (\partial^{2} ℓ_{n} (θ_{0})) / \partial θ \partial θ^{T})$ exists.
C9.: The third derivatives $(\partial^{3} L (θ) / \partial θ_{j} \partial θ_{l} \partial θ_{m})$ exist for all $θ$ in an open set $Θ$ that contains the true parameter point $θ_{0}$ . Furthermore, there exist functions $M_{j l m}$ such that $| n^{- 1} \partial^{3} ℓ_{n} (θ) / \partial θ_{j} \partial θ_{l} \partial θ_{m} | \leq M_{j l m}$ for all $θ \in Θ$ , where $E (M_{j l m}) < \infty$ for $j, l, m$ .
C10.: The penalty function satisfies

$\underset{n \to \infty}{lim inf} \underset{θ_{j} \to 0^{+}}{lim inf} λ_{n j}^{- 1} p_{λ_{n j}}^{'} (| θ_{j} |) > 0, j = 2, \dots, s .$

Remarks: Conditions 1–8 are sufficient conditions for the correctness of the global identification, asymptotic normality, and the consistent estimation of the QMLE of the model (3), which are similar to the conditions provided by Lee [27] and Liu et al. [5]. Concretely, Condition 1 is applied in the use of the central-limit theorem of Kelejian and Prucha [31]. Condition 2 shows the features of the weight matrix. If

{h_{n}}

is a bounded sequence, Condition 2 holds. In Case’s model (Case [32]), Condition 2 is still satisfied although

h_{n}

might diverge to infinity. Condition 3 is used to guarantee the existence of mean and variance of independent variable. Condition 4 implies that the variance of Y is bounded when n tends to infinity, see Kelejian and Prucha [31] as well as Lee [27]. Condition 5 can exclude the multicollinearity of the regressors of X and Z. For convenient analysis, we assume that the regressors are uniformly bounded. If not, they can be replaced by stochastic regressors with certain finite-moment conditions (Lee [27]). Condition 6 is meaningful to deal with the nonlinearity of

l n | S (ρ) |

in a log-likelihood function. Conditions 7–8 are used for the asymptotic normality of the QMLE. Condition 9 plays an important role in the Taylor expansion of related functions, which is similar to the condition (C) provided by Fan and Li [22]. Condition 10 is an assumption on the penalty function.

In order to prove Theorems 1 and 2, we need for the log-likelihood function

ℓ_{n} (ρ, β, γ)

to have several properties, which are stated in the following Lemmas.

Lemma A1.

Suppose that Conditions 1–9 hold, then we can obtain

\frac{1}{\sqrt{n}} \frac{\partial ℓ_{n} (θ_{0})}{\partial θ} = O_{p} (1) .

Proof of Lemma A1.

By calculating the first-order partial derivatives of the log-likelihood function at

θ_{0}

, we obtain

\frac{1}{\sqrt{n}} \frac{\partial ℓ_{n} (θ_{0})}{\partial ρ} = \frac{1}{\sqrt{n}} {(G_{n} X β)}^{T} Σ^{- 1} e + \frac{1}{\sqrt{n}} e^{T} G_{n} Σ^{- 1} e - \frac{1}{\sqrt{n}} t r (G_{n}),

\frac{1}{\sqrt{n}} \frac{\partial ℓ_{n} (θ_{0})}{\partial β} = \frac{1}{\sqrt{n}} X^{T} Σ^{- 1} e,

\frac{1}{\sqrt{n}} \frac{\partial ℓ_{n} (θ_{0})}{\partial γ} = - \frac{1}{2 \sqrt{n}} \sum_{i = 1}^{n} \frac{e_{i}^{2}}{σ_{i}^{2}} Z_{i},

The variance of

\frac{1}{\sqrt{n}} \frac{\partial ℓ_{n} (θ_{0})}{\partial ρ}

is

\begin{matrix} v a r (\frac{1}{\sqrt{n}} \frac{\partial ℓ_{n} (θ_{0})}{\partial ρ}) \\ = & v a r (\frac{1}{\sqrt{n}} {(G_{n} X β)}^{T} Σ^{- 1} e + (\frac{1}{\sqrt{n}} e^{T} G_{n} Σ^{- 1} e - \frac{1}{\sqrt{n}} t r (G_{n}))) \\ \leq & 2 \frac{1}{n} [v a r ({(G_{n} X β)}^{T} Σ^{- 1} e) + v a r (e^{T} G_{n} Σ^{- 1} e - t r (G_{n}))] \\ = & \frac{2}{n} {(G_{n} X β)}^{T} (G_{n} X β) + \frac{2}{n} v a r (e^{T} G_{n} Σ^{- 1} e) \\ = & O (1) + O (\frac{1}{h_{n}}) = O (1) . \end{matrix}

Thus,

\frac{1}{\sqrt{n}} \frac{\partial ℓ_{n} (θ_{0})}{\partial ρ} = O_{p} (1)

. When the elements of X and Z are uniformly bounded for all n, it is obvious that

\frac{1}{\sqrt{n}} X^{T} Σ^{- 1} e = O_{p} (1)

. In addition, by some elementary calculations, we have the variance of

\frac{1}{\sqrt{n}} \frac{\partial ℓ_{n} (θ_{0})}{\partial γ}

is

\begin{matrix} v a r (\frac{1}{\sqrt{n}} \frac{\partial ℓ_{n} (θ_{0})}{\partial γ}) & = & \frac{1}{4 n} \sum_{i = 1}^{n} v a r (e_{i}^{2}) \frac{1}{σ_{i}^{4}} Z_{i}^{T} Z_{i} \\ = & \frac{1}{4 n} \sum_{i = 1}^{n} \frac{1}{σ_{i}^{4}} (μ_{4 i} - σ_{i}^{4}) = O (1), \end{matrix}

where

μ_{4 i} = E (e_{i}^{4})

, therefore,

\frac{1}{\sqrt{n}} \frac{\partial ℓ_{n} (θ_{0})}{\partial γ} = O_{p} (1) .

Thus, the proof of the Lemma A1 is completed. □

Lemma A2.

If Conditions 1–8 hold, then

\frac{1}{n} \frac{\partial^{2} ℓ_{n} (θ_{0})}{\partial θ \partial θ^{T}} = E (\frac{1}{n} \frac{\partial^{2} ℓ_{n} (θ_{0})}{\partial θ \partial θ^{T}}) + o_{p} (1) .

Proof of Lemma A2.

The proof of Lemma A2 is similar to the proof of Theorem 3.2 (Lee [27]), so we omit the details. □

Proof of Theorem 1.

Let

α_{n} = n^{- 1 / 2} + a_{n}

. Similar to the proof of Theorem 1 in Fan and Li [22], we just need to prove that for any given

ε > 0

, there exists a large constant C, such that

P \{sup_{∥ u ∥ = C} L (θ_{0} + α_{n} u) < L (θ_{0})\} \geq 1 - ε .

(A1)

Note that

p_{λ_{n}} (0) = 0

. By using the Taylor expansion of the log-likelihood function, we have

\begin{matrix} D_{n} (u) & = & L (θ_{0} + α_{n} u) - L (θ_{0}) \\ = & ℓ_{n} (θ_{0} + α_{n} u) - ℓ_{n} (θ_{0}) - {n \sum_{j = 2}^{s} p_{λ_{n j}} (| θ_{j 0} + α_{n} u_{j} |) - n \sum_{j = 2}^{s} p_{λ_{n j}} (| θ_{j 0} |)} \\ \leq & α_{n} {(\frac{\partial ℓ_{n} (θ_{0})}{\partial θ})}^{T} u - \frac{1}{2} u^{T} I_{n} (θ_{0}) u n α_{n}^{2} {1 + o_{p} (1)} - \\ \sum_{j = 2}^{s_{1}} [n α_{n} p_{λ_{n j}}^{'} (| θ_{j 0} |) s g n (θ_{j 0}) u_{j} + n α_{n}^{2} p_{λ_{n j}}^{″} (| θ_{j 0} |) u_{j}^{2} {1 + o (1)}] \\ = & A_{1} - A_{2} - A_{3}, \end{matrix}

where

s_{1}

is the number of dimensions of

θ_{0}^{(1)}

. It follows from Lemma A1 that

\begin{matrix} A_{1} = α_{n} {(\frac{\partial ℓ_{n} (θ_{0})}{\partial θ})}^{T} u & \leq & ∥ α_{n} {(\frac{\partial ℓ_{n} (θ_{0})}{\partial θ})}^{T} ∥ \cdot ∥ u ∥ \\ = & ∥ u ∥ \cdot O_{p} (n α_{n}^{2}) . \end{matrix}

Under Condition 8,

A_{2} = \frac{1}{2} u^{T} I_{n} (θ_{0}) u n α_{n}^{2} {1 + o_{p} (1)} = {∥ u ∥}^{2} \cdot O_{p} (n α_{n}^{2}) .

In addition,

\begin{matrix} A_{3} & = & \sum_{j = 2}^{s_{1}} [n α_{n} p_{λ_{n j}}^{'} (| θ_{j 0} |) s g n (θ_{j 0}) u_{j} + n α_{n}^{2} p_{λ_{n j}}^{″} (| θ_{j 0} |) u_{j}^{2} {1 + o (1)}] \\ \leq & \sum_{j = 2}^{s_{1}} [n α_{n} | p_{λ_{n j}}^{'} (| θ_{j 0} |) | \cdot | u_{j} | + n α_{n}^{2} p_{λ_{n j}}^{″} (| θ_{j 0} |) u_{j}^{2} {1 + o (1)}] \\ \leq & \sqrt{s_{1}} n α_{n} a_{n} ∥ u ∥ + n α_{n}^{2} max {p_{λ_{n j}}^{″} (| θ_{j 0} |) : θ_{j 0} \neq {0} ∥ u ∥}^{2} {1 + o (1)} \\ = & ∥ u ∥ \cdot O_{p} (n α_{n}^{2}) + {∥ u ∥}^{2} \cdot o_{p} (n α_{n}^{2}) . \end{matrix}

Furthermore, by choosing a sufficiently large C,

A_{1}

and

A_{3}

are dominated by

A_{2}

uniformly in

∥ u ∥ = C

. Hence, (A1) holds, which implies the proof of Theorem 1 is completed. □

Proof of Theorem 2.

We first prove part (i). According to Theorem 1, it is sufficient to show that, for any

θ^{(1)}

that satisfies

∥ θ^{(1)} - θ_{0}^{(1)} ∥ = O_{p} (n^{- 1 / 2})

and some given small

ϵ_{n} = C n^{- 1 / 2}

, when

n \to \infty

, with probability tending to 1, we have

\frac{\partial L (θ)}{\partial θ_{j}} < 0, for 0 < θ_{j} < ϵ_{n}, j = s_{1} + 1, \dots, s,

(A2)

and

\frac{\partial L (θ)}{\partial θ_{j}} > 0, for - ϵ_{n} < θ_{j} < 0, j = s_{1} + 1, \dots, s .

(A3)

By the Taylor expansion, it is easy to prove that

\begin{matrix} \frac{\partial L (θ)}{\partial θ_{j}} & = & \frac{\partial ℓ_{n} (θ)}{\partial θ_{j}} - n p_{λ_{n j}}^{'} (| θ_{j} |) s g n (θ_{j}) \\ = & \frac{\partial ℓ_{n} (θ_{0})}{\partial θ_{j}} + \sum_{l = 1}^{s} \frac{\partial^{2} ℓ_{n} (θ_{0})}{\partial θ_{j} \partial θ_{l}} (θ_{l} - θ_{l 0}) \\ + \sum_{l = 1}^{s} \sum_{m = 1}^{s} \frac{\partial^{3} ℓ_{n} (θ^{*})}{\partial θ_{j} \partial θ_{l} \partial θ_{m}} (θ_{l} - θ_{l 0}) (θ_{m} - θ_{m 0}) - n p_{λ_{n j}}^{'} (| θ_{j} |) s g n (θ_{j}) \end{matrix}

where

θ^{*}

lies between

θ

and

θ_{0}

. From Lemmas A1–A2 and Condition 9, we obtain

\frac{1}{n} \frac{\partial ℓ_{n} (θ_{0})}{\partial θ_{j}} = O_{p} (n^{- 1 / 2}),

\frac{1}{n} \frac{\partial^{2} ℓ_{n} (θ_{0})}{\partial θ \partial θ^{T}} = E (\frac{1}{n} \frac{\partial^{2} ℓ_{n} (θ_{0})}{\partial θ \partial θ^{T}} + o_{p} (1)),

\frac{1}{n} \frac{\partial^{3} ℓ_{n} (θ^{*})}{\partial θ_{j} \partial θ_{l} \partial θ_{m}} = O_{p} (1) .

When

∥ θ^{(1)} - θ_{0}^{(1)} ∥ = O_{p} (n^{- 1 / 2})

, we have

\begin{matrix} \frac{\partial L (θ)}{\partial θ_{j}} & = & n λ_{n} {λ_{n}^{- 1} \frac{1}{n} \frac{\partial ℓ_{n} (θ_{0})}{\partial θ_{j}} + λ_{n}^{- 1} \sum_{l = 1}^{s} \frac{1}{n} \frac{\partial^{2} ℓ_{n} (θ_{0})}{\partial θ_{j} \partial θ_{l}} (θ_{l} - θ_{l 0}) \\ + λ_{n}^{- 1} \sum_{l = 1}^{s} \sum_{m = 1}^{s} \frac{\partial^{3} ℓ_{n} (θ^{*})}{\partial θ_{j} \partial θ_{l} \partial θ_{m}} (θ_{l} - θ_{l 0}) (θ_{m} - θ_{m 0}) - λ_{n}^{- 1} p_{λ_{n}}^{'} (| θ_{j} |) s g n (θ_{j})} \\ = & n λ_{n} [λ_{n}^{- 1} O_{p} (n^{- 1 / 2}) + λ_{n}^{- 1} O_{p} (n^{- 1 / 2}) + λ_{n}^{- 1} O_{p} (n^{- 1}) - λ_{n}^{- 1} p_{λ_{n}}^{'} (| θ_{j} |) s g n (θ_{j})] \\ = & n λ_{n} [- λ_{n}^{- 1} p_{λ_{n}}^{'} (| θ_{j} |) s g n (θ_{j}) + O_{p} (n^{- 1 / 2} λ_{n}^{- 1})] . \end{matrix}

Note that

\underset{n \to \infty}{lim inf} \underset{δ \to 0^{+}}{lim inf} λ_{n}^{- 1} p_{λ_{n}}^{'} (δ) > 0

and

lim_{n \to \infty} n^{- 1 / 2} λ_{n}^{- 1} = 0

. The sign of the derivative is determined by that of

θ_{j}

with a sufficiently large n. This show that (A2) and (A3) follow. This completes the proof of part (i).

Next, we prove part (ii). By Theorem 1, there is a

\sqrt{n}

consistent local maximizer of

L_{n} (θ^{(1) T}, 0^{T})

, denoted as

{\hat{θ}}^{(1)}

, which satisfies

\frac{\partial L (θ)}{\partial θ_{j}} |_{θ = {(θ^{(1) T}, 0^{T})}^{T}} = 0, for j = 1, \dots, s .

For

j \geq 2

, note that

\begin{matrix} \frac{\partial L (θ)}{\partial θ_{j}} & = & \frac{\partial ℓ_{n} (θ)}{\partial θ_{j}} - n p_{λ_{n j}}^{'} (| θ_{j} |) sgn (θ_{j}) \\ = & \frac{\partial ℓ_{n} (θ_{0})}{\partial θ_{j}} + \sum_{l = 1}^{s} \{\frac{\partial^{2} ℓ_{n} (θ_{0})}{\partial θ_{j} \partial θ_{l}} + o_{p} (1)\} (θ_{l} - θ_{l 0}) \\ - n [p_{λ_{n j}}^{'} (| θ_{j 0} |) sgn (θ_{j 0}) + (p_{λ_{n j}}^{″} (| θ_{j 0} |) + o_{p} (1)) (θ_{j} - θ_{j 0})] \end{matrix}

Hence, we have

\begin{matrix} \frac{\partial ℓ_{n} (θ_{0})}{\partial θ_{j}} & = & \sum_{l = 1}^{s} \{\frac{\partial^{2} ℓ_{n} (θ_{0})}{\partial θ_{j} \partial θ_{l}} + o_{p} (1)\} ({\hat{θ}}_{l} - θ_{l 0}) \\ + n [p_{λ_{n}}^{'} (| θ_{j 0} |) sgn (θ_{j 0}) + (p_{λ_{n}}^{″} (| θ_{j 0} |) + o_{p} (1)) ({\hat{θ}}_{j} - θ_{j 0})] \end{matrix}

where

{\hat{θ}}_{l}

is the lth element of

{\hat{θ}}_{1}^{(1)}

.

Let

Ω = {(- \frac{1}{n} \frac{\partial^{2} ℓ_{n} (θ_{0})}{\partial θ_{j} \partial θ_{l}})}_{s_{1} \times s_{1}}

,

B = [I_{s_{1} \times s_{1}}, 0_{s_{1} \times (s - s_{1})}]

, then

\begin{matrix} B \frac{1}{\sqrt{n}} \frac{\partial ℓ_{n} (θ)}{\partial θ} & = & Ω \sqrt{n} ({\hat{θ}}^{(1)} - θ_{0}^{(1)}) + A_{n} \sqrt{n} ({\hat{θ}}^{(1)} - θ_{0}^{(1)}) + \sqrt{n} b_{n} + o_{p} (1) \\ = & [Ω - I_{n}^{(1)} (θ_{0}^{(1)})] \sqrt{n} ({\hat{θ}}^{(1)} - θ_{0}^{(1)}) + I_{n}^{(1)} (θ_{0}^{(1)}) \sqrt{n} ({\hat{θ}}^{(1)} - θ_{0}^{(1)}) \\ + A_{n} \sqrt{n} ({\hat{θ}}^{(1)} - θ_{0}^{(1)}) + \sqrt{n} b_{n} + o_{p} (1) \\ = & I_{n}^{(1)} (θ_{0}^{(1)}) \sqrt{n} ({\hat{θ}}^{(1)} - θ_{0}^{(1)}) + A_{n} \sqrt{n} ({\hat{θ}}^{(1)} - θ_{0}^{(1)}) + \sqrt{n} b_{n} + o_{p} (1) \\ = & \sqrt{n} [(I_{n}^{(1)} (θ_{0}^{(1)}) + A_{n}) ({\hat{θ}}^{(1)} - θ_{0}^{(1)}) + b_{n}] + o_{p} (1) . \end{matrix}

Furthermore, by using the central-limit theorem for the linear-quadratic forms of Kelejian and Prucha [31], it has

B \frac{1}{\sqrt{n}} \frac{\partial ℓ_{n} (θ_{0})}{\partial θ} \overset{L}{⟶} N (0, (I_{1}^{(1)} (θ_{0}^{(1)}) + J_{1} (θ_{0}^{(1)})) .

Therefore, according to Slutsky’s theorem, we obtain

\sqrt{n} [(I_{n}^{(1)} (θ_{0}^{(1)}) + A_{n}) ({\hat{θ}}^{(1)} - θ_{0}^{(1)}) + B_{n}] \overset{L}{⟶} N (0, (I_{1}^{(1)} (θ_{0}^{(1)}) + J_{1} (θ_{0}^{(1)})) .

The proof of Theorem 2 is, hence, completed. □

References

Cliff, A.; Ord, J.K. Spatial Autocorrelation; Pion: London, UK, 1973. [Google Scholar]
Anselin, L. Spatial Econometrics: Methods and Models; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1988. [Google Scholar]
Anselin, L.; Bera, A.K. Spatial Dependence in Linear Regression Models with an Introduction to Spatial Econometrics. In Handbook of Applied Economics Statistics; Ullah, A., Giles, D.E.A., Eds.; Marcel Dekker: New York, NY, USA, 1998. [Google Scholar]
Xu, X.B.; Lee, L.F. A spatial autoregressive model with a nonlinear transformation of the dependent variable. J. Econom. 2015, 186, 1–18. [Google Scholar] [CrossRef]
Liu, X.; Chen, J.B.; Cheng, S.L. A penalized quasi-maximum likelihood method for variable selection in the spatial autoregressive model. Spat. Stat. 2018, 25, 86–104. [Google Scholar] [CrossRef]
Xie, L.; Wang, X.R.; Cheng, W.H.; Tang, T. Variable selection for spatial autoregressive models. Commun.-Stat. Methods 2021, 50, 1325–1340. [Google Scholar] [CrossRef]
Xie, T.F.; Cao, R.Y.; Du, J. Variable selection for spatial autoregressive models with a diverging number of parameters. Stat. Pap. 2020, 61, 1125–1145. [Google Scholar] [CrossRef]
Su, L.J.; Jin, S.N. Profile quasi-maximum likelihood estimation of partially linear spatial autoregressive models. J. Econom. 2010, 157, 18–33. [Google Scholar] [CrossRef]
Du, J.; Sun, X.Q.; Cao, R.Y.; Zhang, Z.Z. Statistical inference for partially linear additive spatial autoregressive models. Spat. Stat. 2018, 25, 52–67. [Google Scholar] [CrossRef]
Cheng, S.L.; Chen, J.B. Estimation of partially linear single-index spatial autoregressive model. Stat. Pap. 2021, 62, 485–531. [Google Scholar] [CrossRef]
Wei, C.H.; Guo, S.; Zhai, S.F. Statistical inference of partially linear varying coefficient spatial autoregressive models. Econ. Model. 2017, 64, 553–559. [Google Scholar] [CrossRef]
Hu, Y.P.; Wu, S.Y.; Feng, S.Y.; Jin, J.L. Estimation in Partial Functional Linear Spatial Autoregressive Model. Mathematics 2020, 8, 1680. [Google Scholar] [CrossRef]
Lin, X.; Lee, L.F. GMM estimation of spatial autoregressive models with unknown heteroskedasticity. J. Econom. 2010, 157, 34–52. [Google Scholar] [CrossRef]
Dai, X.W.; Jin, L.B.; Tian, M.Z.; Shi, L. Bayesian Local Influence for Spatial Autoregressive Models with Heteroscedasticity. Stat. Pap. 2019, 60, 1423–1446. [Google Scholar] [CrossRef]
Wu, L.C.; Li, H.Q. Variable selection for joint mean and dispersion models of the inverse Gaussian distribution. Metrika 2012, 75, 795–808. [Google Scholar] [CrossRef]
Xu, D.K.; Zhang, Z.Z. A semiparametric Bayesian approach to joint mean and variance models. Stat. Probab. Lett. 2013, 83, 1624–1631. [Google Scholar] [CrossRef]
Zhao, W.H.; Zhang, R.Q.; Lv, Y.Z.; Liu, J.C. Variable selection for varying dispersion beta regression model. J. Appl. Stat. 2014, 41, 95–108. [Google Scholar] [CrossRef]
Li, H.Q.; Wu, L.C.; Ma, T. Variable selection in joint location, scale and skewness models of the skew-normal distribution. J. Syst. Sci. Complex. 2017, 30, 694–709. [Google Scholar] [CrossRef]
Zhang, D.; Wu, L.C.; Ye, K.Y.; Wang, M. Bayesian quantile semiparametric mixed-effects double regression models. Stat. Theory Relat. Fields 2021, 5, 303–315. [Google Scholar] [CrossRef]
Tibshirani, R. Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef] [Green Version]
Fan, J.Q.; Li, R.Z. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Luo, G.; Wu, M. Variable selection for semiparametric varying-coefficient spatial autoregressive models with a diverging number of parameters. Commun.-Stat. Methods 2021, 50, 2062–2079. [Google Scholar] [CrossRef]
Li, R.; Liang, H. Variable selection in semiparametric regression modeling. Ann. Stat. 2008, 36, 261–286. [Google Scholar] [CrossRef] [PubMed]
Zhao, P.X.; Xue, L.G. Variable selection for semiparametric varying coefficient partially linear errorsin-variables models. J. Multivar. Anal. 2010, 101, 1872–1883. [Google Scholar] [CrossRef] [Green Version]
Tian, R.Q.; Xue, L.G.; Liu, C.L. Penalized quadratic inference functions for semiparametric varying coefficient partially linear models with longitudinal data. J. Multivar. Anal. 2014, 132, 94–110. [Google Scholar] [CrossRef]
Lee, L.F. Asymptotic distributions of quasi-maximum likelihood estimators for spatial autoregressive models. Econometrica 2004, 72, 1899–1925. [Google Scholar] [CrossRef]
Wang, H.; Li, R.; Tsai, C. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 2007, 94, 553–568. [Google Scholar] [CrossRef] [PubMed]
Pace, R.K.; Gilley, O.W. Using the spatial configuration of the data to improve estimation. J. Real Estate Financ. Econ. 1997, 14, 333–340. [Google Scholar] [CrossRef]
Su, L.; Yang, Z. Instrumental Variable Quantile Estimation of Spatial Autoregressive Models; Working Paper; Singapore Management University: Singapore, 2009. [Google Scholar]
Kelejian, H.H.; Prucha, I.R. On the asymptotic distribution of the Moran I test statistic with applications. J. Econom. 2001, 104, 219–257. [Google Scholar] [CrossRef] [Green Version]
Case, A.C. Spatial patterns in household demand. Econometrica 1991, 59, 953–965. [Google Scholar] [CrossRef] [Green Version]

Figure 1. The scatter plot of

{\hat{σ}}_{i}^{2}

based on the ordinary quasi-maximum likelihood estimators (QMLE).

Figure 2. The scatter plot of

{\hat{σ}}_{i}^{2}

based on the penalized quasi-maximum likelihood using SCAD.

Figure 3. The scatter plot of

{\hat{σ}}_{i}^{2}

based on the penalized quasi-maximum likelihood using ALASSO.

Table 1. Variable selections for

β

and

γ

under different sample sizes when

ρ = 0.5

.

Table 1. Variable selections for

β

and

γ

under different sample sizes when

ρ = 0.5

.

		SCAD			ALASSO
$β$	n	MSE	C	IC	MSE	C	IC
	150	0.0043	4.9360	0	0.0041	4.9760	0
	200	0.0029	4.9480	0	0.0027	4.9900	0
	300	0.0016	4.9760	0	0.0015	4.9980	0
$γ$	n	MSE	C	IC	MSE	C	IC
	150	0.0903	4.8960	0.0040	0.0970	4.9000	0.0040
	200	0.0633	4.9040	0	0.0631	4.9400	0
	300	0.0342	4.9580	0	0.0352	4.9900	0

Table 2. Variable selections for

β

and

γ

under different sample sizes when

ρ = 0

.

Table 2. Variable selections for

β

and

γ

under different sample sizes when

ρ = 0

.

		SCAD			ALASSO
$β$	n	MSE	C	IC	MSE	C	IC
	150	0.0041	4.9180	0	0.0038	4.9700	0
	200	0.0027	4.9520	0	0.0026	4.9900	0
	300	0.0016	4.9760	0	0.0016	5.0000	0
$γ$	n	MSE	C	IC	MSE	C	IC
	150	0.0903	4.9040	0	0.0955	4.9080	0
	200	0.0632	4.9200	0	0.0645	4.9460	0
	300	0.0387	4.9260	0	0.0383	4.9760	0

Table 3. Variable selections for

β

and

γ

under different sample sizes when

ρ = - 0.5

.

Table 3. Variable selections for

β

and

γ

under different sample sizes when

ρ = - 0.5

.

		SCAD			ALASSO
$β$	n	MSE	C	IC	MSE	C	IC
	150	0.0046	4.9080	0	0.0044	4.9560	0
	200	0.0032	4.9240	0	0.0030	4.9800	0
	300	0.0017	4.9720	0	0.0017	5.0000	0
$γ$	n	MSE	C	IC	MSE	C	IC
	150	0.0919	4.8780	0	0.1033	4.8860	0
	200	0.0600	4.9220	0	0.0636	4.9500	0
	300	0.0363	4.9420	0	0.0372	4.9800	0

Table 4. Description of the variables in Boston housing data.

Variables	Description
CRIM $(X_{1})$	Per capita crime rate by town
ZN $(X_{2})$	Proportion of residential land zoned for lots over 25,000 sq. ft.
INDUS $(X_{3})$	Proportion of non-retail business acres per town
CHAS $(X_{4})$	Charles River dummy variable (=1 if tract bounds river; 0 otherwise)
NOX $(X_{5})$	Nitric-oxides concentration (parts per 10 million)
RM $(X_{6})$	Average number of rooms per dwelling
AGE $(X_{7})$	Proportion of owner-occupied units built prior to 1940
DIS $(X_{8})$	Weighted distances to five Boston employment centres
RAD $(X_{9})$	Index of accessibility to radial highways
TAX $(X_{10})$	Full-value property-tax rate per USD 10,000
PTRATIO $(X_{11})$	Pupil–teacher ratio by town
B $(X_{12})$	$1000 {(B k - 0.63)}^{2}$ , where Bk is the proportion of blacks by town
LSTAT $(X_{13})$	% of the population with lower status
MEDV $(Y)$	Median value of owner-occupied homes in USD 1000s

Table 5. Penalized quasi-maximum likelihood estimators for

β

and

γ

.

Table 5. Penalized quasi-maximum likelihood estimators for

β

and

γ

.

$Coefficient$	QMLE	SCAD	ALASSO
$β_{1} (X_{1})$	−0.0882	0	0
$β_{2} (X_{2})$	0.0699	0	0
$β_{3} (X_{3})$	−0.0427	0	0
$β_{4} (X_{4})$	0.0038	0	0
$β_{5} (X_{5})$	−0.0145	0	0
$β_{6} (X_{6})$	0.4641	0.5709	0.6023
$β_{7} (X_{7})$	−0.1509	−0.1788	−0.1489
$β_{8} (X_{8})$	−0.2076	−0.1287	−0.1175
$β_{9} (X_{9})$	0.2008	0.1440	0.0502
$β_{10} (X_{10})$	−0.2031	−0.2049	−0.1502
$β_{11} (X_{11})$	−0.1073	−0.0750	−0.0829
$β_{12} (X_{12})$	0.1440	0.1638	0.1033
$β_{13} (X_{13})$	−0.0597	0	0
$γ_{1} (Z_{1})$	0.1265	0	0
$γ_{2} (Z_{2})$	0.1020	0	0
$γ_{3} (Z_{3})$	−0.0830	0	0
$γ_{4} (Z_{4})$	0.0823	0	0
$γ_{5} (Z_{5})$	−0.4000	−0.3673	0
$γ_{6} (Z_{6})$	0.1113	0	0
$γ_{7} (Z_{7})$	0.3637	0.3461	0.2357
$γ_{8} (Z_{8})$	−0.4004	−0.3877	−0.2381
$γ_{9} (Z_{9})$	0.7366	0.8952	0.9143
$γ_{10} (Z_{10})$	0.2920	0.2105	0
$γ_{11} (Z_{11})$	−0.3533	−0.4085	−0.2723
$γ_{12} (Z_{12})$	−0.0141	0	0
$γ_{13} (Z_{13})$	−0.4262	−0.4586	−0.3929
$ρ$	0.2531	0.2881	0.2809

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Variable Selection of Heterogeneous Spatial Autoregressive Models via Double-Penalized Likelihood

Abstract

1. Introduction

2. Variable Selection via Penalized Quasi-Maximum Likelihood

2.1. Heterogeneous SAR Models

2.2. Penalized Quasi-Maximum Likelihood

3. Asymptotic Properties

4. Computation

4.1. Algorithm

4.2. Choosing the Tuning Parameters

5. Simulation Study

6. Real Data Analysis

7. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Proof of Theorems

References

Article Metrics

Citations

Article Access Statistics