#### 2.1. Pareto Regression with Spatial Random Effects

In many regression problems, normality may not be always held. Generalized linear models allow a linear regression model to connect the response variable with a proper link function. For some heavy tailed data with minimum value, it is common to use the Pareto model to fit these data. From the expression of Gutenberg–Richter law, it is possible to derive a relationship for the logarithm of the probability to exceed some given magnitude. The standard distribution used for seismic moment is the Pareto distribution. The Pareto distribution has a natural threshold. In practice, people do not take more consideration on “micro” (magnitude from 1–1.9) or “minor” (magnitude from 2–2.9) earthquakes. Compared with exponential distribution, Pareto distribution is a heavy tailed distribution. Heavy tailed distributions tend to have many outliers with very high values. The heavier the tail, the larger the probability that you will get one or more disproportionate values in a sample. In earthquake data, most recorded earthquakes have a magnitude around 3–5, but sometime there will have some significant earthquakes with large magnitude. Hu [

5] used Pareto regression to model earthquake magnitudes, since the Pareto distribution is a heavy tailed distribution with a threshold. Earthquake magnitude data also has a threshold, since people consider earthquake only over a certain magnitude. Based on the generalized linear model setting, we can build Pareto regression model as

where

$\mathit{s}\in D\subset {\mathcal{R}}^{2}$ is a spatial location,

$\mu (\mathit{s})={\beta}_{0}+{\beta}_{1}{X}_{1}(\mathit{s})+\dots +{\beta}_{p}{X}_{p}(\mathit{s})$,

${X}_{i}(\mathit{s})$ is

i-th covariate on location

$\mathit{s}$ and

${z}_{m}$ is the minimum value of the response variable. Under this model, the log shape parameter is modeled with a fixed effects term.

The model in Equation (

1) does not include spatial random effects. Consequently, it is implicitly assumed that

$\alpha (\mathit{s})$ and

$\alpha (\mathit{w})$ are independent for

$\mathit{s}\ne \mathit{w}$. But for many spatial data, it is not realistic to assume that

$\alpha (\mathit{s})$ and

$\alpha (\mathit{w})$ are independent. We can add the latent Gaussian process in the log-linear model so that the generalized linear model becomes a generalized linear mixed model (GLMM). Specifically, we assumed

where

$\mathit{W}$ is an

n-dimensional vector of

${(w({\mathit{s}}_{1}),\dots ,w({\mathit{s}}_{n}))}^{\prime}$,

$\mathit{H}(\varphi )$ is a

$n\times n$ spatial correlation matrix, and

$\{{\mathit{s}}_{1},\dots ,{\mathit{s}}_{n}\}\in D$ are the observed spatial locations. The natural strategy to consider spatial correlation is to use in light of Tobler’s first law that “near things are more related than distant things” [

18]. Spatial random effects allow one to leverage information from nearby locations. Latent Gaussian process models have become a standard method for modeling spatial random effects [

19]. Based on Gaussian process structure, the nearby observations will have higher correlation.

For the latent Gaussian process GLMM, we can build the following hierarchical model:

where “IG” is a shorthand for inverse gamma, “MVN” is a shorthand for multivariate normal, and “N” is a shorthand for a univariate normal distribution. For the Pareto regression model, the normal prior is not conjugate. A proper conjugate prior for the Pareto regression will facilitate the development of an efficient computational algorithm. Chen and Ibrahim [

16] proposed a novel class of conjugate priors for the family of generalized linear model. But they did not show the connection between their conjugate prior and gaussian prior. Bradley et al. [

20] proposed the multivariate log-gamma distribution as a conjugate prior for Poisson spatial regression model and established a connection between a multivariate log-gamma distribution and a multivariate normal distribution. The multivariate log-gamma distribution is an attractive alternative prior for the Pareto regression model due to its conjugacy.

We now present the multivariate log-gamma distribution from Bradley et al. [

20]. We define the

n-dimensional random vector

$\mathit{\gamma}={({\gamma}_{1},\dots ,{\gamma}_{n})}^{\prime}$, which consists of

n mutually independent log-gamma random variables with shape and scale parameters organized into the

n-dimensional vectors

$\mathit{\alpha}\equiv {({\alpha}_{1},\dots ,{\alpha}_{n})}^{\prime}$, and

$\mathit{\kappa}\equiv {({\kappa}_{1},\dots ,{\kappa}_{n})}^{\prime}$, respectively. Then define the

n-dimensional random vector

$\mathit{q}$ as follows:

where

$\mathit{V}\in {\mathcal{R}}^{n}\times {\mathcal{R}}^{n}$ and

$\mathit{\mu}\in {\mathcal{R}}^{n}$. Bradley et al. [

20] called

$\mathit{q}$ the multivariate log-gamma random vector. The random vector

$\mathit{q}$ has the following probability density function:

where “det” represents the determinant function. We use “

$\mathrm{MLG}\left(\mathit{\mu},\mathit{V},\mathit{\alpha},\mathit{\kappa}\right)$” as a shorthand for the probability density function in Equation (

6).

According to Bradley et al. [

20], the latent Gaussian process is a special case of the latent multivariate log-gamma process. If

$\mathit{\beta}$ has a multivariate log-gamma distribution

$\mathrm{MLG}(\mathbf{0},{\alpha}^{1/2}\mathit{V},\alpha \mathbf{1},1/\alpha \mathbf{1})$. When

$\alpha \to \infty $,

$\mathit{\beta}$ will converge in distribution to the multivariate normal distribution vector with mean

$\mathbf{0}$ and covariance matrix

$\mathit{V}{\mathit{V}}^{\prime}$.

$\alpha =10,000$ is sufficiently large for this approximation. MLG model is a more saturated model than Gaussian process model. For the Pareto regression model, the MLG process is more computationally efficient than the Gaussian process. In following hierarchical model, we refer to

$\mathit{\beta}$ and

$\mathit{W}$ as following an MLG distribution with

$\mathit{q}$,

${\mathbf{0}}_{p}$ and

${\mathbf{0}}_{n}$ being the first parameter of MLG corresponding to

$\mathit{\mu}$, and

${\Sigma}_{W}^{1/2}$ and

${\Sigma}_{\beta}^{1/2}$ are the second parameter of MLG like

$\mathit{V}$.

In order to establish conjugacy, we build a spatial GLM with latent multivariate log gamma process as follows:

where

${Z}_{m}$ defined baseline,

$\mu ({\mathit{s}}_{i})=\mathit{X}({s}_{i})\mathit{\beta}+\mathit{W}$,

${\Sigma}_{W}={\sigma}_{w}^{2}\mathit{H}(\varphi )$,

${\Sigma}_{\beta}={\sigma}^{2}\mathrm{diag}(p)$,

${\alpha}_{W}>0,$ ${\alpha}_{\beta}>0,{\kappa}_{W}>0$, and

${\kappa}_{\beta}>0$.

#### 2.2. Bayesian Model Assessment Criteria

In this section, we consider two Bayesian model assessment criteria, DIC and LPML. In addition, we introduce the procedure to calculate DIC and LMPL for the Pareto regression model with spatial random effects. Let ${\mathit{\beta}}^{(M)}$ denote the vector of regression coefficient under the full model M. Also let ${\mathit{\beta}}^{(m)}$ and ${\mathit{\beta}}^{(-m)}$ denote the corresponding vectors of regression parameters included and excluded in the subset model m. Then, ${\mathit{\beta}}^{(M)}=\mathit{\beta}={({({\mathit{\beta}}^{(m)})}^{\prime},{({\mathit{\beta}}^{(-m)})}^{\prime})}^{\prime}$ holds for all m, and ${\mathit{\beta}}^{(-M)}=\varnothing $.

#### 2.2.1. DIC

The deviance information criterion is defined as

where

$\mathrm{Dev}(\overline{\theta})$ is the deviance function,

${p}_{D}=\overline{\mathrm{Dev}}(\theta )-\mathrm{Dev}(\overline{\theta})$ is the effective number of model parameters, and

$\overline{\theta}$ is the posterior mean of parameters

$\theta $, and

$\overline{\mathrm{Dev}}(\theta )$ is the posterior mean of

$\mathrm{Dev}(\theta )$. To carry out variable selection, we specify the deviance function as

where

${D}_{i}=({Y}_{i},{\mathit{X}}_{i},{\widehat{W}}_{i})$,

$f(.)$ is the likelihood function in Equation (

7),

${\widehat{W}}_{i}$ is the posterior mean of the spatial random effects on location

${s}_{i}$,

${\mathit{\beta}}^{(m)}$ is the vector of regression coefficient under the

m-th model. In this way, the DIC criterion is given by

where

where

${\overline{\mathit{\beta}}}^{(m)}=E[{\mathit{\beta}}^{(m)}|D]$, and

$\overline{\mathrm{Dev}({\mathit{\beta}}^{m})}=E\left[\mathrm{Dev}({\mathit{\beta}}^{(m)})\right]$.

#### 2.2.2. LPML

In order to calculate the LPML, we need to calculate CPO first [

14]. Then LPML can be obtained as

where

${\mathrm{CPO}}_{i}$ is the CPO for the

i-th subject.

Let

${D}_{(-i)}$ denote the observation data with the

i-th observation deleted. The CPO for the

i-th subject is defined as

where

$\pi (\mathit{\beta}|{D}_{(-i)})$ is the posterior distribution based on the data

${D}_{(-i)}$.

From Chapter 10 of Chen et al. [

21], CPO in (

13) can be rewritten as

A popular Monte Carlo estimate of CPO using Gibbs samples form the posterior distribution is given as

D instead of

${D}_{(-i)}$. Letting

$\{{\mathit{\beta}}_{b},b=1,\cdots ,B\}$ denote a Gibss sample of

$\mathit{\beta}$ from

$\pi (\mathit{\beta}|D)$ and using (

14), a Monte Carlo estimate of

${\mathrm{CPO}}_{i}^{-1}$ is given by

In the context of variable selection, we select a subset model, which has the largest LPML value and/or the smallest DIC value. In practice, if we have two different results based on two criteria, we will choose both models which were selected by two criteria as the best models. In addition, we can do more diagnostics for the two candidate models. DIC compromises the goodness of fit and the complexity of the model. The CPO is based on leave-one-out-cross-validation. The LPML, the sum of the log CPO’s, is an estimator for the log marginal likelihood.

#### 2.3. Analytic Connections between Bayesian Variable Selection Criteria with Conditional AIC for the Normal Linear Regression with Spatial Random Effects

The Akaike information criterion (AIC) has been applied to choose candidate models in the mixed-effects model by integrating out the random effects. A conditional AIC was proposed to be used for the linear mixed-effects model [

22] under the assumption that the variance-covariance matrix of random effects is known. Under the this assumption, we establish analytic connections of DIC and LPML we proposed in

Section 2.3 with cAIC. We have the following linear regression model with spatial random effects:

where

$\mathit{\beta}$ is a

$p\times 1$ vector of fixed effects,

${w}_{i}$ is spatial random effects for individual

i. The cAIC is defined as:

where

$\mathit{X}$ is with full rank

k. Having the MLE of

$\mathit{\beta}$, we can have

where

$\mathrm{SSE}={(\mathit{y}-\widehat{\mathit{y}})}^{\prime}(\mathit{y}-\widehat{\mathit{y}})$,

$\widehat{\mathit{y}}={({\widehat{y}}_{1},\dots ,{\widehat{y}}_{n})}^{\prime}$,

${\widehat{y}}_{i}={\mathit{X}}_{i}\widehat{\mathit{\beta}}+{\widehat{w}}_{i}$.

From [

14], we can have DIC and LPML for the linear regression model with spatial random effects as follows

and

where

${\mathrm{SSE}}^{\ast}$ is calculated by posterior mean,

${a}_{0}=0$ with conjugate prior for likelihood model,

$R=-\frac{2{(1+{a}_{0})}^{2}}{1+2{a}_{0}}{R}^{\ast}$,

${R}^{\ast}$ is the remainder of Taylor expansion. So in the conjugate prior condition, our proposed Bayesian variable selection criterion is similar with cAIC for the linear regression model with spatial random effects.