Generalized linear models are routinely used in many environment statistics problems such as earthquake magnitudes prediction. Hu et al. proposed Pareto regression with spatial random effects for earthquake magnitudes. In this paper, we propose Bayesian spatial variable selection for Pareto regression based on Bradley et al. and Hu et al. to tackle variable selection issue in generalized linear regression models with spatial random effects. A Bayesian hierarchical latent multivariate log gamma model framework is applied to account for spatial random effects to capture spatial dependence. We use two Bayesian model assessment criteria for variable selection including Conditional Predictive Ordinate (CPO) and Deviance Information Criterion (DIC). Furthermore, we show that these two Bayesian criteria have analytic connections with conditional AIC under the linear mixed model setting. We examine empirical performance of the proposed method via a simulation study and further demonstrate the applicability of the proposed method in an analysis of the earthquake data obtained from the United States Geological Survey (USGS).
The earthquake magnitude data has become increasingly popular over the last decade. Statistical models for earthquake have been proposed since 1800s. Since large earthquakes are rare, it is difficult to fit simple linear models. Many different parametric models (Gamma model, Weibull model) have been considered to analyze earthquake magnitudes, but some earthquakes with very small magnitudes are not reported by seismic centers. The Pareto-type distribution is a popular choice for analyzing earthquake magnitudes data (e.g., [1,2,3]), as the Pareto distribution is a heavy-tailed distribution with a lower threshold. In statistical analysis, a regression model is used to connect dependent covariates of earthquakes to the magnitude of the earthquake. A generalized linear model strategy can be used for the Pareto regression. Existing seismology literatures pay less attention to spatially dependent structure on earthquake magnitudes. They just built simple linear regression models or generalized linear models to explore covariates effects on earthquake magnitudes . Hu and Bradley  proposed using the Pareto regression with spatial random effects for earthquake magnitudes, but they did not consider the model selection problems. In order to have more explicit understanding of dependent covariates of earthquake magnitudes, variable selection approaches should be considered in a Pareto regression model with spatial random effects.
Variable selection and Bayesian statistics have received widespread attention and become increasingly important tools in the field of environment and ecology [6,7]. For hierarchical spatial model, it is difficult to do inference for latent variables. Bayesian approach provides a convenient way for estimating latent variables in hierarchical models. Compared with the frequentist approach, a Bayesian approach can bring some prior information on parameters of the model. It is an important part of a statistical analysis. In practice, we may want to measure how good a model is for answering a certain question or comparing different models to see which model is best suited. There are many popular variable selection criteria, including Akaike’s information criterion (AIC)  and Bayesian information criterion (BIC) , Bayes factor, conditional predictive ordinate (CPO) [10,11], L measure , and the deviance information criterion (DIC) . Chen et al.  provide the connections between these popular criteria for variable subset selection under generalized linear models. However, there are some difficulties for Bayesian variable selection to carry out because of the challenge in assigning prior distributions for the parameters. In order to tackle this issue, we consider the multivariate log-Gamma distribution (MLG) based on Bradley et al. , which is conjugate with the Pareto distribution . Hence, the Bayesian approach to variable selection is straightforward for our model. Consequently, we use CPO and DIC criteria to carry out Bayesian variable selection for Pareto regression models due to the performance of the conjugate priors (see , for a discussion).
Both CPO and DIC are criteria-based methods and they have some advantage over other criteria. Compared with regularized estimation approach, these two criteria consider goodness of fit of the candidate models. Furthermore, compared with negative log probability density or RMSE for predictions, these two criteria consider the model complexity. Like the AIC or BIC, these two criteria compromise the tradeoff between the goodness of fit and model complexity. The CPO provides a site-specific model fit metric that can be used for exploratory analysis and can be combined at the site to generate a logarithm pseudo marginal likelihood (LPML) as an overall model fit measure. The CPO is based on leave-one-out-cross-validation. It estimates the probability of observing data on one particular location in the future if after having already observed data. The LPML is a leave-one-out cross-validation with log likelihood as the criteria which can be easily obtained from an Markov chain Monte Carlo (MCMC) output (see ). More details about two criteria will be discussed in Section 2.2. The major contribution of this paper is that we introduce two Bayesian model selection criteria in generalized linear model with spatial random effects. Furthermore, we exam the relationship between the two criteria with conditional AIC (cAIC) in random effects model. Other than the variable selection problem in regression model, our criteria can also be used in model selection in the presence of spatial random effects. In general, our proposed criteria can select important covariates and random effects model simultaneously.
The remaining sections of this article are organized as follows. Section 2 introduces our proposed statistical model, and review two Bayesian model assessment Criteria including LPML and DIC . In Section 3 and Section 4, we present MCMC scheme and a simulation study for two scenarios, and use two criteria to select true model. In Section 5, we carry out a detailed analysis of the US earthquake dataset from United States Geological Survey (USGS) and use two criteria to select the best model(s). Finally, Section 6 contains a brief summary of this paper. For ease of exposition all proofs are given Appendix A.
2.1. Pareto Regression with Spatial Random Effects
In many regression problems, normality may not be always held. Generalized linear models allow a linear regression model to connect the response variable with a proper link function. For some heavy tailed data with minimum value, it is common to use the Pareto model to fit these data. From the expression of Gutenberg–Richter law, it is possible to derive a relationship for the logarithm of the probability to exceed some given magnitude. The standard distribution used for seismic moment is the Pareto distribution. The Pareto distribution has a natural threshold. In practice, people do not take more consideration on “micro” (magnitude from 1–1.9) or “minor” (magnitude from 2–2.9) earthquakes. Compared with exponential distribution, Pareto distribution is a heavy tailed distribution. Heavy tailed distributions tend to have many outliers with very high values. The heavier the tail, the larger the probability that you will get one or more disproportionate values in a sample. In earthquake data, most recorded earthquakes have a magnitude around 3–5, but sometime there will have some significant earthquakes with large magnitude. Hu  used Pareto regression to model earthquake magnitudes, since the Pareto distribution is a heavy tailed distribution with a threshold. Earthquake magnitude data also has a threshold, since people consider earthquake only over a certain magnitude. Based on the generalized linear model setting, we can build Pareto regression model as
where is a spatial location, , is i-th covariate on location and is the minimum value of the response variable. Under this model, the log shape parameter is modeled with a fixed effects term.
The model in Equation (1) does not include spatial random effects. Consequently, it is implicitly assumed that and are independent for . But for many spatial data, it is not realistic to assume that and are independent. We can add the latent Gaussian process in the log-linear model so that the generalized linear model becomes a generalized linear mixed model (GLMM). Specifically, we assumed
where is an n-dimensional vector of , is a spatial correlation matrix, and are the observed spatial locations. The natural strategy to consider spatial correlation is to use in light of Tobler’s first law that “near things are more related than distant things” . Spatial random effects allow one to leverage information from nearby locations. Latent Gaussian process models have become a standard method for modeling spatial random effects . Based on Gaussian process structure, the nearby observations will have higher correlation.
For the latent Gaussian process GLMM, we can build the following hierarchical model:
where “IG” is a shorthand for inverse gamma, “MVN” is a shorthand for multivariate normal, and “N” is a shorthand for a univariate normal distribution. For the Pareto regression model, the normal prior is not conjugate. A proper conjugate prior for the Pareto regression will facilitate the development of an efficient computational algorithm. Chen and Ibrahim  proposed a novel class of conjugate priors for the family of generalized linear model. But they did not show the connection between their conjugate prior and gaussian prior. Bradley et al.  proposed the multivariate log-gamma distribution as a conjugate prior for Poisson spatial regression model and established a connection between a multivariate log-gamma distribution and a multivariate normal distribution. The multivariate log-gamma distribution is an attractive alternative prior for the Pareto regression model due to its conjugacy.
We now present the multivariate log-gamma distribution from Bradley et al. . We define the n-dimensional random vector , which consists of n mutually independent log-gamma random variables with shape and scale parameters organized into the n-dimensional vectors , and , respectively. Then define the n-dimensional random vector as follows:
where and . Bradley et al.  called the multivariate log-gamma random vector. The random vector has the following probability density function:
where “det” represents the determinant function. We use “” as a shorthand for the probability density function in Equation (6).
According to Bradley et al. , the latent Gaussian process is a special case of the latent multivariate log-gamma process. If has a multivariate log-gamma distribution . When , will converge in distribution to the multivariate normal distribution vector with mean and covariance matrix . is sufficiently large for this approximation. MLG model is a more saturated model than Gaussian process model. For the Pareto regression model, the MLG process is more computationally efficient than the Gaussian process. In following hierarchical model, we refer to and as following an MLG distribution with , and being the first parameter of MLG corresponding to , and and are the second parameter of MLG like .
In order to establish conjugacy, we build a spatial GLM with latent multivariate log gamma process as follows:
where defined baseline, , , , , and .
2.2. Bayesian Model Assessment Criteria
In this section, we consider two Bayesian model assessment criteria, DIC and LPML. In addition, we introduce the procedure to calculate DIC and LMPL for the Pareto regression model with spatial random effects. Let denote the vector of regression coefficient under the full model M. Also let and denote the corresponding vectors of regression parameters included and excluded in the subset model m. Then, holds for all m, and .
The deviance information criterion is defined as
where is the deviance function, is the effective number of model parameters, and is the posterior mean of parameters , and is the posterior mean of . To carry out variable selection, we specify the deviance function as
where , is the likelihood function in Equation (7), is the posterior mean of the spatial random effects on location , is the vector of regression coefficient under the m-th model. In this way, the DIC criterion is given by
where , and .
In order to calculate the LPML, we need to calculate CPO first . Then LPML can be obtained as
where is the CPO for the i-th subject.
Let denote the observation data with the i-th observation deleted. The CPO for the i-th subject is defined as
where is the posterior distribution based on the data .
From Chapter 10 of Chen et al. , CPO in (13) can be rewritten as
A popular Monte Carlo estimate of CPO using Gibbs samples form the posterior distribution is given as D instead of . Letting denote a Gibss sample of from and using (14), a Monte Carlo estimate of is given by
So the LPML defined as
In the context of variable selection, we select a subset model, which has the largest LPML value and/or the smallest DIC value. In practice, if we have two different results based on two criteria, we will choose both models which were selected by two criteria as the best models. In addition, we can do more diagnostics for the two candidate models. DIC compromises the goodness of fit and the complexity of the model. The CPO is based on leave-one-out-cross-validation. The LPML, the sum of the log CPO’s, is an estimator for the log marginal likelihood.
2.3. Analytic Connections between Bayesian Variable Selection Criteria with Conditional AIC for the Normal Linear Regression with Spatial Random Effects
The Akaike information criterion (AIC) has been applied to choose candidate models in the mixed-effects model by integrating out the random effects. A conditional AIC was proposed to be used for the linear mixed-effects model  under the assumption that the variance-covariance matrix of random effects is known. Under the this assumption, we establish analytic connections of DIC and LPML we proposed in Section 2.3 with cAIC. We have the following linear regression model with spatial random effects:
where is a vector of fixed effects, is spatial random effects for individual i. The cAIC is defined as:
where is with full rank k. Having the MLE of , we can have
where , , .
From , we can have DIC and LPML for the linear regression model with spatial random effects as follows
where is calculated by posterior mean, with conjugate prior for likelihood model, , is the remainder of Taylor expansion. So in the conjugate prior condition, our proposed Bayesian variable selection criterion is similar with cAIC for the linear regression model with spatial random effects.
3. MCMC Scheme
The algorithm requires sampling the all parameters in turn from their respective full conditional distributions. We assume that , are independent a priori. We further assume and . Thus, sampling from and is straightforward. For and , we assume that , and , that is, , , and . The sampling scheme for these three parameters is not straightforward. We use a Metropolis–Hasting algorithm to sampling from three parameters. The other difficulty is how to compute the log-determinant of a matrix. Because we are using a log-likelihood function, the formula for the log-likelihood involves the expression or . To compute the logarithm of a determinant, we encourage not try to compute the determinant itself. Instead, computing the log-determinant directly. For a matrix with a large determinant, the computation of the log-determinant will usually be achieved, however, the computation of the determinant might cause a numerical error. The method is given by
where the is the Cholesky root of matrix , is the Cholesky root of matrix , and “diag” denotes a column vector whose elements are the elements on the diagonal of matrix. The derivative details for the full conditional distributions given in Appendix A.
Note that , and are prespecified hyperparameters. In this article, we use and . For more flexibility, we can also assume and each following a Gamma distribution with suitable hyperparameters.
4. Simulation Study
The spatial domain for the two simulation studies are chosen to be D . The locations is selected uniformly over D. We present the two different simulation settings and generate 100 replicate data sets for each scenario. We assume so that we have seven candidate models. We generate from a multivariate normal distribution with mean zero and covariance . We set and fix in both Simulations 1 and 2. We generate the elements of independently from the uniform distribution U(0,1). We define the baseline threshold (scale parameter) equal to three in both simulations.
4.1. Simulation for the Connection between Multivariate Log Gamma and Multivariate Normal Distribution
In this section, we examine the connection between the multivariate log-gamma distribution and the multivariate normal distribution. First, we draw the quantile-quantile (QQ)-plot in Figure 1 to show the normality of generated from , when . In addition, we use the Kolmogorov–Smirnov test to examine the connection for one dimensional data. We use a multivariate two-sample test  for multivariate dimensional data. We generated one data set of size 100 from the multivariate log-gamma distribution and another data set of size 100 from the multivariate normal distribution and then calculated the p-value from the multivariate two-sample test for comparing these two data sets. Then, we repeated this process 1000 times. We found that 992 out of these 1000 p-values were larger than the significance level of 0.05. That is, in 992 of 1000 times, we did not reject the null hypothesis that the two samples were drawn from the same distribution.
4.2. Simulation for Estimation Performance
In this simulation study, our goal was to examine the estimation performance of the hierarchical model. We set . We estimated the parameters in this simulation and report the bias (), the standard error (SE) (, where ), and the mean square error (MSE) () in Table 1, where is the true value of .
We try to predict the parameters close to true mean value of our target random variable and the variance is how scattered for our predictions. From Table 1, using the MLG prior for , we got a reasonable estimation result because it achieves low bias and low variance simultaneously. Besides, we calculated the coverage probability for each variable, it indicates the 94% coverage probability for each parameter.
4.3. Simulation for Model Selection
In this simulation study, our goal was to study the accuracy of our model selection criteria. We have two different simulations in this section. Simulation 1: we set true and calculated the difference between the true model and other candidate models for both criteria. In Figure 2, a difference beyond zero means that the true model had smaller DIC than the candidate model and the difference below zero means that the true model had higher LPML than the candidate model in Figure 2. The true model had the smallest DIC and the largest LPML in 99 of 100 simulated data sets. Simulation 2: we set true and the results are shown in Figure 3. In each simulation, we have seven candidate models and one of them is true model and denote the true model as model 5. In Figure 2 and Figure 3, the y-axis is the difference between “candidate model i” with true model. The true model had the smallest DIC in 81 of 100 simulated data sets and the largest LPML in 80 out of 100 simulated data sets. For each replicate dataset, we fit our model with 5000 Markov chain Monte Carlo iterations and treated the first 2000 iterations as burn-in. From Figure 2 and Figure 3, in both simulation studies, we find that DIC and LPML yielded relatively consistent model selection results.
4.4. Simulation for Model Comparison
In this simulation study, our goal is to evaluate the accuracy of our model selection criteria for different spatial random effects model. In this section, we generate the spatial random effects from MLG, , where . Other settings are same with previous simulations. We generated 100 data sets in these settings. Then, we compared the model fitness based on two following priors:
For each replicate dataset, we fit our model with 5000 Markov chain Monte Carlo iterations and treat first 2000 iterations as burn-in. Then, we calculated the difference of DICs and the difference of LPMLs between these two priors. In Figure 4, the values below zero in the left plot imply that prior 1 has smaller DIC than prior 2. Also, the values above zero in the right plot in Figure 4 indicate that prior 1 has higher LPML than prior 2. The results shown in Figure 4 that we have a better result when we use the MLG prior than the Gaussian prior.
5. A Real Data Example
5.1. Data Description
We analyzed seven days of US earthquake data collected in 2018, which includes n = 228 earthquakes that have magnitudes over (https://earthquake.usgs.gov/). We present the earthquake data in Figure 5. We find the data most lie in seismic belts. In Figure 6 and Figure 7, we present the histogram of this data and the scatter plot of this data set. In this analysis we have three variables (depth, gap, rms). The depth is where the earthquake begins to rupture. The gap is the largest azimuthal gap between azimuthally adjacent stations (in degrees). RMS is the root-mean-square (RMS) travel time residual, in sec, using all weights.
We consider the model in Equation (7) and specify and . These choices lead to an MLG that approximates a multivariate normal distribution. This choice of hyper-parameters will give an approximately normal prior on . Inverse gamma priors are chosen for variance parameters and , which is a usual choice of the variance parameters in Bayesian analysis. The full conditionals in the Appendix A are used to run a Gibbs sampler. We have seven candidate models in total, and (depth, gap, rms). The number of iterations of the Gibbs sampler is 15,000, and the number of burn-in iterations is 10,000. The trace plots of posterior samples are provided in the Appendix B to show the convergence of MCMC chain. We also compare to a model when approximates to Normal. The “DIC” and “LPML” denote the DIC and LPML for a model when approximates to normal respectively. Furthermore, we calculated the log probability density (LPD) for candidate models. Based on the results in Table 2, the three criteria selected the same model with and MLG spatial random effects. Our proposed criteria had consistent results with the LPD.
From Table 2, we know that the model with has the smallest DIC and largest LPML. We also report the posterior estimates under the best model in Table 3 according to both DIC and LPML.
From these posterior estimates, the model we select just contains depth as the important covariates and 95% credible interval does not contain zero. We see that as the depth increases, the expected value of earthquakes magnitudes increases. The other two covariates, gap and RMS, have no significant effects on earthquake magnitudes. In other words, from these seven-day earthquake data, deep earthquakes will have bigger magnitudes than shallow earthquakes. From the posterior estimates of and , we can find that there exists spatial correlation of earthquake magnitudes between different locations. In addition, using MLG as spatial random effects increases the goodness of fit of regression model in this data. This result is consistent with the earthquake literature .
In this paper, we propose a Bayesian variable selection criterion for a Bayesian spatial-temporal model for analyzing earthquake magnitudes. Our main methodological contributions are to use the multivariate log-gamma model for both the regression coefficients and spatial random effects and to do variable selection for regression covariates with spatial random effects. Both DIC and LPML have a good selection power to choose the true model. But Bayesian model assessment criteria such as DIC and LPML do not perform well in the high-dimensional case, because the number of candidate models is very large when the number of covariates increases a lot. Developing a high-dimensional variable selection procedure is one of the important future works. The other future work is to fit other earthquake magnitudes models such as the gamma model or the Weibull model. In addition, we need to propose some Bayesian model assessment criterion to select the true data model for earthquake magnitudes. For the nature hazards problem, we need to incorporate the temporal dependent structure of earthquakes. Recently, the ETAS model  (combining the Gutenberg–Richter law and the Omori law) has been widely studied. Modelling earthquake dynamics is an important approach for preventing economic loss caused by an earthquake. Incorporating self-exciting effects in our generalized linear model with spatial random effects is another important future work. Furthermore, we only consider earthquake information as the covariates in our model. It will increase the predictive accuracy for us to combine more geographical information such as fault line information or crustal movement in the future.
Chen’s research was partially supported by NIH grants #GM70335 and #P01CA142538. Hu’s research was supported by Dean’s office of College of Liberal Arts and Sciences in University of Connecticut.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A. Full Conditionals Distributions for Pareto Data with Latent Multivariate Log-Gamma Process Models
From the hierarchical model in Equation (7), the full conditional distribution for satisfies:
Rearranging terms we have
which implies that is equal to , which is a shorthand for the conditional MLG distribution used in .
Similarly, the full conditional distribution for satisfies:
Rearranging terms we have
which implies that is equal to . Thus we obtain the following full-conditional distributions to be used within a Gibbs sampler:
where “cMLG” is the conditional multivariate log gamma distribution from . A motivating feature of this conjugate structure is that it is relatively straightforward to simulate from a cMLG. For , and , we consider using a Metropolis–Hasting algorithm or slice sampling procedure .
The parameters of the conditional multivariate log gamma distribution are organized into in Table A1.
Parameters of the full conditional distribution.
Parameters of the full conditional distribution.
Appendix B. Trace Plot in Real Data Analysis
(upper left) Trace plot for ; (upper right) Trace plot for ; (lower left) Trace plot for ; (lower right) Trace plot for .
(upper left) Trace plot for ; (upper right) Trace plot for ; (lower left) Trace plot for ; (lower right) Trace plot for .
Mega, M.S.; Allegrini, P.; Grigolini, P.; Latora, V.; Palatella, L.; Rapisarda, A.; Vinciguerra, S. Power-law time distribution of large earthquakes. Phys. Rev. Lett.2003, 90, 188501. [Google Scholar] [CrossRef] [PubMed]
Kijko, A. Estimation of the maximum earthquake magnitude, m max. Pure Appl. Geophys.2004, 161, 1655–1681. [Google Scholar] [CrossRef]
Vere-Jones, D.; Robinson, R.; Yang, W. Remarks on the accelerated moment release model: Problems of model formulation, simulation and estimation. Geophys. J. Int.2001, 144, 517–531. [Google Scholar] [CrossRef]
Gelfand, A.E.; Dey, D.K. Bayesian model choice: Asymptotics and exact calculations. J. R. Stat. Soc. Ser. B (Methodol.)1994, 56, 501–514. [Google Scholar] [CrossRef]
Geisser, S. Predictive Inference; Routledge: Abingdon, UK, 1993. [Google Scholar]
Ibrahim, J.G.; Laud, P.W. A predictive approach to the analysis of designed experiments. J. Am. Stat. Assoc.1994, 89, 309–319. [Google Scholar] [CrossRef]
Spiegelhalter, D.J.; Best, N.G.; Carlin, B.P.; Van Der Linde, A. Bayesian measures of model complexity and fit. J. R. Stat. Soc. Ser. B Stat. Methodol.2002, 64, 583–639. [Google Scholar] [CrossRef][Green Version]
Chen, M.H.; Huang, L.; Ibrahim, J.G.; Kim, S. Bayesian variable selection and computation for generalized linear models with conjugate priors. Bayesian Anal.2008, 3, 585. [Google Scholar] [CrossRef] [PubMed]
Bradley, J.R.; Holan, S.H.; Wikle, C.K. Bayesian Hierarchical Models with Conjugate Full-Conditional Distributions for Dependent Data from the Natural Exponential Family. arXiv, 2017; arXiv:1701.07506. [Google Scholar]
Chen, M.H.; Ibrahim, J.G. Conjugate priors for generalized linear models. Stat. Sin.2003, 13, 461–476. [Google Scholar]
Geisser, S.; Eddy, W.F. A predictive approach to model selection. J. Am. Stat. Assoc.1979, 74, 153–160. [Google Scholar] [CrossRef]
Tobler, W.R. A computer movie simulating urban growth in the Detroit region. Econ. Geogr.1970, 46, 234–240. [Google Scholar] [CrossRef]
Gelfand, A.E.; Schliep, E.M. Spatial statistics and Gaussian processes: A beautiful marriage. Spat. Stat.2016, 18, 86–104. [Google Scholar] [CrossRef]
Bradley, J.R.; Holan, S.H.; Wikle, C.K. Computationally Efficient Distribution Theory for Bayesian Inference of High-Dimensional Dependent Count-Valued Data. arXiv, 2015; arXiv:1512.07273. [Google Scholar]
Chen, M.H.; Shao, Q.M.; Ibrahim, J.G. Monte Carlo Methods in Bayesian Computation; Springer Science and Business Media: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
Liang, H.; Wu, H.; Zou, G. A note on conditional AIC for linear mixed-effects models. Biometrika2008, 95, 773–778. [Google Scholar] [CrossRef] [PubMed]
Baringhaus, L.; Franz, C. On a new multivariate two-sample test. J. Multivar. Anal.2004, 88, 190–206. [Google Scholar] [CrossRef]
Ogata, Y. Statistical models for earthquake occurrences and residual analysis for point processes. J. Am. Stat. Assoc.1988, 83, 9–27. [Google Scholar] [CrossRef]