# Applied Geospatial Bayesian Modeling in the Big Data Era: Challenges and Solutions

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

**Y**at location ${\mathbf{s}}_{0}$ when it has not been observed? Kriging is a method that produces smooth estimates of unobserved data points, which is filling in predictions at new locations based on information that is available from the observable points in the dataset. The technique of ordinary kriging uses a weighted average of observable points to estimate the unobserved points at a given location, while universal kriging (our focus here) not only uses locational information, but also covariates as predictors. Therefore, kriging is a useful technique to better understand issues that are correlated in space (e.g., disease propagation, natural resource detection, political ideology, and so on).

## 2. Materials and Methods

#### 2.1. Why Is Bayesian Kriging Difficult with Big Data?

#### 2.2. Procedure: Kriging with Bootstrapping

#### 2.2.1. Monte Carlo Integration of Spatial Quantities

- 1.
- Draw a set of initial values from the prior distributions for $\frac{{\tau}^{2}}{{\sigma}^{2}}$ and $\frac{1}{\varphi}$.
- 2.
- Set the FGLS iteration counter to $m=1$ to start the iterative part of the process.
- 3.
- At the ${m}^{\mathrm{th}}$ step using the posterior probabilities computed in Equation (2) draw a single set of sample posterior values for $\frac{{\tau}^{2}}{{\sigma}^{2}}$ and $\frac{1}{\varphi}$ and label these ${\frac{{\tau}^{2}}{{\sigma}^{2}}}^{\left(m\right)}$ and ${\frac{1}{\varphi}}^{\left(m\right)}$.
- 4.
- Use the sampled values of ${\frac{{\tau}^{2}}{{\sigma}^{2}}}^{\left(m\right)}$ and ${\frac{1}{\varphi}}^{\left(m\right)}$ to define the conditional posterior distribution of the partial sill, ${\sigma}^{2}$:$${\sigma}^{2}|\mathbf{Y},\mathit{X},{\frac{{\tau}^{2}}{{\sigma}^{2}}}^{\left(m\right)},{\frac{1}{\varphi}}^{\left(m\right)}\sim {\chi}_{ScI}^{2}(n,{\widehat{\sigma}}^{2})$$Take a draw from this scaled inverse ${\chi}^{2}$ distribution to determine ${\sigma}^{{2}^{\left(m\right)}}$.
- 5.
- Use the sampled values of ${\frac{{\tau}^{2}}{{\sigma}^{2}}}^{\left(m\right)}$, ${\frac{1}{\varphi}}^{\left(m\right)}$, and ${\sigma}^{{2}^{\left(m\right)}}$ to estimate the regression coefficients using (3) and (4). This yields the vector of coefficients $\tilde{\beta}$ and the covariance matrix ${\sigma}^{2}{V}_{\tilde{\beta}}$. With these two terms, we define the conditional posterior distribution of the vector of regression coefficients, $\beta $:$$\beta |\mathbf{Y},\mathit{X},{\sigma}^{{2}^{\left(m\right)}},{\frac{{\tau}^{2}}{{\sigma}^{2}}}^{\left(m\right)},{\frac{1}{\varphi}}^{\left(m\right)}\sim \mathcal{MVN}(\tilde{\beta},{\sigma}^{2}{V}_{\tilde{\beta}})$$Take a draw from this multivariate normal distribution to determine ${\beta}^{\left(m\right)}$.
- 6.
- Update the FGLS iteration counter as $m=m+1$.
- 7.
- Repeat steps 3–6 until $m=M$.

- 1.
- Independently draw B bootstrap samples of the parameter estimate $3+k$ length vector, $\theta =\{{\tau}^{2},{\sigma}^{2},\varphi ,\beta \}$, where k is the number of explanatory variables on the right-hand-side of the core model. The data of size N are drawn across iterations without replacement, meaning that these B samples will not contain overlapping data cases among them.

- 1a.
- Independently draw n data cases from the full N-sized sample, without replacement since zero distances between cases are not defined in the model, producing ${\mathit{X}}_{1}^{*},{\mathit{X}}_{2}^{*},\dots ,{\mathit{X}}_{n}^{*}$.
- 1b.
- Take these n data cases and perform the kriging process n.sims times to get the parameter draws: $({\tau}_{b},{\sigma}_{b},{\varphi}_{b},{\beta}_{b})$, for $b=1,\dots ,\mathrm{n}.\mathrm{sims}$. Put another way, we repeat the hybrid MC-FGLS procedure described above n.sims times in order to construct a posterior sample of the parameters over a resample.
- 1c.
- From the $\mathrm{n}.\mathrm{sims}\times (3+k)$ matrix generated by the MC-FGLS process, take the means down columns to produce ${\theta}_{b}^{*}=\{\widehat{\tau},\widehat{\sigma},\widehat{\varphi},\widehat{\beta}\}$, producing one of the $b=1,\dots ,B$ results required in step 2 below.

- 2
- Record the sample statistics of interest, ${\theta}_{b}^{*}$ for each bootstrap sample, and the mean of these statistics:$${\overline{\theta}}^{*}=\frac{1}{B}\sum _{b=1}^{B}{\theta}_{b}^{*}$$
- 3
- Estimate the bootstrap standard error of the statistic by:$$\mathrm{Var}\left({\overline{\theta}}^{*}\right)=\frac{1}{B-1}\sum _{b=1}^{B}{\left(\right)}^{{\theta}_{b}^{*}}2$$

#### 2.3. Properties of Bootstrap Random Spatial Sampling

#### 2.3.1. Terminology and Consistency

#### 2.3.2. A Modification of the Parametric Bootstrap for Big Data

## 3. Results

#### 3.1. Validity Demonstration Example: Fracking in West Virginia

#### 3.1.1. Fracking Model: Full Data versus Bootstrap

#### 3.1.2. A Performance Experiment

#### 3.2. Application with Big Data: Campaign Contributions in California

## 4. Discussion & Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Appendix A. Monte Carlo Experiment

**Figure A1.**Results from Monte Carlo Experiment of Performance over an N = 2000 Dataset versus Bootstrap Methods. (

**a**) % Mean Absolute Deviation, (

**b**) Coverage Probability.

## References

- Gill, J. Measuring Constituency Ideology Using Bayesian Universal Kriging. State Politics Policy Q.
**2021**, 21, 80–107. [Google Scholar] [CrossRef] - Monogan, J.E.; Gill, J. Measuring State and District Ideology with Spatial Realignment. Political Sci. Res. Methods
**2016**, 4, 97–121. [Google Scholar] [CrossRef][Green Version] - Gill, J. Bayesian Methods: A Social and Behavioral Sciences Approach, 3rd ed.; Chapman & Hall/CRC: New York, NY, USA, 2014. [Google Scholar]
- Banerjee, S.; Carlin, B.P.; Gelfand, A.E. Hierarchical Modeling and Analysis for Spatial Data, 2nd ed.; Chapman & Hall/CRC: New York, NY, USA, 2015. [Google Scholar]
- Matheron, G. Principles of Geostatistics. Econ. Geol.
**1963**, 58, 1246–1266. [Google Scholar] [CrossRef] - Cressie, N.A.C. Statistics for Spatial Data, revised ed.; Wiley: New York, NY, USA, 1993. [Google Scholar]
- van Stein, B.; Wang, H.; Kowalczyk, W.; Emmerich, M.; Bäck, T. Cluster-Based Kriging Approximation Algorithms for Complexity Reduction. Appl. Intell.
**2020**, 50, 778–791. [Google Scholar] [CrossRef][Green Version] - Dickinson, J.P. Some Statistical Results in Combination of Forecasts. J. Oper. Res. Soc.
**1973**, 24, 253–260. [Google Scholar] [CrossRef] - Neiswanger, W.; Wang, C.; Xing, E. Asymptotically Exact, Embarrassingly Parallel MCMC. arXiv
**2013**, arXiv:1311.4780. [Google Scholar] - Scott, S.L.; Blocker, A.W.; Bonassi, F.V.; Chipman, H.A.; George, E.I.; McCulloch, R.E. Bayes and Big Data: The Consensus Monte Carlo Algorithm. Int. J. Manag. Sci. Eng. Manag.
**2016**, 11, 78–88. [Google Scholar] [CrossRef][Green Version] - Luengo, D.; Martino, L.; Elvira, V.; Bugallo, M. Efficient Linear Fusion of Partial Estimators. Digit. Signal Process.
**2018**, 78, 265–283. [Google Scholar] [CrossRef] - Bradley, J.R.; Cressie, N.; Shi, T. A Comparison of Spatial Predictors When Datasets Could be Very Large. Stat. Surv.
**2016**, 10, 100–131. [Google Scholar] [CrossRef] - Sun, Y.; Li, B.; Genton, M.G. Geostatistics for Large Datasets. In Advances and Challenges in Space-Time Modeling of Natural Events; Porcu, E., Montero, J.M., Schlather, M., Eds.; Springer: Berlin, Germany, 2012. [Google Scholar]
- Hobert, J.P.; Casella, G. The Effect of Improper Priors on Gibbs Sampling in Hierarchical Linear Mixed Models. J. Am. Stat. Assoc.
**1996**, 91, 1461–1473. [Google Scholar] [CrossRef] - Rennen, G. Subset Selection From Large Datasets for Kriging Modeling. Struct. Multidiscip. Optim.
**2009**, 38, 545–569. [Google Scholar] [CrossRef] - Furrer, R.; Genton, M.G.; Nychka, D. Covariance Tapering for Interpolation of Large Spatial Datasets. J. Comput. Graph. Stat.
**2006**, 15, 502–523. [Google Scholar] [CrossRef][Green Version] - Cressie, N.A.; Johannesson, G. Fixed Rank Kriging for Very Large Spatial Data Sets. J. R. Stat. Soc. Ser. B Stat. Methodol.
**2008**, 70, 209–226. [Google Scholar] [CrossRef] - Hartman, L.; Hossjer, O. Fast Kriging of Large Data Sets with Gaussian Markov Random Fields. Comput. Stat. Data Anal.
**2008**, 52, 2331–2349. [Google Scholar] [CrossRef] - Diggle, P.J.; Ribeiro, P.J., Jr. Bayesian Inference in Gaussian Model-based Geostatistics. Geogr. Environ. Model.
**2002**, 6, 129–146. [Google Scholar] [CrossRef] - Diggle, P.J.; Ribeiro, P.J., Jr. Model-Based Geostatistics; Springer: New York, NY, USA, 2007. [Google Scholar]
- Bickel, P.J.; Freedman, D.A. Some Asymptotic Theory for the Bootstrap. Ann. Stat.
**1981**, 9, 1196–1217. [Google Scholar] [CrossRef] - Singh, K. On the Asymptotic Accuracy of Efron’s Bootstrap. Ann. Stat.
**1981**, 9, 1187–1195. [Google Scholar] [CrossRef] - Efron, B. Nonparametric Estimates of Standard Error: The Jackknife, the Bootstrap and Other Methods. Biometrika
**1981**, 68, 589–599. [Google Scholar] [CrossRef] - Efron, B. Bootstrap Confidence Intervals: Good or Bad? Psychol. Bull.
**1982**, 104, 293–296. [Google Scholar] [CrossRef] - Efron, B. The Bootstrap and Modern Statistics. J. Am. Stat. Assoc.
**2000**, 95, 1293–1296. [Google Scholar] [CrossRef] - Diaconis, P.; Efron, B. Computer-Intensive Methods in Statistics. Sci. Am.
**1983**, 248, 116–131. [Google Scholar] [CrossRef] - Efron, B.; Tbishirani, R.J. Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy. Stat. Sci.
**1986**, 1, 54–77. [Google Scholar] [CrossRef] - Hall, P. The Bootstrap and Edgworth Expansion; Springer: New York, NY, USA, 1992. [Google Scholar]
- Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; CRC Press: Boca Raton, FL, USA, 1994. [Google Scholar]
- Shao, J.; Tu, D. The Jackknife and Bootstrap; Springer: New York, NY, USA, 2012. [Google Scholar]
- Shao, J. Bootstrap Estimation of the Asymptotic Variances of Statistical Functionals. Ann. Inst. Stat. Math.
**1990**, 42, 737–752. [Google Scholar] [CrossRef] - Alvarez, R.A.; Pacala, S.W.; Winebrake, J.J.; Chameides, W.L.; Hamburg, S.P. Greater Focus Needed on Methane Leakage from Natural Gas Infrastructure. Proc. Natl. Acad. Sci. USA
**2012**, 109, 6435–6440. [Google Scholar] [CrossRef][Green Version] - Meng, Q. Modeling and Prediction of Natural Gas Fracking Pad Landscapes in the Marcellus Shale Region, USA. Landsc. Urban Plan.
**2014**, 121, 109–116. [Google Scholar] [CrossRef] - Meng, Q.; Ashby, S. Distance: A Critical Aspect for Environmental Impact Assessment of Hydraulic Fracking. Extr. Ind. Soc.
**2014**, 1, 124–126. [Google Scholar] [CrossRef] - Ground Water Protection Council and ALL Consulting. Modern Shale Gas Development in the United States: A Primer; United States Department of Energy: Washington, DC, USA, 2009. [Google Scholar]
- Tam Cho, W.K.; Gimpel, J.G. Prospecting for (Campaign) Gold. Am. J. Political Sci.
**2007**, 51, 255–268. [Google Scholar] [CrossRef] - Gimpel, J.G.; Schuknecht, J.E. Patchwork Nation: Sectionalism and Political Change in American Politics; University of Michigan Press: Ann Arbor, MI, USA, 2003. [Google Scholar]
- Nigam, A.K.; Rao, J.N.K. On Balanced Bootstrap for Stratified Multistage Samples. Stat. Sin.
**1996**, 6, 199–214. [Google Scholar]

**Figure 3.**Rival Parametric Semivariograms for West Virginia Fracking for BRSS Without Replacement (

**Left**) and BRSS (with replacement) Jittered (

**Right**).

Full Data Set | BRSS Sample Size = 500 | BRSS with Jitter Sample Size = 500 | |||||||
---|---|---|---|---|---|---|---|---|---|

Parameter | Median | 5th Perc. | 95th Perc. | Median | 5th Perc. | 95th Perc. | Median | 5th Perc. | 95th Perc. |

Intercept | 7.0542 | 5.1359 | 8.7443 | 6.9515 | 4.1634 | 9.9202 | 6.5613 | 3.5828 | 9.0190 |

Log Elevation | 0.2690 | 0.0325 | 0.5235 | 0.2926 | −0.1193 | 0.7113 | 0.3752 | 0.0327 | 0.7989 |

Pressure | 0.0002 | 0.0002 | 0.0003 | 0.0003 | 0.0002 | 0.0004 | 0.0003 | 0.0002 | 0.0004 |

${\sigma}^{2}$ | 3.0243 | 2.4428 | 3.7879 | 2.9852 | 2.3574 | 3.7473 | 3.1772 | 2.5154 | 3.6810 |

$1/\varphi $ | 11,229.3100 | 8448.2760 | 14482.7600 | 21,246.9800 | 15,515.2600 | 25,635.1300 | 11,068.5300 | 6949.5260 | 20,982.9300 |

${\tau}^{2}/{\sigma}^{2}$ | 0.1245 | 0.0966 | 0.1586 | 0.1675 | 0.0852 | 0.2875 | 0.0570 | 0.0500 | 0.1248 |

Ordinary Kriging | Universal Kriging | |||||
---|---|---|---|---|---|---|

Parameter | Median | 5th Perc. | 95th Perc. | Median | 5th Perc. | 95th Perc. |

Intercept | 5.9810 | 5.6986 | 6.3034 | 0.9167 | −4.3045 | 5.5629 |

Vote Frequency | 0.0117 | −0.0277 | 0.0623 | |||

Logged Age | 0.4534 | −0.0026 | 0.7872 | |||

Female | −0.2561 | −0.4235 | −0.0542 | |||

Income (12 Cat.) | 0.1030 | 0.0789 | 0.1262 | |||

Asian | 0.1695 | −0.1785 | 0.4015 | |||

Black | −0.2342 | −0.6583 | 0.1895 | |||

Latino | −0.3553 | −0.6468 | −0.0557 | |||

College | 0.1763 | −0.0534 | 0.3689 | |||

Eastings | −0.0012 | −0.0034 | 0.0007 | |||

Northings | −0.0005 | −0.0015 | 0.0003 | |||

${\sigma}^{2}$ | 1.3461 | 1.2053 | 1.4669 | 1.3008 | 1.1229 | 1.4787 |

$1/\varphi $ | 646.2641 | 334.7049 | 752.5614 | 714.3906 | 569.0185 | 797.5597 |

${\tau}^{2}/{\sigma}^{2}$ | 2.2432 | 2.1426 | 2.3966 | 2.2555 | 2.1125 | 2.4299 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Byers, J.S.; Gill, J.
Applied Geospatial Bayesian Modeling in the Big Data Era: Challenges and Solutions. *Mathematics* **2022**, *10*, 4116.
https://doi.org/10.3390/math10214116

**AMA Style**

Byers JS, Gill J.
Applied Geospatial Bayesian Modeling in the Big Data Era: Challenges and Solutions. *Mathematics*. 2022; 10(21):4116.
https://doi.org/10.3390/math10214116

**Chicago/Turabian Style**

Byers, Jason S., and Jeff Gill.
2022. "Applied Geospatial Bayesian Modeling in the Big Data Era: Challenges and Solutions" *Mathematics* 10, no. 21: 4116.
https://doi.org/10.3390/math10214116