# Statistical Analysis in the Presence of Spatial Autocorrelation: Selected Sampling Strategy Effects

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. The Case Brus Promotes: A Summary and Traditional Numerical Counterexamples

- Remarkably, in other publications we can read that the classical formula [for finite population sampling?] for the variance of the estimated population mean with SRS [simple RS] underestimates the true variance for populations showing spatial structure (see for instance Griffith [17] and Plant [18]). The reasoning is that due to the spatial structure there is less information in the sample data about the population mean.
- Another persistent misconception is that when estimating the variance of the estimated mean of a spatial population or the correlation of two variables of a population we must account for autocorrelation of the sample data. This misconception occurs, for instance, in [17] and in [18]). The reasoning is that, due to the [SA] in the sample data, there is less information in the data about the parameter of interest, and so the effective sample size is smaller than the actual sample size.

#### 2.1. Samples Artificially Enlarged by Repeating Each of Their Selected Observations

#### 2.2. Ignoring the Correlation in Correlated/Paired/Matched Samples

_{k}, not 2n

_{k}), and that the difference of two means standard error is nearly three times greater than its true magnitude. A real-world example of this possibility occurred during the early years of the United States Environmental Protection Agency’s Environmental Monitoring and Assessment Program (US EPA EMAP), underwater soil samples collected from the Chesapeake Bay and other near-coastal water sea beds along the US Atlantic Ocean and Gulf of Mexico shorelines were coded and then distributed to private chemistry laboratories for assaying, with the goal of determining which companies should receive subsequent government contracts through this project. For quality control purposes, both the same and different laboratories received sets of soil sample bags with some containing near-duplicate (i.e., some would say SA, whereas others would say measurement error, at work) content because of their side-by-side borehole locations. One entrepreneur deciphered the adopted code, and merely again reported first replicate bag results rather than assaying duplicate samples: he treated a sample smaller than n as being of size n. Because he reported identical results for replicate bags, project scientists detected his deception, preventing him from receiving a subsequent project contract. Recognizing and accounting for correlation in observations matters!

#### 2.3. Samples Grouping into Two Equal-Sized Disparate-Valued Subsets of Observation

_{x,1}, μ

_{y,1}) = (5, 5) and (μ

_{x,2}, μ

_{y,2}) = (10, 10), and σ

_{x}= σ

_{y}= 0.5; this chosen variance ensures that the two subset clusters are distinct and concentrated. The experiment had 10,000 replications, and supported comparing output for a SRS sample of size n with a stratified sample of size four. The only factor manipulated in this experiment was the design sample size.

_{k}, where k denotes the number of strata, here from one to two (i.e., two draws from each cluster).

## 3. SA Effects in Spatial Sampling Designs

#### 3.1. What Some Experts Say

#### 3.2. About Variance

- If there is any spatial dependence then [the pooled within strata variance] will be less than s
^{2}, and so the variance and standard error of … stratified [RS] will be less than that of a simple random sample for the same effort, the same size of sample. … If we were happy with the precision achieved by [SRS] then we could get the same precision by stratification with a smaller sample [size]. Stratified [RS] is more efficient by the [multiplicative] factor

_{random}/n

_{stratified}],

#### 3.3. Classical CLT Stipulated Calculations and $\overline{y}$ Behavior in the Presence of SA

## 4. Alternative Sampling Designs and Model-Based Inference

#### 4.1. Some Reasons Why Designs Utilize Other Than Simple/Unrestricted RS

#### 4.2. Some Reasons Why SA Encourages Model-Based Inference

_{i}is a random, rather than a fixed (as in design-based inference), quantity (see Appendix A). Now the random component comes from stochastic processes affiliated with RV Y in a superpopulation, in lieu of SRS. This model-based conceptualization presents the following two different inference problems pertaining to: (1) the realization mean, $\overline{\mathrm{y}}$, which varies from realization to realization in a way analogous to how this calculation varies across all possible SRS samples of size n; and, (2) the superpopulation mean counterpart to $\overline{\mathrm{y}}$, namely μ. The simplest descriptive model specification (i.e., no covariates) posits

**Y**= μ

**1**+

**ε**, where

**Y**,

**1**, and

**ε**are n-by-1 vectors, with all of the elements of

**1**being one, and the elements of

**ε**being drawings from a designated probability model (e.g., normal) with a zero mean—because scalar μ already is in the postulated specification—and finite nonzero variance; this is the case treated by Griffith [17]. It spawns a $\overline{\mathrm{y}}$ estimate with an attendant superpopulation variance estimator that simplifies to the one for SRS applied to an infinite population (i.e., no correction factor; hence the approximation variety for finite populations having n ≤ 0.05N) [52] (p. 43).

**Y**− μ

**1**)

^{T}

**I**(

**Y**− μ

**1**)/(

**1**

^{T}

**I1**) to (

**Y**− μ

**1**)

^{T}

**V**

^{−1}(

**Y**− μ

**1**)/(

**1**

^{T}

**V**

^{−1}

**1**),

**I**denotes the identity matrix and

**V**denotes the SA structure operator instilling geographic covariation. This conversion signals the importance of whether or not matrix

**V**≠

**I**matters when implementing SRS. Both expert opinion and content in Table 3 and Table 4 imply that it does.

#### 4.3. A Monte Carlo Simulation Experiment Investigating Design- vs. Model-Based Inference

^{2}in the presence of SA (see §2). His contention compels an exploration of the meaning of underestimation in terms of the way he uses the word, which, in part, becomes conditional upon the extent or scale of the geographic landscape in question. Table 4 already illustrates that variance estimation is insensitive to a change in sampling design from SRS to stratified RS when an attribute variable exhibits a purely random mixture geographic distribution Moreover, if strata are ineffective (e.g., they capture the presence of zero SA) in a stratified design, the estimated variance is expected to be the same as that for SRS. However, if these strata are effective (i.e., they capture prevailing non-zero SA, grouping within each stratum units that had roughly similar y-values), then the stratified variance estimator would capture the improved precision of the estimated mean achieved by this stratified design. Although design-based inference does not assume that all populations are random with no structure, the design-based variance estimators properly take into account features of strata and/or clusters. This is not the outcome when positive SA prevails because similar values cluster on its map, recurrently materializing as a conspicuous two-dimensional pattern, catapulting the notion of variance inflation—the only SRS variance estimation declaration made in Griffith [17]—to the forefront of SA distortion concerns (see Table 4).

## 5. The Role of iid in Sampling Designs

_{Y}(t) = ${\mathrm{e}}^{\mathsf{\mu}\mathrm{t}+{\mathsf{\sigma}}^{2}{\mathrm{t}}^{2}}$. Its accompanying arithmetic mean MGF is ${\mathrm{M}}_{\overline{\mathrm{Y}}}(\mathrm{t})={\left[{\mathrm{e}}^{\mathsf{\mu}\mathrm{t}/\mathrm{n}+{\sigma}^{2}{\left(\frac{\mathrm{t}}{\mathrm{n}}\right)}^{2}}\right]}^{\mathrm{n}}=\left[{\mathrm{e}}^{\mathsf{\mu}\mathrm{t}+\left({\sigma}^{2}/\mathrm{n}\right)(\mathrm{t}{)}^{2}}\right]$, with the population iid property authorizing the product of the n individual observation MGFs as a power function. The resulting general sample MGF is that for a normal RV with a mean of μ and a variance of σ

^{2}/n, exact CLT outcomes. In other words, an n = 1 iid normal RV population sampling distribution achieves instant convergence, as the law of large numbers also confirms for samples of size n = 1. The question applied researchers find themselves asking, then, is how to detect the underlying RV nature of their data. Answering this question diverts them to an inspection of a histogram portrayal of their sample of attribute values, and hence often a goodness-of-fit test for the theoretical RV they hypothesize. Recognizing that determining an appropriate sample size for meaningful evaluation of normality is a difficult task (all goodness-of-fit diagnostics suffer from the following dual weaknesses: small sample sizes allow identification of only the most aberrant, whereas very large sample sizes invariably magnify trivial departures from normality, often delivering either statistically nonsignificant but substantively important, or statistically significant but substantively unimportant, inferential conclusions; accordingly, normality diagnostic statistics require a minimum sample size before they become informative; tests of normality have notoriously low power to detect non-normality in small samples). Graphs appearing in Razali and Yap [54] corroborate this contention; only normal approximations, the recipients of such assessments, exist in the real world, as conventional statistical wisdom purports that n needs to be between 30 and 100 for a sound statistical inference about a RV name. Razali and Yap [54] also furnish some contemporary wisdom based upon evidence from Shapiro–Wilk normality diagnostic statistical power simulation experiments: the minimum sample size for deciding upon an underlying RV appears to be about 250 (e.g., the chi-squared, gamma, and uniform RVs), and maybe more (e.g., the t-distribution RV) in order to maximize statistical power. Table 6 tabulates specimen results for a small array of RVs that span the complete spectrum of PDF forms, all with μ = 1 and σ = 1 by construction: bell-shaped, right-skewed, uniform, symmetric sinusoidal (i.e., U-shaped beta), and mixed (a blend of these previous four RVs). The first four of these table entries involve iid RV populations, with their analytics being expressions similar to the preceding normal RV power function. In other words, the sampling distribution MGF is ${\left[{\mathrm{M}}_{\mathrm{Y}}\left(\mathrm{t}/\mathrm{n}\right)\right]}^{\mathrm{n}}$ rather than $\prod}_{\mathrm{i}=1}^{\mathrm{n}}{\mathrm{M}}_{{\mathrm{Y}}_{\mathrm{i}}}\left(\mathrm{t}/\mathrm{n}\right)$—the more traditional mathematical expression that may be rewritten as a convex combination of separate orthodox PDFs (e.g., see Table 6)—representing sampled independent but non-id observations.

_{N}+ 25p

_{E}+ p

_{U}− 5)/(10n

^{2}), where p

_{N}, p

_{E}, and p

_{U}, respectively, denote the mixing weights attached to the normal, exponential, and uniform RV components [i.e., p

_{S}= (1 − p

_{N}− p

_{E}− p

_{U}) for the sinusoidal RV]; (2) almost always encompass unidentified component RVs; and, (3) usually entail unknown mixing weights. These first two points jointly mean that the largest component RV minimum sample size defines n for ensuring that the CLT begins to regulate a sampling distribution; hence, n often tends to be near the end, rather than the beginning, of the routine n = 30–100 range. Meanwhile, the set of mixing weights, p

_{j}, follows a multinomial distribution, impacting the minimum sample size when one goal of a sampling design is to draw a sample composition closely proportional to the prevailing set of mixing weights (i.e., representativeness—avoiding the 14 medleys other than at least one observation from each RV—unattainable with stratified RS because of ignorance about mixture properties). Accordingly, the smallest p

_{j}harmonizes with one of the escorting RV convergence rates to delineate a suitable minimum sample size; e.g., invoking the 6σ principle, if the minimum p

_{j}is 0.05, then, for example, 0.05 − √[(0.05)(0.95)/n] = 0.04 ⇒ n = 2850 ⇒ N ≥ 57,000 (magnitudes more akin to those for the law of large numbers). In other words, suppressing the id part of the iid assumption tends to increase the necessary minimum sample size, exemplifying the notorious inclination for mixture RVs to present statisticians with serious technical challenges.

_{j}, or the cumulative component n

_{j}s (i.e., these subset sizes sum to n) across RVs, goes to infinity. Unfortunately, when mixture distribution inflection points occur at component RV means, CLT-based confidence intervals can become distorted, which is one of the prominent examples of technical challenges arising when dealing with this category of RV [55]. Furthermore, without iid, the critical RV assumptions revert to a finite mean and a finite variance; without id, distinctiveness of component RVs may become critical, and SRS sampling variance can deteriorate. Nonetheless, among its inventory of well-known properties, the classical CLT still fails to preside over a mixture containing a Cauchy RV.

## 6. Discussion, Implications, and Concluding Comments

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Appendix A. Dissecting Sources Of Variation

_{i}) be zero, although each can deal with it otherwise. Both are capable of accommodating the RS error (i.e., ξ

_{i}) result from the use of a subset n of a (super)population of size N. Both consider some landscape-wide mean (i.e., ${\mathsf{\mu}}_{\mathrm{C}}$), attributing the remaining part of a variable value quantity to an observation-unique term (i.e., ${\mathsf{\mu}}_{\mathrm{NC},\mathrm{i}}+{\mathsf{\epsilon}}_{\mathrm{i}}$). For the design-based approach, $\sum}_{\mathrm{i}=1}^{\mathrm{n}}\left({\mathsf{\mu}}_{\mathrm{NC},\mathrm{i}}+{\mathsf{\epsilon}}_{\mathrm{i}}\right)$ = 0, even in its estimated version for a given sample, because ${\mathsf{\mu}}_{\mathrm{C}}$ or its estimate absorbs any non-zero part of this sum. In contrast, the model-based approach has an equational specification describing this composite term, with the conditional estimates of the individual terms ${\mathsf{\mu}}_{\mathrm{NC},\mathrm{i}}$ and ${\mathsf{\epsilon}}_{\mathrm{i}}$ separately summing to zero (i.e., $\sum}_{\mathrm{i}=1}^{\mathrm{n}}{\mathsf{\mu}}_{\mathrm{NC},\mathrm{i}$ = 0 and $\sum}_{\mathrm{i}=1}^{\mathrm{n}}{\mathsf{\epsilon}}_{\mathrm{i}$ = 0) for the same reason. The design-based approach can invoke stratified RS to capture ${\mathsf{\mu}}_{\mathrm{NC},\mathrm{i}}$ effects, which Table 3 tabulations reflect. The variance inflation of interest here, which links to an effective geographic sample size when SA prevails, is attributable to assuming a constant mean, and hence failing to account for any variance contribution by the systematic source ${\mathsf{\mu}}_{\mathrm{NC},\mathrm{i}}$.

## References

- Brus, D. Statistical approaches for spatial sample survey: Persistent misconceptions and new developments. Eur. J. Soil Sci.
**2021**, 72, 686–703. [Google Scholar] [CrossRef] - Griffith, D. A family of correlated observations: From independent to strongly interrelated ones. Stats
**2020**, 3, 166–184. [Google Scholar] [CrossRef] - Lebart, L. Analyse statistique de la contiguïté. Publ. Inst. Stat. Univ. Paris
**1969**, 3, 81–112. [Google Scholar] - Besag, J.; York, J.; Mollié, A. Bayesian image restoration, with two applications in spatial statistics. Ann. Inst. Stat. Math.
**1991**, 43, 1–20. [Google Scholar] [CrossRef] - Wall, M. A close look at the spatial structure implied by the CAR and SAR models. J. Stat. Plan. Infer.
**2004**, 121, 311–324. [Google Scholar] [CrossRef] - Wakefield, J. Sensitivity analyses for ecological regression. Biometrics
**2003**, 59, 9–17. [Google Scholar] [CrossRef] - Hawkins, B.; Diniz-Filho, J.; Bini, L.; De Marco, P.; Blackburn, T. Red herrings revisited: Spatial autocorrelation and parameter estimation in geographical ecology. Ecography
**2007**, 30, 375–384. [Google Scholar] [CrossRef] - Hodges, J.; Reich, B. Adding spatially-correlated errors can mess up the fixed effect you love. Am. Stat.
**2010**, 64, 325–334. [Google Scholar] [CrossRef] - Griffith, D.; Lagona, F. On the quality of likelihood-based estimators in spatial autoregressive models when the data dependence structure is misspecified. J. Stat. Plan. Infer.
**1998**, 69, 153–174. [Google Scholar] [CrossRef] - LeSage, J.; Pace, R. The biggest myth in spatial econometrics. Econometrics
**2014**, 2, 217–239. [Google Scholar] [CrossRef][Green Version] - Partridge, M.; Boarnet, M.; Brakman, S.; Ottaviano, G. Introduction: Whither spatial econometrics? J. Reg. Sci.
**2012**, 52, 167–171. [Google Scholar] [CrossRef] - Lark, R.; Cullis, B. Model-based analysis using REML for inference from systematically sampled data on soil. Eur. J. Soil Sci.
**2004**, 55, 799–813. [Google Scholar] [CrossRef] - Hansen, M.; Madow, W.; Tepping, B. An evaluation of model-dependent and probability-sampling inferences in sample surveys. J. Am. Stat. Assoc.
**1983**, 78, 776–793. [Google Scholar] [CrossRef] - Brus, D.; de Gruijter, J. Random sampling or geostatistical modelling? Choosing between design-based and model-based sampling strategies for soil (with discussion). Geoderma
**1997**, 80, 1–44. [Google Scholar] [CrossRef] - Papageorgiou, I. Sampling from correlated populations: Optimal strategies and comparison study. Sankhya B
**2016**, 78, 119–151. [Google Scholar] [CrossRef] - Gilks, W.; Richardson, S.; Spiegelhalter, D. Markov Chain Monte Carlo in Practice; Chapman and Hall: London, UK, 1996. [Google Scholar]
- Griffith, D. Effective geographic sample size in the presence of spatial autocorrelation. Ann. Assoc. Am. Geogr.
**2005**, 95, 740–760. [Google Scholar] [CrossRef] - Plant, R.E. Spatial Data Analysis in Ecology and Agriculture Using R; CRC Press: Boca Raton, FL, USA, 2012. [Google Scholar]
- Wang, J.; Haining, R.; Cao, Z. Sample surveying to estimate the mean of a heterogeneous surface: Reducing the error variance through zoning. J. Geogr. Info. Sci.
**2010**, 24, 523–543. [Google Scholar] [CrossRef] - Webster, R.; Oliver, M. Geostatistics for Environmental Scientists, 2nd ed.; Wiley: Chichester, UK, 2007. [Google Scholar]
- Skinner, C.; Holt, D.; Smith, T. (Eds.) Analysis of Complex Surveys; Wiley: New York, NY, USA, 1989. [Google Scholar]
- Särndal, C.-E.; Swensson, B.; Wretman, J. Model Assisted Survey Sampling; Springer: New York, NY, USA, 1992. [Google Scholar]
- Fisher, R. The arrangement of field experiments. J. Ministr. Agric.
**1926**, 33, 503–513. [Google Scholar] - Tedin, O. The influence of systematic plot arrangement upon the estimate of error in field experiments. J. Agric. Sci.
**1931**, 21, 191–208. [Google Scholar] [CrossRef] - Yates, F. Sir Ronald Fisher and the design of experiments. Biometrics
**1964**, 20, 307–321. [Google Scholar] [CrossRef] - Cochran, W. Relative accuracy of systematic and random samples for a certain class of populations. Ann. Math. Stat.
**1946**, 17, 164–177. [Google Scholar] [CrossRef] - Lahiri, S.; Lahiri, S. Resampling Methods for Dependent Data; Springer Science & Business Media: New York, NY, USA, 2003. [Google Scholar]
- Cressie, N. Statistics for Spatial Data; Wiley: New York, NY, USA, 1991. [Google Scholar]
- Schabenberger, O.; Gotway, C. Statistical Methods for Spatial Data Analysis; Chapman & Hall: Boca Raton, FL, USA, 2005. [Google Scholar]
- Clifford, P.; Richardson, S.; Hemon, D. Assessing the significance of the correlation between two spatial processes. Biometrics
**1989**, 45, 123–134. [Google Scholar] [CrossRef] [PubMed] - Acosta, J.; Vallejos, R. Effective sample size for spatial regression models. Electron. J. Stat.
**2018**, 12, 3147–3180. [Google Scholar] [CrossRef] - Vallejos, R.; Acosta, J. The effective sample size for multivariate spatial processes with an application to soil contamination. Nat. Resour. Mod.
**2021**, 34, 12–22. [Google Scholar] [CrossRef] - Dutilleul, P.; Pelletier, B.; Alpargu, G. Modified F tests for assessing the multiple correlation between one spatial process and several others. J. Stat. Plan. Infer.
**2008**, 138, 1402–1415. [Google Scholar] [CrossRef] - Dale, M.; Fortin, M. Spatial autocorrelation and statistical tests: Some solutions. J. Agric. Boil. Environ. S.
**2009**, 14, 188–206. [Google Scholar] [CrossRef] - Renner, I.; Warton, D.; Hui, F. What is the effective sample size of a spatial point process? Aust. N. Z. J. Stat.
**2021**, 63, 144–158. [Google Scholar] [CrossRef] - de Gruijter, J.; ter Braak, C. Model-free estimation from spatial samples: A reappraisal of classical sampling theory. Math. Geol.
**1990**, 22, 407–415. [Google Scholar] [CrossRef] - Acosta, J.; Vallejos, R.; Griffith, D. On the effective geographic sample size. J. Stat. Comput. Sim.
**2018**, 88, 1958–1975. [Google Scholar] [CrossRef] - Acosta, J.; Alegría, A.; Osorio, F.; Vallejos, R. Assessing the effective sample size for large spatial datasets: A block likelihood approach. Comput. Stat. Data Anal.
**2021**, 162, 107–282. [Google Scholar] [CrossRef] - Rubin, D. An evaluation of model-dependent and probability-sampling inferences in sample surveys: Comment. J. Am. Stat. Assoc.
**1983**, 78, 803–805. [Google Scholar] [CrossRef] - Overton, S.; Stehman, S. Properties of designs for sampling continuous spatial resources from a triangular grid. Commun. Stat.
**1993**, 22, 251–264. [Google Scholar] [CrossRef] - Griffith, D. Eigenfunction properties and approximations of selected incidence matrices employed in spatial analyses. Linear Algebra Appl.
**2000**, 321, 95–112. [Google Scholar] [CrossRef][Green Version] - Menard, S. Applied Logistic Regression Analysis, 2nd ed.; SAGE: Los Angeles, CA, USA, 2001. [Google Scholar]
- Vittinghoff, E.; Glidden, D.; Shiboski, S.; McCulloch, C. Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models, 2nd ed.; Springer: New York, NY, USA, 2012. [Google Scholar]
- Johnston, R.; Jones, K.; Manley, D. Confounding and collinearity in regression analysis: A cautionary tale and an alternative procedure, illustrated by studies of British voting behavior. Qual. Quant.
**2018**, 52, 1957–1976. [Google Scholar] [CrossRef] [PubMed][Green Version] - Milliken, G.; Johnson, D. Analysis of Messy Data, Vol. I; Chapman & Hall/CRS Press: Boca Raton, FL, USA, 1989. [Google Scholar]
- Griffith, D. Estimating spatial autoregressive model parameters with commercial statistical packages. Geogr. Anal.
**1988**, 20, 176–186. [Google Scholar] [CrossRef] - Wadoux, A.; Marchant, B.; Lark, R. Efficient sampling for geostatistical surveys. Eur. J. Soil Sci.
**2019**, 70, 975–989. [Google Scholar] [CrossRef] - Besag, J. On the statistical analysis of dirty pictures. J. R. Stat. Soc. Ser. B (Methodol.)
**1986**, 48, 259–302. [Google Scholar] [CrossRef][Green Version] - Griffith, D.; Liau, Y.-T. Imputed spatial data: Cautions arising from response and covariate imputation measurement error. Spat. Stat.
**2021**, 42, 100419. [Google Scholar] [CrossRef] - Ryan, T. Sample Size Determination and Power; Wiley: New York, NY, USA, 2013. [Google Scholar]
- Lakens, D. The practical alternative to the p value is the correctly used p value. Perspect. Psychol. Sci.
**2021**, 16, 639–648. [Google Scholar] [CrossRef] - Kangas, A. Design-based sampling and inference. In Forestry Inventory: Methodology and Applications; Kangas, A., Maltamo, M., Eds.; Springer: Dordrecht, The Netherlands, 2006; pp. 39–51. [Google Scholar]
- Hoeffding, W. The large-sample power of tests based on permutations of observations. Ann. Math. Stat.
**1952**, 23, 169–192. [Google Scholar] [CrossRef] - Razali, N.; Yap, B. Power comparisons of Shapiro–Wilk, Kolmogorov–Smirnov, Lilliefors and Anderson–Darling tests. J. Stat. Mod. Anal.
**2011**, 2, 21–33. [Google Scholar] - Zheng, J.; Frey, H. Quantification of variability and uncertainty using mixture distributions: Evaluation of sample size, mixing weights, and separation between components. Risk. Anal.
**2004**, 24, 533–571. [Google Scholar] [CrossRef] [PubMed] - Böhning, D.; Seidel, W.; Alfó, M.; Garel, B.; Patilea, V.; Walther, G. Editorial: Advances in mixture models. Comput. Stat. Data An.
**2007**, 51, 5205–5210. [Google Scholar] [CrossRef] - Zhang, J.; Huang, Y. Finite mixture models and their applications: A review. Austin Biomet. Biostat.
**2015**, 2, 1013. [Google Scholar] - Chen, J. On finite mixture models. Stat. Theory Rel. Fields
**2017**, 1, 15–27. [Google Scholar] [CrossRef][Green Version] - McLachlan, G.; Lee, S.; Rathnayake, S. Finite mixture models. Annu. Rev. Stat. Appl.
**2019**, 6, 355–378. [Google Scholar] [CrossRef] - Mukhopadhyay, N.; Son, M. On the covariance between the sample mean and variance. Commun. Stat.
**2011**, 22, 1142–1148. [Google Scholar] [CrossRef] - Heeringa, S.; West, B.; Berglund, P. Applied Survey Data Analysis, 2nd ed.; Chapman and Hall/CRC: London, UK, 2017. [Google Scholar]
- Stehman, S.; Overton, W. Comparison of variance estimators of the Horvitz-Thompson estimator for randomized variable probability systematic sampling. J. Am. Stat. Assoc.
**1994**, 89, 30–43. [Google Scholar] [CrossRef]

**Figure 1.**The specimen mixture RV composed of a normal, exponential, uniform, and sinusoidal RV with equal mixing weights (i.e., ¼). Left (

**a**): its PDF. Right (

**b**): overlaid individual RV PDFs.

n | Central Limit Theorem | n/2 SRS Selections Doubled | n SRS Selections | ${\left(\frac{{\hat{\mathsf{\sigma}}}_{\overline{\mathbf{y}}}^{*}}{{\hat{\mathsf{\sigma}}}_{\overline{\mathbf{y}}}}\right)}^{2}$ | |||
---|---|---|---|---|---|---|---|

${\mathsf{\mu}}_{\overline{\mathbf{y}}}$ | ${\mathsf{\sigma}}_{\overline{\mathrm{y}}}$ | ${\widehat{\mathsf{\mu}}}_{\overline{\mathrm{y}}}^{*}$ | ${\widehat{\mathsf{\sigma}}}_{\overline{\mathrm{y}}}^{*}$ | ${\widehat{\mathsf{\mu}}}_{\overline{\mathrm{y}}}$ | ${\widehat{\mathsf{\sigma}}}_{\overline{\mathrm{y}}}$ | ||

30 | 0 | 0.18257 | 0.00359 | 0.25759 | 0.00337 | 0.18197 | 2.00382 |

100 | 0 | 0.10000 | 0.00021 | 0.14135 | 0.00082 | 0.10085 | 1.96444 |

500 | 0 | 0.04472 | 0.00026 | 0.06272 | 0.00026 | 0.04433 | 2.00178 |

1000 | 0 | 0.03162 | −0.00013 | 0.04428 | 0.00015 | 0.03148 | 1.97854 |

5000 | 0 | 0.01414 | −0.00016 | 0.01995 | −0.00012 | 0.01404 | 2.01907 |

**Table 2.**Simulated CLT results for the difference of two means sampling distributions, with ρ = 0.9 for the correlated samples; 10,000 replications.

Sampling Design | n_{k} | Difference of Means: μ _{1} − μ_{2} = 0 | Standard Error ^{†} for σ_{1} = σ_{2} | P(K–S) ^{‡} Normality Diagnostic |
---|---|---|---|---|

two independent samples | 30 | 0.0002 | 0.2595 | > 0.15 |

one correlated sample | –0.0002 | 0.0818 | 0.1362 | |

two independent samples | 100 | –0.0009 | 0.1414 | > 0.15 |

one correlated sample | –0.0006 | 0.0443 | 0.1269 |

^{†}the CLT values here are $\sqrt{2/{\mathrm{n}}_{\mathrm{k}}}$ (i.e., respectively, 0.2582 and 0.1414) for independent samples, and $\sqrt{2\left(1-\mathsf{\rho}\right)/{\mathrm{n}}_{\mathrm{k}}}$ (i.e., respectively, 0.0816 and 0.0447) for correlated samples.

^{‡}denotes the probability of the Kolmogorov–Smirnov(K–S) normality goodness-of-fit diagnostic statistic.

**Table 3.**Varying n SRS versus fixed small n stratified RS; population: β

_{0}= 0, β

_{1}= 1, R

^{2}= 1, and CV = 1; 10,000 replications.

n | SRS | n_{1} = n_{2} = 2 stratified RS | variance ratios | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

${\widehat{\mathsf{\beta}}}_{0}$ | ${\widehat{\mathsf{\beta}}}_{1}$ | R^{2} | CV | ${\widehat{\mathsf{\beta}}}_{0}$ | ${\widehat{\mathsf{\beta}}}_{1}$ | R^{2} | CV | ${\widehat{\mathsf{\beta}}}_{0}$ | ${\widehat{\mathsf{\beta}}}_{1}$ | R^{2} | CV | |

4 | 4.740 | 0.683 | 0.778 | 350.016 | 0.025 | 0.997 | 0.990 | 4.020 | 201.8 | 112.9 | 1191.1 | 1.9 $\times $ 10^{10} |

30 | 1.165 | 0.921 | 0.906 | 1.605 | 0.011 | 0.999 | 0.990 | 4.021 | 22.0 | 12.3 | 511.1 | 1600.2 |

100 | 0.619 | 0.957 | 0.934 | 1.091 | 0.017 | 0.998 | 0.990 | 4.020 | 7.4 | 4.1 | 224.7 | 49.0 |

500 | 0.439 | 0.969 | 0.945 | 1.011 | 0.026 | 0.997 | 0.990 | 4.021 | 2.5 | 1.4 | 107.0 | 0.3 |

1000 | 0.428 | 0.969 | 0.945 | 1.005 | 0.037 | 0.997 | 0.990 | 4.021 | 2.1 | 1.1 | 107.3 | <0.1 |

5000 | 0.424 | 0.969 | 0.944 | 1.001 | 0.023 | 0.998 | 0.990 | 4.020 | 1.7 | 0.9 | 95.7 | <0.1 |

**Table 4.**Sampling design comparisons in the presence of SA: a specimen 200-by-200 regular square tessellation geographic landscape; Y~N(μ = 25, σ = 5); 10,000 replications.

n | CLT: ${\mathsf{\sigma}}_{\overline{\mathrm{y}}}^{*}$ | Near-Maximum Positive SA (MC ≈ 1, GR ≈ 0.03) | Near-Zero Positive SA (MC ≈ 0, GR ≈1) | ${\left(\frac{{\hat{\mathsf{\sigma}}}_{\overline{\mathbf{y}}}^{*}}{{\hat{\mathsf{\sigma}}}_{\overline{\mathbf{y}}}}\right)}^{2}$ | ||||
---|---|---|---|---|---|---|---|---|

Stratified RS | SRS | Stratified RS | SRS | |||||

$\overline{\overline{\mathrm{y}}}$ | ${s}_{\overline{\mathrm{y}}}$ | ${s}_{\overline{\mathrm{y}}}$ | $\overline{\overline{\mathrm{y}}}$* | ${s}_{\overline{\mathrm{y}}}^{*}$ | ${s}_{\overline{\mathrm{y}}}^{*}$ | |||

25 | 1.000 | 24.99683 | 0.42343 | 1.01057 | 24.98502 | 0.99711 | 0.99236 | 5.54527 |

64 | 0.625 | 25.00073 | 0.19666 | 0.62203 | 24.99690 | 0.62707 | 0.62656 | 10.16717 |

100 | 0.500 | 25.00156 | 0.13642 | 0.50118 | 24.99648 | 0.50005 | 0.50507 | 13.43602 |

400 | 0.250 | 25.00080 | 0.05437 | 0.24656 | 25.00249 | 0.24988 | 0.24887 | 21.12245 |

625 | 0.200 | 25.00006 | 0.04171 | 0.19917 | 24.99697 | 0.20093 | 0.19827 | 23.20648 |

1600 | 0.125 | 24.99974 | 0.02448 | 0.12267 | 25.00148 | 0.12364 | 0.12104 | 25.50910 |

**Table 5.**SRS variance estimates for landscape-centered quadrant-of-the-plane subregions of a larger complete geographic landscape; 10,000 replications.

n | Data Type | Standard Deviation | Levene Test Probability | |||||
---|---|---|---|---|---|---|---|---|

Landscape-Wide | Q_{1} | Q_{2} | Q_{3} | Q_{4} | $\overline{probability}$ | % < 0.10 | ||

population | --- | 5 | 5 | 5 | 5 | 5 | 1 | --- |

realization (superpopulation; n = 40,000) | iid | 5 | 5.01 | 4.98 | 5.03 | 4.97 | 0.64 | --- |

SA | 5 | 4.09 | 4.09 | 4.06 | 4.11 | 0.69 | --- | |

30 | iid | 4.95 | 4.99 | 4.95 | 4.99 | 4.94 | 0.49 | 10.5 |

SA | 4.96 | 4.06 | 4.05 | 4.02 | 4.08 | 0.49 | 10.2 |

_{j}denotes the j

^{th}customary quadrant of the plane, in a counterclockwise direction; for iid data, MC = 0.00 and GR = 1.00; and for SA data, MC = 0.50 and GR = 0.50.

Feature | Normal | Exponential | Uniform | Sinusoidal | Mixture ^{†} |
---|---|---|---|---|---|

μ | 1 | 1 | 1 | 1 | 1 |

σ | 1/n | 1/n | 1/n | 1/n | 1/n |

CLT skewness | 0 | 0 | 0 | 0 | 0 |

CLT kurtosis | 3 | 3 | 3 | 3 | 3 |

CLT skewness convergence rate | instant | −1/n^{3/2} | instant | instant | −1/(8n)^{3/2} |

CLT kurtosis convergence rate | instant | −6/n^{2} | 6/(5n^{2}) | 3/(2n^{2}) | −33/(40n^{2}) |

^{†}simulation experiment verification of findings using sample sampling distributions approximated with 300 samples of various sizes, and 1000 replications.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Griffith, D.A.; Plant, R.E. Statistical Analysis in the Presence of Spatial Autocorrelation: Selected Sampling Strategy Effects. *Stats* **2022**, *5*, 1334-1353.
https://doi.org/10.3390/stats5040081

**AMA Style**

Griffith DA, Plant RE. Statistical Analysis in the Presence of Spatial Autocorrelation: Selected Sampling Strategy Effects. *Stats*. 2022; 5(4):1334-1353.
https://doi.org/10.3390/stats5040081

**Chicago/Turabian Style**

Griffith, Daniel A., and Richard E. Plant. 2022. "Statistical Analysis in the Presence of Spatial Autocorrelation: Selected Sampling Strategy Effects" *Stats* 5, no. 4: 1334-1353.
https://doi.org/10.3390/stats5040081