# Improving Estimates of Natural Resources Using Model-Based Estimators: Impacts of Sample Design, Estimation Technique, and Strengths of Association

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

_{i}, …, y

_{n}are regarded as realizations of random variables Y

_{i},…,Y

_{n}and hence the population is a realization of a random process”. The primary difference being, “in the model-based approach [inference] stems from the model, not from the sampling design”. While some rely on the availability of models as justification for deviating from probabilistic designs when drawing inferences, it is critical to remember that with natural systems, we seldom understand or can collect all the information needed to build models that completely describe the complexities of those systems. This can lead to model misspecification, localized deviations in relationships, and more generally to models that do not describe a specific population well. Therefore, sample designs used to calibrate models that include randomness in the selection process are critical to developing models that can be used to estimate population and subpopulation parameters. Moreover, probabilistic designs that ensure that samples are widely distributed across feature space can reduce the variability of population estimates.

## 2. Theoretical Background

## 3. Materials and Methods

#### 3.1. Simulated Raster Surfaces

^{2}of a mixed forested/urban landscape (Figure 2) with a nominal pixel resolution of 1 m

^{2}. The four spectral bands of the NAIP image (x

_{i}; i = 1, 2, 3, 4) were transformed to produce three distinct continuous surfaces (SC1, Figure 2). The first SC1 surface was obtained by averaging the band values for each pixel:

_{j}is an indicator variable taking the value 1 if the value of band 4 for pixel j falls between 170 and 210 and is 0 otherwise.

_{kj =}μ

_{kj}+ ε

_{kj}(Figure 2), where the ε

_{kj}are independent, identically distributed errors. Normally distributed errors were generated using a standard deviation proportional to the standard deviation of pixel values within a given SC1 and a mean of zero; the constants of proportionality were set to 20%, 40%, 60%, and 80% to provide an equally spaced range of noise added to each relationship and to evaluate the impact of noise on model estimation. The realized errors produced homoscedastic surfaces with error values centered at zero before being added to SC1 surfaces. In total, 12 unique continuous response surfaces were created, with various shapes and amounts of error (% Noise).

#### 3.2. Sample Designs

#### 3.3. Model Calibration and Estimation

#### 3.4. Evaluation

_{i}) around each selected pixel of a particular sample in multidimensional predictor space or in geographic space. Inclusion probabilities (n/N) were then summed within each polytope and the variance of these sums were then calculated:

_{i}is the sum of inclusion probabilities in p

_{i.}

## 4. Results

#### 4.1. Allocation of Sample Units

#### 4.2. Model Calibration and Functional Relationships

#### 4.3. Estimates Using NAIP Bands as Predictors

#### 4.4. Impact of Spreading Sample Units in Feature Space

## 5. Discussion

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Thompson, M.P.; Wei, Y.; Calkin, D.E.; O’connor, C.D.; Dunn, C.J.; Anderson, N.M.; Hogland, J.S. Risk Management and Analytics in Wildfire Response. Curr. For. Rep.
**2019**, 5, 226–239. [Google Scholar] [CrossRef] [Green Version] - Hogland, J.; Dunn, C.J.; Johnston, J.D. 21st Century Planning Techniques for Creating Frie-Resilient forest in the American West. Forests
**2021**, 12, 1084. [Google Scholar] [CrossRef] - Hogland, J.; Affleck, D.L.R.; Anderson, N.; Seielstad, C.; Dobrowski, S.; Graham, J.; Smith, R. Estimating Forest Characteristics for Longleaf Pine Restoration Using Normalized Remotely Sensed Imagery in Florida USA. Forests
**2020**, 11, 426. [Google Scholar] [CrossRef] - McRoberts, R. Probability- and model-based approaches to inference for proportion forest using satellite imagery as ancillary data. Remote Sens. Environ.
**2010**, 114, 1017–1025. [Google Scholar] [CrossRef] - McRoberts, R. Satellite image-based maps: Scientific inference or pretty pictures? Remote Sens. Environ.
**2011**, 115, 715–724. [Google Scholar] [CrossRef] - Grafström, A.; Shao, X.; Nylander, M.; Petersson, H. A new sampling strategy for forest inventories applied to the temporary clusters of the Swedish national forest inventory. Can. J. For. Res.
**2017**, 47, 1161–1167. [Google Scholar] [CrossRef] - Gregoire, T.; Valentine, H. Sampling Strategies for Natural Resources and the Environment; Chapman & Hall: Boca Raton, FL, USA; London, UK; New York, NY, USA, 2008; 474p. [Google Scholar]
- Gregoire, T. Design-based and model-based inference in survey sampling: Appreciating the difference. Can. J. For. Res.
**1998**, 28, 1429–1447. [Google Scholar] [CrossRef] - Stehman, S.; Czaplewski, R. Design and analysis for thematic map accuracy assessment: Fundamental principles. Remote Sens. Environ.
**1998**, 64, 331–344. [Google Scholar] [CrossRef] - Crowson, M.; Hagensieker, R.; Waske, B. Mapping Land cover change in northern Brazil with limited training data. Int. J. Appl. Eaarth Obs. GeoInf.
**2019**, 78, 202–214. [Google Scholar] [CrossRef] - Foody, G.; Mathur, A. The use of small training sets containing mixed pixels for accurate hard image classification: Training on mixed spectral responses for classification by a SVM. Remote Sens. Environ.
**2006**, 103, 179–189. [Google Scholar] [CrossRef] - Xie, Y.; Lark, T.; Brown, J.F.; Gibbs, H. Mapping irrigated cropland extent across the conterminous United States at 30 m resolution using a semi-automatic training approach on Google Earth Engine. ISPRS J. Photogramm. Remote Sens.
**2019**, 155, 136–149. [Google Scholar] [CrossRef] - Houborg, R.; McCabe, M. A hybrid training approach for leaf area index estimation via Cubist and random forests. Machine-learning
**2018**, 135, 173–188. [Google Scholar] [CrossRef] - Kavzoglu, T. Increasing the accuracy of neural network classifications using refined training data. Environ. Model. Softw.
**2009**, 24, 850–858. [Google Scholar] [CrossRef] - Lesparre, J.; Gorte, B. Using mixed pixels for the training of a maximum likelihood classification. Int. Soc. Photogramm. Remote Sens.
**2006**, 36, 6. [Google Scholar] - Stehman, S. Basic probability sampling designs for thematic map accuracy assessment. Remote Sens.
**1999**, 20, 2423–2441. [Google Scholar] [CrossRef] - Comber, A.; Fisher, P.; Brunsdon, C.; Khmag, A. Spatial analysis of remote sensing image classification accuracy. Remote Sens. Environ.
**2012**, 127, 237–246. [Google Scholar] [CrossRef] [Green Version] - Stehman, S. Model-assisted estimation as a unifying framework for estimating the area of land cover and land-cover change from remote sensing. Remote Sens. Environ.
**2009**, 113, 2455–2462. [Google Scholar] [CrossRef] - Stehman, S. Sampling designs for accuracy assessment of land cover. Int. J. Remote. Sens.
**2009**, 3, 5243–5272. [Google Scholar] [CrossRef] - Tille, Y.; Wilhelm, M. Probability Sampling Designs: Principles for Choice of Design and Balancing. Stat. Sci.
**2017**, 32, 176–189. [Google Scholar] [CrossRef] [Green Version] - Stevens, D.L.; Olsen, A.R. Spatially-balanced sampling of natural resources. J. Am. Stat. Assoc.
**2004**, 99, 262–277. [Google Scholar] [CrossRef] - Grafstrmöm, A.; Lundström, N.L.P. Why well spread probability samples are balanced. Open J. Stat.
**2013**, 3, 36–41. [Google Scholar] [CrossRef] [Green Version] - Edwards, T.; Cutler, D.; Zimmermann, N.E.; Geiser, L.; Moisen, G.G. Effects of sample survey design on the accuracy of classification tree models in species distribution models. Ecol. Modeling
**2006**, 199, 132–141. [Google Scholar] [CrossRef] - Brus, D.J. Sampling for digital soil mapping: A tutorial supported by R scripts. Geoderma
**2019**, 338, 464–480. [Google Scholar] [CrossRef] - McRoberts, R.E.; Walter, B. Statistical inference for remote sensing-based estimates of net deforestation. Remote Sens. Environ.
**2012**, 124, 394–401. [Google Scholar] [CrossRef] - Sarndal, C.E. Design-based and Model-based Inference in Survey Sampling. Scand. J. Statist.
**1978**, 5, 27–52. [Google Scholar] - Shi, Y.; Cameron, J.; Heckathorn, D.D. Model-Based and Design Based Inference: Reducing Bias Due to Differential Recruitment in Respondent-Driven Sampling. Sociol. Methods Res.
**2016**, 48, 13–33. [Google Scholar] [CrossRef] - Sterba, S. Alternative Model_Based and Design-Based Frameworks for Inference From Samples to Populations: From Polarization to Integration. Multivar. Behav. Res.
**2009**, 44, 711–740. [Google Scholar] [CrossRef] - Rao, J.N.K.; Molina, I. Small Area Estimation, 2nd ed.; Wiley and Sons: Hoboken, NJ, USA, 2015; p. 441. [Google Scholar]
- Hogland, J.; Anderson, N.; St. Peters, J.; Drake, J.; Medley, P. Mapping Forest Characteristics at Fine Resolution across Large Landscapes of the Southeastern United States Using NAIP Imagery and FIA Field Plot Data. ISPRS Int. J. Geo-Inf.
**2018**, 7, 140. [Google Scholar] [CrossRef] [Green Version] - Hogland, J.; Anderson, N.; Affleck, D.L.R.; St. Peter, J. Using Forest Inventory Data with Landsat 8 imagery to Map Longleaf Pine Forest Characteristics in Georgia, USA. Remote Sens.
**2019**, 11, 1803. [Google Scholar] [CrossRef] [Green Version] - Neter, J.; Kutner, M.; Nachtsheim, C.; Wasserman, W. Applied Linear Statistical Models, 4th ed.; McGraw Hill: Boston, MA, USA, 1996; p. 1408. [Google Scholar]
- Bishop, C.M. Pattern Recognition and Machine Learning; Springer Science and Business Media, LLC.: Singapore, 2006; p. 738. [Google Scholar]
- Wood, S.N.; Augustin, N.H. GAMs with integrated model selection using penalized regression splines and applications to environmental modeling. Ecol. Modeling
**2002**, 157, 157–177. [Google Scholar] [CrossRef] [Green Version] - Breiman, L. Random forests. Mach. Learn.
**2001**, 45, 5–32. [Google Scholar] [CrossRef] [Green Version] - National Agriculture Imagery Program [NAIP]. National Agriculture Imagery Program (NAIP) Information Sheet. 2012. Available online: http://www.fsa.usda.gov/Internet/FSA_File/naip_info_sheet_2013.pdf (accessed on 14 May 2014).
- Moran, P. Notes on Continuous Stochastic Phenomena. Biometrika
**1950**, 37, 17–23. [Google Scholar] [CrossRef] [PubMed] - Kincaid, T.M.; Olsen, A.R. Spsurvey: Spatial Survey Design and Analysis. R Package Version 4.1.0 2019. Available online: https://cran.r-project.org/web/packages/spsurvey/index.html (accessed on 27 September 2021).
- TIGER 2017. TIGER/Line Shapefiles (Machine Readable Data Files)/Prepared by the U.S. Census Bureau. Available online: https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html (accessed on 27 April 2019).
- R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2014; Available online: http://www.R-project.org/ (accessed on 28 April 2018).
- Wood, S.N. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J. R. Stat. Soc. (B)
**2011**, 73, 3–36. [Google Scholar] [CrossRef] [Green Version] - Karatzoglou, A.; Smola, A.; Hornik, K.; Zeileis, A. Kernlab—An S4 Package for Kernel Methods. R. J. Stat. Softw.
**2004**, 11, 1–20. [Google Scholar] - Venables, W.N.; Ripley, B.D. Modern Applied Statistics with S, 4th ed.; Springer: New York, NY, USA, 2002; p. 495. [Google Scholar]
- Liaw, A.; Wiener, M. Classification and Regression by random forest. R News
**2002**, 2, 18–22. [Google Scholar] - Johnson, R.; Wichern, D. Applied Multivariate Statistical Analysis, 5th ed.; Prentice Hall: Upper Saddle River, NJ, USA, 2002; p. 767. [Google Scholar]
- Kermorvant, C.; D’Amico, F.; Bru, N.; Caill-Milly, N.; Robertson, B. Spatially balanced sampling designs for environmental surveys. Environ. Monit. Assess.
**2019**, 191, 524–531. [Google Scholar] [CrossRef] - Opsomer, J.D.; Breidt, F.J.; Moisen, G.G.; Kauermann, G. Model-Assisted Estimation of Forest Resources With Generalized Additive Models. J. Am. Stat. Assoc.
**2007**, 102, 400–416. [Google Scholar] [CrossRef] [Green Version] - Breidt, F.J.; Opsomer, J.E. Model-Assisted Survey Estimation with Modern Prediction Techniques. Stat. Sci.
**2017**, 32, 190–205. [Google Scholar] [CrossRef] - Opitz, D.; Maclin, R. Popular ensemble methods: An empirical study. JAIR
**1999**, 11, 169–198. [Google Scholar] [CrossRef] - Breiman, L. Stacked regressions. Mach. Learn.
**1999**, 24, 49–64. [Google Scholar] [CrossRef] [Green Version] - Hogland, J.; Anderson, N. Function Modeling Improves the Efficiency of Spatial Modeling Using Big Data from Remote Sensing. Big Data Cogn. Comput.
**2017**, 1, 3. [Google Scholar] [CrossRef] [Green Version]

**Figure 1.**Diagram of simulations and workflow to determine the impact of sample designs on model and design-based estimates. The gray box in sample design denotes a probabilistic sample design for a subset of geographic space. NAIP = National Agriculture Imagery Program imagery; SRS = simple random sample, SYS = systematic random sample, GRTS = generalized random tessellation stratified, RSNR = simple random sample near roads, LN = linear model; GAM = generalized additive models, SVM = support vector machines, NN = neural networks, RF = random forests, EX = Horvitz Thompson expansion estimator.

**Figure 2.**Depiction of the NAIP, Linear, Quadratic, and Nonlinear transformations (SC1). Beneath each SC1 depiction, cell mean, standard deviation, minimum, maximum, and global Moran’s I [37] (rook neighborhood) statistics are presented by the amount of random error (% Noise) introduced within each response surface.

**Figure 3.**Distribution of sample mean digital number (DN) values of bands 1 and 4 (

**left**panel; Feature Space) and sample mean northing and easting values (

**right**panel; Geographic Space) for 100 samples obtained from generalized random tessellation stratified (GRTS), simple random Scheme 50. was used for each of the 100 samples.

**Figure 4.**Illustration of spatial density of sample unit locations for each sample design across the study area. The NAIP image is used for reference; the density raster surfaces depict the frequency of selected cell locations for samples using the SRS, SYS, GRTS, and RSNR designs. To enhance the display of density surfaces, each surface was created by drawing 2000 random samples (sample size 50) for each design and counting the number of sample units falling within each density raster cell (50 m grain size). White cells within the density raster surfaces identify cells where no sample units were selected. SRS = simple random sample; SYS = systematic random sample; GRTS generalized random tessellation stratified; RSNR = random sample near roads.

**Figure 5.**Depiction of functional relationships of fully-specified estimators and SC1 surface values (blue line) for a slice of predictor variable space where NAIP cell values were held constant at the mean for bands 1–3 while band 4 cell values (x-axis) were allowed to vary. The amount of random noise introduced into the relationships between predictor and SC1 surfaces was 40% of the SC1 surface standard deviation. Sample design used to generate models was simple random sampling. Dashed colored lines correspond to averaged expansion and model-based estimates. The grey shaded regions around each estimation technique correspond to the 99% empirically based confidence interval given the 100 iterations performed. Sample size for expansion estimates and model calibration was 50. EX = expansion; LN = linear; GAM = general additive model; SVM = support vector machine; NN = neural networks; RF = random forests.

**Figure 6.**Relative RMSE (RRMSE) for expansion and fully-specified model-based estimates derived from 100 iterations of nonlinear trend response surfaces. Response surfaces incorporated random errors at 20%, 40%, 60%, and 80% of the total nonlinear SC1 image standard deviation. Column and row titles identify sample designs and proportion of SC1 standard deviation used in response surfaces. Ex = expansion; LN = linear; GAM = general additive model; SVM = support vector machine; NN = neural networks; RF = random forests; SRS = simple random sample; SYS = systematic random sample; GRTS generalized random tessellation stratified; and RSNR = random sample near roads.

**Figure 7.**Distribution of estimation errors for response surface population totals. Estimation errors are expressed as a percentage of the population total for trend response surfaces with normally distributed errors based on 20%, 40%, 60%, and 80% of the total nonlinear SC1 image standard deviation. Column and row titles identify sample designs and proportion of SC1 standard deviation used in response surfaces. Ex = expansion; LN = linear; GAM = general additive model; SVM = support vector machine; NN = neural networks; RF = random forests; SRS = simple random sample; SYS = systematic random sample; GRTS generalized random tessellation stratified; and RSNR = random sample near roads.

**Figure 8.**Smoothed trend of nonlinear response surface predicted values (y-axis) versus observed (x-axis) values for design and model-based estimators calibrated using generalized random tessellation stratified designs (GRTS) and image bands 1–4 (Fully-Specified) and image bands 1–3 (Partially-Specified) as predictor variables. Introduced noise into the nonlinear response surface was held constant at 40% of the nonlinear SC1 surface standard deviation. Sample size was held constant at 50 for each of the 100 iterations used in the study. The gray dashed lines within Fully-Specified and Partially-Specified graphs serve as a one-to-one reference line while the gray dotted lines above the graphs depict relative frequency of each observed value. Ex = expansion; LN = linear; GAM = general additive model; SVM = support vector machine; NN = neural networks; RF = random forests.

**Figure 9.**Root mean squared error (RMSE) versus spread values (B) for the top three estimation techniques given nonlinear response surface sample design with 40% introduced noise. Densities of RMSE and B values are presented to the right and above each graph. GAM = general additive model; RF = random forests; SVM = support vector machine; SRS = simple random sample; SYS = systematic random sample; GRTS generalized random tessellation stratified; and RSNR = random sample near roads.

**Table 1.**Design with smallest average root mean squared error (RMSE) across iterations by response surface distribution (SC1), fully-specified modeling technique, and amount of introduced noise (% Noise).

SC1 | % Noise | GAM | RF | SVM | NN | LN | EX |
---|---|---|---|---|---|---|---|

Linear | 20 | SRS | GRTS *** | GRTS *** | GRT S * | SRS | GRTS *** |

40 | GRTS | GRTS | GRTS *** | SRS | GRTS | GRTS ** | |

60 | GRTS | GRTS * | GRTS *** | GRTS * | RSNR | GRTS *** | |

80 | SYS | SYS | GRTS ** | GRTS | SYS | GRTS * | |

Squared | 20 | GRTS * | GRTS *** | GRTS *** | GRTS | GRTS *** | GRTS *** |

40 | SYS | GRTS *** | GRTS ** | GRTS | GRTS * | GRTS *** | |

60 | GRTS | GRTS * | GRTS * | SYS | GRTS | GRTS *** | |

80 | SRS | GRTS | GRTS* | GRTS | GRTS | GRTS * | |

Nonlinear | 20 | GRTS * | GRTS * | SYS | RSNR * | GRTS | GRTS * |

40 | GRTS ** | GRTS | GRTS | GRTS | GRTS | GRTS | |

60 | GRTS | GRTS | RSNR | GRTS | GRTS | GRTS | |

80 | GRTS | GRTS * | RSNR | GRTS | GRTS | GRTS |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Hogland, J.; Affleck, D.L.R.
Improving Estimates of Natural Resources Using Model-Based Estimators: Impacts of Sample Design, Estimation Technique, and Strengths of Association. *Remote Sens.* **2021**, *13*, 3893.
https://doi.org/10.3390/rs13193893

**AMA Style**

Hogland J, Affleck DLR.
Improving Estimates of Natural Resources Using Model-Based Estimators: Impacts of Sample Design, Estimation Technique, and Strengths of Association. *Remote Sensing*. 2021; 13(19):3893.
https://doi.org/10.3390/rs13193893

**Chicago/Turabian Style**

Hogland, John, and David L. R. Affleck.
2021. "Improving Estimates of Natural Resources Using Model-Based Estimators: Impacts of Sample Design, Estimation Technique, and Strengths of Association" *Remote Sensing* 13, no. 19: 3893.
https://doi.org/10.3390/rs13193893