Improving Estimates of Natural Resources Using Model-Based Estimators: Impacts of Sample Design, Estimation Technique, and Strengths of Association

Natural resource managers need accurate depictions of existing resources to make informed decisions. The classical approach to describing resources for a given area in a quantitative manner uses probabilistic sampling and design-based inference to estimate population parameters. While probabilistic designs are accepted as being necessary for design-based inference, many recent studies have adopted non-probabilistic designs that do not include elements of random selection or balance and have relied on models to justify inferences. While common, model-based inference alone assumes that a given model accurately depicts the relationship between response and predictors across all populations. Within complex systems, this assumption can be difficult to justify. Alternatively, models can be trained to a given population by adopting design-based principles such as balance and spread. Through simulation, we compare estimates of population totals and pixel-level values using linear and nonlinear model-based estimators for multiple sample designs that balance and spread sample units. The findings indicate that model-based estimators derived from samples spread and balanced across predictor variable space reduce the variability of population and unit-level estimators. Moreover, if samples achieve approximate balance over feature space, then model-based estimates of population totals approached simple expansion-based estimates of totals. Finally, in all comparisons made, improvements in estimation were achieved using model-based estimation over design-based estimation alone. Our simulations suggest that samples drawn from a probabilistic design, that are spread and balanced across predictor variable space, improve estimation accuracy.


Introduction
Managers of natural resources need accurate information describing the resources they manage to make informed decisions [1][2][3]. Owing to high acquisition costs, managers typically employ sampling and estimation techniques to describe various aspects of a given population. Additionally, to increase estimation accuracy, ancillary data, such as remotely sensed data, can be used to improve population estimates through specification and application of models that characterize how the resources of interest vary as a function of the ancillary data (e.g., [4][5][6][7]). Whether or not such models are employed or if simple expansion-based techniques are used to estimate population parameters, sample design plays an important role in estimation accuracy [7,8].
Within the context of estimating forest characteristics from remotely sensed data, less emphasis is generally placed on rigorous sample design to train models than when validating models [9]. In large part, this discrepancy stems from the conditions required for design versus model-based inference [4,5]. As Gregoire [8] describes, "in the designbased framework, the population is regarded as fixed whereas the sample is regarded as a realization of a stochastic process". Conversely the model-based framework assumes, "that the values y i , . . . , y n are regarded as realizations of random variables Y i , . . . ,Y n and hence the population is a realization of a random process". The primary difference being, "in the model-based approach [inference] stems from the model, not from the sampling design". While some rely on the availability of models as justification for deviating from probabilistic designs when drawing inferences, it is critical to remember that with natural systems, we seldom understand or can collect all the information needed to build models that completely describe the complexities of those systems. This can lead to model misspecification, localized deviations in relationships, and more generally to models that do not describe a specific population well. Therefore, sample designs used to calibrate models that include randomness in the selection process are critical to developing models that can be used to estimate population and subpopulation parameters. Moreover, probabilistic designs that ensure that samples are widely distributed across feature space can reduce the variability of population estimates.
However, many mapping projects employ non-probabilistic sample designs when calibrating predictive models [9] or use designs that may not fully capture the spread of predictor variables and that may be unbalanced with regard to the population. Furthermore, some authors have argued that sample units should only contain pure homogenous examples of a given class [10][11][12][13][14][15] and have developed sampling protocols to ensure that outcome. Compared to probabilistic sampling designs, samples of homogeneous units can be expected to estimate lower error rates when calibrating and validating models. However, such metrics can only be attributed to the class of units targeted for sampling and typically cannot be generalized to the rest of the population [16,17].
Stehman [16,18,19] describes multiple probabilistic sampling designs and outlines their associated strengths and weaknesses as they relate to map accuracy. Tille and Wilhelm [20], Grafstrom et al. [6], and Steven and Olsen [21] further describe the importance of probabilistic designs while highlighting concepts of balance and spread as they relate to the precision of estimators. Balance and spread in this case refer to characteristics of a sample with regards to auxiliary variables. Specifically, a balanced sample is one for which the estimated means or totals of the auxiliary variables equal the actual population means or totals of those variables [20]. A sample that is well spread in auxiliary space means that the sample units are widely dispersed across the auxiliary values of that population [21]. For population estimates, balance becomes appealing where there are potentially strong relationships between response and auxiliary variables; for balance then implies that a sample will provide an accurate estimate of the mean or total of an auxiliary variable, and hence of the true mean or total of the response variable. Likewise, a sample that is well spread across auxiliary variables has the advantage of being balanced [22] while also capturing the variability in auxiliary variables. While often described for expansion-based estimators, these arguments are also important in the context of sampling to develop models to support estimation [9,20,[23][24][25]. Nevertheless, many researchers disregard these warnings and build models derived from unbalanced, ill spread, and/or non-probabilistic samples to estimate characteristics of a given population.
Reasons for deviating from a probabilistic study design typically stem from logistical constraints and cost of implementation. However, using predicted values derived from models developed from non-probabilistic designs to describe complex natural systems such as a forested landscape can be highly questionable [26][27][28]. Questions pertaining to how sample units should be spread across multiple dimensions of feature and geographical space and subsequent impacts on estimator accuracy and inference are at the very forefront of understanding how resource assessments and maps can be used to inform decision making. In this study, we investigate these questions through a series of simulations that compare and contrast estimates of population totals and pixel values obtained under varying sampling designs. We also evaluate the relative impact of sampling design on estimators derived without use of ancillary data and from various linear, non-linear, and non-parametric models. Specifically, in this study, we: (1) evaluate the relative accuracy of alternative model-based estimators of unit values and population totals across populations where the form and/or strength of associations vary, (2) evaluate the impact of probabilistic designs that spread and balance samples over geographic or feature space on the accuracy of those estimators, (3) demonstrate the potential impact of using a probabilistic sampling design taken from a subset of geographic space on the performance of those estimators and (4) assess the impact of misspecified models on the accuracy of estimation. We anticipate that within our simulations, samples drawn from probabilistic designs that are spread and balanced across predictor variable space will produce better estimates of the population and subpopulations.

Theoretical Background
In this study we assess expansion and model-based estimation procedures applied to populations constructed according to distinct functional relationships between response and predictor variables but tempered by normally distributed errors of varying magnitudes. While fundamentally different, expansion and model-based estimation techniques can be used to estimate population parameters [29]. For expansion-based methods, parameter estimators are derived by weighting observed response values according to the relative frequency with which the corresponding sample units are selected under the design. In contrast, model-based estimators of population parameters are derived from a model of how the response variables change across the population. In this approach, the emphasis is on estimating model parameters, which can then be indirectly used to estimate either parameters of a given realization of the model, or the model-expectations thereof [29]. Contrary to the expansion-based methodology, the model-based approach is not necessarily tied to a specific population but instead assumes that a given population is a realization of some random process, typically involving known or mapped auxiliary variables. The emphasis of the model-based approach is often on the relationships between a variable of interest (response) and these other variable(s) (predictors). Given those relationships, the forms of deviations around them, and the predictor variable values within a specific population, one can then estimate model and population parameters.
While models do not necessarily need to be tied to a given population, it is the case that within natural systems the underlying shapes and forms of the relationships between response and predictor variables vary and may be unknown and noisy. Therefore, model development is often based on a specific population (e.g., [30,31]), making models uniquely calibrated to a given time and place. Though uniquely calibrated, different modeling techniques make different assumptions with regard to the relationships between response and predictor variables. For many natural systems, these assumptions may be difficult to meet and can lead to a model that is misspecified. To assess the impact of potential model misspecification on estimating population and subpopulation parameters, we use simulated landscapes with known functional relationships and introduced error ( Figure 1). The functional shape of relationships evaluated in our simulations include linear, quadratic, and nonlinear distributions with normally distributed errors and various amounts of introduced noise (Section 3.1). Additionally, estimation techniques are evaluated under various sample designs to assess the impact of spread and balance on parameter estimation (Section 3.2). Expansion and model-based estimators were chosen based on their frequency of application, underlying assumptions, and flexibility in capturing various relationships among response and predictor variables. Modeling techniques evaluated include linear regression (LN), neural networks (NNs), support vector machines (SVMs), random forests (RF), and generalized additive models (GAMs). In terms of complexity, expansion estimators are the simplest, followed by modelbased estimators LN, NNs, SVMs, RF, and GAMs. Expansion estimators are commonly limited to a singular, population-wide estimate of mean or total for a population of interest [7]. Conversely, linear regression models can estimate the mean response value of pixels for given values of predictor variables but are obviously limited in their ability to describe nonlinear relationships and can produce estimates outside the range of values observed in a sample [32]. NNs, a machine learning technique, can represent nonlinear associations by introducing activation functions and weighting schemes of linear coefficients [33]. SVMs use kernel functions to identify subsets of the data within multidimensional feature space that describe the relationship between response and predictor variables [33]. RFs allow for discontinuities in the response function across feature space by implementing classification and regression trees (CART) and incorporating bagging and randomness into model training. RFs ultimately rely on ensembles of CART models to capture relationships between response and predictors but can yield only estimates within the range of observed response values [34]. GAMs are based on a flexible modeling technique that assumes additivity in predictor variables effects, and that response values change smoothly over feature space. GAMs allow for nonlinear, nonparametric relationships between response and predictor variables [35] and can explicitly account for non-Gaussian error distributions.
Of particular interest with regard to expansion and model-based estimators is that an expansion estimator can be thought of as a model-based estimator with only an intercept parameter if an equal probability design is used and if the presumed model In terms of complexity, expansion estimators are the simplest, followed by modelbased estimators LN, NNs, SVMs, RF, and GAMs. Expansion estimators are commonly limited to a singular, population-wide estimate of mean or total for a population of interest [7]. Conversely, linear regression models can estimate the mean response value of pixels for given values of predictor variables but are obviously limited in their ability to describe nonlinear relationships and can produce estimates outside the range of values observed in a sample [32]. NNs, a machine learning technique, can represent nonlinear associations by introducing activation functions and weighting schemes of linear coefficients [33]. SVMs use kernel functions to identify subsets of the data within multidimensional feature space that describe the relationship between response and predictor variables [33]. RFs allow for discontinuities in the response function across feature space by implementing classification and regression trees (CART) and incorporating bagging and randomness into model training. RFs ultimately rely on ensembles of CART models to capture relationships between response and predictors but can yield only estimates within the range of observed response values [34]. GAMs are based on a flexible modeling technique that assumes additivity in predictor variables effects, and that response values change smoothly over feature space. GAMs allow for nonlinear, nonparametric relationships between response and predictor variables [35] and can explicitly account for non-Gaussian error distributions.
Of particular interest with regard to expansion and model-based estimators is that an expansion estimator can be thought of as a model-based estimator with only an intercept parameter if an equal probability design is used and if the presumed model incorporates an independently identically distributed error structure. That is, if the same sample is used by an expansion estimator to calibrate an intercept only model, the population estimates from the expansion and model-based estimators will be the same. This breaks down if unequal probability designs area used, because information on sampling intensities under a design are not used in the model-based approach. Furthermore, if relationships between the response and potential predictors cannot be identified, then the only viable model is one describing variation around the mean, which suggests use of simple estimators like the sample mean. Owing to this, within our simulation, we predict that as the amount of model error (noise) introduced into the relationship between response and predictors is increased, model and expansion-based estimates will converge. Conversely, as the amount of noise introduced between response and predictor decreases, the estimated values from the modelbased estimators will be less variable than the expansion-based estimates. Similarly, within our simulations we anticipate that when underlying model relationships are misspecified or calibrated in the absence of important correlated data, model-based estimates will converge with expansion-based estimates. Moreover, we predict that samples drawn from probabilistic designs that enhance spread and balanced across predictor variable space will produce better estimates of the population and subpopulations.

Simulated Raster Surfaces
A spatial subset extracted from a National Agricultural Imaging Program (NAIP) [36] image located in the panhandle of Florida, USA, serves as our base dataset for all derived raster surfaces used within our simulations. This subset was acquired in the summer of 2018 and covers approximately 6 km 2 of a mixed forested/urban landscape ( Figure 2) with a nominal pixel resolution of 1 m 2 . The four spectral bands of the NAIP image (x i ; i = 1, 2, 3, 4) were transformed to produce three distinct continuous surfaces (SC1, Figure 2). The first SC1 surface was obtained by averaging the band values for each pixel: For each SC1 surface (Linear, Quadratic, and Nonlinear), normally distributed errors were added to create response surfaces ykj = μkj + εkj (Figure 2), where the εkj are independent, identically distributed errors. Normally distributed errors were generated The second SC1 surface was obtained as the square of the first, and thus is a quadratic function of the individual band values and their cross-products. Finally, a discontinuous, nonlinear SC1 surface was created by adding an additional logical query on the NAIP band 4 cell values: where λ j is an indicator variable taking the value 1 if the value of band 4 for pixel j falls between 170 and 210 and is 0 otherwise. For each SC1 surface (Linear, Quadratic, and Nonlinear), normally distributed errors were added to create response surfaces y kj = µ kj + ε kj (Figure 2), where the ε kj are independent, identically distributed errors. Normally distributed errors were generated using a standard deviation proportional to the standard deviation of pixel values within a given SC1 and a mean of zero; the constants of proportionality were set to 20%, 40%, 60%, and 80% to provide an equally spaced range of noise added to each relationship and to evaluate the impact of noise on model estimation. The realized errors produced homoscedastic surfaces with error values centered at zero before being added to SC1 surfaces. In total, 12 unique continuous response surfaces were created, with various shapes and amounts of error (% Noise).

Sample Designs
To evaluate the impact of sample designs on estimation, we compared population and raster cell estimates for each of our response surfaces using four different sampling designs. Sample designs included simple random sampling (SRS), systematic random sampling (SYS), modified Generalized Random Tessellation Stratified sampling (GRTS; [21,38]), and randomly selecting sample units next to roads (RSNR). In our simulation, SRS selects locations across the spatial domain of the image, independently and uniformly at random. SYS spreads sample units evenly across the spatial domain of the image by distributing locations over a square grid of fixed orientation. The dimensions of each square grid cell was calculated as the square root of the total image area divided by a sample size of 50. SYS was achieved by selecting a random start within the bottom east square grid cell and systematically selecting locations based on the grid cell dimensions. Owing to the systematic nature of the design and the dimension of each grid cell, some SYS samples initially had fewer or more than 50 sample units. For SYS samples with more than 50 locations selected, locations were randomly removed until the sample size reached 50. For SYS samples with less than 50 locations selected, random locations across the spatial domain of the image were added to the sample until the sample size reached 50. GRTS spreads and balances samples in auxiliary space based on the NAIP spectral values (bands 1-4). To achieve this, we selected an initial SRS of 100,000 locations within the NAIP image, extracted raw spectral values from each band at those locations, and performed a principal components analysis (PCA) using the bands' correlation matrix. Using the loadings of the first two components of the PCA, we then transformed the 100,000 sampled NAIP cell values to bivariate component scores. These were then input into the GRTS algorithm [38] to select a sample of 50 observations spread and balanced within a two-dimensional PCA score space; response surface values were available only at these 50 locations. RSNR uses Tiger Road files [39] and a 50 m buffer around roads to subset the population and then randomly select sample unit locations within those subsets. Sample sizes of 50 for model calibration and parameter estimation were chosen to reflect common cost constraints associated with field data collection campaigns and were held constant across estimation techniques to control for sample intensity.

Model Calibration and Estimation
LN, GAM, SVM, NN, and RF models were trained separately for each sample drawn from the 12 response surfaces and 4 sample designs. For LN models, ordinary least squares regression was used to relate response and predictor variables [40]. For GAMs, the Gaussian family with an identity link and additive thin plate penalized regression splines were used to relate response and predictor variables [41]. SVM models used Epsilon Regression and a Radial Basis kernel (Gaussian) [42]. NNs fit a single layer hidden linear model (least squares) with a size parameter equal to half the sample size and a decay parameter equal to 1/10 the size parameter [43]. RF models averaged the results of 20 regression trees drawing on 66% of the data; owing to the small number of predictors involved (three or four), each tree randomly selected one predictor variable at each training node [44]. All model calibrations assumed independent, homoscedastic errors.
Model predictor variables initially included all four NAIP band cell values. That is, all information used to generate the response surfaces were made available in the modeling process as predictor variables. Below, we refer to these as fully-specified models. A second class of models (partially-specified models) utilized only the first three NAIP bands as predictors. This represents the common situation in which the available predictor variables do not fully describe the response function.
Once trained, model and NAIP cell values were used to create prediction surfaces of estimated raster cell values. Estimates of the population total (τ y ) were calculated by aggregation:τ whereŷ i is an estimated pixel value obtained from a calibrated model of one of the above types and N is the total number of pixels in the population.
Additionally, from the response values alone, each sample provided an expansion estimate of the population total obtained simply from the per-pixel sample mean expanded by the population size:τ where n is the sample size (50).

Evaluation
For each sample design (d) and response surface (k), 100 samples (iterations) were drawn and used to calibrate models and estimate surface characteristics. The spread of each sample (B) was measured using the approach described in [21,22]. Briefly, Mahalanobis distance [45] was used to identify Voronoi polytopes (p i ) around each selected pixel of a particular sample in multidimensional predictor space or in geographic space. Inclusion probabilities (n/N) were then summed within each polytope and the variance of these sums were then calculated: where v i is the sum of inclusion probabilities in p i. For each response surface, sample design, and estimation technique, estimates were compared against the corresponding response surface total or against individual cell values. For a global perspective, estimates of the response surface totals (τ y ) were recorded and compared against actual totals (τ y = N ∑ i=1 y i ); differences were expressed relative to the Remote Sens. 2021, 13, 3893 8 of 18 actual totals, on a percentage basis. For a local perspective, estimated cell values were recorded and the RMSE was calculated across cells as follows: RMSE was also calculated for the expansion estimator reflecting the fact that the expansion estimator (scaled by N) is the only available estimate of cell values as follows: where y is the simple sample mean. While similar to the variance of the expansion estimator, this RMSE expression is evaluated over the entire population and is a measure of within response-surface variability. Using either expression, relative RMSE was then calculated as where sd y is the actual population standard deviation.

Allocation of Sample Units
Of all the sample designs considered, only GRTS utilized information from feature space. Although it used only the first two principal components of the four bands, these accounted for approximately 97% of the total variation in the NAIP image. Within feature space, GRTS sample means were approximately balanced, clustering near the population mean values for each band (Figure 3). By contrast, samples drawn by SRS, SYS, and RSNR designs were unbalanced, with individual sample means varying substantially from band averages (Figure 3). In geographic space, GRTS and SRS had similarly dispersed distributions of sample means while the distribution of sample means for the SYS design was tighter ( Figure 3). However, the distribution of sample means for RSNR was skewed to upper right quadrant in Figure 3 owning to road densities ( Figure 4). Imbalance of SYS samples in geographic space was higher than anticipated, possibly as a result of the sample adjustments made to ensure a constant sample size of 50 units.    While the RSNR design only selected sample units within 50 m of roads insid study area and appeared to underrepresent impervious surfaces such as industrial a and buildings (Figure 4), mean values of NAIP bands and northing and easting cl While the RSNR design only selected sample units within 50 m of roads inside the study area and appeared to underrepresent impervious surfaces such as industrial areas and buildings (Figure 4), mean values of NAIP bands and northing and easting closely matched population parameters in feature space ( Figure 3). Moreover, even though no sample units were selected further than 50 m from a road in the RSNR design, sample unit locations appeared to be generally spread across the study area in both feature and geographic space (Figure 3). While the areas within 50 m of a road represent a subset of the geographic domain of the study area, the spatial extent of that subset still spanned a broad distribution of feature space (Figure 3).

Model Calibration and Functional Relationships
While trends in population estimates were generally consistent across response surfaces, the most complex results occurred when estimators were applied to the nonlinear SC1 response surfaces. For these surfaces, the RF estimator would presumably have an advantage over expansion, LN, SVM, NN, and GAM estimators given that the latter techniques presume smooth response functions over feature space and are limited in their ability to capture the discontinuities in these response surfaces. Figure 5 depicts the averaged fitted estimates over the 100 SRS iterations for a slice of feature space (40% introduced normally distributed errors). Relatively speaking, fully-specified GAM, RF, and SVM estimators were able to reproduce the rise and drop of the response surface cell values. While NN estimators partially capture the rise in values, they were not able to capture the subsequent fall in those values in nonlinear relationships ( Figure 5). As expected, the linear estimators only allowed for constant rates of change. They could therefore describe patterns in the linear SC1 surfaces but failed to describe characteristics of the other two SC1 surfaces.
in their ability to capture the discontinuities in these response surfaces. Figure 5 depicts the averaged fitted estimates over the 100 SRS iterations for a slice of feature space (40% introduced normally distributed errors). Relatively speaking, fully-specified GAM, RF, and SVM estimators were able to reproduce the rise and drop of the response surface cell values. While NN estimators partially capture the rise in values, they were not able to capture the subsequent fall in those values in nonlinear relationships ( Figure 5). As expected, the linear estimators only allowed for constant rates of change. They could therefore describe patterns in the linear SC1 surfaces but failed to describe characteristics of the other two SC1 surfaces. These aspects of model performance were consistent across sample designs and levels of introduced noise. As anticipated, when more error was introduced into the response surfaces, variability in estimates increased. When comparing estimation techniques across these slices of SC1 response surface values, fully-specified GAMs, RFs, and SVMs consistently outperformed other estimation approaches ( Figure 5).

Estimates Using NAIP Bands as Predictors
As expected, sample design had an impact on the accuracy of estimated pixel values and population totals (Figures 6 and 7). While we anticipated that the RSNR design would produce biased estimates, bias was minor in estimated values ( Figure 7). Moreover, across These aspects of model performance were consistent across sample designs and levels of introduced noise. As anticipated, when more error was introduced into the response surfaces, variability in estimates increased. When comparing estimation techniques across these slices of SC1 response surface values, fully-specified GAMs, RFs, and SVMs consistently outperformed other estimation approaches ( Figure 5).

Estimates Using NAIP Bands as Predictors
As expected, sample design had an impact on the accuracy of estimated pixel values and population totals (Figures 6 and 7). While we anticipated that the RSNR design would produce biased estimates, bias was minor in estimated values ( Figure 7). Moreover, across all response surfaces, similar patterns in RMSE and variability in the estimated population totals were observed. Generally, as the proportion of noise increased in the response surfaces, RMSE and the variability in population estimates increased (Figures 6 and 7). Because results and trends were similar for linear, squared, and nonlinear response surfaces, we primarily present results for the nonlinear (most complex) response surfaces within this section.
For the expansion estimators, RMSE varied little across iterations within nonlinear response surfaces but consistently exceeded the RMSEs of the fully-specified model-based estimators ( Figure 6). While there did appear to be minor improvements in RMSE for the GRTS design, it is difficult to clearly rank sample designs with regards to RMSE in these figures. Generally, though, GRTS and SYS designs had lower RMSE values than the other designs (Table 1). all response surfaces, similar patterns in RMSE and variability in the estimated population totals were observed. Generally, as the proportion of noise increased in the response surfaces, RMSE and the variability in population estimates increased (Figures 6 and 7). Because results and trends were similar for linear, squared, and nonlinear response surfaces, we primarily present results for the nonlinear (most complex) response surfaces within this section. For the expansion estimators, RMSE varied little across iterations within nonlinear response surfaces but consistently exceeded the RMSEs of the fully-specified model-based estimators ( Figure 6). While there did appear to be minor improvements in RMSE for the GRTS design, it is difficult to clearly rank sample designs with regards to RMSE in these figures. Generally, though, GRTS and SYS designs had lower RMSE values than the other designs (Table 1).  From the perspective of population totals, the GRTS designs had the largest positive impact on expansion-based estimators (Figure 7, reduction in variability of % bias). While the impact of sample designs was less apparent for model-based estimators, generally the GRTS design had the least variation in population estimates (Figure 7), while producing the smallest RMSEs (Table 1). Across all iteration, the mean difference between observed and estimated total, measured as a percentage of the response surface total, was close to zero (Figure 7), suggesting that estimator bias, if present, was minor. On the basis of RMSE, the best performing estimator for all sample designs, transformations, and introduced errors was the GAM estimator followed by RF, SVM, NN, LN, and expansion estimators. Comparing results across sample designs for the GAM estimators, GRTS and SYS typically had the smallest RMSE and highest accuracy in population estimates. While we anticipated that expansion and model-based estimates would be slightly biased for the RSNR sample design, our results did not fully support that claim. This is likely due to the relatively large proportion of area within 50 m of a road in the NAIP image subset, the identical trend and error distributions applied across the study area, and the similar distribution of spectral values found within the road  From the perspective of population totals, the GRTS designs had the largest positive impact on expansion-based estimators (Figure 7, reduction in variability of % bias). While the impact of sample designs was less apparent for model-based estimators, generally the GRTS design had the least variation in population estimates (Figure 7), while producing the smallest RMSEs (Table 1). Across all iteration, the mean difference between observed and estimated total, measured as a percentage of the response surface total, was close to zero (Figure 7), suggesting that estimator bias, if present, was minor.
On the basis of RMSE, the best performing estimator for all sample designs, transformations, and introduced errors was the GAM estimator followed by RF, SVM, NN, LN, and expansion estimators. Comparing results across sample designs for the GAM estimators, GRTS and SYS typically had the smallest RMSE and highest accuracy in population estimates. While we anticipated that expansion and model-based estimates would be slightly biased for the RSNR sample design, our results did not fully support that claim. This is likely due to the relatively large proportion of area within 50 m of a road in the NAIP image subset, the identical trend and error distributions applied across the study area, and the similar distribution of spectral values found within the road buffers and the image.
As anticipated, when evaluating designs, percent error, and model estimation techniques for various ranges of response variable values using nonlinear response surface cell values, we saw similar trends as described in Section 4.2 and as displayed in Figure 5. Figure 8 illustrates these relationships by presenting predicted versus observed smoothed trends for the GRTS sample designs, expansion, and model-based estimators, and 40% introduced noise into the relationship between response and predictors. Of particular note in Figure 8 is that all methods produced attenuated estimates and that the GAM, RF, and SVM procedures more closely match the one-to-one line depicted in Figure 8 within the most common areas in feature space (frequency graphs above graphics in Figure 8). These general trends also extended to partially-specified estimators in which only NAIP bands 1-3 were used as predictor variables. and 40% introduced noise into the relationship between response and predictors. Of particular note in Figure 8 is that all methods produced attenuated estimates and that the GAM, RF, and SVM procedures more closely match the one-to-one line depicted in Figure  8 within the most common areas in feature space (frequency graphs above graphics in Figure 8). These general trends also extended to partially-specified estimators in which only NAIP bands 1-3 were used as predictor variables. Moreover, all models, regardless of misspecification (model distribution or lack of predictors) tended to produce better estimates than expansion-based estimation alone. Similar to results in Section 4.2, the GRTS sample design coupled with the GAM-based estimator produced the best results and more closely followed the one-to-one line when weighted by the frequency of sampled values (Figure 8). Finally, in all comparisons, as percent error increased departures from the one-to-one line also increased (not shown).

Impact of Spreading Sample Units in Feature Space
While estimation approach had a larger effect on our results, sample design also impacted parameter estimates. Across all response surfaces, iterations, and modeling techniques the GRTS sample design provided estimates with the smallest average RMSEs (or was not statistically different than the smallest averaged RMSEs; Table 1). Additionally, the variability in estimated population totals tended to be smaller for the GRTS sample design (Figure 7). Moreover, for the top three estimation techniques (GAM, RF, and SVM) and the most complex response surfaces (nonlinear), spreading samples in feature space generally produced the smallest RMSEs ( Figure 9, Table 1). Interestingly, estimates derived from samples spread in geographic space (SYS) often performed equally to samples spread in feature space (GRTS) even though response surfaces did not Moreover, all models, regardless of misspecification (model distribution or lack of predictors) tended to produce better estimates than expansion-based estimation alone. Similar to results in Section 4.2, the GRTS sample design coupled with the GAM-based estimator produced the best results and more closely followed the one-to-one line when weighted by the frequency of sampled values ( Figure 8). Finally, in all comparisons, as percent error increased departures from the one-to-one line also increased (not shown).

Impact of Spreading Sample Units in Feature Space
While estimation approach had a larger effect on our results, sample design also impacted parameter estimates. Across all response surfaces, iterations, and modeling techniques the GRTS sample design provided estimates with the smallest average RMSEs (or was not statistically different than the smallest averaged RMSEs; Table 1). Additionally, the variability in estimated population totals tended to be smaller for the GRTS sample design (Figure 7). Moreover, for the top three estimation techniques (GAM, RF, and SVM) and the most complex response surfaces (nonlinear), spreading samples in feature space generally produced the smallest RMSEs ( Figure 9, Table 1). Interestingly, estimates derived from samples spread in geographic space (SYS) often performed equally to samples spread in feature space (GRTS) even though response surfaces did not depend directly on geography (Equations (1)-(3)). In large part, this can be attributed to spatial correlation in NAIP bands (Figure 2, Moran's I statistic). Nevertheless, across all estimation techniques, estimators derived from samples spread in feature space tended to have the smallest RMSE ( Figure 9). Moreover, the sample design that yielded the smallest average values for B (well spread samples) was GRTS. depend directly on geography (Equations (1)-(3)). In large part, this can be attributed to spatial correlation in NAIP bands (Figure 2, Moran's I statistic). Nevertheless, across all estimation techniques, estimators derived from samples spread in feature space tended to have the smallest RMSE ( Figure 9). Moreover, the sample design that yielded the smallest average values for B (well spread samples) was GRTS.

Discussion
While sample design affected the performance of all estimators, the estimation approach had a larger impact on cell and population estimates. As expected, in all instances evaluated, regardless of the underlying relationship between response and predictor variables, model-based estimators produced better pixel-level estimates than expansion estimators ( Figure 6, Table 1). Additionally, when sample units were spread and balanced in feature space, model-based estimators of population totals had relatively low bias and remained less variable than expansion estimators (Figure 7). This finding suggests that models derived from samples balanced and spread in feature space produce low-bias estimates with less variability than RSNR, SRS, and SYS designs.
Likewise, expansion estimators derived from samples spread and balanced across feature space (GRTS), produced estimates similar to model-based estimates of population totals with less variability than SYS, SRS, and RSNR expansion estimators; supporting the idea that spreading and balancing sample observations in feature space can substantially improve population estimates presented by others [6,20,46]. Within our simulations, the variability in estimated totals was on average always less for model-based estimators and expansion estimators derived from samples spread and balanced in feature space. In all instances the increased precision (reduction in RMSE) of using a model-based estimator over expansion estimators alone outweighed the impacts of the bias introduced by modelbased estimators, even when models were misspecified.
Within our simulations, GAM and RF were the two best modeling techniques when all NAIP bands were used to calibrate models. Once calibrated, these estimators were able to better identify and track the underlying transformations of NAIP cell values better than other evaluated estimators. Additionally, when models were misspecified (e.g., a linear model applied to a nonlinear response), raster cell values approached expansion-based estimates. Furthermore, when only the first three NAIP bands were used to calibrate models, model-based cell estimates also approached expansion-based estimates. These findings suggest that using probability sampling, ancillary data, and models calibrated from probability samples improves pixel-level value estimates, even if the model is misspecified and especially if sample units are spread across the values of the ancillary

Discussion
While sample design affected the performance of all estimators, the estimation approach had a larger impact on cell and population estimates. As expected, in all instances evaluated, regardless of the underlying relationship between response and predictor variables, model-based estimators produced better pixel-level estimates than expansion estimators ( Figure 6, Table 1). Additionally, when sample units were spread and balanced in feature space, model-based estimators of population totals had relatively low bias and remained less variable than expansion estimators (Figure 7). This finding suggests that models derived from samples balanced and spread in feature space produce low-bias estimates with less variability than RSNR, SRS, and SYS designs.
Likewise, expansion estimators derived from samples spread and balanced across feature space (GRTS), produced estimates similar to model-based estimates of population totals with less variability than SYS, SRS, and RSNR expansion estimators; supporting the idea that spreading and balancing sample observations in feature space can substantially improve population estimates presented by others [6,20,46]. Within our simulations, the variability in estimated totals was on average always less for model-based estimators and expansion estimators derived from samples spread and balanced in feature space. In all instances the increased precision (reduction in RMSE) of using a model-based estimator over expansion estimators alone outweighed the impacts of the bias introduced by modelbased estimators, even when models were misspecified.
Within our simulations, GAM and RF were the two best modeling techniques when all NAIP bands were used to calibrate models. Once calibrated, these estimators were able to better identify and track the underlying transformations of NAIP cell values better than other evaluated estimators. Additionally, when models were misspecified (e.g., a linear model applied to a nonlinear response), raster cell values approached expansion-based estimates. Furthermore, when only the first three NAIP bands were used to calibrate models, model-based cell estimates also approached expansion-based estimates. These findings suggest that using probability sampling, ancillary data, and models calibrated from probability samples improves pixel-level value estimates, even if the model is misspecified and especially if sample units are spread across the values of the ancillary data (feature space). This finding is especially relevant today with the availability of remotely sensed imagery and the correlations between land cover (e.g., forest ecosystems) attributes and spectral and textural image metrics (e.g., [30,31]). While it is tempting to view our results as confirmation of the robustness of GAM and RF modeling techniques, it is important to recognize that within our simulation, randomness played a critical role in all sample designs [20] and that error and SC1 transformations were consistent across the spatial domain. Foody et al. [11] points out that other relationships and error structures may favor different modeling techniques-for example, non-Gaussian error distributions could advantage the GAM approach. However, on average, when samples were spread and balanced in feature space, pixel-level and population total estimates were as good as or better than samples that were not spread or balanced in feature space. This result is most likely because spreading and balancing sample units across feature space will more consistently allow for the detection of changes in complex relationship between response and predictors than samples that are not spread and balanced.
While we did not directly evaluate sample size in our comparisons, we assume that the strength of the relationship between response and predictor variables, represented in our study as added random noise, would have impacts on estimation similar to sample size. Specifically, we would expect that an increase in sample size would have a similar impact on estimation accuracy as a decrease in noise introduced into the response surfaces. This would suggest that as sample size increases, estimates should become more precise and measures of spread, such as the B statistic [21,22] will become smaller. Future investigations should look at quantifying tradeoffs in sample size, spread, and the strength of the relationship between response and predictor variables on estimation for modeling techniques such as LN, GAM, RF, SVM, and NN.
These simulated findings have practical relevance for natural resource managers who are interested in inventory, monitoring, and managing natural resources. Specifically, improvements in estimating characteristics of a natural resource for a given population (e.g., basal area in a forest) can be gained by incorporating models into the estimation process [4,30,31]. Moreover, indirect estimates of population subdomains (e.g., stands) can be made from model-based estimates [26]. In the case where observational units (e.g., pixels) cover a smaller spatial domain than the subdomain of interest (e.g., stands), those pixel estimates can be aggregated to the spatial extent of the stand. Conversely, when pixels have an extent larger than the stand of interest, model estimates can be attributed to the entire stand. In instances where pixel estimates partially cover the extent of the stand, estimates can be weighted and attributed to the stand based on the amount of overlapping area between pixels and the stand. However, models have estimation error, which should be incorporated into the standard errors of the estimates of the domain and subdomain characteristics [26,47,48].
In our simulation, we used iteration and sampling to quantify estimation error. Our top performing estimator (GAM) on average had the least amount of bias and the smallest RMSE across iterations. However, there were instances (iterations) within our simulations when the GAM model had, relatively speaking, large RMSE. These instances within our iterations identify the case in which a sample drawn from the population were not well balanced or spread. While the GRTS sample design minimized the occurrence of samples that were not well balanced or spread, it did not eliminate those types of samples. This means that while rare, some samples may have a disproportionately large number of extreme occurrences within feature space which could adversely affect the calibration of a given model and model estimates. In this situation, bagging or boosting could be used to iteratively draw random subsets from a given sample, calibrate multiple models, average model estimates, and empirically estimate standard error to reduce the impact of extreme observations [49,50]. Applying methods such as this should have a similar impact to the variation in RMSE values as what is displayed in Figure 9 for the RF modeling techniques, which uses bagging and model averaging [34].
For forest managers, this suggests that estimates of forest characteristics such as species composition, basal area, and tree counts can be improved for a given area by relating field measurements to remotely sensed imagery such as NAIP (e.g., [4,30]). Moreover, with the relative abundance of free remotely sensed data and newer software designed to facilitate these types of analyses (e.g., [51]), managers can estimate characteristics of the forests they manage with a greater level of detail and accuracy than was previously possible.

Conclusions
In this study we compared multiple estimators for a variety of response and predictor variable relationships, Gaussian error distributions, amounts of noise, and sample designs. Our findings indicated that balance and spread are important aspects of a sample, estimates of pixel-level and population totals can be improved by incorporating models, and spreading samples within feature space can improve estimates. Likewise, for those same samples, when the relationship between response and predictor variables is misspecified, missing key information, or is extremely noisy, expansion and model-based estimates of the population converge, suggesting that with regards to estimation, nothing is lost by using ancillary data. Moreover, for mapping endeavors attempting to spatially depict various characteristics of a landscape, samples used to calibrate predictive models should aim to spread and balance observational units across feature space.

Institutional Review Board Statement:
This study does not involve human subjects, human material, human tissues, or human data.

Informed Consent Statement:
This study does not involve human subjects, human material, human tissues, or human data.