1. Introduction
Remotely sensed data play an ever-increasing role in characterizing and quantifying landscapes. These types of data have been used to study our surroundings [
1], stratify the terrestrial environment [
2,
3], and build a wide range of data products depicting terrestrial characteristics, such as topography [
4], land use and cover [
5], vegetative indices [
6], vegetation communities [
7,
8], fire severity [
9], land cover change [
10], and temperature [
11]. Due to the success and relatively low cost of using remotely sensed data to depict landscape patterns and changes in those patterns, fields like landscape ecology [
12] and concepts like spatial connectivity and the relationships between patterns and processes are now at the forefront of many land management and planning endeavors [
13,
14,
15,
16].
Ideas such as spatial contiguity, patch size, and patch juxtapositioning, and their relationships to processes and concepts such as forest management, land use planning, and sustainable forestry have in part fueled the desire to precisely and accurately define existing patterns at fine spatial detail, across broad extents [
17,
18,
19]. Coupled with the availability of fine-grained remotely sensed data (≦5 m) and advancements in computer-based hardware and software [
20], a fine-scaled depiction of the landscape can now be produced across broad extents relatively quickly, at a low cost [
21,
22,
23]. At the same time, the fine-grain nature of these types of data provide unique opportunities to relate measured characteristics of the landscapes for small spatial extents (response variables) to remotely sensed data (predictor variables).
Many have capitalized on this point to develop mathematical, statistical, and spatial models that can be used to create surfaces depicting landscape variables of interest using geo-rectified field and remotely sensed data [
18,
22,
23,
24]. Generally, this process can be described as: (1) registering both field and remotely sensed data to a known coordinate system, (2) using the spatial coordinates of the field and remotely sensed data to link measured values in the field to remotely sensed data, (3) building a model for the linked variable as a function of variables derived from the remotely sensed data, and (4) applying the model to remotely sensed surfaces to create a continuous surface of estimated characteristics. While straightforward, the linking process is subject to error (co-registration error) owing to the imperfectly identified spatial coordinates of the response and predictors, and this can have a negative impact on the accuracy of the model estimates (i.e., increased bias and imprecision). With regression models, predictor variables (
Xi) are assumed to be measured without error. Response variables (
Yi) can be measured with error, and this is accounted for within the modeling process, often by specifying an additive random discrepancy, typically denoted as
[
25]. Take for example a simple linear model equation:
where
and
correspond to the intercept and slope, and
corresponds to model error which includes any potential error associated with measuring the response variable. When co-registration errors occur, this amounts to the introduction of error into the ability to measure
Xi (e.g., spectral values) coincident with
Yi (e.g., basal area per hectare). Measurement error in
Xi is not typically accounted for in regression models and will cause attenuation bias [
25,
26], which manifests in estimates trending towards the global mean of the response variable.
To circumvent the impacts of co-registration errors, analysts have employed a wide variety of solutions, ranging from rectifying images in a relative manner [
27] to ignoring these errors and assuming them to be of little importance in predictions [
28]. Regardless of the precision of measuring the true or relative surface location, spatial error will always be part of the rectification process and will have an impact on the underlying predictive model.
Within remote sensing literature, the impact of co-registration error has been recognized, especially for Light Detection And Ranging (LiDAR) data [
29,
30,
31,
32], but typically is not directly quantified. Often studies cite co-registration as an additional source of error that should be minimized, but fall short in describing the effects of those errors or providing suggestions to minimize the influence of those errors on predicted values. In this study, we address this knowledge gap by developing techniques to quantify this source of error and mitigate co-registration errors in applied work. Through simulation using Landsat 8 and National Agriculture Imaging Program (NAIP) imagery and images created with specific spatial correlation, based on Landsat 8 and NAIP images, we investigate co-registration errors and their impacts on the modeling process, and test the hypothesis that co-registration errors can be mitigated through spatial aggregation. Additionally, given estimates of global spatial continuity and co-registration errors, we provide recommendations on the size and layout of field observations with respect to the grain size of remotely-sensed data that will help to minimize the impact of co-registration errors.
2. Materials and Methods
2.1. Theoretical Background
The impact of co-registration errors on any predictive models should be related to four primary factors: (1) the horizontal misalignment between response and predictor variables, (2) the spatial extent of the sample unit, (3) the spatial correlation of predictor and response variables, and (4) the strength and form of the relationship between response and predictor variables. Prior to performing a study, researchers typically do not know the spatial correlation of response variables, nor the strength or form of the relationship between response and predictor variables. To address this lack of information in our study, remove issues of measurement error, and focus our study solely on the impacts of co-registration error, we constrain our predictor surfaces to have a one-to-one relationship with our response variables. In this scenario, the Y and X surfaces, in the absence of co-registration errors, should exhibit a perfect linear relationship (i.e., an intercept of 0, a slope of 1, and a coefficient of determination of 1). Also, where the relationship between X and Y is linear, aggregated values (i.e., averages over multiple adjacent pixels) will also exhibit the same one-to-one relationship as non-aggregated values. Given this design, deviations from a one-to-one relationship can be solely attributed to co-registration errors.
Additionally, assuming that co-registration errors manifest as random noise within the regression models, we anticipate that the proportion of variation in
Y explained by
X (
R2) should follow the squared geometric relationship between the sample unit size (
As) and the area of overlap (
Ao) between the
X and
Y units, when the
X values are distributed independently at random over space (
Figure 1,
Appendix A). This can be expressed as follows:
In concept, each sample unit’s
Y values are related to a combination of the corresponding
X values now attached to an area only partially overlapping with the sample unit (cross-hatched area in
Figure 1), as well as to
X values attached to distinct spatial areas that have been falsely aligned with the sample unit. The latter occurs only because co-registration errors incorrectly identify a spatial match. If the
X values are distributed independently at random over space, then on average the proportion of information on
Y that can be explained by
X should correspond to the average amount of area shared between response and predictor sample units, given the registration errors. Given this assumption, deviation from this condition in our simulations can be attributed to the spatial correlation within a landscape, and provide a rationale for using measures like global Moran’s index (GMI) [
33] as predictors, to estimate the proportion of modeling error contributed by co-registration errors.
2.2. Overview
All analyses within this study were performed using R [
34]. Images created with specified amounts of spatial correlation (virtual images) were built using the raster [
35] and gstat [
36,
37] packages. Our simulations use one Landsat 8 [
38] and five NAIP [
39] images as baseline datasets taken from varying landscapes (
Figure 2,
Table 1), to produce nine virtual Landsat images and ten virtual NAIP images, respectively. To determine the amount of spatial correlation associated with the Landsat and NAIP baseline images, a uniform random selection of 20 locations were used to extract raster cell values within a 200 by 200 cell window, and to calculate empirical omnidirectional covariogram statistics [
40].
Cell values within each 200 by 200 cell window were summarized to estimate a mean digital number (DN) value, as well as sill, nugget, and range values for empirical omnidirectional covariograms. Mean DN, sill, and nugget statistics from each band were then averaged across image sources and used as inputs for creating Landsat- and NAIP-based virtual surfaces. To mimic different degrees of spatial correlation, range values were allowed to vary from 0.5 cells (completely random image) to the maximum range found among bands within each image source. Together, mean DN, sill, nugget, and ranges with a spherical spatial model were used to create virtual NAIP and Landsat surfaces (36,37). A complete listing of the code used to estimate spectral and spatial statistics and create virtual Landsat and NAIP images can be found in
Appendix B.
After creating each single band virtual images, two simulated sampling experiments were conducted, using actual and virtual Landsat 8 and NAIP images to evaluate the impacts of co-registration errors, spatial aggregation, sampling intensity, and spatial correlation on model prediction. The first set of simulations (stage I) were used to quantify the impacts of spatial aggregation of individual cells into multi-cell sampling units with regards to model prediction given co-registration error and defined spatial correlations (
Figure 3). To account for potential logistical constraints of sampling large areas in the field, a second simulation was performed (stage II) that explored the impacts of alternative subsampling configurations corresponding to varying levels of measurement intensity and sample unit extent.
Due to computational limitations associated with calculating range values for the extent of Landsat and NAIP imagery, we explored using GMI as a surrogate for range. GMI, while different than range, quantifies spatial correlation as an index value bounded between −1 (negative correlation) and 1 (positive correlation), with a value of zero corresponding to no spatial correlation (completely random image). For each band within each image of our simulations, GMI was calculated as follows:
With x equal to values within a raster surface indexed by I and j rows and columns, wij representing a weighted spatial matrix (rook’s case), N being the number of cells, and W being the sum of all weights. The remainder of this section describes in detail the design and implementation of each simulation stage within our study and model fitting used to estimate the impact of co-registration errors.
2.3. Stage I Simulations
Co-registration errors were mimicked based on published NAIP (6 m) [
39], Landsat 8 (37 m) [
41], and global positioning system (GPS; 7 m) [
42] horizontal errors. For each image, 200 sample locations (L
1) were selected spatially at random and used to extract DN values for sequentially increasing spatial extents, with side lengths ranging from 1–100 cells (response). Mean cell DN values were calculated and recorded for the extent of the sample unit size at each L
1 location. L
1 locations were then randomly shifted based on co-registration errors between GPS locations and imagery to produce L
2 locations. Random shifts were implemented to the nearest cell by adding random distances and azimuths to easting and northing coordinates, based on a normal distribution with mean 0 and standard deviation expressed as the root mean squared error (RMSE) for each source of spatial error. Because Landsat 8 absolute geodetic accuracy is reported at the 90% confidence level, while NAIP imagery is reported at the 95% confidence level, we adjusted Landsat 8 horizontal error to the 95% confidence level. For Landsat 8, this transition amounts to an absolute error of 48 m (1.6 cells). The source code used to perform spatial shifts (function shiftXY), image value extractions (function extractRC), and mean calculations (function getMeanBlockValue) can be found in
Appendix B (jhLib.r).
L2 locations follow the same DN extraction and summarization process as L1 locations (predictor). Using response and predictor variables for each image, band, and sample unit size, we performed a simple linear regression using ordinary least squares and recorded RMSE (measured in units of mean DN value), as well as intercept, slope, and coefficient of variation (R2) fit statistics. To minimize the effects of sampling variation, this procedure was performed 10 times, and regression results were averaged across all iterations. Additionally, for each image and band GMI was calculated. Regression fit statistics and coefficients were then compared across sample unit sizes and spatial correlation to identify and quantify the impact of co-registration errors and determine an array of suitable field sampling extents, to evaluate measurement intensity for Stage II of the simulations.
2.4. Stage II Simulations
Preferably, when relating remotely sensed data to field samples, the entire area within a sample unit would be measured on the ground. However, due to practical limitations related to collecting field data for sample units with large spatial extents, this is often not economically feasible. This situation can lead to instances when the only practical way to estimate a mean for a spatial extent is to use subsampling. To quantify the impact of six common subsampling (subplots) layouts and various subsampling intensities (area measured) within a given sample unit size (plot), we investigated multiple plot/subplot layouts. A depiction of plot extents, subplot layouts, and subsampling intensities are illustrated in
Figure 4.
Identical to Stage I simulations, Stage II simulations mimic registration errors for 200 observations and extract cell values surrounding each plot location for each sample unit size. For the response variable, the mean values for each sample unit size are calculated based on the spatial extent of one of six subplot layouts, and subsampling intensities ranging from 0.05 to 0.95 of the plot extent, by increments of 0.05. Subplot layouts include one subplot located in the center of the plot (One), four subplots located systematically in the corners of the plot (Sys 4), four randomly located subplots within the plot (Rnd4), four subplots oriented in a similar fashion as the Forest Inventory Analysis (FIA) program plot protocol (FIA 4) [
43], five subplots systematically placed within the plot extent (Sys 5), and nine plots systematically placed within the subplot (Sys 9). For predictor variables, mean values were calculated using all cell values within the extent of the plot (P
all), and for only the areas within the subplots (P
sub). Regression fit statistics and coefficients were then compared with results from 100% of the sample unit size measured in stage I.
2.5. Modelling the Impacts of Co-Registration Error
After preforming each simulation and recording error and fit statistics for each image, we developed a suite of models to relate those statistics to predictors measuring spatial correlation in the images (GMI) and the magnitude of spatial co-registration errors (expected proportion of area overlapped between field plots and corresponding image locations). While the overlap between two rectangles can be calculated if both the distance and direction of co-registration errors are known, the direction of co-registration errors is seldom calculated or reported. Therefore, within our iterations we estimated the expected proportion of overlap (PO) for each sample unit size, given the offsets used to simulate co-registration errors.
For virtual images with no spatial correlation, we hypothesized a one-to-one relationship between PO
2 and the proportion of variation explained (
R2). However, as spatial correlation increases within images, we anticipate that the ratio between
R2 and PO
2 will be greater than one and will interact with spatial correlation metrics. Additionally, we recognize that PO
2 would be difficult to calculate in practice given commonly reported horizontal rectifications. Therefore, when modeling the impact of co-registration errors on
R2 in the presence of spatial correlation, we used only sample unit size and GMI as predictors and beta regression with a logit link [
44]. Similar in concept to logistic regression, beta regression was developed to work with observations between zero and one, and is typically used to characterize natural rates or proportions on a continuous scale. Using a logit link, our proposed model takes the following form:
where
f() and
g() are known transformations of the sample unit size and image GMI, respectively, and the
βk are parameters estimated from the data. Transformations of predictor variables were determined based on graphical analyses. While we anticipated needing sample unit size, GMI, and their interaction to estimate
R2, we also evaluated nested models using only sample unit size and GMI. All beta regression models were compared using Akaike’s information criterion (AIC) [
45,
46].
4. Discussion
Through our simulations, we have documented that spatial co-registration errors produce attenuation bias in linear models (
Figure 6 and
Figure 7). For studies that relate field data located using GPS to geo- or ortho-rectified remotely sensed data, this bias will manifest in regression coefficients biased toward 0 and regression estimates trending towards the sampled mean value of the variable of interest (
Appendix C,
Table A1). While every attempt to minimize the amount of co-registration error should be taken, technical and financial limitations often make it impractical to completely remove this source of error. Due to these limitations, we explored the impacts of spatial aggregation of observational units on model performance when predictor variables have spatial co-registration errors.
Our findings demonstrate that increasing the spatial extent of sample units can help to reduce the impacts of imperfect co-registration. This result further verifies that larger field plots can mitigate the effects of co-registration error found by others [
29,
30,
47,
48]. However, when choosing the extent of a field sample unit, one must take into consideration practical issues associated with the costs of implementation and measurement, as well as the fact that large field sampling units can have a smoothing effect on spatial variability [
29]. Moreover, subsampling within the extent of a field plot, regardless of the subplot layout, introduces addition variability into the predictive models, and should be used sparingly when spatially relating field measurements to remotely sensed information.
Given the sample unit sizes, co-registration errors, and spatial correlation we investigated, we recommend selecting a field plot extent large enough to substantially reduce bias in linear regression, while also keeping the extent of the field plot as small as possible to retain spatial detail. In the case of NAIP imagery, this recommendation would correspond to a field plot with an area between 400 m2 and 1600 m2. For Landsat 8 imagery, this recommendation corresponds to a field plot with an area between 8100 m2 and 72,900 m2. Fortunately, most NAIP and Landsat 8 images have a large degree of spatial correlation, suggesting that the lower end of these recommendations may suffice in mitigating the impacts of co-registration errors. For other sources of remotely sensed information that have different co-registration errors, simulations similar to those presented in this study should be completed to help determine suitable field plot extents and sampling intensities.
If subplots are used to estimate mean values within the extent of a sample unit, it is important that the subplot layout covers as much of the area within the extent of the sample unit as possible. For NAIP imagery, we recommend measuring 75% or more of the sample unit area to minimize the negative effects of subsampling. When it is too costly to measure 75% of the area within a sample unit, a tradeoff between cost and precision must made. In this situation, collecting more sample units with less than 75% of the subsample area measured can help to offset the losses in precision associated with subsampling (
Figure 9). Additionally, when subsampling is used, layouts should be chosen such that there is no overlap among subplots, such as layout Sys 4 from our study (e.g.,
Figure 11).
The actual extent of a sampling unit should depend on the amount of co-registration error, the spatial correlation within the imagery, and the amount of model error one is willing to accept. For readily available Landsat and NAIP imagery, their reported horizontal accuracies, and their estimated spatial correlations, we can estimate the co-registration error-induced reduction in variation explained by linear regression for various sample unit sizes (Equation (4) and
Table 4). From these estimates, one can select a sample unit extent that both reduces estimation bias and quantifies error in predictor variables due to co-registration. For example, if a project was to use NAIP imagery with a sample unit size of 20 cells (field plot extent of 400 m
2) and an estimated GMI of 0.92, then one would expect the logit of
R2 to be approximately 1.744, and the loss in predictive ability associated with co-registration errors to be 1 −
R2 = 0.149.
While Equation (4) and the coefficients from
Table 4 can be used to help guide the size of a field plot needed to mitigate the negative impacts of co-registration (
Appendix C,
Table A2), they should be interpreted as a best-case scenario. Specifically, our simulations were developed under the premise that there was a perfect one-to-one relationship between response and predictor variables. In many applications this will not be the case, and co-registration errors will be coupled with model error. To decouple co-registration errors from model errors, model coefficients can be dis-attenuated [
26,
49]. Within that context, simulations similar to the ones performed in our study, which use a random sample of the predictor variables and regress those values against shifted locations, can be used to estimate a ratio adjustment factor for model coefficients, as described by Forest and Thompson [
49].
Appendix B provides examples of
R coding that can be used to simulate co-registration errors and determine ratio adjustment factors.
Fortunately, most remotely-sensed images have relatively high levels of spatial correlation, which in turn dampens the impacts of co-registration errors. In our study, we evaluated the effects of co-registration on model error for levels of spatial correlation that spanned independent random landscapes, to those commonly found in terrestrial environments. For all actual landscapes used in our study, the minimum spatial correlation found had a GMI value of 0.93. Interestingly, virtual images with ranges comparable to actual image ranges had corresponding GMI values that were substantially less than those found in the actual images. This is likely due to the dramatic transitions found between land use and cover types that can occur within actual landscapes (e.g., a forest bounded by grass lands). This further suggests that natural landscapes have more localized spatial correlation than our virtual landscapes, and that edges between land use and cover types constitute a substantial amount of the overall area within an image. Because these edge areas can make up a substantial component of the landscape, it is important that they are included in future investigations that use simulated landscapes, and more importantly, model training. Mapping endeavors that omit these transition areas from training sets do so at the cost of extrapolating model results to potentially large portions of an image.