Sensitivity of codispersion to noise and error in ecological and environmental data

Codispersion analysis is a new statistical method developed to assess spatial covariation between two spatial processes that may not be isotropic or stationary. Its application to anisotropic ecological datasets have provided new insights into mechanisms underlying observed patterns of species distributions and the relationship between individual species and underlying environmental gradients. However, the performance of the codispersion coefficient when there is noise or measurement error ("contamination") in the data has been addressed only theoretically. Here, we use Monte Carlo simulations and real datasets to investigate the sensitivity of codispersion to four types of contamination commonly seen in many real-world environmental and ecological studies. Three of these involved examining codispersion of a spatial dataset with a contaminated version of itself. The fourth examined differences in codisperson between plants and soil conditions, where the estimates of soil characteristics were based on complete or thinned datasets. In all cases, we found that estimates of codispersion were robust when contamination, such as data thinning, was relatively low (<15\%), but were sensitive to larger percentages of contamination. We also present a useful method for imputing missing spatial data and discuss several aspects of the codispersion coefficient when applied to noisy data to gain more insight about the performance of codispersion in practice.


Introduction
Spatial associations are a fundamental aspect of most ecological and environmental data. Although accounting for spatial covariation has become routine in ecological data analysis (Fortin & Dale, 2005), ecologists and environmental scientists have been slower to appreciate and account for anisotropic patterns and processes (but see, e.g., Ellison et al., 2014). Codispersion (Vallejos et al., 2015) measures lag-dependent spatial covariation in two or more spatial processes, which may be anisotropic. Codispersion recently has been used to analyze ecological data and provide new insights into potential ecological processes that underlie observed patterns in co-occurrence between pairs of species (Buckley et al., 2016a) and in relationships between attributes of individual species and the underlying environment (Buckley et al., 2016b).
Applications of codispersion analysis that have been published to date have assumed either that there are no errors in the datasets or that any errors that are present would have no effect on the analysis. These assumptions are clearly unrealistic. The goal of this paper is to better understand how sensitive the codispersion coefficient is to different types of noise and measurement error ("contamination") in the analyzed data. We approach this goal by using Monte Carlo simulation studies to examine several classes of noise that we would expect to occur in datasets or images analyzed using codispersion.
We consider the effect on codispersion of simple observation error, in which noise is added to a fixed number of random points (or pixels) in a dataset (or image) either as white noise (spatially independent and identically distributed) or as a spatially-dependent process. We also consider images with missing values distributed either randomly throughout the sample space or in clusters (e.g., clouds obscuring a large section of a remotely-sensed image) and present different algorithms for interpolation prior to calculation of codispersion. Using bivariate data for forest tree-environment relationships, we evaluate the robustness of codispersion under different levels of error introduced into the environmental data (kriged surfaces derived from complete or "thinned" datasets mimicking those with missing values). In most cases, the effect of these sources of noise and error are tested in one of two ways: either by calculating the codispersion between the original dataset (image) and a contaminated version of itself (codispersion values are correlated to quantify the effect of the noise) or, in the case of the species-environment relationships, by calculating the codispersion between the two original datasets and then comparing the output to the codispersion calculated between the tree data and contaminated environmental data.
In Section 2, we describe the different types of contamination and observation error that we added to both real and simulated datasets. We also describe the method we used for imputing missing data. Results are presented in Section 3, and discussed in Section 4. Technical details on simulation and imputation algorithms are given in the Appendix.

Methods
In spatial modeling and time series, several types of contamination can be specified (Anselin, 1995;Fox, 1972). Here, we consider types of contamination frequently observed in spatial data. Many of our examples are from data collected on forest trees. These data will be familiar to many ecologists and environmental scientists, and to date, codispersion analysis used in ecological settings has been applied primarily to forested ecosystems.
1. Salt-and-pepper noise on an image: Salt-and-pepper noise has been used widely in image processing and computational statistics to represent real distortions (Huang and Zhu, 2010) and to generate different scenarios via Monte Carlo simulation (Mc-Quarrie and Tsai, 2003).
Assume that X is an image and suppose that X follows a zero mean normal distribution with variance σ 2 . Then we consider additive noise following a zero mean normal distribution with variance τ 2 such that τ 2 σ 2 . The contamination is located randomly in space such that a small percentage of observations are corrupted with a probability δ. Specifically, Preliminary experiments were introduced by Vallejos et al. (2015) but here we present a more extensive study. The contamination scheme was generated by using Monte Carlo simulations according to (1). We considered σ 2 = 1, τ 2 = 1, 5, 10, and the percentage of contamination δ = 0.05, 0.1, 0.25. We conjecture that the codispersion coefficient is robust for δ ≤ 0.05.
In Figure 1(a) we illustrate an aerial photograph of a forest stand in Massachusetts, USA. Figures 1 (c),(e), and (g) are contaminated versions of the original one when δ = {0.05, 0.10, 0.25}. The corresponding perspective plots shown in Figure 1(b),(d),(f), and (h) depict the effect of contamination on the gray intensities. The greater the contamination, the greater the dispersion as is observed in the z-axis of the three dimensional scatter plots displayed in Figure 1.
We compared codispersion calculated for the original image to that calculated for the contaminated images. In addition to the reference image shown in Figure 1(a) we considered other aerial images. Figure 2 displays four images with different textures related to the same forest stand in Massachusetts. The codispersion maps of these images are presented in the supplementary material for this paper.
The correlation between the spatial variables X(·) and Y (·) is controlled by the parameter ρ 12 , which allows one to generate bivariate Gaussian spatial processes with different levels of dependence. Without loss of generality, it can be assumed that the mean of the bivariate process is zero, but the theory works well for any bivariate process with mean (µ 1 , µ 2 ) . Any type of contamination can be applied over the generated dependence data. In this case, we applied salt-and-pepper noise.
We generated dependent random fields from the bivariate Matérn class of covariance functions described in Equation (2) by Monte Carlo simulation using the R package RandomFields (Schlather et al., 2017). We then added the salt-and-pepper noise, varying the additional parameter ρ 12 , which represents the known correlation between processes X(·) and Y (·). Figure 3 shows one realization of size 512×512 from a bivariate Gaussian process with correlation equal to 0.8 (images (a) and (b)). Figure 3 (c),(d), and (e) show versions of (b) contaminated with salt-and-pepper noise with the percentage of contamination equal to 5%, 15%, and 25%, respectively. Because the Gaussian process is stationary, images (a) and (b) look very smooth and regular, and any correlation between them (if it exists) is difficult to observe in the printed images.

Missing Observations at Random Locations:
We used the salt-and-pepper scheme to intentionally delete n observations at random. We first defined the percentage of contamination (δ), and then deleted that many observations from the dataset. In practice we replaced observations with NA at the randomly-selected locations. The main feature of these missing observations is that they are spatially independent of one another, but for the posterior data analysis they will remain fixed.
The imputation Algorithm described in the Appendix was not applied here because codispersion calculations are not affected when the percentage of contamination δ is small.
In Figure 4 we illustrate the missing random scheme with nine contaminated versions of the original image shown in Figure 1(a). The columns show the effect of increasing the percentage of contamination (5%, 15%, and 25% respectively), and the rows depict the effect of increasing the block size of contaminated pixels, which are 15×15, 30×30, and 60×60 respectively. The contaminated pixels have been colored in white. NAs were ignored in the computation of the codispersion coefficients because for large gaps of missing observations the computation of the codispersion coefficient will be affected for those directions h such that ||h|| is less than the maximum diameter of the missing block.

Gaps Resulting from Clusters of Missing Observations:
Missing values may be clustered, for example, either because of local difficulties in sampling or because large sections of an image are obscured by clouds or shadows. We simulated clustered missing observations for the image shown in Figure 1(a), given three different pixel sizes for the contaminated block: 200 × 200, 400 × 400, and 800 × 800 ( Figure 5). We used simple clustered geometries (squares) for ease of computation. The difference between the previous type of contamination and this one is that in the former, the contamination consisted of several blocks of small size. Here we introduced just one gap containing a large number of pixels which, in Figure 5, is located for illustrative purposes in the center of the image. However, in our simulations and analysis, we randomized both the size of the missing block and its location.
To compute codispersion coefficients for datasets with such large blocks of missing data, we needed to fill the missing gaps (impute missing data) prior to computing the codispersion coefficient. We address this issue in two ways.
First, the image with a missing gap is represented by a first-order spatial autoregressive process. The fitting of the parameters of the models is done via least squares estimation following the guidelines given by Allende et al. (2001). This estimation method was studied by Ojeda et al. (2010) and found to yield an approximated image Z of the original one X (see Algorithm 1 in the Appendix).
Second, to predict the values of the process in the locations belonging to the missing block, a reconstruction is made by applying Algorithm 1 to the four closest blocks to the missing gap as is illustrated in Figure 14. This prediction scheme is summarized in Algorithm 2 (Appendix). In simple words, the first step represents the image intensity by an autoregressive process that assumes that the intensity of any pixel is a weighted average of the intensity of the surrounding pixels. This is a model-based alternative to the average or median commonly computed using the intensities of a moving window across the image. The second step predicts the missing values using similar autoregressive models to represent the surrounding blocks. The predicted value of a pixel belonging to the missing block is a weighted average where the weights are proportional to the distance from the missing pixel to the surrounding blocks 5. Sampling Error: Spatial data often are sampled from a kriged surface, which itself is generated from a set of field observations. The information in the kriged surface is a function of the number of observations and the smoothing parameter of the covariance function (Minasny & McBratney, 2005). For a pair of spatial point processes X(·) and Y (·), where the number of observations in X(·) and Y (·) differed by several orders of magnitude, we generated a kriged surface from Y (·) using either all or two random ("thinned") subsets of forest plot data (tree species locations and soil chemistry variables), containing respectively 90% and 80% of the original soil chemistry data (Minasny & McBratney, 2005). We then sampled the kriged surfaces to obtain predicted valuesŶ (·) for each of the sampled points (tree locations) in X(·).
To assess sampling error, we used data from plants and soils collected in the 50-ha forest dynamics plot on Barro Colorado Island, Panamá (Condit, 1998;Hubbell et al., 1998Hubbell et al., , 2005. Of the 299 plant species mapped, identified, and measured every five years in this plot, we used six: Alseis blackiana, Oenocarpus mapora, Hirtella triandra, Protium tenuifolium, Poulsenia armata, and Guarea guidonia ( Figure 6). The abundances of unique single-stemmed individuals of each of these six species ranged from 993 (Poulsenia armata) to 7928 (Alseis blackiana), and included species that had a range of positive, negative, and weak associations with measured soil variables (John et al., 2007). Spatial locations and diameters of individual trees of each species (excluding dead individuals and individuals with more than one stem) were taken from the seventh (2010) semi-decadal census of the plot.
Soil samples were collected on a 50-m lattice in 2005 with additional samples taken at finer spatial grains at alternate sampling stations (John et al., 2007). Soil samples were analyzed for concentrations of 11 elements; we used only data for concentrations of calcium (Ca), phosphorus (P), and aluminium (Al), which had the highest loadings on the first three principal axes of a multivariate analysis (NMDS) on the complete soil dataset (John et al., 2007). We used ordinary kriging in the geoR package (Ribeiro & Diggle, 2001), version 1.7-5.2, to fit a surface to the data for each soil element and predict its concentration at the location of each tree (Figure 7). Variogram models (exponential, exponential, and wave for Ca, P, and Al, respectively) needed as input for the kriging function were fit to detrended (2 nd -order polynomial) data that had been Box-Cox transformed (λ = 0.5, 1.0, and 1.0 for Ca, P, and Al, respectively); kriging was done on back-transformed data to which the trend had been added. Nuggets were estimated empirically for Ca and P, but the nugget for Al was fixed (following visual inspection of the empirical variogram) equal to 4,000.

Availability of data and code
BCI vegetation and soils data are available from http://ctfs.si.edu/webatlas/datasets/ bci/. Analyses were done using the R software system, version 3.3.1 (R Development Core Team, 2016). The images and all the code used in this paper are available from (blinded/website)

Results
We used codispersion maps (Buckley et al., 2016a;Vallejos et al., 2015) to explore the possible patterns and features caused by the introduction of noise and to evaluate the performance of the codispersion coefficient when the process was contaminated with one of the distortions described above. Recall that the generation of the noise is through statistical models that do not necessarily include a particular direction in space. The effects of specific directional contamination on codispersion was investigated by Vallejos et al. (2015). The only effect observed when the forest image was contaminated with salt-and-pepper noise (Figure 1) was a trend of decreasing codispersion between the original and contaminated images with an increase in the percentage of contamination (Figure 8). In the case of the dependent processes generated by a Gaussian process with covariance matrix as in (2), we plotted the codispersion maps between the original and contaminated images displayed in Figure 3. The salt-and-pepper contamination caused a complete loss of correlation between the two images, which were originally correlated by 0.8 (Figure 9). A decrease in codispersion between the original and contaminated images was also observed when noise was introduced through missing observations at random locations or as the missing block size increased (Figure 10). Figure 5 illustrates how we introduced large gaps of missing values in the center of the reference image shown in Figure 1(a). Before computing the codispersion map, we imputed the missing data (Algorithm 2 in the Appendix). Although the performance of such algorithms strongly depends on the size of the block of missing observations, the construction of it is based on the spatial information contained in the nearest neighbors (Algorithm 1 in the Appendix). The spatial autoregressive lags in the AR-2D process are fixed when the order of the process is chosen. In this case, three neighbors were considered in a strongly causal set to guarantee an infinite moving average representation of the process. The images filled by the imputation algorithm are shown in Figure 11(d)-(f). The filled areas are smooth in terms of texture and have a smaller variance. Visually, the imputation of the larger missing block looks different from the rest of the image. For small missing blocks it is difficult to see the imputed values. From Figure 11(g)-(h), we observed that Algorithm 2 was able to recover valuable information and that the codispersion between the original and imputed images in all cases was close to one.
Finally, the codispersion between tree species' diameters (for the six species shown in Figure 6) and the three soil elements (Figure 7) sampled in the Barro Colorado Island plot at three levels of data 'thinning' (none, 10% and 20%) showed that codispersion was robust to this form of contamination. Only the results for the most abundant ( Figure 12) and the least abundant ( Figure 13) species are shown.

Discussion
The methods and examples developed in this paper improve our understanding of the behavior of the codispersion coefficient when data have been contaminated. The codispersion coefficient appears to be robust for small percentages of contamination (< 15%), but always leads to an underestimation of the codispersion between the datasets. As the percentage of contamination increases, the codispersion decreases in all directions on the plane. We note that the types of noise considered in this paper did not affect the codispersion in any particular direction(s). Although the performance of codispersion for directional noise was explored by Vallejos et al. (2015Vallejos et al. ( , 2016, directtional noise has not yet been observed in real datasets. When applied to environmental data, codispersion has been shown to useful for describing scales of covariation in two or more variables across complex spatial gradients (e.g., Buckley et al. 2016a, b). Our ability to detect such spatial pattern depends on the grain of spatial variation in the data and how this compares to the lag sizes used in the codispersion analysis. For example, the complete loss of correlation between the two images in Figure  3 under only a small degree of contamination highlights the importance of considering the spatial grain of the datasets relative to that of the noise-inducing processes. The coarsergrained spatial pattern in the forest images is retained, even under contamination, whereas the spatial dependence in the images in Figure 3 is at a smaller grain than the extent of the image, which is relatively heavily disturbed by the salt-and-pepper noise.
The imputation algorithm described in the Appendix seems to be a promising technique to handle blocks of missing observations. Several aspects of it are worth exploring with future research. These include the success of the algorithm in recovering missing observations as a function of the block size; how to select the number of neighbors to be considered in the AR-2D process; and the similarity between the texture of the imputed observations and the texture of the reference image. For simplicity and without loss of generality, the missing blocks we illustrated were square regions located in the center of the image, but certainly Algorithm 2 could be extended to other types of regions located anywhere in the image.
More general aspects of codispersion analysis are in need of further exploration and testing. First, it will be of interest to study the results of codispersion analysis of rasterized images. This is because rasterization of images is widespread and common rasterization methods rarely, if ever, preserve the original spatial correlation of each process. The development of a new rasterization method that preserves better the spatial correlation within processes could follow Goovaerts (2010). Second, the computation of codispersion maps is computationally expensive. Thus, the development of efficient algorithms capable of creating codispersion maps for large images is still needed.

Acknowledgements
Appendix A Image Imputation Algorithm The algorithm described below is based on the fact that it is possible to represent any image by using unilateral AR-2D processes (Ojeda et al., 2010). The generated image is called a local AR-2D approximated image by using blocks.
Let Z = {Z r,s : 0 ≤ r ≤ M − 1, 0 ≤ s ≤ N − 1} be an original image, and let X the original image corrected by the mean. That is, X r,s = Z r,s − Z, for all 0 ≤ r ≤ M − 1, 0 ≤ s ≤ N − 1, and for which Z is the mean of Z.
Following Bustos et al. (2009), assume that X follows a causal AR-2D process of the form X r,s = φ 1 X r−1,s + φ 2 X r,s−1 + φ 3 X r−1,s−1 + ε r,s , where (r, s) ∈ Z 2 , (ε r,s ) (r,s)∈Z 2 is Gaussian white noise, and φ 1 , φ 2 , and φ 3 are the autoregressive parameters. Let 4 ≤ k ≤ min(M, N ). For simplicity we consider that the images to be processed are arranged in such a way that the number of columns minus one and the number of rows minus one are multiples of k − 1; Then we define the (k − 1) Now suppose that image Z has a rectangular block of missing values. Without loss of generality, assume that the rectangular block of missing values is of size (K − 1) × (K − 1). Furthermore, in each border, X (l) ,l = 1, 2, 3, 4, is defined as a block of information of Z of size K × K, such as appears in Figure 14. Also assume that for all l, X (l) is represented by a AR-2D model of the form r−1,s−1 + ε (l) r,s , l = 1, 2, 3, 4. Compute the least square (LS) estimators of φ 1 , φ 2 and φ 3 associated with block are the LS estimators of φ 1 , φ 2 and φ 3 respectively. 4: end for 5: The approximated image Z of Z is: 3 are estimated using the block X (l) for l = 1, 2, 3, 4, respectively. Then the prediction model is where A (l) is the index set for which X (l) is known and i, j = 1, . . . , K. The prediction algorithm is the following Algorithm 2 Prediction Algorithm. Input: An image Z with a missing block, and K. Output: Image Z without missing values.