Clustering Tools for Integration of Satellite Remote Sensing Imagery and Proximal Soil Sensing Data

: Remote sensing (RS) and proximal soil sensing (PSS) technologies o ﬀ er an advanced array of methods for obtaining soil property information and determining soil variability for precision agriculture. A large amount of data collected by these sensors may provide essential information for precision or site-speciﬁc management in a production ﬁeld. Data clustering techniques are crucial for data mining, and high-density data analysis is important for ﬁeld management. A new clustering technique was introduced and compared with existing clustering tools to determine the relatively homogeneous parts of agricultural ﬁelds. A DUALEM-21S sensor, along with high-accuracy topography data, was used to characterize soil variability in three agricultural ﬁelds situated in Ontario, Canada. Sentinel-2 data assisted in quantifying bare soil and vegetation indices (VIs). The custom Neighborhood Search Analyst (NSA) data clustering tool was implemented using Python scripts. In this algorithm, part of the variance of each data layer is accounted for by subdividing the ﬁeld into smaller, relatively homogeneous, areas. The algorithm’s attributes were illustrated using ﬁeld elevation, shallow and deep apparent electrical conductivity (EC a ), and several VIs. The unique feature of this proposed protocol was the successful development of user-friendly and open source options for deﬁning the spatial continuity of each group and for use in the zone delineation process. characteristics and their sources.


Introduction
A delineated areal extent (DAE) is a finite part of a field representing a unique and homogeneous portion of data [1,2]. The determination of DAEs, or zones, using remote sensing (RS) and proximal soil sensing (PSS) data is becoming critical in the assessment of soil properties and the characterization of variability in precision agriculture [1][2][3][4][5][6][7][8]. In the delineation process, high-resolution data from these sensing technologies, together with quantitative methods, are used to infer the spatial pattern of soil heterogeneity [9][10][11][12][13]. To obtain information on the spatial pattern of soil parameters and produce thematic soil maps to understand a field's agronomic and yield-limiting factors, high-density and Remote Sens. 2019, 11, 1036 2 of 17 multivariate data analyses were drawn upon to isolate homogeneous field areas and identify potential management zones [14][15][16][17][18][19][20].
Multivariate data and hierarchical clustering techniques are crucial for identifying and understanding soil variability within a production field [13,[21][22][23][24][25]. Among the multivariate data analysis techniques, the unsupervised clustering techniques of fuzzy c-means and k-means are most commonly used for data mining [26][27][28][29][30][31][32]. Because of the fuzziness of c-means and k-means and other limitations-each cluster object can belong in more than one group and boundary pixels are created-in the isolation process [8,33,34], this study attempted to provide a multivariate and hierarchical clustering tool to represent unique thematic maps and zonal boundaries based on the homogeneity of the agricultural field.
Most clustering algorithms applied in zone delineation do not handle high-density data files with multiple variables [35][36][37][38][39] or produce an optimal number of zones. As clustering techniques commonly generate fragmented management zones, agricultural scientists and farmers face challenges when implementing variable-rate operations [8,16,[40][41][42][43][44]. In practice, for field operations, the optimal number of zones should be such that the capacity of GPS-guided field equipment is neither overtaxed (too many isolated zones) nor underexploited (too few isolated zones). A survey conducted using a Real-Time Kinematic (RTK), DUALEM proximal soil sensor, and a remote sensing satellite sensor yielded high-density elevation, apparent electrical conductivity (EC a ), and surface vegetation reflectance data, respectively. In this research, the proposed data clustering algorithm was optimized to generate spatially contiguous zones to aid in the achievement of best management practice goals. This study presents the process used to develop a new and enhanced clustering technique to better understand soil variability (e.g., topography, crop performance, and high-density soil data, such as EC a ) in an agricultural field. The performance of this technique was then compared to that of other commonly used techniques.

Experimental Sites and Data Description
Situated at the Woodrill Farms near Guelph, Ontario, Canada, three agricultural fields (namely, WH, LD, and RB), differing in acreage and soil class, were surveyed using both RS and PSS sensors (Table 1 and Figure 1). The PSS equipment was pulled behind an all-terrain vehicle and measured elevation and EC a data points for the experimental sites at intra-and inter-row spacings of 5 m and 10 m, respectively. Elevation data points were collected by an RTK Global Navigation Satellite Systems (GNSS) receiver (Trimble Inc., California, USA) ( Table 2). On the basis of the high-density elevation points, a digital elevation model (DEM) was created with a spatial precision of about 2 cm horizontally and 3 cm vertically. Slope, aspect ratio [sin(aspect/2)], and a topographic wetness index (TWI) were derived from a DEM of the study sites. Developed by Beven and Kirkby [45] and serving to investigate hydrological processes controlled by topography, the TWI was determined using the SAGA GIS v.2.4 (University of Hamburg, Germany). The DUALEM-21S system (DUALEM Inc, Milton, ON, Canada) had one transmitter coil and four receivers-two of horizontal coplanar (HCP) geometry and two of perpendicular coplanar (PRP) geometry-at a separation distance of 1 to 2 m. It was used to collect EC a at four different depths: PRP1 at 0-0.5 m, PRP2 at 0-1.0 m, HCP1 at 0-1.6 m, and HCP2 at 0-3.0 m ( Table 3). The pre-processing procedures for the collection of RTK elevations and EC a values were similar and included a raw data display, the identification of missing values, median filtering, and the removal of outliers. Culled data included: (i) start pass and end pass delays, (ii) points with overspeed limits, (iii) values outside the user-defined minimum and maximum values, and (iv) pitch or roll changes outside the acceptable limit. Data outliers were removed on the basis of the criteria above, such that about 15% of data points were removed. Various methods of geospatial data processing were undertaken on multiple data layers, including rectification, interpolation, and point data extraction. These methods enhanced the data quality for further analysis.     WH field boundary with soil apparent electrical conductivity (EC a ) data points (b), LD field boundary with soil EC a data points (c), and RB field boundary with soil EC a data points (d).
A Sentinel-2 image was used to analyze bare soil and vegetation characteristics (Table 4). Remote sensing image processing steps were followed, including radiometric correction, stitching, co-registration, and stack bands. One OrthoPhoto and two Sentinel-2 images were used for co-registration and visual interpretation with zonal thematic maps. In addition to the traditional visible (RGB) and near-infrared (NIR) spectral bands, Sentinel-2 imagery presented red edge part of the spectrum as well. Spectral indices were produced from Sentinel-2 data to identify the strong absorption spectrum of chlorophyll. These included the Difference Vegetation Index (DVI), the Normalized Difference Red Edge Index (NDRE), the Normalized Difference Vegetation Index (NDVI), and the Modified Soil Adjusted Vegetation Index (MSAVI2). Among the vegetation indices (VIs), NDVI maps were found to be more suitable and were used for the clustering process [46,47].

Interpolated Maps of Selected Sensor Variables
Ordinary Kriging interpolation maps were generated from the PSS measurements in ESRI ArcGIS software (v10.5.1). Kriged maps (with a spatial resolution of 5 m) showing RTK elevation (DEM), derived topographic variables (including slope, aspect, and TWI), and DUALEM sensor variables (HCP1, HCP2, PRP1, and PRP2) were produced. Slope and aspect showed similar clustering patterns as TWI and thus were deemed redundant. In the final clustering process only TWI was used. Indices from NDVI maps (with a spatial resolution of 10 m) were extracted for the clustering tool. Those maps represented significant variations across the expanse of each field (Figures 2-4). The interpolated maps were extracted into a data file of multiple layers. Finally, a text data file was generated to store the sensor-derived variables for input into the newly developed clustering tool and commonly used fuzzy clustering techniques.
To delineate zones, the multilayer data files were analyzed by the proposed data clustering tool. The new data clustering algorithm and its processing steps are elaborated in detail in the following section, as well as the new algorithm's clustering outputs in comparison to outputs from fuzzy clustering techniques.

Data Clustering Algorithms
Fuzzy c-means calculated by the management zone analyst (MZA) [48] were used to generate the normalized classification entropy (NCE) and fuzziness performance index (FPI) of the five zones. The k-means algorithm in the Python data library was used to generate (k = 5, k =15, and k = 25) clusters and find cluster centers using the sum of square distances of all data points and the number of cases in each cluster. Initially, five user-defined clusters were defined in the above clustering methods; however, the optimum number of zones was determined in the final step and compared between the two methods.
The proposed data clustering method, called the Neighborhood Search Analyst (NSA), resulted in the algorithms shown in Figure 5. The processing steps and formula were adopted from the NSA and are written in MATLAB scripts [6]. Preliminary tests of the algorithm in numerous production fields highlighted the algorithm's robustness when partitioning field areas using several field measurements. To construct an objective function to be optimized through the data grouping process, the mean squared error (MSE) was calculated for each individual data layer k according to: where X ij is a sensor value for the i th grid cells within the j th group; X j is the mean of j th group; N is the total number of grid cells; m is the number of groups; and n j is the number of grid cells within the j th group.
It should be noted that the difference between the total number of grid cells and the number of groups can be determined by: Since the algorithm initially assumes that all grid cells belong to the same group, labeled "1" and designated as "the rest of the field", then MSE k (m = 1) represents the variance of the k th data layer across the entire field. Given that the area of the field is substantially greater than the area of a grid cell, MSE k (m = 1) can be termed Farthest Distance Variance (FDV k ). In such a situation, the portion of data variance accounted for by distributing N grid cells among m groups can be calculated as: where MSE k (m = 1) can be called Farthest Distance Variance (FDV k ). The maximum value of R 2 k can be obtained when MSE k is as small as possible. It approaches 1when the number of groups increases. Since the result can be considered less favorable if at least one data layer k is not adequately accounted for, it is reasonable to employ the integration operator OR instead of the more common AND. This avoids the need to assign a weight factor to each individual data layer when adding corresponding MSE k estimates. In mathematical terms, this would mean that the product of all R 2 k should be maximized. Therefore, the objective function (OF) was defined as: where K is the number of PSS data layers. In this study, the smallest number of data elements that could be grouped within the grid cell square window was nine (3 × 3). Therefore, the maximum accountable variance is the variance of PSS Remote Sens. 2019, 11, 1036 9 of 17 measurements between immediate neighbors. The Shortest Distance Variances (SDV k ) can be found using: where w is the total number of 3 × 3 square windows of grid cells.
Since SDV k represents the smallest MSE k value, the maximum value of R 2 k is calculated as: This R 2 k max parameter can range between 0 and 1. It is equal to 0 when data layer k is either uniform or highly variable, so that SDV k = FDV k . In such a case, the data layer should not be able to affect changes in the OF. Alternatively, when R 2 k max is close to 1, the data layer has a strong spatial structure (SDV k << FDV k ), and OF must be sensitive to the change of MSE k corresponding to that particular data layer. In mathematical terms, this goal can be achieved by multiplying all R 2 k values raised to the R 2 k max power of: The resultant OF indicates the overall quality of grid cell groupings. It varies from 0 to 1 and approaches high values when every spatially structured layer of PSS measurement is separated among spatially continuous groups of grid cells with minimum internal group variance. Such groups represent different combinations of average PSS measurements obtained with different sensors that diverge from average field conditions. To facilitate the formation of grid cell groups that would maximize the OF, the NSA algorithm was implemented in this study using Python v3.6 (created by Guido van Rossum and managed by Python Software Foundation, Delaware, USA).
where w is the total number of 3x3 square windows of grid cells.
Since SDVk represents the smallest MSEk value, the maximum value of R 2 k is calculated as: This R 2 k max parameter can range between 0 and 1. It is equal to 0 when data layer k is either uniform or highly variable, so that SDVk = FDVk. In such a case, the data layer should not be able to affect changes in the OF. Alternatively, when R 2 k max is close to 1, the data layer has a strong spatial structure (SDVk << FDVk), and OF must be sensitive to the change of MSEk corresponding to that particular data layer. In mathematical terms, this goal can be achieved by multiplying all R 2 k values raised to the R 2 k max power of: The resultant OF indicates the overall quality of grid cell groupings. It varies from 0 to 1 and approaches high values when every spatially structured layer of PSS measurement is separated among spatially continuous groups of grid cells with minimum internal group variance. Such groups represent different combinations of average PSS measurements obtained with different sensors that diverge from average field conditions. To facilitate the formation of grid cell groups that would maximize the OF, the NSA algorithm was implemented in this study using Python v3.6 (created by Guido van Rossum and managed by Python Software Foundation, Delaware, USA).

c-Means Clustering
On the basis of the seven input variables (i.e., elevation, TWI, NDVI, HCP1, HCP2, PRP1, and PRP2) of the WH field, Euclidean distance-based NCE and FPI indices in FCM clustering were assessed for their performance in creating an optimum number of zones. Comparing the NCE index to the FPI index showed that the maximum value was reached only in zones 4 and 5 ( Figure 6). This clustering method is flaws when it comes to obtaining an optimum number of zones [8,49,50]. The FCM clusters produced pixels with isolated boundaries in various parts of the field [51,52]. Many studies have reported this representation problem regarding the clustering of data due to the fuzzy boundary [16,32,53,54]. In the present method, user-defined numbers of clusters were produced without considering the geospatial locations of the dataset (spatial continuity) or their distances.

c-Means Clustering
On the basis of the seven input variables (i.e., elevation, TWI, NDVI, HCP1, HCP2, PRP1, and PRP2) of the WH field, Euclidean distance-based NCE and FPI indices in FCM clustering were assessed for their performance in creating an optimum number of zones. Comparing the NCE index to the FPI index showed that the maximum value was reached only in zones 4 and 5 ( Figure 6). This clustering method is flaws when it comes to obtaining an optimum number of zones [8,49,50]. The FCM clusters produced pixels with isolated boundaries in various parts of the field [51,52]. Many studies have reported this representation problem regarding the clustering of data due to the fuzzy boundary [16,32,53,54]. In the present method, user-defined numbers of clusters were produced without considering the geospatial locations of the dataset (spatial continuity) or their distances.

k-Means Clustering
In the k-means clustering (k=5), the data values were taken directly from the input table of the WH field for generating cluster centers (Figure 7a). Data were standardized and normalized for the specific variable values. Among the five user-defined clusters, clusters 1, 2, 3, and 5 used the most data points. Since there was a random component, after several runs of each clustering process, the coefficient of determination (R 2 ) varied according to how the k-means algorithm was initialized. The cluster map consisted of groups of pixels with isolated boundaries in various parts of the WH field ( Figure 7b). Figure 7b shows that the k-means cluster map of the WH field generated 36 scattered zones of user-define clusters (k=25).

k-Means Clustering
In the k-means clustering (k=5), the data values were taken directly from the input table of the WH field for generating cluster centers (Figure 7a). Data were standardized and normalized for the specific variable values. Among the five user-defined clusters, clusters 1, 2, 3, and 5 used the most data points. Since there was a random component, after several runs of each clustering process, the coefficient of determination (R 2 ) varied according to how the k-means algorithm was initialized. The cluster map consisted of groups of pixels with isolated boundaries in various parts of the WH field ( Figure 7b). Figure 7b shows that the k-means cluster map of the WH field generated 36 scattered zones of user-define clusters (k=25).

c-Means Clustering
On the basis of the seven input variables (i.e., elevation, TWI, NDVI, HCP1, HCP2, PRP1, and PRP2) of the WH field, Euclidean distance-based NCE and FPI indices in FCM clustering were assessed for their performance in creating an optimum number of zones. Comparing the NCE index to the FPI index showed that the maximum value was reached only in zones 4 and 5 ( Figure 6). This clustering method is flaws when it comes to obtaining an optimum number of zones [8,49,50]. The FCM clusters produced pixels with isolated boundaries in various parts of the field [51,52]. Many studies have reported this representation problem regarding the clustering of data due to the fuzzy boundary [16,32,53,54]. In the present method, user-defined numbers of clusters were produced without considering the geospatial locations of the dataset (spatial continuity) or their distances.

k-Means Clustering
In the k-means clustering (k=5), the data values were taken directly from the input table of the WH field for generating cluster centers (Figure 7a). Data were standardized and normalized for the specific variable values. Among the five user-defined clusters, clusters 1, 2, 3, and 5 used the most data points. Since there was a random component, after several runs of each clustering process, the coefficient of determination (R 2 ) varied according to how the k-means algorithm was initialized. The cluster map consisted of groups of pixels with isolated boundaries in various parts of the WH field ( Figure 7b). Figure 7b shows that the k-means cluster map of the WH field generated 36 scattered zones of user-define clusters (k=25).

NSA Clustering
In the NSA zone delineation process, unlike other clustering algorithms, providing the number of field partitioning clusters is not obligatory. Without defining the number of clusters, NSA produced an optimum number of groups for the grid cell (grid size of 20 m), separately, for seven different input variables. More importantly, this clustering tool efficiently delimited maps by providing the optimum number of zones for field management (Figures 8a, 9a and 10a). On this basis, the WH, LD, and RB fields have 28, 20, and 27 georeferenced zones, respectively. For NSA clustering, user-defined (k = 5, k = 15, and k = 25) zones were delineated and are illustrated later in this paper.

NSA Clustering
In the NSA zone delineation process, unlike other clustering algorithms, providing the number of field partitioning clusters is not obligatory. Without defining the number of clusters, NSA produced an optimum number of groups for the grid cell (grid size of 20 m), separately, for seven different input variables. More importantly, this clustering tool efficiently delimited maps by providing the optimum number of zones for field management (Figures 8a, 9a, and 10a). On this basis, the WH, LD, and RB fields have 28, 20, and 27 georeferenced zones, respectively. For NSA clustering, user-defined (k = 5, k = 15, and k = 25) zones were delineated and are illustrated later in this paper.     In NSA, zone delineation was performed by the individual R 2 values of each variable (Figures 8b, 9b and 10b) and overall OF (Figures 8c, 9c and 10c). These graphs show the part of the variance of each data layer which was accounted for by subdividing the field into smaller areas. In each graph, a greater number of R 2 value indicated that variability within individual zones was smaller than the difference between zones. Figures 8b, 9b and 10b show that the R 2 values increased when new groups were formed or added to the existing groups. The NSA that produced R 2 max value was about 0.9, and the graph had a steeper initial slope. This indicated that the data layer had a strong spatial structure and was dominant when the field was split. Moreover, the x value (No. of cells), where most graphs leveled off, showed that the smallest level of field partitioning revealed most of the soil heterogeneity. Results in LD and RB fields indicated that R 2 for each data layer reached a maximum height (0.60) with around 500 classified grid cells, whereas R 2 reached 0.70 near the 1000-grid cell level for the WH field ( Figure 11). Roughly 60% (in LD and RB) and 70% (WH) of the field variance in both cases was accounted for by making the clusters. In NSA, zone delineation was performed by the individual R 2 values of each variable (Figures  8b, 9b, and 10b) and overall OF (Figures 8c, 9c, and 10c). These graphs show the part of the variance of each data layer which was accounted for by subdividing the field into smaller areas. In each graph, a greater number of R 2 value indicated that variability within individual zones was smaller than the difference between zones. Figures 8b, 9b, and 10b show that the R 2 values increased when new groups were formed or added to the existing groups. The NSA that produced R 2 max value was about 0.9, and the graph had a steeper initial slope. This indicated that the data layer had a strong spatial structure and was dominant when the field was split. Moreover, the x value (No. of cells), where most graphs leveled off, showed that the smallest level of field partitioning revealed most of the soil heterogeneity. Results in LD and RB fields indicated that R 2 for each data layer reached a maximum height (0.60) with around 500 classified grid cells, whereas R 2 reached 0.70 near the 1000-grid cell level for the WH field ( Figure 11). Roughly 60% (in LD and RB) and 70% (WH) of the field variance in both cases was accounted for by making the clusters.

Comparison of k-Means and NSA Clustering
At this stage, three user-defined clusters (k = 5, k = 15, and k = 25) were generated to allow a comparison of the two clustering algorithms, i.e., k-means and NSA. User-defined centers for all clusters were needed for k-means; however, these were not a requirement for the NSA algorithm. The R 2 values of the NSA algorithm were compared among the three different fields (Figure 11). The overall OF showed that all of the clusters reached maximum R 2 values close to 0.6 and up to 0.7. In the three defined k-means clusters (k = 5, k = 15, and k = 25), the R 2 of the RB field was higher: 0.78, 0.80, and 0.84 respectively ( Figure 12). Also, R 2 (k = 5) was relatively high in k-means clustering process because of the fragmentation of clusters throughout the field, while NSA clusters were always contiguous (i.e., not broken into parts). The R 2 of the k-means cluster compared to that of the NSA was higher in most of the fields and was approximately 0.80. The R 2 values were comparable when the isolated/boundary pixels in each k-means cluster were disjointed from the main cluster and created spatially contiguous zones. The k-means cluster map consisted of groups or pixels with isolated boundaries in various parts of the WH field (Figure 7b), whereas the NSA algorithm counted these as different groups and reduced the zone fragmentation (Figure 8a). In the case of the user-defined cluster (k = 25), the k-means cluster maps of WH, LD, and RB fields generated 36, 34, and 38 scattered zones respectively, whereas the NSA maps created approximately 25 spatially contiguous clusters for each of the three fields ( Figure 12).

Conclusions
The high-density and multivariate data clustering approach provided an optimal number of zones for three agricultural fields in Ontario, Canada. The preprocessing and variable selection steps common to all clustering techniques are imperative for providing a well-defined zonal boundary for developing management zones. Compared to other data clustering algorithms, NSA has a unique capability for zone separation, which allows one to produce an optimum number of zones and spatially contiguous clusters during multivariate classification. Moreover, an improved version of this software was tested and proved to be capable of handling a significant number of variables and data layers for delineating the optimum number of zones in a more robust way.
The software was found to be reliable when integrating high-density field topography, RS, and PSS data files. It had a fast processing time and could be run on any platform with open source python modules. The robust zone delineation process and georeferenced thematic maps are useful for variable rate crop management technologies and for other management purposes. Multi-sensor data fusion, advanced data filtering procedures, and the web application of the NSA could be implemented to facilitate appropriate site-specific agronomic and environmental decisions in many regions.
The zonal maps will be useful for further agronomic model calibration using targeted soil sampling. Field data, for example, crop yield and lab-measured soil properties, could be used to validate the georeferenced clusters and management zones created. Furthermore, this research enhances and provides information for better variable-rate fertilizer recommendations and can optimize pesticide and herbicide applications, thereby providing greater environmental benefits.