Unsupervised Parameterization for Optimal Segmentation of Agricultural Parcels From Satellite Images in Different Agricultural Landscapes

: Image segmentation is a cost-effective way to obtain information about the sizes and structural composition of agricultural parcels in an area. To accurately obtain such information, the parameters of the segmentation algorithm ought to be optimized using supervised or unsupervised methods. The difficulty in obtaining reference data makes unsupervised methods indispensable. In this study, we evaluated an existing unsupervised evaluation metric that minimizes a global score (GS), which is computed by summing up the intra-segment uniformity and inter-segment dissimilarity within a segmentation output. We modified this metric and proposed a new metric that uses absolute difference to compute the GS. We compared this proposed metric with the existing metric in two optimization approaches based on the Multiresolution Segmentation (MRS) algorithm to optimally delineate agricultural parcels from Sentinel-2 images in Lower Saxony, Germany. The first approach searches for optimal scale while keeping shape and compactness constant, while the second approach uses Bayesian optimization to optimize the three main parameters of the MRS algorithm. Based on a reference data of agricultural parcels, the optimal segmentation result of each optimization approach was evaluated by calculating the quality rate, over-segmentation, and under-segmentation. For both approaches, our proposed metric outperformed the existing metric in different agricultural landscapes. The proposed metric identified optimal segmentations that were less under-segmented compared to the existing metric. A comparison of the optimal segmentation results obtained in this study to existing benchmark results generated via supervised optimization showed that the unsupervised Bayesian optimization approach based on our proposed metric can potentially be used as an alternative to supervised optimization, particularly in geographic regions where reference data is unavailable or an automated evaluation system is sought.


Introduction
Agriculture is the single largest land use (LU) covering the Earth's land surface [1]. The increasing global population and the accompanying increase in food consumption are placing unparalleled demands on agricultural lands [2]. Some of the negative impacts of these demands include the loss of biodiversity [1], the degradation and destruction of natural ecosystems [3], and an increase in greenhouse gas (GHG) emission [4]. Ensuring food security while minimizing the negative impact of agriculture on the environment requires the use of sustainable agricultural practices [2,5]. Formulating agricultural and environmental policies that ensure sustainable agriculture requires the development of an agricultural monitoring system. The foundation of such a system is accurate and up-to-date agricultural LU maps [6,7]. Agricultural LU maps are essential input data for various processes such as the estimation of biomass and yield [7], monitoring of the phenology of different agricultural LU types [7], modeling of GHG variability [8], estimation of the area of agricultural lands [9], and control of area-based subsidies paid to farmers [9].
The generation and continuous update of agricultural LU maps using traditional methods such as field surveys are inefficient and expensive [8]. Remote Sensing (RS) provides a better alternative due to the frequency at which data can be acquired over large geographical areas [10,11]. The availability of high-resolution satellite images has increased the popularity of Object-Based Image Analysis (OBIA) over traditional pixel-based image analysis [12]. Unlike pixels, which carry only spectral information, image objects additionally carry contextual and spatial information [12], thereby making them more useful for subsequent processes such as classification. The advantages of OBIA over pixel analysis for generating agricultural LU maps have been reported by these authors [10,13,14].
Image segmentation, which is the process of clustering image pixels into homogeneous objects, is a critical step in OBIA [15]. Various authors [16][17][18][19][20] have proved that the quality of segmentation has a direct impact on classification accuracy. One of the most popular segmentation algorithms is the Multiresolution Segmentation (MRS) algorithm proposed by Baatz et al. [21]. MRS is a bottomup region merging algorithm that starts with one-pixel objects and then in a pairwise manner merges smaller objects into bigger ones until a user-given scale threshold is met [22]. In a recent review article by Ma et al. [23], the MRS algorithm as implemented in the eCognition software [24] accounted for 80.9% of 254 case studies the authors reviewed. This overwhelming popularity hinges on the fact that some exhaustive evaluation studies [25][26][27] have had eCognition coming up tops. In eCognition, the three main parameters that influence the quality of the MRS segmentation are scale, shape, and compactness. To obtain optimal segmentation results, it is imperative to optimize these parameters.
To optimize any segmentation algorithm, the quality of the segmentation output of that algorithm for different parameter combinations ought to be evaluated. This can be done through visual inspection, supervised segmentation evaluation, or unsupervised segmentation evaluation [28,29]. Visual inspection is subjective and inherently limits the number of segmentation evaluations that can be done due to its laborious nature [29]. The supervised evaluation methods assess a segmentation result by comparing it to a reference data and computing a global score (GS) that represents the degree of similarity between the segmentation result and the reference data [29]. The main limitation of supervised segmentation evaluation is that the acquisition of reference data is expensive and time-consuming [29]. This makes unsupervised segmentation evaluation indispensable, as it does not rely on reference data but purely on the content of an image to evaluate the segmentation result [29]. For the unsupervised evaluation methods, the GS is a statistical measure that indicates the level of intra-region uniformity and/or inter-region dissimilarity within the segmentation result [30]. In RS, two of the most used methods are the estimation of scale parameter (ESP) [31,32] tool and the objective function [33]. The ESP tool only addresses the intra-region uniformity of segments by making use of local variance graphs [34]. The objective function of Espindola et al. [33] is a combined measure that addresses intra-region uniformity through average area-weighted variance (WV) and inter-region dissimilarity through spatial autocorrelation using the global Moran's I (MI) [35]. A comparative analysis by Grybas et al. [36] showed that the objective function outperformed the ESP tool. Various variations [19,[37][38][39][40][41][42] of the objective function have been used in the literature.
To compute the GS for each input image band, Espindola et al. [33] separately normalized the WV and MI between zero and one before summing them up. Böck et al. [43] identified a weakness with this normalization step, pointing out that the selection of which segmentation is optimal was dependent on the user-defined scale parameter range. They subsequently proposed the use of fixed ranges to normalize the WV and MI. This produced stable results regardless of the input range of the scale parameter. In the remainder of the paper, we call this modification of Böck et al. [43] the Böck metric. Georganos et al. [16] identified some limitations with the normalization approach of the Böck metric, which triggered them to propose a different approach. The problem with their approach is that it adds some level of subjectivity to the evaluation process, because it requires some initial empirical tests. This makes their proposal unusable within our context of having a metric that can be used for automated segmentation evaluation without any human intervention.
In this study, we aimed at proposing a new unsupervised evaluation metric for assessing the segmentation output of any segmentation algorithm. To do so, we modified the Böck metric and proposed absolute difference (AD) as a means of computing the GS. We compared the Böck and AD metrics by separately using each of them in two unsupervised optimization approaches to optimize the parameters of the MRS algorithm to delineate agricultural parcels from 21 Sentinel-2 images of 10x10 km sizes in Lower Saxony, Germany. In the first optimization approach, as is mostly done in the literature [20,31,[37][38][39][43][44][45], we optimized scale while keeping the shape and compactness parameters constant at their default values. In the second optimization approach, we employed Bayesian optimization to optimize all three MRS parameters. The optimal segmentation results identified by each metric were evaluated with parcels from the Land Parcel Identification System (LPIS), which is a spatial database of agricultural parcels and their land-use types as declared by farmers within the European Union (EU) [46,47]. The optimal segmentation results of the Böck and AD metrics were compared to each other per each optimization approach. Further, we compared the optimal segmentation results of the unsupervised Bayesian optimization approaches based on the Böck and AD metrics to the benchmark segmentation results of Tetteh et al. [47], where they used supervised Bayesian optimization.

Study Area and Data
In this study, we used cloud-free Sentinel-2 images downloaded from the Copernicus Open Access Hub (https://scihub.copernicus.eu) covering the German federal state of Lower Saxony. The images were pre-processed in the previous study of Tetteh et al. [47] using the standard procedure of converting the top-of-atmosphere Level-1C images to the bottom-of-atmosphere Level-2A images with Sen2Cor [48] in the Sentinel Application Platform (SNAP) software. For each Level-2A image, the visible (red, green, blue) and near-infrared bands were extracted and composed into an image made up of four bands. This image is henceforth named VNIR. Each VNIR image has a spatial resolution of 10 m. To identify the optimal MRS parameters needed for segmenting agricultural parcels for every part of Lower Saxony, Tetteh et al. [47] clipped the VNIR images with 10x10 km tile grids numbering 562 and additionally masked out all non-agricultural areas such as forests, built-up areas, water bodies, and roads. Out of these 562 images, we selected 21 tiles that spread across Lower Saxony as our study sites ( Figure 1). These 21 tiles have diverse agricultural landscapes. The approach we used to select the 21 tiles can be found in the methodology section. Additional pieces of information such as the image acquisition date, percentage coverage of agricultural lands, and other descriptive statistics of the reference agricultural parcels in the LPIS per tile can be found in Appendix A (Table A1). The variation in the sizes of agricultural parcels per tile can also be found in Appendix A ( Figure A1).

Methodology
The simplified workflow we used to obtain the results is outlined in Figure 2. The core components of our workflow consist of image segmentation, modification of the existing unsupervised segmentation evaluation metric, unsupervised optimization of segmentation, and empirical evaluation of the segmentation results with reference parcels in the LPIS. These components will be fully covered in the proceeding subsections. The simplified workflow we used in this study. Böck refers to the unsupervised segmentation evaluation metric proposed by Böck et al. [43], and absolute difference (AD) is the modified version we proposed in this study.

Selection of the 21 Tiles
The goal here is to reduce the number of tiles from 562 to a number that will lead to a reduction in the computational time needed for segmentation optimization. The 21 tiles were selected in a way that they were representative of the structural composition of the other tiles that were not used for further processing. The methodology we used to identify these 21 tiles is explained in this section.
For each reference parcel in the LPIS of the 562 tiles, we extracted the minimum bounding rectangle (MBR). The width and length of each MBR were calculated. Aspect was computed by dividing the width by the length. Then, we clustered the 562 tiles based on the average aspect per tile using the k-means method. The determination of the appropriate number of clusters was done using the silhouette analysis [49]. This analysis is used to measure the internal consistency of clusters and the separability of those clusters. To perform the analysis, we clustered the average aspect of the 562 tiles using an incremental approach in which the number of clusters was initiated with two and increased by one in subsequent steps up to 21. For each cluster number, a silhouette coefficient was computed. The silhouette coefficients range from -1 to 1, with high values being more desirable, as it indicates the consistency within clusters and good separability among them. In our case, at cluster number 16, the silhouette coefficient was the highest (0.543), so we kept that. Then, we manually selected a tile from each of the 16 clusters and additionally included five more tiles to ensure a better spatial distribution over Lower Saxony, Germany.

Image Segmentation
In this study, image segmentation was done based on the implementation of the Multiresolution Segmentation (MRS) algorithm in eCognition Developer 9.5.0 [24]. Starting with one-pixel objects as seed points, in numerous subsequent steps, where the difference in heterogeneity between an object and any of its neighbors is minimal, the two objects are merged into a bigger one [22]. The heterogeneity of an object is calculated using the color and shape of that object [22,47]. The pairwise merging process is terminated when a user-given threshold is met [22]. In eCognition, three parameters (scale, shape, and compactness) influence the segmentation results of the MRS algorithm. Scale defines the minimum size of an object and is used as the threshold criterion to terminate the merging process. Shape refers to the weight placed on an object's form against its color information during the clustering process [47]. Shape and color add up to 1. In eCognition, one can only pass the shape weight, which then inversely modifies the color weight. Color is a requirement; hence, shape ranges from 0 to 0.9 [47]. Compactness defines the weight of an objects' squareness against its smoothness during the clustering process [47]. The compactness and smoothness weights also add up to 1. In eCognition, one passes the compactness weight, which inversely changes the smoothness weight. Extensive details about the MRS algorithm can be found in these pieces of literature [21,22,24]. Generating optimal segments requires the optimization of the MRS parameters [47].

Segmentation Optimization
To optimize any segmentation algorithm, one needs to be able to assess the quality of the segmentation results churned out by the algorithm for different parameter combinations. In this study, we used unsupervised segmentation evaluation metrics that measure the quality of the segmentation results purely based on the spectral values of the underlying image.

Existing Unsupervised Segmentation Evaluation Metrics
To evaluate a segmentation result, Espindola et al. [33] used average area-weighted variance (WV) and Moran's I (MI) [35]. The WV measures intra-segment homogeneity [33]. Therefore, it shows the level of under-segmentation in a segmentation result. Lower WV values indicate lower under-segmentation [38]. It is derived by first calculating the variance of pixels within each segment per image band, weighting the variance by each segment's area, and then averaging over all segments to obtain one global value per band. Equation (1) shows the formulation of WV, where represents the area of a segment, is the variance of pixels within a segment, and n is the number of segments.
MI measures the inter-segment heterogeneity [33] within the segmentation result, thereby being indicative of the level of over-segmentation. Lower MI values indicate lower over-segmentation [38]. Similar to the WV, it is also computed per image band. Its formulation is shown by Equation (2), where n is the number of segments, and are the respective mean values of an image band for segments i and j, is the mean band value of the entire image, and is a weight matrix that measures the spatial contiguity [43] between a segment and its neighbors. The elements of the matrix are either zero or one. One indicates that segments i and j have a common boundary, and zero indicates they do not.
MI ranges from -1 (perfect dispersion of segments) to 1 (perfect clustering of segments). Lower MI values indicate that the mean spectral values of neighboring segments within a segmentation layer are more different from each other, thereby indicating lower over-segmentation. Higher MI values show that the mean spectral values of the neighboring segments are more similar, which means that there is more over-segmentation present in the segmentation layer.
To compute a single global score (GS) per image band for a segmentation result, the WV and MI values are individually normalized using Equation (3) [37], where is either the WV or MI. Then, the normalized WV (nWV) and normalized MI (nMI) are summed up to obtain the GS per image band [33]. Then, the final GS for the segmentation result is computed with Equation (4), where represents the number of bands in the image, which is four in our case.
The GS ranges from zero (best quality) to one (worst quality). Given a set of segmentation results generated with different segmentation parameters, the parameter combination that results in the lowest GS is deemed as optimal. Böck et al. [43] observed that the identification of the optimal GS based on the definition of Espindola et al. [33] is highly influenced by the range of the user-defined scale parameter. Different scale parameter ranges yield different optimal segmentation results for the same image. According to Böck et al. [43], this instability is due to the normalization process in Equation (3). To deal with this problem, Böck et al. [43] proposed fixed range normalization for WV and MI as respectively captured by Equation (5) and Equation (6) before computing the final GS, where is the variance of the entire image per band. To obtain Equation (6), Böck et al. [43] respectively replaced and in Equation (3) with -1 and 1, which are the theoretical extrema of MI.
The Böck metric also ranges from zero (best quality) to one (worst quality).

Metric Proposal Based on Absolute Difference (AD)
According to Georganos et al. [16], the fixed ranged normalization proposal put forward by Böck et al. [43] makes two problematic assumptions. The first one is that where there is complete under-segmentation, i.e., where one segment is created for the entire image, the WV becomes equal to the image variance; hence, nWV becomes 1. When this happens, the equivalent value of MI and by extension nMI becomes undefined, because a spatial network of more than one segment is required to compute MI. Secondly, in the case of complete over-segmentation, i.e., where each pixel in the image is a segment, MI is -1 and nMI becomes 0, but the corresponding value of WV may be very low and not necessarily zero. In RS, it is highly implausible to obtain complete over-segmentation; hence, an MI value of -1 is hardly realized [16]. Furthermore, Georganos et al. [16] did some tests and observed that the Böck metric has the potential of selecting under-segmented objects as optimal. We tested this hypothesis using some simulated segmentation data captured by Figure 3. Figure 3a shows the reference data, while Figure 3b-d captures three different corresponding segmentation results. For each dataset in Figure 3, each row represents a segment; hence, there are four segments for each dataset. Figure 3b captures a situation where there is a lot of clustering with minimal under-segmentation, Figure 3c is a situation where there is a balance between clustering and dispersion with moderate under-segmentation, and Figure 3d represents a situation where there is a lot of dispersion with a high level of under-segmentation. The MI, nMI, nWV, and GS of the Böck metric computed for the simulated segmentation results (Figure 3b-d) are captured by Table 1. As postulated by Georganos et al. [16], the Böck metric selected the segmentation result with the highest level of under-segmentation as optimal, given that it had the lowest GS value. -0.667 0.167 0.875 1.042 The issues raised by Georganos et al. [16] point to the problem posed by Equation (6), where the theoretical extrema of MI are used to normalize the MI. As visible in Table 1, after normalizing the MI, the numerical difference between the nWV and MI increased in Figure 3b, where the MI was positive. However, for Figure 3c and Figure 3d, the numerical differences diminished substantially. Therefore, in areas with more dispersion, the Böck metric has the potential of selecting under-segmented results as optimal, as it would be more biased toward nMI [16]. To overcome these issues, we used two steps. First, we did not normalize the MI given that by definition, it lies between -1 and 1. We maintained the nWV. Therefore, the minimum and maximum values of nWV will correspond to the minimum and maximum of MI. Second, to obtain the final GS, we computed the absolute difference between the MI and nWV per band and then averaged over all bands as shown by Equation (7), where the notations have the same meaning as Equation (4). This ensures that the MI and nWV have a fair chance of influencing the GS depending on their respective magnitudes. Similar to the Böck metric, low values mean good quality, and high values mean bad quality. The outcome of this modification, named the AD metric, for the simulated segmentation results ( Figure  3b-d) is shown in Table 2. The AD metric correctly selected the least under-segmented result as optimal, followed by the moderately under-segmented. We tested another distance metric, specifically Euclidean Distance (ED), to combine the MI and nWV values at the 21 tiles, but the AD metric proved superior, so we maintained that as our proposal. The difficulty with using ED to compute the GS lies in the fact that given any two numbers, here MI and nWV, it places more emphasis on the larger number than the smaller one, thereby accentuating the influence of the larger number on the overall outcome.  Figure 3. The bold-faced text within the body of the table is the optimal result. Figure 3b 0.400 0.375 0.025 Figure 3c -0.018 0.698 0.716 Figure 3d -0.667 0.875 1.542 3.

Unsupervised Segmentation Optimization
The point of optimization within the context of this study is to identify the MRS parameter combination that yields the lowest GS per metric. The segmentation output corresponding to this combination is the optimal result. We tested two optimization approaches in this study.
For the first approach, which we termed default optimization, we optimized the scale parameter while keeping the shape and compactness parameters constant at their default, as is mostly done in the literature [20,31,[37][38][39][43][44][45]. Shape was kept at 0.1, and compactness was kept at 0.5. The scale ranged from 10 to 300 with intervals of 10. The segmentation output corresponding to the scale parameter with the lowest GS is the optimal output.
The second optimization approach is Bayesian optimization, which was used to optimize all three MRS parameters. We adopted the Bayesian optimization approach of Tetteh et al. [47] but used it within an unsupervised optimization framework. Applying Bayesian optimization requires four main definitions: 1. The domain space (minimum and maximum values) of each input parameter. The domain space of scale was defined as 20 and 200, for shape 0.0 and 0.9, and for compactness 0.0 and 1.0. These parameter ranges were also used by Tetteh et al. [47] in their approach. 2. An objective function to optimize. For our study, the objective function to optimize is f(x), where x is a parameter combination of scale, shape, and compactness. The function takes the parameter combination, performs image segmentation, computes the GS of the segmentation output, and finally returns the GS. 3. A surrogate model for the objective function. To build the surrogate model, one has to first define a prior probability distribution that captures the prior behavior of the objective function. We chose Gaussian Processes (GP) [50] as the prior probability distribution. Then, some initial parameter combinations together with their corresponding GS are used to initialize the whole optimization process. We used 125 parameter combinations as initialization samples. These 125 parameter combinations were selected in a way to ensure uniform and representative distribution over each parameter space. For scale, the values were (40,80,120,160,200), and for both shape and compactness, the values were (0.1, 0.3, 0.5, 0.7, 0.9). The grid search method was used to calculate the corresponding GS for the 125 samples. These samples were used to update the GP to obtain posterior probability distribution over the objective function. 4. An acquisition function to be used in sampling new parameter combinations to be evaluated with the objective function. For the acquisition function, we used expected improvement (EI) [51]. EI is used to iteratively select new parameter combinations with the highest probability of optimizing the objection function. We sampled 50 new parameter combinations with the EI function in 50 iterations. At each iteration, out of 10,000 parameter combinations randomly sampled from the domain space, the combination with the highest likelihood of improving upon the current optimal parameter combination is identified by the EI function using the current posterior probability distribution. Then, this identified parameter combination is evaluated with the objective function, and the corresponding GS is used to update the current posterior probability distribution. In all, 175 combinations were used within the Bayesian optimization approach to identify the optimal one. A more detailed explanation of Bayesian optimization can be found here [52][53][54][55]. The Böck and AD metrics were separately used in the two optimization approaches to optimize the segmentation of agricultural parcels. The optimal segmentation identified by each metric was further evaluated through empirical discrepancy measures. Given the sheer number of segmentations that had to be done, we used eCognition Server 9.5.0 and the eCognition command-line interface (CLI) to automate the segmentation process [47]. For the initial 125 parameter combinations that were used to initialize the Bayesian optimization method, two parallel processes were executed, as our eCognition Server license was limited to two [47]. The Python programming language was used to glue everything together. The implementation of Bayesian optimization via Scikit-optimize in Python was used [47].

Empirical Discrepancy Measures
To identify which optimization approach and metric performed better per tile, we computed four empirical discrepancy measures (Table 3) by comparing the optimal segmentation results to the reference agricultural parcels in the LPIS. The quality rate (QR) [56] measures the level of geometric match between the segmentation result and the reference parcels. It is the only measure that takes into account both the amount of agreement and disagreement between the reference parcels and their corresponding segments [57]. Therefore, it can single-handedly be used to judge the quality of segmentation. When a reference parcel is larger than its corresponding segment, over-segmentation (OR) [57] occurs, and when the segment is larger, under-segmentation (UR) [57] occurs. The root mean square (RMS) [56] combines the OR and UR into a single measure. In the formulas in Table 3, is a reference parcel and is its corresponding segment, and n is the total number of segments. The discrepancy measures are first computed per segment in a segmentation result. To obtain a single discrepancy measure for an entire segmentation result, an area-weighted average was used (Table 3).

Optimal Segmentation Based on Default Optimization
For each tile, Figure 4 shows the QR for the optimal segmentations identified by the AD and Böck metrics using the default shape value of 0.1 and 0.5 for compactness. The other empirical evaluation measures (OR, UR, and RMS) are captured by Appendix A (Table A2). At T11, the two metrics obtained the same result. Except for T3 and T18, where the Böck metric was marginally better, the AD metric was remarkably better at the other tiles. The highest difference between the two metrics was recorded at T1, where the AD metric exceeded the Böck metric by 17%. The lowest differences were recorded at T2 and T19, where the AD metric was about 1% better. The optimal segmentation results identified by our metric were the least under-segmented except for T2 and T19, where our metric was rather the least over-segmented. The RMS values of our metric were lower at all tiles except T19. The Böck metric often selected higher scale values than the AD metric, even to the extent that at T1, it chose the highest scale value as the optimal. This led to massive under-segmentation, an example of which is shown in Figure 5a at T1. Four different LU types-namely, winter wheat, winter rapeseed, spring barley, and pastures-are present in this area. Due to the high scale value selected by the Böck metric, only one segment was created containing all the aforementioned LU types, leading to massive under-segmentation. The AD metric did a better job of separating the different LU types, hence reducing under-segmentation (Figure 5b). The segments generated based on the AD metric had a better geometric match to the LPIS reference parcels. To understand the different behaviors of the Böck and AD metrics, we explored the nWV, MI, nMI, and the corresponding GS computed for each scale value at T1, where the AD metric was substantially better, and then T3, where the Böck metric was marginally better. For both metrics, the nWV increased with increasing scale as the pixels in each segment became more varied, while the MI and nMI exhibited an opposite behavior ( Figure 6 and Figure 7). Figure 6a shows that as the scale increased, the Böck metric decreased in response until it reached its minimum at scale 300. As a reminder, lower GS values of a metric correspond to more accurate segmentation results. Our metric, on the other hand, as captured by Figure 6b, exhibited a decreasing trend up to scale 190 and then started to increase in response to increasing nWV and decreasing MI. The GS was at its lowest at scale 190. At T3 (Figure 7), where the Böck metric was marginally better, the GS of both metrics had one commonality. After some initial decreasing behavior, they both started to continuously increase around the median of the scale range, which is 155. The optimal scale selected by the Böck metric was 150, and that of the AD metric was 140.

Optimal Segmentation Based on Bayesian Optimization
We employed Bayesian optimization to respectively minimize the two unsupervised metrics (Böck and AD) at the 21 tiles to optimize the MRS parameters. To identify the optimal MRS parameters, Tetteh et al. [47] used their supervised Bayesian optimization approach to directly maximize the QR. We consider the results achieved by their approach as the benchmark results. For the analysis here, we compared the results achieved by the two unsupervised Bayesian optimization approaches to each other and in parallel compared both to the benchmark results. The QR measures of the optimal segmentations obtained by the supervised and the two unsupervised approaches for the 21 tiles used in this research are captured by Figure 8. The other empirical evaluation measures can be found in Appendix A (Table A3). The unsupervised Bayesian optimization approach based on the AD metric outperformed the Böck metric at all tiles. The approach based on the AD metric was over 22% better at T1 and T15, and it was about 1% better at T6 in comparison with the unsupervised Bayesian approach based on the Böck metric. The supervised approach was expectedly better than both unsupervised approaches at all tiles. At T7 and T17, the segmentation quality of the supervised approach was over 20% higher than the unsupervised AD approach. However, at T2 and T19, the supervised approach was just about 2% better. Regarding the Böck metric, the supervised approach was over 30% better at T1, T14, and T15, and it was about 5% better at T2. The segmentation results of the unsupervised approaches were generally more under-segmented but less over-segmented compared to the supervised approach. The RMS measure was in favor of the supervised approach at all tiles. The optimal segmentation results of the three Bayesian optimization approaches symbolized by the QR calculated per segment at T1, T2, and T17 are captured by Figure 9, Figure Figure 12 shows a specific case of segments within the optimal results of the three Bayesian optimization approaches at T1 for the same area shown in Figure 5.     To understand the reason behind the differences in QR between the supervised optimization approach and the unsupervised Bayesian optimization approaches, we analyzed the linear relationship ( Figure 13) between the differences in QR and the number of land-use types present at each tile. For each metric, the Pearson correlation coefficient (r) was high, and the p-value was less than 0.05. Therefore, the relationship between the number of crop types and the differences in QR between the supervised approach and each unsupervised approach is significant. Figure 13. Correlation between the number of land use (LU) types and the difference in QR between the supervised benchmark results and the unsupervised Bayesian optimization approaches based on (a) the Böck metric and (b) the AD metric.

Discussion
The analysis of which metric was optimal for unsupervised segmentation evaluation within our experimental setup of using 21 tiles revealed that our metric (AD) was better than the Böck metric, whether one uses it within a default or Bayesian optimization approach. Visually and quantitatively, the segmentation results yielded by the AD metric were better than the Böck metric in different landscapes composed of diverse agricultural LU types.
For the default optimization approach, at tiles such as T3, where the Böck and AD metrics yielded very similar segmentation results, this is attributable to the fact that there was more clustering of objects as the scale was increased. This is captured by Figure 7b, where all the MI values were positive. Clustering normally occurs in areas where there are different LU types but with similar spectral behaviors sharing the same neighborhood or in areas highly dominated by a single LU type such as grasslands, as was the case of T3. Under those conditions, the GS values of the Böck and AD metrics exhibited a common behavior (Figure 7) and consequently selected similar scale values, leading to very similar segmentation results.
At other tiles such as T1, where there was an enormous disparity between the two metrics, the agricultural landscape is more diverse and interspersed with different LU types such as winter wheat, sugar beet, and maize. Consequently, they had more negative MI values with increasing scale (Figure 6b), which is indicative of the dispersion of objects. The Böck and AD metrics on such occasions differed in curve behavior and global minimum position ( Figure 6). Based on the trajectory of the Böck metric in Figure 6a, one can safely conclude that the Böck metric would have further decreased if the scale value had further been increased. Our metric, on the other hand, as captured by Figure 6b, exhibited a decreasing trend up to scale 190 and then started to increase in response to increasing nWV and decreasing MI. The benefit of not normalizing the MI and using absolute difference to compute the GS became manifest on such occasions, where there was a greater dispersion of agricultural parcels. The AD metric was initially more influenced by the MI, but it was later more influenced by the nWV as the scale increased and more MI values became negative ( Figure 6b). With the AD metric, the MI and nWV values have a fair chance of impacting the GS value depending on their respective magnitudes. The Böck metric, on the other hand, was continuously impacted by the nMI (Figure 6a). This can be attributed to the normalization approach applied to the MI by the Böck metric. As captured by Figure 6b, before normalization, all the originally negative MI values were numerically smaller than their corresponding nWV values. After normalizing the MI to obtain the nMI (Figure 6a), those negative MI values became numerically higher than their corresponding nWV values, thereby continuously influencing the GS of the Böck metric (Figure 6a).
The Böck metric is more impacted by the nMI than the nWV in all agricultural landscapes. This behavior of the Böck metric has the potential of selecting large-scale values as optimal, thereby leading to the identification of under-segmented objects as optimal. This observation was also made by Georganos et al. [16]. This particular behavior of the Böck metric becomes more problematic in areas with diverse LU types and a greater dispersion of objects, as previously shown in Figure 5a. The more diverse the LU types and the more spectrally similar they behave, the higher the probability of selecting under-segmented objects as optimal using any segmentation evaluation metric, especially a metric that is purely based on the image content. Therefore, a good unsupervised segmentation evaluation metric must reduce over-segmentation but more importantly under-segmentation as the AD metric proved to be able to do, at least in comparison with the Böck metric. For subsequent processes such as object classification, under-segmentation is preferable to over-segmentation [26,58,59]. In general, under-segmentation can largely be dealt with by using very high-resolution images in which visible boundaries between adjacent but spectrally similar parcels can be identified [47].
For the unsupervised Bayesian optimization approach, the approach based on the AD metric outperformed that of the Böck metric at all the tiles, especially at T1, which is composed of diverse LU types. Interestingly, at T3, the Bayesian optimization approach based on the AD metric became better than the Böck metric. This is opposite to the default optimization results at T3, where Böck was marginally better than AD. Overall, in both optimization approaches, the AD metric consistently proved to be better suited for optimizing the segmentation of agricultural parcels in different landscapes. A look at the segmentation results for T1 (Figure 9) clearly shows that the Bayesian optimization approach based on the AD metric generated more segments (Figure 9d) with a higher segmentation quality than the results of the Bayesian approach based on the Böck metric (Figure 9b). There was a greater clustering of objects at T2 ( Figure 10) and T17 ( Figure 11) based on the computed MI values; hence, the Bayesian optimization approach based on the Böck (Figure 10b and Figure 11b) and the AD (Figure 10d and Figure 11d) metrics yielded very similar segmentation results.
As expected, the supervised Bayesian optimization approach performed better than all the unsupervised Bayesian optimization approaches at all the tiles used in our experiment. This is especially true for T1 ( Figure 9c) and T17 (Figure 11c), where the landscape has diversified LU types. At T2, which is highly dominated by pome fruits, the segmentation quality was bad for all the optimization methods. Tetteh et al. [47] in using the supervised Bayesian optimization approach to delineate agricultural parcels made this observation as well for T2 and attributed it to the small size and elongation of agricultural parcels present at that tile. This also holds for the unsupervised Bayesian optimization approaches tested in this research. The high correlation between the number of LU types and the difference in QR between the supervised Bayesian optimization approach and the two unsupervised Bayesian optimization approaches as captured by Figure 13 indicated that at tiles with a smaller number of LU types, the unsupervised Bayesian approaches obtained results similar to the supervised approach. The supervised Bayesian approach was able to adapt more to diverse agricultural landscapes than the unsupervised Bayesian approaches. An example of this can be seen in Figure 12c, where the supervised approach generated segments with well-defined boundaries and a better geometric match to the LPIS parcels than the two unsupervised Bayesian approaches in Figure 12a and Figure 12b, respectively. The adaptability of supervised segmentation optimization was also asserted by Yang et al. [39] after testing a supervised optimization approach based on the information gain ratio and an unsupervised optimization approach based on MI and WV as was proposed by Espindola et al. [33]. The major defect of any supervised optimization method is the reliance on reference data, which are tedious to obtain [29]. An unsupervised method such as the Bayesian optimization approach based on our proposed AD metric provides a good alternative to supervised segmentation optimization.
Unlike the proposition of Georganos et al. [16], our proposed metric is objective and fully automated. It does not require any human intervention to identify the optimal segmentation. The approach of Georganos et al. [16] requires the user to compute a certain number of initial segmentations with unknown step intervals, something the authors mentioned has a great impact on the results. Additionally, using locally estimated scatterplot smoothing (LOESS) requires a user to specify the order of the polynomial and a span, which controls the level of smoothing. Since the optimal values of those user inputs cannot be known beforehand, the user has to experiment to identify the optimal settings for normalization, which violates the principle behind unsupervised segmentation evaluation.

Conclusions
In this study, we modified an existing unsupervised segmentation evaluation metric based on global variance and spatial autocorrelation [43]. We proposed the use of absolute difference (AD) to combine the global variance and spatial autocorrelation. We tested the AD metric and the existing metric, named Böck, in identifying the optimal parameters for delineating agricultural parcels from Sentinel-2 images using the Multiresolution Segmentation (MRS) algorithm. We first tested both metrics at 21 tiles with different agricultural landscapes to optimize the scale parameter of the MRS algorithm through default optimization. In this default approach, we kept the shape and compactness parameters constant and increased the scale at equal intervals to determine the optimal one. The AD metric proved superior to the Böck metric in identifying the segmentation result with a better geometric match to reference agricultural parcels in the Land Parcel Identification System (LPIS). On average, the segmentation quality of the AD metric was over 6% higher than the Böck metric in this default approach. Our metric often identified segmentations that were least under-segmented as optimal, unlike the Böck metric. We separately used each metric in a Bayesian optimization routine to optimize the three main parameters of the MRS algorithm at the same 21 tiles. The Bayesian optimization approach based on the AD metric performed better than that of the Böck metric at all tiles. In the Bayesian optimization approach, the quality of the segmentation result of the AD metric was on average about 9% better than the Böck metric. A comparison of the segmentation results in this study to existing benchmark results obtained via supervised Bayesian optimization showed that the unsupervised Bayesian optimization approach based on the AD metric can be a good alternative. In areas where the number of land-use (LU) types was small, supervised and unsupervised Bayesian optimization obtained similar segmentation results. Supervised segmentation optimization methods require reference data, which are generally difficult and time-consuming to generate, especially for wide geographic areas such as regions and countries. The Bayesian optimization approach based on the AD metric solely depends on the image content to fine-tune the optimization process without any human intervention; hence, it can easily be used in any operational OBIA workflow to generate segmentations in near real time.
In a nutshell, our proposed metric performed better than its predecessor in identifying optimal segmentation. Identifying optimal segmentation is important for purposes of obtaining correct agricultural statistics such as the sizes of agricultural parcels. In the absence of reference data, a Bayesian optimization approach based on the AD metric can provide a means of fulfilling the aforementioned purpose in an automated and efficient manner with no human interaction. Even though we tested this optimization approach on the MRS algorithm within the thematic area of agriculture, it is easily applicable to any segmentation algorithm and different thematic areas.
Going into the future, one possible way of improving the results of the segmentation optimization process with our proposed metric will be to incorporate local variance and spatial autocorrelation in a multi-scale approach to refine under-segmented and over-segmented objects in subsequent steps as was done by Johnson et al. [37]. Different weighting schemes for different agricultural landscapes can be applied to the normalized weighted variance and spatial autocorrelation before the computation of the global score for the AD metric. The impact of this weighting scheme on the identification of the optimal segmentation result would be analyzed accordingly. The impact of the segmentation results identified by the supervised and unsupervised Bayesian optimization approaches on object classification would be assessed. The 21 tiles we used in our experimental setup had relatively flat terrains. However, our proposed metric should work fairly well in other terrains as long as there is enough spectral dissimilarity (dispersion) between adjacent parcels in any geographical area. This hypothesis will be tested in the future.

Acknowledgments:
We are grateful to the Ministry of Food, Agriculture and Consumer Protection of Lower Saxony for providing the LPIS reference parcels. We appreciate the constructive feedbacks of Antonia Ortmann and Ann-Kathrin Holtgrave on this paper.

Conflicts of Interest:
The authors declare no conflict of interest.