On the Objectivity of the Objective Function — Problems with Unsupervised Segmentation Evaluation Based on Global Score and a Possible Remedy

Image segmentation is a crucial stage at the very beginning of many geographic object-based image analysis (GEOBIA) workflows. While segmentation quality is generally deemed of great importance, selecting adequate tuning parameters for a segmentation algorithm can be tedious and subjective. Procedures to automatically choose parameters of a segmentation algorithm are meant to make the process objective and reproducible. One of those approaches, and perhaps the most frequently used unsupervised parameter optimization method in the context of GEOBIA is called the objective function, also known as Global Score. Unfortunately, the method exhibits a hitherto widely neglected, yet severe source of instability, which makes quality rankings inconsistent. We demonstrate the issue in detail and propose a modification of the Global Score to mitigate the problem. This hopefully serves as a starting point to spark further development of the popular approach.


Introduction
Image segmentation is one of the first stages in geographic object based image analysis (GEOBIA).It is performed with the objective to partition an image into meaningful groups of pixels, i.e., the geo-objects depicted in an image.The quality of the segments is deemed crucial, as it will affect the performance of subsequent processing, especially the possibility to assign meaningful class labels to objects [1].
Image segmentation is regarded a hard problem in computer vision, due to its ill-posed nature [2,3].By changing a segmentation algorithm's tuning parameters or by altering the pre-processing of the input imagery, it is possible to produce a vast number of different segmentations for an image.Manually checking a large number of candidate solutions is possible and potentially leads to satisfying results [4], but is inherently time consuming and subjective in nature.
To ease the choice of a specific segmentation for further analysis, segmentation evaluation measures have been developed that can be used to optimize a segmentation algorithm's input parameter values in an automated way [5][6][7][8][9][10][11][12].One of the most popular unsupervised segmentation evaluation methods in remote sensing is called the Global Score (GS) [13] or simply objective function.
The GS method was proposed by Espindola et al. [5].GS combines measures of intra-segment homogeneity and inter-segment heterogeneity to judge segmentation quality.The former is expressed by the segments' average variance weighted by its area, while the latter is expressed as the segments' spatial autocorrelation in terms of Moran's I [14,15].In the original formulation of GS, the individual measures are calculated for a set of segmentations and a single image band.Intra-segment homogeneity and inter-segment heterogeneity measures are afterwards normalized separately to a common range (e.g., 0 to 1).The sum of the two normalized measures finally yields the objective function's value.
Unfortunately, however, the method has an inherent instability introduced by the normalization procedure.To the best of our knowledge, this inherent instability has not yet been treated in detail.While the purpose of parameter optimization is to make the choice of segmentation reproducible and less subjective, the calculation of GS in its current form exhibits an undesirable sensitivity to the user-defined range of parameters.The range of initial segmentation parameters is usually chosen ad-hoc and not sufficiently reasoned.
The aim of this work is to demonstrate the issue in detail.This helps to better understand the underlying causes of the problem, and to increase the general awareness.Based on this analysis, we provide a possible modification of the GS, mitigating the undesirable instability of the objective function.

Materials and Methods
To illustrate the effect of the different approaches, a set of candidate segmentations is obtained using the well-known Multiresolution Segmentation (MRS) algorithm [31].MRS is a bottom-up region merging algorithm.Although we use MRS to illustrate our research, it has to be noted that the choice of segmentation algorithm itself does not affect the findings of our study.
The segmentations used here are produced leaving two of the three main parameters of MRS, namely Shape and Compactness, at constant levels of 0.1 and 0.5, respectively.The third parameter, called Scale, is altered in a range from 20 to 300 with an increment of 10, yielding a total of twenty-nine segmentations to illustrate the findings.The minimum and maximum Scale values ensure both, overand under-segmentation.
Tests were performed on a Landsat dataset of the study area Assis microregion (São Paulo State, Brazil), available from a previous study [32].The data cover roughly 715,000 ha of an agriculturally dominated landscape.Segmentation was initially performed on the full layer stack of six bands.For the sake of simplicity, and without the loss of generality, we restrict analysis here to a single band of the dataset (i.e., near-infrared band of Landsat 8 OLI).Similar findings will be obtained for each layer in the stack (not shown).
Global Score (GS) is a combination of area-weighted variance (v), measuring intra-segment homogeneity, and a measure of spatial autocorrelation, i.e., Moran's I (I), globally quantifying similarity of neighboring segments.For a single band of an image, v is calculated as: where n is the total number of segments and v i and a i are the variance and area of i segments, respectively.Calculation of Moran's I for a single band of an image is given by (see Figure 1 for illustration): with y i and y j being the mean digital number of regions R i and R j , respectively, and y the mean of variable y.Furthermore, w ij is a measure of the spatial contiguity of two regions R i and R j .Following Espindola et al. [5], w ij is set to 1 for regions that share a common boundary and 0 for non-adjacent regions.The individual measures of and are normalized to a common range from 0 to 1 in order to balance their relative importance.Normalization of and is performed either by the formula used in Espindola et al. [5]: or the one used in Johnson and Xie [13]: Both are functionally equivalent; only the direction of optimization is different, i.e., minimization or maximization.We will use the latter, because it makes the shape of and more intuitive.Analogue findings would be obtained using Espindola's normalization.The value of the objective function is finally the sum of the two normalized measures: For multiband images, it has furthermore been proposed to average the GS calculated for each band individually [13,16].

Illustration of the Sensitivity of GS to the User-Defined Range of Tested Segmentations
Area weighted variance typically increases as segments grow larger.By contrast, correlation between adjacent regions initially declines with growing size (yielding weaker spatial autocorrelation in terms of Moran's I), but is expected to increase when segments become large enough to contain a mixture of classes [33].As and I increase/decrease with increasing Scale parameter, the minimum and maximum values of the measures ( and ) are likely to be attained by the finest and the coarsest segmentations in the test set, respectively.This is important, as minima and maxima are used afterwards to normalize the two components of Equation (5).The optimum of GS therefore depends on the user-defined range of parameters tested.
The net effect of variable and can easily be demonstrated, by calculating GS for the full set of candidate segmentations (Figure 2b) and two subsets separately (Figure 2a,c).
In the example provided in Figure 2, altering the range of segmentations used for analysis not only alters the absolute value of GS but also shifts the optimum.This would also occur, if all three parameters were varied and/or if another segmentation algorithm would have been chosen (not shown).Furthermore, comparison of GS in Figure 2a,c for Scale values between 110 and 210 shows The individual measures of I and v are normalized to a common range from 0 to 1 in order to balance their relative importance.Normalization of v and I is performed either by the formula used in Espindola et al. [5]: or the one used in Johnson and Xie [13]: Both are functionally equivalent; only the direction of optimization is different, i.e., minimization or maximization.We will use the latter, because it makes the shape of I and v more intuitive.Analogue findings would be obtained using Espindola's normalization.The value of the objective function is finally the sum of the two normalized measures: For multiband images, it has furthermore been proposed to average the GS calculated for each band individually [13,16].

Illustration of the Sensitivity of GS to the User-Defined Range of Tested Segmentations
Area weighted variance typically increases as segments grow larger.By contrast, correlation between adjacent regions initially declines with growing size (yielding weaker spatial autocorrelation in terms of Moran's I), but is expected to increase when segments become large enough to contain a mixture of classes [33].As v and I increase/decrease with increasing Scale parameter, the minimum and maximum values of the measures (X min and X max ) are likely to be attained by the finest and the coarsest segmentations in the test set, respectively.This is important, as minima and maxima are used afterwards to normalize the two components of Equation (5).The optimum of GS therefore depends on the user-defined range of parameters tested.
The net effect of variable X min and X max can easily be demonstrated, by calculating GS for the full set of candidate segmentations (Figure 2b) and two subsets separately (Figure 2a,c).
In the example provided in Figure 2, altering the range of segmentations used for analysis not only alters the absolute value of GS but also shifts the optimum.This would also occur, if all three parameters were varied and/or if another segmentation algorithm would have been chosen (not shown).Furthermore, comparison of GS in Figure 2a,c for Scale values between 110 and 210 shows inconsistencies in the relative ranking of segmentations' 'quality'.For example, segmentation at Scale 130 is more favorable than segmentation at Scale 190 according to Figure 2a, while the opposite is true in Figure 2c.The deplorable effects of using variable (range dependent) X min and X max are thus threefold: • the absolute values of GS change, • the optimum (minimum) value of GS is shifted, • the relative ranking of acceptable candidate solutions is altered.
Remote Sens. 2017, 9, 769 4 of 9 inconsistencies in the relative ranking of segmentations' 'quality'.For example, segmentation at Scale 130 is more favorable than segmentation at Scale 190 according to Figure 2a, while the opposite is true in Figure 2c.The deplorable effects of using variable (range dependent) and are thus threefold:


the absolute values of GS change,  the optimum (minimum) value of GS is shifted,  the relative ranking of acceptable candidate solutions is altered.

Illustration of an Alternative Normalization Scheme
Instead of using the minima and maxima ( , ) derived from the user-defined segmentations, we propose to normalize and to a fixed range prior to combination for GS.For , the outmost limit of any segmentation for an image can be used.That is on one hand the situation, where each pixel resembles a segment on its own (complete over-segmentation) and the state where the entire image is regarded a single segment (complete under-segmentation).In case of complete over-segmentation will arguably be 0, while for complete under-segmentation equals the variance of the image, turning the normalization into: where ̅ is the variance of the image.The above strategy is less straight-forward for setting fixed limits for I. Moran's I typically ranges from −1 to 1 (Figure 1), but is not strictly bound to that range [15,34].For the case of extreme oversegmentation, can be calculated and is expected to be positive and more or less close to 1 depending on the image used (0.96 in our case).By aggregating individual pixels during segmentation, is expected to decrease [33].However, in some cases, for example in severely textured images, can be low for single pixel segments and increase as texture is smoothed out and adjacent segments become more similar.
For the opposite extreme of complete under-segmentation, is not defined, because only a single region remains.As an approximation, the case of two remaining segments can be considered.For any two remaining regions will take a value of −1 by definition.Consequently, as a conservative and easily applicable solution, we suggest using −1 to 1 as fixed limits for normalization of .Substituting and in Equation ( 4) yields:

Illustration of an Alternative Normalization Scheme
Instead of using the minima and maxima (X min , X max ) derived from the user-defined segmentations, we propose to normalize I and v to a fixed range prior to combination for GS.For v, the outmost limit of any segmentation for an image can be used.That is on one hand the situation, where each pixel resembles a segment on its own (complete over-segmentation) and the state where the entire image is regarded a single segment (complete under-segmentation).In case of complete over-segmentation v will arguably be 0, while for complete under-segmentation v equals the variance of the image, turning the normalization into: where v is the variance of the image.The above strategy is less straight-forward for setting fixed limits for I. Moran's I typically ranges from −1 to 1 (Figure 1), but is not strictly bound to that range [15,34].For the case of extreme over-segmentation, I can be calculated and is expected to be positive and more or less close to 1 depending on the image used (0.96 in our case).By aggregating individual pixels during segmentation, I is expected to decrease [33].However, in some cases, for example in severely textured images, I can be low for single pixel segments and increase as texture is smoothed out and adjacent segments become more similar.
For the opposite extreme of complete under-segmentation, I is not defined, because only a single region remains.As an approximation, the case of two remaining segments can be considered.For any two remaining regions I will take a value of −1 by definition.Consequently, as a conservative and easily applicable solution, we suggest using −1 to 1 as fixed limits for normalization of I. Substituting X min and X max in Equation (4) yields: Using the proposed fixed limits for normalization makes GS independent of the range of tested segmentations, therefore stabilizing GS values and the location of its optimum (Figure 3).Using the proposed fixed limits for normalization makes GS independent of the range of tested segmentations, therefore stabilizing GS values and the location of its optimum (Figure 3).

Discussion
Given the highlighted sensitivity of the Global Score to the range of tested segmentations, we believe that previous findings indicating the effectiveness of the method should be critically reviewed.For example, Gao et al. [16] found that the optimum GS more or less coincides with the maximum overall accuracy obtained for land cover classification of nine distinct segmentations in their study.Although they provide sufficient data to calculate GS on a slightly reduced set of segmentations, the small number of segmentations used in their study does not permit for in-depth analysis of the effect of normalization with respect to varying the range of tested segmentations.While their optimum seems quite pronounced and suggests some robustness, we believe that the relationship between classification accuracy and segmentation accuracy in terms of GS should be further confirmed and not be taken as granted.
In another study, Johnson and Xie [13] calculate GS for a number of automated segmentations and compare it with the GS attained for their manually delineated ground truth.They hypothesize: "In theory, the reference digitization should score very well (low GS) since expert knowledge of the study area was required to create it.If the reference digitization does not receive a good score, the evaluation method may not be effective for judging segmentation quality" [13] (p. 476).Indeed, the reference digitization scores well in their setup, i.e., for the segmentation parameter range they have used.The manual reference attains an absolute value comparable to the optimal GS calculated for the automated segmentation at scale 70 (Figure 4a).However, it can be shown that if they had for example used 150 instead of 250 as the maximum Scale parameter level in their study, the score attained by their manual digitization would have not supported the effectiveness of GS (Figure 4b).
Similarly, a recent study of Varo-Martínez et al.
[23] compared segmentations produced by two different algorithms using GS.The authors normalized and for the segmentations of each algorithm separately before comparing the absolute GS values of both methods.Using GS in such a way is highly problematic, which can easily be seen, for example, from Figure 2 or Figure 4, where identical segmentations attain vastly different absolute GS values depending on the sample used for normalization.Again, altering the range of tested parameters for one (or both) of the segmentation algorithms used, might have led to a different judgement on the relative performance of the two methods.

Discussion
Given the highlighted sensitivity of the Global Score to the range of tested segmentations, we believe that previous findings indicating the effectiveness of the method should be critically reviewed.For example, Gao et al. [16] found that the optimum GS more or less coincides with the maximum overall accuracy obtained for land cover classification of nine distinct segmentations in their study.Although they provide sufficient data to calculate GS on a slightly reduced set of segmentations, the small number of segmentations used in their study does not permit for in-depth analysis of the effect of normalization with respect to varying the range of tested segmentations.While their optimum seems quite pronounced and suggests some robustness, we believe that the relationship between classification accuracy and segmentation accuracy in terms of GS should be further confirmed and not be taken as granted.
In another study, Johnson and Xie [13] calculate GS for a number of automated segmentations and compare it with the GS attained for their manually delineated ground truth.They hypothesize: "In theory, the reference digitization should score very well (low GS) since expert knowledge of the study area was required to create it.If the reference digitization does not receive a good score, the evaluation method may not be effective for judging segmentation quality" [13] (p. 476).Indeed, the reference digitization scores well in their setup, i.e., for the segmentation parameter range they have used.The manual reference attains an absolute value comparable to the optimal GS calculated for the automated segmentation at scale 70 (Figure 4a).However, it can be shown that if they had for example used 150 instead of 250 as the maximum Scale parameter level in their study, the score attained by their manual digitization would have not supported the effectiveness of GS (Figure 4b).
Similarly, a recent study of Varo-Martínez et al.
[23] compared segmentations produced by two different algorithms using GS.The authors normalized I and v for the segmentations of each algorithm separately before comparing the absolute GS values of both methods.Using GS in such a way is highly problematic, which can easily be seen, for example, from Figure 2 or Figure 4, where identical segmentations attain vastly different absolute GS values depending on the sample used for normalization.Again, altering the range of tested parameters for one (or both) of the segmentation algorithms used, might have led to a different judgement on the relative performance of the two methods.

Conclusions
The development of image segmentation evaluation measures is mainly driven by the desire to make image analysis workflows reproducible and least subjective.A measure should ideally rank different segmentations consistently, with respect to pre-defined quality indicators.
We demonstrated that one of the most widely used unsupervised segmentation evaluation approaches in remote sensing Global Score (GS) is highly susceptible to the user-defined range of tested segmentations.Indeed, the 'optimum' suggested by GS heavily depends on the (arbitrary) choice of segmentations tested (e.g., the minimum and maximum parameter values).Altering the range of tested segmentations may not only change the parameter combination deemed optimal, but also significantly changes the quality ranking of other tested segmentations.Similar problems occur when comparing different segmentation algorithms, even more if they employ different types of parameters.Depending on the range of parameters used for each of the two algorithms, completely different findings can be obtained, making such comparisons ineffective.
The reason for the instability of the traditional GS method is related to the normalization used to balance the two individual terms of the objective function.The problem occurs with differing rates of change of inter-segment and intra-segment heterogeneity across scale and has been observed in preliminary tests on multi-spectral datasets of varying land-cover type and spatial resolution (see supplementary material).This confirms previous concerns about its vulnerability [35].Although our proposed modification is able to alleviate the problems introduced by the original normalization procedure, we neither claim this is the only nor the best solution.Image segmentation can be regarded a problem of psychophysical perception and what is considered a good solution also depends on the application and the imagery at hand and the expectations and the a priori knowledge of an analyst [36].While the general instability of GS due to the normalization procedure can easily be demonstrated using a single image, confirmation of the effectiveness of the proposed approach requires rigorous testing in different scenarios including various types of images or applications and this clearly exceeds the scope of this letter.
Our findings do not automatically render previous results obsolete, as for example in Johnson and Xie [13], the optimal segmentation identified using GS is only the starting point for further refinement of the segmentation.We believe the rationale behind the original normalization is still reasonable in such scenarios, in particular if the initial range of tested segmentations is carefully chosen.Assigning equal importance to the individual measures based on the range of values found in the set of tested segmentations intuitively makes sense, as long as the sample covers the solution

Conclusions
The development of image segmentation evaluation measures is mainly driven by the desire to make image analysis workflows reproducible and least subjective.A measure should ideally rank different segmentations consistently, with respect to pre-defined quality indicators.
We demonstrated that one of the most widely used unsupervised segmentation evaluation approaches in remote sensing Global Score (GS) is highly susceptible to the user-defined range of tested segmentations.Indeed, the 'optimum' suggested by GS heavily depends on the (arbitrary) choice of segmentations tested (e.g., the minimum and maximum parameter values).Altering the range of tested segmentations may not only change the parameter combination deemed optimal, but also significantly changes the quality ranking of other tested segmentations.Similar problems occur when comparing different segmentation algorithms, even more if they employ different types of parameters.Depending on the range of parameters used for each of the two algorithms, completely different findings can be obtained, making such comparisons ineffective.
The reason for the instability of the traditional GS method is related to the normalization used to balance the two individual terms of the objective function.The problem occurs with differing rates of change of inter-segment and intra-segment heterogeneity across scale and has been observed in preliminary tests on multi-spectral datasets of varying land-cover type and spatial resolution (see Supplementary Materials).This confirms previous concerns about its vulnerability [35].Although our proposed modification is able to alleviate the problems introduced by the original normalization procedure, we neither claim this is the only nor the best solution.Image segmentation can be regarded a problem of psychophysical perception and what is considered a good solution also depends on the application and the imagery at hand and the expectations and the a priori knowledge of an analyst [36].While the general instability of GS due to the normalization procedure can easily be demonstrated using a single image, confirmation of the effectiveness of the proposed approach requires rigorous testing in different scenarios including various types of images or applications and this clearly exceeds the scope of this letter.
Our findings do not automatically render previous results obsolete, as for example in Johnson and Xie [13], the optimal segmentation identified using GS is only the starting point for further refinement of the segmentation.We believe the rationale behind the original normalization is still reasonable in such scenarios, in particular if the initial range of tested segmentations is carefully chosen.Assigning equal importance to the individual measures based on the range of values found in the set of tested segmentations intuitively makes sense, as long as the sample covers the solution space adequately.In addition, selecting an ensemble of segmentations at multiple scales instead of a single scale segmentation (e.g., using a plateau objective function), can further help to stabilize the result [20,30].Nevertheless, practitioners should be aware of the limitations of the GS method and critically assess its appropriateness with respect to the application at hand.

Figure 1 .
Figure 1.Example of Moran's I values for different configurations of black and white cells on a regular lattice.(a) High spatial autocorrelation indicated by a Moran's I value of 0.97, as black and white cells are (mostly) surrounded by equal cells.(b) Random pattern yielding a Moran's I close to zero.(c) A perfectly dispersed pattern, where black and white cells do not share a single boundary, yields a Moran's I of −1.

Figure 1 .
Figure 1.Example of Moran's I values for different configurations of black and white cells on a regular lattice.(a) High spatial autocorrelation indicated by a Moran's I value of 0.97, as black and white cells are (mostly) surrounded by equal cells.(b) Random pattern yielding a Moran's I close to zero.(c) A perfectly dispersed pattern, where black and white cells do not share a single boundary, yields a Moran's I of −1.

Figure 2 .
Figure 2. Results for Global Score (GS), Moran's Index ( ) and Weighted Variance ( ) using the normalized measures calculated for the set of test segmentations, where (a) is restricted to a subset of Scale between 20 and 210, (b) is the full set of Scale ranging from 20 to 300 and (c) is the subset of Scale ranging from 110 to 300.While the optimum of the full set shown in (b) is contained in both (a,c), each set reports a different segmentation as optimal.

Figure 2 .
Figure 2. Results for Global Score (GS), Moran's Index (I) and Weighted Variance (v) using the normalized measures calculated for the set of test segmentations, where (a) is restricted to a subset of Scale between 20 and 210, (b) is the full set of Scale ranging from 20 to 300 and (c) is the subset of Scale ranging from 110 to 300.While the optimum of the full set shown in (b) is contained in both (a,c), each set reports a different segmentation as optimal.

Figure 3 .
Figure 3. Results for Global Score (GS), Moran's Index ( ) and Weighted Variance ( ) calculated for the same set of test segmentations as in Figure 2, using fixed values for normalization this time.(a) is restricted to a subset of Scale between 20 and 210, (b) is the full set of Scale ranging from 20 to 300 and (c) is the subset of Scale ranging from 110 to 300.Regardless of the subset used, segmentation at Scale 160 is reported as optimal.

Figure 3 .
Figure 3. Results for Global Score (GS), Moran's Index (I) and Weighted Variance (v) calculated for the same set of test segmentations as in Figure 2, using fixed values for normalization this time.(a) is restricted to a subset of Scale between 20 and 210, (b) is the full set of Scale ranging from 20 to 300 and (c) is the subset of Scale ranging from 110 to 300.Regardless of the subset used, segmentation at Scale 160 is reported as optimal.

Figure 4 .
Figure 4. Illustration of problematic usage of GS published by Johnson and Xie [13].(a) Comparing the results of GS for their set of tested segmentations and the score obtained for their manual digitization.(b) Results calculated for a (hypothetically) reduced set of segmentations.Changing the set of segmentations causes the optimum to shift slightly and, more remarkably, a worse relative score for their reference digitization is obtained.

Figure 4 .
Figure 4. Illustration of problematic usage of GS published by Johnson and Xie [13].(a) Comparing the results of GS for their set of tested segmentations and the score obtained for their manual digitization.(b) Results calculated for a (hypothetically) reduced set of segmentations.Changing the set of segmentations causes the optimum to shift slightly and, more remarkably, a worse relative score for their reference digitization is obtained.

Supplementary Materials:
The following are available online at www.mdpi.com/2072-4292/9/8/769/s1, Figure S1.False color composite (bands 5,6,4) of the Landsat-8 scene; Figure S2.Results for original Global Score, Moran's Index and Weighted Variance based on the NIR band of the Landsat-8 data set (a) GS calculated for all 57 segmentations; (b) GS calculated on the subset of Scale lower than 210, (c) GS calculated on the subset of Scale larger than 110; Figure S3.Results for proposed Global Score, Moran's Index and Weighted Variance based on the NIR band of the Landsat-8 data set (a) GS calculated for all 57 segmentations; (b) GS calculated on the subset of Scale lower than 210, (c) GS calculated on the subset of Scale larger than 110; Figure S4.False color composite (bands 8,4,3) of the Sentinel-2 scene; Figure S5.Results for original Global Score, Moran's Index and Weighted Variance based on the NIR band of the Sentinel-2 data set (a) GS calculated for all 100 segmentations; (b) GS calculated on the subset of Scale lower than 500; (c) GS calculated on the subset of Scale larger than 100; Figure S6.Results for proposed Global Score , Moran's Index and Weighted Variance based on the NIR band of the Sentinel-2 data set (a) GS calculated for all 100 segmentations; (b) GS calculated on the subset of Scale lower than 500; (c) GS calculated on the subset of Scale larger than 100; Figure S7.False color composite (bands 7,5,3) of the WorldView-2 scene; Figure S8.Results for the original Global Score, Moran's Index and Weighted Variance based on the NIR band of the WorldView-2 data set (a) GS calculated for all 100 segmentations; (b) GS calculated on the subset of Scale lower than 700; (c) GS calculated on the subset of Scale larger than 300; Figure S9.Results for the proposed Global Score, Moran's Index and Weighted Variance based on the NIR band of the WorldView-2 data set (a) GS calculated for all 100 segmentations; (b) GS calculated on the subset of Scale lower than 700; (c) GS calculated on the subset of Scale larger than 300.