AssesSeg — A Command Line Tool to Quantify Image Segmentation Quality : A Test Carried Out in Southern Spain from Satellite Imagery

This letter presents the capabilities of a command line tool created to assess the quality of segmented digital images. The executable source code, called AssesSeg, was written in Python 2.7 using open source libraries. AssesSeg (University of Almeria, Almeria, Spain; Politecnico di Bari, Bari, Italy) implements a modified version of the supervised discrepancy measure named Euclidean Distance 2 (ED2) and was tested on different satellite images (Sentinel-2, Landsat 8, and WorldView-2). The segmentation was applied to plastic covered greenhouse detection in the south of Spain (Almería). AssesSeg outputs were utilized to find the best band combinations for the performed segmentations of the images and showed a clear positive correlation between segmentation accuracy and the quantity of available reference data. This demonstrates the importance of a high number of reference data in supervised segmentation accuracy assessment problems.


Introduction
Environmental modeling and spatial planning need the development of operational tools capable to simplify the extraction of information from available data.This is especially true when the considered data are remotely sensed and the extracted information has an economic relevance [1].
The results presented in this letter are aimed at the enhancement of Object-Based Image Analysis (OBIA) products.OBIA starts by aggregating similar pixels to obtain homogeneous objects by means of a crucial stage called image segmentation.Many reasons justify the use of an OBIA approach (e.g., geometric and neighbourhood information availability and the reduction of salt and pepper effect in classifications).A general review of pros and cons of OBIA can be found in Blaschke [1] and Blaschke et al. [2].
In remote sensing OBIA applications, the segmentation phase is commonly part of the initial stage of the workflow.Scientific literature proposes different methods for assessing the segmentation quality both from unsupervised and supervised approaches (e.g., Drȃguţ et al. [3], Liu et al. [4]).However, the proposed methods are not always coupled with tools able to automatically compute the described metrics.To overcome the limited availability of software assessing segmentation accuracy, this paper presents a free of charge command line tool (AssesSeg) that implements a modified version of the supervised discrepancy measure named Euclidean Distance 2 (ED2) proposed by Liu et al. [4].AssesSeg was used to detect the best band combinations for the detection of plastic-covered greenhouses from three multispectral satellite data: Sentinel-2 (S2), Landsat 8 (L8), and WorldView-2 (WV2) images.The authors' choice to test AssesSeg on plastic-covered greenhouses segmentation was due to their sound experience on this active research topic using both pixel-based approaches [5][6][7][8] and OBIA approaches [9][10][11][12][13][14].The performed test also demonstrates the importance of the number of reference objects in segmentation quality assessment.

Background
The modified ED2 works with a set of reference objects (ROs), so being considered a supervised segmentation quality metric.In this study, the ROs were manually digitized polygons.ED2 index (Equation ( 1)) computations, in their original formulation, start by defining the corresponding segment dataset characterized by

•
segments that spatially overlap the ROs; • overlapping criteria to be respected [15]: the intersection area between a RO and a candidate corresponding segment is more than half (50%) the area of either the RO or the corresponding segment of the polygon.
When the corresponding segments dataset is defined, the ED2 index evaluates the segmentation quality by computing the potential segmentation error (PSE) and the number-of-segments ratio (NSR).The potential segmentation error (PSE) measures the geometric discrepancy computed through the ratio between the total area of under-segments and ROs.A high PSE indicates macroscopic differences between the ROs and the corresponding segments datasets.In particular, the under-segmentation error occurs when the contour of a RO, r k , splits the corresponding segment, s i , into at least two parts (Equation ( 2)).The number-of-segments ratio (NSR) measures the arithmetic discrepancy between the ROs and the corresponding segments, being defined as the absolute difference between the number of ROs (m) and the number of corresponding segments (v) divided by the number of ROs (Equation ( 3)).
A high ED2 value can indicate a significant geometric discrepancy, a significant arithmetic discrepancy, or both.On the contrary, a small ED2 value can indicate a good segmentation quality.
The ED2 index was modified since the overlapping criteria acts as a filter both on the candidate corresponding segment and the ROs.In fact, when a RO does not have a corresponding segment candidate, the true number of employed ROs is lower than the original one.To take into account this biasing effect, both the PSE and NSR values were increased considering the actual number of ROs employed.In this way, n being the number of excluded ROs, the new PSE (Equation ( 4)) and NSR (Equation ( 5)) values were computed as follows: where max(|s i − r k |) is the maximum under-segmented area found for a single RO; v max represents the maximum number of corresponding segment found for one single RO; ∑|r k | here is referred to the total area of the m − n ROs.

AssesSeg Tool Description
"AssesSeg.exe"(see supplementary material) is a standalone command line tool (.exe extension) that implements the rules described in Section 2. AssesSeg deals only with the ESRI polygon shapefile (it does not depend on the segmentation software), and its source code was written in Python 2.7 given the large availability of open source optimization, data analysis, control, and numerical analysis libraries (e.g., NumPy and SciPy) [16].It requires two different typologies of input in a working folder (see supplementary material): a RO input shapefile and a set of segmentation shapefiles (at least composed by one single file).The output of the executable is a spreadsheet file (see Appendix A). Figure 1 shows the iterated process for each segmentation file within AssesSeg.
AssesSeg can also deal with different prefixed overlapping percentages between ROs and input segments.However, in this letter, the overlapping criteria, described in Section 2, was always applied.Lastly, AssesSeg is a very powerful tool if coupled with automatic or semi-automatic algorithms capable of producing many segmentation files following a certain criterion.In this work, this was accomplished by means of a semi-automatic eCognition rule set characterized by a looping process among prefixed multi-resolution segmentation (MRS) algorithm parameters.Appendix A shows an example output table (see supplementary material) and software details (including download links in Table A1).

AssesSeg Tool Description
"AssesSeg.exe"(see supplementary material) is a standalone command line tool (.exe extension) that implements the rules described in Section 2. AssesSeg deals only with the ESRI polygon shapefile (it does not depend on the segmentation software), and its source code was written in Python 2.7 given the large availability of open source optimization, data analysis, control, and numerical analysis libraries (e.g., NumPy and SciPy) [16].It requires two different typologies of input in a working folder (see supplementary material): a RO input shapefile and a set of segmentation shapefiles (at least composed by one single file).The output of the executable is a spreadsheet file (see Appendix A). Figure 1 shows the iterated process for each segmentation file within AssesSeg.
AssesSeg can also deal with different prefixed overlapping percentages between ROs and input segments.However, in this letter, the overlapping criteria, described in Section 2, was always applied.Lastly, AssesSeg is a very powerful tool if coupled with automatic or semi-automatic algorithms capable of producing many segmentation files following a certain criterion.In this work, this was accomplished by means of a semi-automatic eCognition rule set characterized by a looping process among prefixed multi-resolution segmentation (MRS) algorithm parameters.Appendix A shows an example output table (see supplementary material) and software details (including download links in Table A1).

Study Area and Satellite Data
The test area (Figure 2) is located in the "Sea of Plastic", province of Almería (Southern Spain) characterized by the greatest concentration of greenhouses in the world.
Three cloud-free images taken with different sensors were used to test AssesSeg within the context of plastic-covered greenhouses segmentations: (i) the recently launched S2 Multi Spectral Instrument (MSI); (ii) the L8 Operational Land Imager (OLI); and (iii) the WV2 satellite (bundle combination of Panchromatic (PAN) and MultiSpectral (MS) bands).

Study Area and Satellite Data
The test area (Figure 2) is located in the "Sea of Plastic", province of Almería (Southern Spain) characterized by the greatest concentration of greenhouses in the world.
Three cloud-free images taken with different sensors were used to test AssesSeg within the context of plastic-covered greenhouses segmentations: (i) the recently launched S2 Multi Spectral Instrument (MSI); (ii) the L8 Operational Land Imager (OLI); and (iii) the WV2 satellite (bundle combination of Panchromatic (PAN) and MultiSpectral (MS) bands).Satellite data were atmospherically and geometrically corrected before segmentation.Particularly, L8 and S2 data were co-registered to the WV2 PAN image given the much higher resolution and geometric accuracy of WV2 imagery.The WV2 data (5 July 2015) and the L8 OLI scene (8 January 2016) images were geometrically and atmospherically corrected using the capabilities of Geomatica v. 2014 (PCI Geomatics, Richmond Hill, ON, Canada).Further details can be found in [12] where a similar pre-processing scheme was undertaken.In particular, the OLI panchromatic band was not used in this test.
Level 1C (L1C) S2 MSI image (12 January 2016, orbit R051) product is a top of atmosphere reflectance dataset with cartographic projection (UTM/WGS84), 12-bit dynamic range, and tiles/granules consisting of 100 km 2 ortho-images.The MSI sensor collects up to 13 bands with three different geometric resolutions: 60 m, 20 m, and 10 m ground sample distance (GSD).The sen2core algorithm [17] was used to produce a Level 2A (L2A) S2 bottom of an atmosphere reflectance image.Lastly, the study area was extracted from the selected S2 granule (Figure 2) and co-registered to the WV2 PAN image by using Geomatica v. 2014.In this test, 60 m GSD bands were excluded from the segmentations.

Segmentation Results
The MRS algorithm provided by eCognition v. 8.8 was used to produce different segmentation datasets (see [18] for a complete mathematical description of the algorithm).MRS requires user driven parameters (scale, shape, compactness, and band combinations) to obtain satisfactory results.For this purpose, 400 plastic-covered greenhouse ROs ("400.shp" in supplementary material, the same ROs for all three satellite images), manually delimited on the corrected WV2 PAN scene, were extracted to find the best modified ED2 index.To deal with the scale parameters of the same order of magnitude among the three satellite data, the S2 and L8 initial geometric resolutions (10 m and 20 m GSD for S2, and 30 m GSD for L8) were increased to 1.875 m and 2 m, respectively, by simply splitting pixels in equal parts without any resampling.Then, AssesSeg was used for every modified ED2 index computation.
The test for S2 and L8 started by only varying the scale parameter values (from 10 to 120 with a step of 1) and fixing the shape and compactness to 0.5.In the case of the S2 data, only the 10 m GSD bands (Blue, Green, Red, NIR) were used, whereas for the L8 data the SWIR bands were also tested in consideration of their high sensitivity to plastic coverings [19].Table 1 depicts the best ED2 results achieved and the corresponding scale parameter values for the tested L8 and S2 band combinations.It is noteworthy that the best band combinations common for S2 and L8 datasets are Blue-Green-NIR and Blue-Green-Red-NIR.Although the S2 Red-NIR combination was characterized by an ED2 value smaller than the Blue-Green-NIR one, the same does not occur for the L8 Red-NIR band combination that features almost the highest L8 measured ED2.The chosen band combinations were subsequently applied to extend and refine the search of an optimal segmentation for S2 and L8 images by varying the scale parameter values, within an interval of the local optimum, and the Satellite data were atmospherically and geometrically corrected before segmentation.Particularly, L8 and S2 data were co-registered to the WV2 PAN image given the much higher resolution and geometric accuracy of WV2 imagery.The WV2 data (5 July 2015) and the L8 OLI scene (8 January 2016) images were geometrically and atmospherically corrected using the capabilities of Geomatica v. 2014 (PCI Geomatics, Richmond Hill, ON, Canada).Further details can be found in [12] where a similar pre-processing scheme was undertaken.In particular, the OLI panchromatic band was not used in this test.
Level 1C (L1C) S2 MSI image (12 January 2016, orbit R051) product is a top of atmosphere reflectance dataset with cartographic projection (UTM/WGS84), 12-bit dynamic range, and tiles/granules consisting of 100 km 2 ortho-images.The MSI sensor collects up to 13 bands with three different geometric resolutions: 60 m, 20 m, and 10 m ground sample distance (GSD).The sen2core algorithm [17] was used to produce a Level 2A (L2A) S2 bottom of an atmosphere reflectance image.Lastly, the study area was extracted from the selected S2 granule (Figure 2) and co-registered to the WV2 PAN image by using Geomatica v. 2014.In this test, 60 m GSD bands were excluded from the segmentations.

Segmentation Results
The MRS algorithm provided by eCognition v. 8.8 was used to produce different segmentation datasets (see [18] for a complete mathematical description of the algorithm).MRS requires user driven parameters (scale, shape, compactness, and band combinations) to obtain satisfactory results.For this purpose, 400 plastic-covered greenhouse ROs ("400.shp" in supplementary material, the same ROs for all three satellite images), manually delimited on the corrected WV2 PAN scene, were extracted to find the best modified ED2 index.To deal with the scale parameters of the same order of magnitude among the three satellite data, the S2 and L8 initial geometric resolutions (10 m and 20 m GSD for S2, and 30 m GSD for L8) were increased to 1.875 m and 2 m, respectively, by simply splitting pixels in equal parts without any resampling.Then, AssesSeg was used for every modified ED2 index computation.
The test for S2 and L8 started by only varying the scale parameter values (from 10 to 120 with a step of 1) and fixing the shape and compactness to 0.5.In the case of the S2 data, only the 10 m GSD bands (Blue, Green, Red, NIR) were used, whereas for the L8 data the SWIR bands were also tested in consideration of their high sensitivity to plastic coverings [19].Table 1 depicts the best ED2 results achieved and the corresponding scale parameter values for the tested L8 and S2 band combinations.It is noteworthy that the best band combinations common for S2 and L8 datasets are Blue-Green-NIR and Blue-Green-Red-NIR.Although the S2 Red-NIR combination was characterized by an ED2 value smaller than the Blue-Green-NIR one, the same does not occur for the L8 Red-NIR band combination that features almost the highest L8 measured ED2.The chosen band combinations were subsequently applied to extend and refine the search of an optimal segmentation for S2 and L8 images by varying the scale parameter values, within an interval of the local optimum, and the shape values from 0.1 to 0.9 (with a step of 0.1).The compactness parameter was always set to 0.5 since the literature underlined its minor contribution when compared to shape and, above all, scale parameters [3].Table 2 summarizes the best achieved results for the two band combinations and the two tested satellite images.It is worth noting that the S2 image always performed better segmentations than the L8 one considering the ED2 metric.
Table 1.Modified ED2 and its associated scale parameter (SP) (shape and compactness were fixed to 0.5) for the tested Sentinel-2 (S2) and Landsat 8 (L8) equal-weighted band combinations.In grey, the best common results achieved.The amplitude of the tested scale parameter values was reduced in the case of the WV2 image (further details can be found in a recent study performed by Aguilar et al. [9] over the same area).In fact, the chosen test scale parameter values ranged from 25 to 45, whereas the tested shape parameters varied from 0.1 to 0.9.The scale, shape, and compactness parameters followed the same rules defined for the S2 and L8 images.Table 3 reports the best results achieved, for each considered band combination, under the performed WV2 tests.In particular, Table 3 shows that the best WV2 segmentation performances were achieved with the Blue-Green-NIR1 combination.Tables 2 and 3 show that all the best ED2 values (high quality segmentations) were attained, for the three atmospherically corrected satellite data, with a very similar band combination.

Band
Table 3.The modified ED2 and its associated scale (SP) and shape parameters (compactness was fixed to 0.5) for the tested WorldView2 (WV2) equal-weighted band combinations.In grey, the best results achieved.
The segmentation features a high visual quality for WV2 data and an acceptable visual quality for S2 data, whereas L8 segmentation performed the worst.This visual comparison allows readers to fully appreciate the use of the modified ED2 index, implemented in AssesSeg, to extract the best potential segmentation from both VHR and medium resolution satellite images.

Discussion
The experimental design of this study shows one possible implementation of the freely available tool AssesSeg.Being a command line ".exe" tool, it can be used both with programming languages (e.g., Matlab, IDL, and Python) that implement the possibility to call the operating system to execute a specified command, and as a standalone tool able to compare one RO input file with many produced segmentation files.Particularly, the above results were achieved by using many segmentation files produced by an eCognition rule set and compared with a large (400 manually digitized polygons) RO input file.
Tables 2 and 3 proved the stability of the resulting best common band combinations for plastic-covered greenhouses segmentations regardless of the tested satellite image.In particular, S2 and L8 featured the same best band combination, whereas the WV2 dataset showed very good results with either the Blue-Green-NIR1 and Blue-Green-NIR2 band combinations.The same did not occur when the Blue and the Green bands are used at the same time with the two WV2 NIR bands.The results also show that the Blue-Green-Red-NIR band combination is characterized by very good ED2 indices for the three sensors, whereas, for the WV2 dataset, the contemporary use of the two NIR bands leads to higher ED2 results if compared with an analogue single NIR band combination.Moreover, Tables 1-3 show that the use of all bands has never been the best option in the selected area.

Discussion
The experimental design of this study shows one possible implementation of the freely available tool AssesSeg.Being a command line ".exe" tool, it can be used both with programming languages (e.g., Matlab, IDL, and Python) that implement the possibility to call the operating system to execute a specified command, and as a standalone tool able to compare one RO input file with many produced segmentation files.Particularly, the above results were achieved by using many segmentation files produced by an eCognition rule set and compared with a large (400 manually digitized polygons) RO input file.
Tables 2 and 3 proved the stability of the resulting best common band combinations for plastic-covered greenhouses segmentations regardless of the tested satellite image.In particular, S2 and L8 featured the same best band combination, whereas the WV2 dataset showed very good results with either the Blue-Green-NIR1 and Blue-Green-NIR2 band combinations.The same did not occur when the Blue and the Green bands are used at the same time with the two WV2 NIR bands.The results also show that the Blue-Green-Red-NIR band combination is characterized by very good ED2 indices for the three sensors, whereas, for the WV2 dataset, the contemporary use of the two NIR bands leads to higher ED2 results if compared with an analogue single NIR band combination.Moreover, Tables 1-3 show that the use of all bands has never been the best option in the selected area.
Figure 3 shows, as already reported in Aguilar et al. [11], that the ED2 metric can be characterized by a very good relationship with the visual quality of plastic-covered greenhouses segmentations.This is especially true with very high resolution data.Lastly, the readers should bear in mind that the above results were achieved using always the same RO input data and with atmospherically corrected band data.Different results can be attained using atmospherically uncorrected data.
Another important finding achieved with AssesSeg is the ability to implement a large RO input file (with hundreds of polygons) avoiding extremely time-consuming manual computations.Indeed, in previous scientific works, only a reduced number of ROs were used to assess the segmentation accuracy (e.g., in [4,9,11,20], this number was less than 100).The improvement achieved through the use of a large RO input can be demonstrated in the chosen test area.For this reason, 50 independent sets of ROs were randomly extracted (with replacement) from the initial population of 400 ROs in groups of variable sample sizes ranging from 25 to 200 ROs with a step of five (see Figure 4).Each single independent set was extracted without replacement.This experiment shows that uncertainty in the computation of the modified ED2 index decreases when the number of ROs employed increases.
Figures 4 and 5 respectively show, for the selected best band combinations of the three satellite data, the scatter plots and the corresponding 95% confidence intervals for the modified ED2 index according to the RO sample size.In this sense, both figures demonstrate that the ED2 variability is excessively high when working with a low number of ROs.In fact, based on the results obtained in this specific work, a good segmentation would require more than 100 ROs.It is relevant that only around 30 ROs per class were used in previous segmentation quality studies [4,20] and that, in this case study, 30 ROs were not enough to guarantee stable results.
Figure 3 shows, as already reported in Aguilar et al. [11], that the ED2 metric can be characterized by a very good relationship with the visual quality of plastic-covered greenhouses segmentations.This is especially true with very high resolution data.Lastly, the readers should bear in mind that the above results were achieved using always the same RO input data and with atmospherically corrected band data.Different results can be attained using atmospherically uncorrected data.
Another important finding achieved with AssesSeg is the ability to implement a large RO input file (with hundreds of polygons) avoiding extremely time-consuming manual computations.Indeed, in previous scientific works, only a reduced number of ROs were used to assess the segmentation accuracy (e.g., in [4,9,11,20], this number was less than 100).The improvement achieved through the use of a large RO input can be demonstrated in the chosen test area.For this reason, 50 independent sets of ROs were randomly extracted (with replacement) from the initial population of 400 ROs in groups of variable sample sizes ranging from 25 to 200 ROs with a step of five (see Figure 4).Each single independent set was extracted without replacement.This experiment shows that uncertainty in the computation of the modified ED2 index decreases when the number of ROs employed increases.
Figures 4 and 5 respectively show, for the selected best band combinations of the three satellite data, the scatter plots and the corresponding 95% confidence intervals for the modified ED2 index according to the RO sample size.In this sense, both figures demonstrate that the ED2 variability is excessively high when working with a low number of ROs.In fact, based on the results obtained in this specific work, a good segmentation would require more than 100 ROs.It is relevant that only around 30 ROs per class were used in previous segmentation quality studies [4,20] and that, in this case study, 30 ROs were not enough to guarantee stable results.

Conclusions
By introducing AssesSeg, a command line user-friendly tool to assess digital image segmentation quality becomes freely available both for scientific and technical purposes.
The use of the very popular ESRI shapefiles data format makes AssesSeg widely compatible with the outputs of the most common OBIA software.Thanks to AssesSeg, the optimal bands combination and MRS parameters, headed up to carry out plastic-covered greenhouses segmentation, were easily found when working on S2, L8, and WV2 satellite images.Moreover, the test results also indicate the importance of increasing the number of ROs to diminish the uncertainty in assessing the segmentation quality through the ED2 modified metric.This demonstrates the importance of a high number of reference data in supervised segmentation accuracy assessment problems.
Future development will be aimed both at the implementation of a further optimization of the AssesSeg source code, including the development of a graphical user interface, and at the improvement of the performance of segmentation quality metrics indices.

Conclusions
By introducing AssesSeg, a command line user-friendly tool to assess digital image segmentation quality becomes freely available both for scientific and technical purposes.
The use of the very popular ESRI shapefiles data format makes AssesSeg widely compatible with the outputs of the most common OBIA software.Thanks to AssesSeg, the optimal bands combination and MRS parameters, headed up to carry out plastic-covered greenhouses segmentation, were easily found when working on S2, L8, and WV2 satellite images.Moreover, the test results also indicate the importance of increasing the number of ROs to diminish the uncertainty in assessing the segmentation quality through the ED2 modified metric.This demonstrates the importance of a high number of reference data in supervised segmentation accuracy assessment problems.Future development will be aimed both at the implementation of a further optimization of the AssesSeg source code, including the development of a graphical user interface, and at the improvement of the performance of segmentation quality metrics indices.

Figure 1 .
Figure 1.Flowchart of the core algorithm implemented in AssesSeg.

Figure 1 .
Figure 1.Flowchart of the core algorithm implemented in AssesSeg.

Figure 2 .
Figure 2. Location of the study area depicted by means of the Red band of the Sentinel-2 image (T30SWF granule).Coordinate system: ETRS89 UTM Zone 30N.

Figure 2 .
Figure 2. Location of the study area depicted by means of the Red band of the Sentinel-2 image (T30SWF granule).Coordinate system: ETRS89 UTM Zone 30N.
Remote Sens. 2017, 9, 40 6 of 11 fully appreciate the use of the modified ED2 index, implemented in AssesSeg, to extract the best potential segmentation from both VHR and medium resolution satellite images.

Table 2 .
Modified ED2 and its associated scale and shape parameters (compactness was fixed to 0.5) for the tested Sentinel-2 (S2) and Landsat 8 (L8) equal-weighted band combinations.In grey, the best results achieved.

Table A1 .
AssesSeg output table example.From left to right are enumerated the following columns: name, scale parameter (SC, for MRS outputs), shape parameter (SP, for MRS outputs), compactness parameter (C, for MRS outputs), number (n) of ground truth (gt) geometries, number of segmented (seg) geometries, area gt (total area of gt geometries), under seg area (total under segmented area), number-of-segments ratio (NSR, modified or original), potential segmentation error (PSE, modified or original), and Euclidean Distance 2 (ED2, modified or original).