Assessment of the Segmentation of RGB Remote Sensing Images: A Subjective Approach

: The evaluation of remote sensing imagery segmentation results plays an important role in the further image analysis and decision-making. The search for the optimal segmentation method for a particular data set and the suitability of segmentation results for the use in satellite image classiﬁcation are examples where the proper image segmentation quality assessment can a ﬀ ect the quality of the ﬁnal result. There is no extensive research related to the assessment of the segmentation e ﬀ ectiveness of the images. The designed objective quality assessment metrics that can be used to assess the quality of the obtained segmentation results usually take into account the subjective features of the human visual system (HVS). A novel approach is used in the article to estimate the e ﬀ ectiveness of satellite image segmentation by relating and determining the correlation between subjective and objective segmentation quality metrics. Pearson’s and Spearman’s correlation was used for satellite images after applying a k-means ++ clustering algorithm based on colour information. Simultaneously, the dataset of the satellite images with ground truth (GT) based on the “DeepGlobe Land Cover Classiﬁcation Challenge” dataset was constructed for testing three classes of quality metrics for satellite image segmentation.


Introduction
Satellite imagery classification is one of the main tasks in remote sensing applications such as city planning, climate change research [1], earth observations, geographical maps improvement [2], topographic surveys, or for the military purposes. Colour segmentation may also be used to track land cover changes over time [3]. The most prominent goal of remote sensing data analysis is object detection, based on image segmentation [2]. The principal goal of the image segmentation process is to partition the image into a set of segments that are homogeneous, according to specific criteria such as colour spectrum, shape, intensity, or texture [3] and map the individual regions to the corresponding real-world objects, like rivers, fields, roads, and other. Image segmentation is a broad term dependent on the goals of the specific application. Image segmentation is utilized for a wide variety of applications and is also a part of the object detection process.
Remote sensing images (RSI) usually represent various parts of the Earth surface and are characterized by clear details along with diverse spatial and textural information. When compared to most natural images, the satellite images stand out as being highly structured and uniform [1].

Related Works
Over many years, the development of image quality assessment (IQA) has drawn extensive and constant attention. Relatively less research was conducted from the perspective of evaluating image segmentation quality, especially the way humans understand and estimate the quality of segmented images.
Authors of Reference [7] systematized human visual properties, important for designing an objective segmentation metric: (1) human visual tolerance (e.g., for border distortions), (2) human visual saturation (difficulty for the human to evaluate similarity when distortions become large), (3) different perception of false negative (FN) and false positive (FP) pixels, and (4) the strongest distortions determine overall segmentation quality.
For natural images, the authors of Reference [8] suggested a contour-based score, combining Jaccard index (JI) and BF (Boundary F1) score for increased correlation with human rankings, as JI does not take into account quality of segmentation boundaries. It was observed that, while the accurate contours are definitely less important than correct classification, the proposed measure can further improve correlation with human rankings for high-quality segmentation. Authors used PASCAL VOC 2007 and 2011 datasets. However, very imbalanced data in PASCAL Visual Object Classes (PASCAL VOC) dataset can create biased results for some metrics like Accuracy (ACC). Ground truths contain instances of individual objects labelled separately in addition to the auxiliary label for object contours. Similar to Reference [8], authors in Reference [9] proposed a subjective quality metric for single object segmentation combining region and boundary-based terms that relate to human visual tolerance and saturation properties. A subjective test showed the improved results, achieving a Spearman Rank Order Correlation Coefficient (SROCC) value of 0.88 and then compared to JI, F1 score (F1), Fuzzy Contour (FC), BF score, and Mixed Measure (MM) metrics. For testing the new metric, the dataset was created by selecting images from other object segmentation databases like Microsoft research Asia salient object database, PASCAL VOC 2012, and Microsoft research Cambridges grabcut database.
The effectiveness of Peak signal-to-noise ratio (PSNR) as a quality measure for image segmentation algorithms was explored in Reference [10]. In this experiment, GT data was created by modifying images from the Berkeley BSR300 database (intended for evaluating edge detection algorithms) to resemble threshold segmentation. PSNR was used to measure the similarity between the GT image and segmentation result obtained by applying salt and pepper noise on the created GT image. It was noted that quality of the edge detection can be more effectively evaluated by PSNR. The tests performed by the authors, however, have not considered multi-region segmentation, which is a more practical approach for real-world objects.
Evaluation of segmentation quality for satellite images is regarded as a difficult task since there is no universal standard for evaluating segmentation results of satellite images. The most common evaluation methods are based on external or full-reference quality measures, and the need of sufficient amount of reference data poses a problem. Authors of Reference [11] suggested a Synthetic Image Testing Framework (SITEF) for the evaluation of multispectral satellite image segmentation by using synthetic images. This method provides different evaluation perspectives such as parcel size, shape, and land cover type. The framework was tested using images obtained by SPOT HRG satellite consisting of six land cover classes. The Hammoude metric, Rand coefficient, Corrected Rand coefficient, and JI were used for segmentation evaluation.
Authors in Reference [12] proposed attention dilation-LinkNet (AD-LinkNet) neural network that displays a significant improvement on the segmentation accuracy of satellite images. Three satellite segmentation datasets were used: DeepGlobe's road extraction and land cover datasets [1] as well as Inner Mongolia's land classification dataset. For different data sets, different model optimizations are needed. JI was used for evaluating both road and land segmentation results.
There are apparent differences between satellite images [1] and natural images, as seen in PASCAL VOC 2012 dataset. In satellite imagery, every object has a semantic meaning, while natural images are more chaotic than satellite, and often include large areas of background, which is less important when compared to foreground objects.
Colour image segmentation can provide more information for detecting objects than using grayscale images. Satellite images consist of homogenous colour regions, that define image data and grouping of pixels can be performed on distinct colour features. Many papers have been published in the past, focusing on using colour features for segmenting satellite images [3,13]. Automatic detection of road segments from RSI for road detection applications [14] suggested an approach that can also be applied to detect other objects in RSI. Authors in Reference [15] proposed a method for land cover classification based on colour moments (mean, standard deviation, and skewness) used in extracting colour features. The segmentation method in Reference [16] used wavelet transform for extracting colour and texture features from satellite images in the YCbCr colour space.

Image Segmentation
A wide variety of clustering solutions have been proposed for solving various problems depending on application-specific goals and/or characteristics of a particular dataset. A cluster is a group of pixels, which are similar and also dissimilar to pixels in other clusters, according to specific image features, that can be used for further image analysis. If the segmentation uses colour as a feature, then pixels are assigned to the clusters based on their colour similarity. Additionally, segmentation can be performed by using a combination of different features [16]. Using segmentation by colour, we can also avoid feature calculation for every pixel, thus, improving overall performance.
For this research, the k-means [17] clustering algorithm was selected. Researchers tend to employ clustering methods with well-known advantages and disadvantages, avoiding undiscovered limitations that could impact results [18]. K-means popularity and its simple implementation make it easier for replicating the same experiment independently. However, other clustering algorithms can be used, which might provide a different perspective and/or better results in certain situations (e.g., finding more complex cluster shapes).
The k-means algorithm (Table 1) groups the N data points or pixels {x 1 , x 2 , · · · , x i , · · · , x N } into K clusters by minimizing its objective function SSE (sum of squared error) within a cluster for all clusters. Minimizing SSE also maximizes SSB (the sum of squares between clusters). For this reason, SSE, SSB, and metrics combining these criteria return higher scores for clusters constructed by k-means. Step Description 1 An optimal number of clusters K is selected by the Silhouette method [19].
2 Instead of random initialization, cluster centroids are initialized using k-means++ procedure ( Table 2). 3 The Euclidean distance D = d(x i , c i ) is calculated between the cluster centroids and each pixel of an image. 4 Based on the calculated D, all pixels are assigned to the nearest centroid.

5
Recalculate all cluster centroid positions c i by computing the mean of currently assigned pixels.
The cycle is repeated (from the third step) until the position of the cluster centroids no longer changes 1 (i.e., no pixels are reassigned). Table 2. The k-means++ cluster centroid initialization method [20].
Step Description 1 First cluster centroid c 1 is selected evenly randomly from the existing set of pixels X. 2 Calculate distances D(x) from each pixel to the nearest centroid (in the first iteration, this is c 1 ).

3
Each subsequent cluster centroid c i is selected from the remaining set of pixels x ∈ X with probability: Go back to step 2 and repeat K-1 times until K cluster centroids have been added. 5 Proceed with step 3 from Table 1.
Before the segmentation stage, no preprocessing is necessary, although, according to Reference [8], the quality metric scores might have less meaning when the quality of segmentation is too low, suggesting that middle-quality segmentation is the best choice for the subjective quality evaluation. As such, we selected the optimal number of clusters by the Silhouette method [19] using a k-means++ algorithm to find their initial centroids ( Table 2).
Results produced by k-means are dependent on the initialization procedure of cluster centroids. It is important to configure parameters for more deterministic behavior, not to impact the calculations of the metrics severely. The prepared images from our satellite dataset (described in Section 4) have lower resolution (324 × 220 px) and fewer clusters (2)(3)(4). We observed segmentation results returned by MATLAB implementation of k-means seem to be stable even with the default value of 100 iterations and only three replications. Alternatively, it is possible to (1) lock cluster centroids by selecting them manually or (2) use Pseudo-Random Number Generator (PRNG) with constant seed [18]. In our case, k-means++ initialization (Table 2) was used, which also can improve segmentation results.
Satellite images are segmented in the CIE 1976 L*a*b* colour space, as suggested in Reference [14], which also more accurately represents the human perception of the colour [21] and have better performance than RGB space in many colour image applications. The Euclidean distance is used to measure colour similarities between pixels in the a*b* plane [3].

Satellite Images Dataset
There are many datasets designed for image segmentation problems. The selection of an appropriate dataset is a crucial decision that impacts subsequent choices, such as the selection of the optimal clustering method. It is important to understand the limitations and possible problems of the particular dataset before using it for any research project.
The majority of quality metrics for the assessment of the segmented image quality requires the GT image. For the evaluation of image quality after such distortions as image compression, the GT is treated as the original image, and the GT preparation process is unnecessary. Obtaining GT is always a barrier for automated image segmentation. Human skills are often required for manual labelling, and it is a time-consuming process. The manual labelling itself is subjective, and the different GT versions of the same image may be produced. Widely used datasets like Berkeley segmentation datasets (BSDS) [22] often contain natural images very diverse in their content, but they do not necessarily serve as a target for the specific application or provide GT, which match specific segmentation goals. Figure 1 depicts images from our constructed dataset based on the "DeepGlobe Land Cover Classification Challenge" dataset [1] intended to solve a mentioned problem by providing satellite imagery with GT data for improving state-of-the-art satellite image processing methods. Segmentation results for satellite images from our dataset part 1 [23] (different clusters for each image are marked with selected colours): (a) satellite image "71619_sat," (b) corrected GT image "71619_gt" with 4 clusters, (c) k-means++ segmentation results "71619_seg" of "71619_sat," (d) satellite image "161109_sat," (e) corrected GT image "161109_gt" with three clusters, (f) k-means++ segmentation results "161109_seg" of "161109_sat," (g) satellite image "676758_sat," (h) corrected GT image "676758_gt" with two clusters, (i) k-means++ segmentation results "676758_seg" of "676758_sat.".
The cost for annotating multi-class segmentation GT images provided in the DigitalGlobe Land Cover dataset have segments that are less accurate, missing prominent land cover portions ( Figure 3). Therefore, semi-automatic adjustments to the GT images were performed inside MATLAB using an Image Labeler, which allows us to label image data manually or use automation, and to export to the MATLAB workspace as a ground truth object variable containing label definitions.
The constructed dataset, including average mean opinion score (MOS) scores for each image used in our survey, can be accessed at Reference [23].

Objective Quality Assessment Methods for Image Segmentation
Depending on whether segmentation results are evaluated by a human or an algorithm, quality assessment is divided into two main branches: objective and subjective [25,26]. The main intention of objective quality models is to approximate properties of HVS in order to avoid slow and impractical subjective testing procedures. For this reason, the design process of new objective quality metrics often includes correlation tests with an obtained MOS [5,6].
The objective metrics may require initial information to evaluate the image segmentation quality. Such information is known as a reference image or GT. It could be prepared manually so that a comparison could be made with segmentation results achieved by a particular algorithm. This group of metrics is called external metrics (or supervised evaluation measures). The external information may not always be available, and metrics that do not depend on external information are used. This group of metrics is called internal metrics (or unsupervised evaluation measures).

External Metrics
The evaluation of segmented image quality using external metrics is equivalent to comparing two segmentation versions (GT image and segmentation result), where each pixel has a unique class label (or index) assigned to it. The GT image is often created (labelled) by an expert in a particular field (e.g., medical image segmentation) or sometimes can be generated [11] from an input image in the form of synthetic information. In our case, the segmentation result is obtained from the clustering algorithm, which returns an array containing cluster indices of each pixel corresponding to which cluster that pixel was assigned. Then the quality of the segmentation result can be evaluated by an external metric taking GT image and segmentation result as an input to determine the level to which two segmentations match. The evaluation process of the external metric is based on the analysis of cluster indices assigned to all pixel pairs between the GT image and segmentation result.
External quality metrics are calculated from the confusion matrix. The confusion matrix is a summary of the results of the segmentation problem. The number of correct and incorrect assignments is summed for a specific class. The confusion matrix is defined as a square matrix (K × K) where K is a number of classes (or clusters) and consists of four parameters (Table 3). These parameters are then used to derive combined external metrics that are presented in Table 4.
It is worth noting that external metrics can also serve as a mean of comparing segmentation results of two different algorithms or segmentation results of a single algorithm, but with different parameters [24]. Table 3. Notation for the external comparison metrics.

Notation Name Description
TP true positive pixels pixels that belong to cluster C i in the GT image and are correctly assigned to cluster C i in the segmented image (i.e., common pixels between GT image and segmented image) TN true negative pixels pixels that are assigned to different clusters both in the segmented and GT images FP false positive pixels pixels that are incorrectly assigned to cluster C i in the segmented image compared to the GT image FN false negative pixels pixels that are assigned to cluster C i in the GT image, but assigned to different cluster in the segmented image

MCC
Matthews correlation coefficient JI, IoU Jaccard index, Intersection over Union TP TP+FP+FN The F β measure is defined as a weighted harmonic mean of P and R: In Equation (12), the parameter β is any real positive number (0 ≤ β ≤ +∞) and determines the weighting between P and R. If β > 1, higher weight is applied to R. If β < 1, higher weight is applied to P. Depending on the value of β, expression (12) can lead to several possible scenarios/metrics: where R has double weight compared to P, • if β = 1 , ⇒ F β is equal to the unweighted harmonic mean of P and R, and is equal to F 1 . In this case, P and R are equally important, KI is defined as the arithmetic mean of P and R, while the FMI is defined as the geometric mean of P and R. Since the geometric mean is always in-between of arithmetic mean and harmonic mean for any positive number (in this case 0 ≤ P, R ≤ 1), then it is also true that K ≥ FMI ≥ F 1 .
All external metrics listed in Table 4 have a range of [0; 1], except MCC, ranging [−1; 1], where 1 represents a perfect segmentation result. However, various research studies suggest [8,9] that region-based metrics like F 1 or JI alone cannot fully reflect the human perception of image segmentation quality.
The definitions of metrics in Table 4 can only be applied to the binary (two-class) segmentation case. For multiclass segmentation, the overall scores for external measures were obtained by finding the score for each cluster C i (function confusionmatStats [28]), and then calculating the unweighted mean among all clusters K.

Internal Metrics
Internal metrics require no external information, i.e., reference image, to evaluate the segmentation quality. The segmented result is evaluated based on a particular set of characteristics (criteria), derived from the initial dataset. This feature is important, as generating or creating reference images is time-consuming or sometimes impossible.
The internal quality metrics are usually employed for (1) solving an optimal number of clusters, (2) determining the quality of clustering results without depending on external information, and (3) determining if data have any structure [24].
Internal methods evaluate clustering by examining the separation and the compactness of the clusters.

•
Cluster cohesion (or compactness) measures how closely related objects are in a cluster or how close the data points are from the cluster centroid. Better clustering results have pixel values close to their respective cluster centroids.

•
Cluster separation measures how a cluster differs or is separated from the other clusters.
Better clustering results have centroids of different clusters far from each other.
The primary measures of cohesion (14) and separation (15) are calculated from the image under investigation, while (16), (17), (19), and (21) measures combine both cohesion and separation. Basic notation for internal metrics is provided in Table 5. Clustering quality is considered good when the clusters are well separated and compact. Table 5. Basic notations and definitions for the internal metrics.

Notation Description
K the number of clusters N the number of objects (pixels) in image X (i.e., pixel count or resolution) the Euclidean distance between two objects (pixels) |C i | the total number of data points (pixels) in a cluster i X i = x i1 , x i2 , · · · , x i j , · · · , x i|C i | the set of pixels in C i {c 1 , c 2 · · · , c i , · · · c K } the set of cluster means (centroids) The Sum of Squared Errors Within Cluster (SSW) is alternatively known as the Sum of Squared Errors (SSE) (14) [29]. Lower values indicate higher cluster cohesion. SSE decreases with the increase of the number of clusters. SSE is defined as: The Sum of Squares Between Clusters (SSB) (15) [29] is a measure of separation. Higher SSB values indicate more separated clusters. SSB is defined as the sum of squared distances from c (known as overall centroid, i.e., centroid of all cluster centroids) to other cluster centroids c i each time multiplied by the number of pixels in the cluster C i .
Using SSE and SSB, some other combined internal metrics can be calculated. The Calinski-Harabasz index (CHI) (16) [30] is alternatively known as the variance ratio criterion (VRC).
The larger the SSB K /SSE K ratio is, the better the clustering quality is. The Hartigan index (HI) (17) is defined as the logarithmic relationship between SSB and SSE [31].
The Xu coefficient (18) combines SSE K , number of clusters K, a total number of pixels, and dimensionality of input data [32].
The coefficient −1 ≥ S x ij ≥ 1 can be calculated for an individual pixel x ij .
• a x ij average distance from the pixel x ij to other pixels in C i (cohesion), • b x ij is the minimum (worst case) of the all-average distances, each of them computed among the same pixel x ij and all the pixels inside another cluster (separation).
SH value for an individual pixel x ij represents how similar x ij is to pixels inside its own cluster, compared to pixels in other clusters. The Silhouette coefficient for a single cluster C i .
Finally, the overall SH for an image can be calculated similarly to Equation (13). However, alternative ways of calculating the overall score are possible (e.g., by averaging SH values for all pixels).
Higher SH values indicate a better clustering result. More detailed interpretations of SH values are described in Reference [19].
The Davies-Bouldin index (DBI) [33,34] calculation is based on the ratio of within-cluster distances to between-cluster distances.
R i = max i j R ij -maximum of R between the cluster C i and each other cluster C j , d ij = d c i , c j -distance between the centroid c i of cluster C i and centroid c j of the cluster C j , S i -the average distance between every pixel (within the cluster C i ) and its centroid c i , S j -the average distance between every pixel (within the cluster C j ) and its centroid c j .
Here, cohesion is defined in the form of sum S i + S j , while d ij defines separation. Lower DBI values indicate a better clustering result. If clusters are close to each other (small d ij ) and are dispersed (big S i + S j ), then DBI value will be high, indicating less optimal clustering.

IQA Metrics
IQA metrics can be divided into three groups, depending on the need of the reference information: (1) Full-Reference (FR)-metrics, that require reference/ground-truth image (quality of the segmented image is measured in comparison to the ground-truth image); (2) No Reference (NR)-metrics, that do not require reference/ground truth image for measuring quality; (3) reduced reference (RR) metrics measure quality by comparing distorted/segmented image with a reference/ground truth image, composed of specific extracted features (such as edge information). The reference image is named ground truth from the perspective of image segmentation. In this experiment, we concentrated on commonly used FR metrics that are listed in Table 6. IQA metrics can be applied to the image segmentation evaluation [10]. As input, the segmented images have different characteristics compared to the loosely compressed natural images and can be treated more like a synthetic (i.e., artificially generated) image. Segmented images have crisp contours, uniform regions, and are generally less complex. Majority of the proposed IQA metrics are designed for correlation with HVS, which is very sensitive to the contrast [50] or structural information changes. To achieve correlation with human perceived quality, most IQA metrics employ multiple strategies. However, some of the features (Table 6), like contrast changes, contrast masking or luminance masking, may be less or not important in evaluating the satellite image segmentation result. Various research studies emphasize the importance of precise contours for improved perceived segmentation quality, after combining JI and BF quality metrics [8] in the novel FR-IQA quality index, intended for colour images [5].

Subjective Quality Assessment of Segmented Satellite Images
Subjective evaluation by a human is the most reliable method for determining image quality in various applications (such as image editing, image retargeting [52], and others) as well as in the image segmentation [7]. However, segmentation quality requirements are also application-dependent [8,9].
Subjective quality evaluation tests require careful preparation, human, and time resources. In contrast to the long-established subjective evaluation of the distorted image and video quality, image segmentation lacks dedicated quality evaluation methodologies. For this reason, due to similarities with IQA, many strategies for subjective quality assessment for segmentation are adopted from the existing standards [7].
For the assessment of image quality, there are many test methods and rating scales provided by International Telecommunication Union (ITU) standards, describing acceptable modifications and recommendations. The method describes how a stimulus (in this case, a sequence of segmented images) is presented, and the rating scale describes the way that subjects evaluate their opinion of the stimulus. For this subjective quality assessment test, the simultaneous double stimulus for continuous evaluation (SDSCE) method described in ITU-R BT.500-14 [53] was combined with an absolute category rating (ACR) scale described in T-REC-P.913 [54]. ACR consists of a five-level rating scale: (5-Excellent, 4-Good, 3-Fair, 2-Poor and 1-Bad). T-REC-P.913 also does not recommend increasing the number of levels, since the accuracy of the results does not improve [55], and the evaluation process becomes more complicated for a human.
Before performing the MOS experiment, it is recommended to ensure that there are enough segmentation results of bad, average, and good quality between all segmented images. Otherwise, the data points will be concentrated in a single corner of the scatter plot and will not fully cover the ACR scale.
For subjects to fully understand their task, the first part of the test is the training phase and includes examples covering the full range of segmentation quality results combined with verbal instructions given by the administrator explaining voting procedure by following practices described in Reference (Section 11.5 in [54]).
During the second phase, subjects were presented with the electronic form depicted in Figure 4 and were requested to evaluate the differences between the GT segmentation and segmentation result obtained by k-means++. Subjects were aware of which image is the GT and which image is the segmentation result obtained by the selected algorithm. There was no time limit for evaluating a single pair of images. However, each experiment session lasted no more than 20 min.
Similar to the approach in Reference [9], original satellite images were also placed on the left, but only for context purposes and should not impact the judgement of segmentation quality. The segmented images have not been scaled to avoid introducing possible distortions.
In order to avoid fatigue and/or boredom, the test was divided into two parts with each consisting of 45 segmentations (90 segmentations in total). The first part was evaluated by 95 and the second part was evaluated by 92 subjects, which is more than enough for stable average ratings. After collecting experimental results, ratings were converted to numerical values (1, 2, 3, 4, and 5) and the total MOS for the single-segmented image was calculated as the arithmetic mean of the individual assessments that the subjects assigned to the segmentation result [53].
Here, o n is the observed rating for subject n, and N is the total number of subjects (participants) in the experiment. We observed that most segmented images received MOS scores from 2.5 to 3.5. This distribution is likely due to the human tendency of trying to avoid giving extreme scores when evaluating images [56].

Results
The workflow presented in Figure 5 was performed to collect all necessary data required for calculating the correlation between the subjective and objective scores.
As seen in the general framework, the whole workflow can be divided into three main branches. On the left section are presented steps that deal with constructing our dataset using the original dataset of satellite images and their GT. In the middle section are the described steps related to the selection of the segmentation method and performing segmentation of the satellite images based on the colour feature. Presented steps that are used for obtaining subjective scores are on the right section. The goal is to obtain the objective and subjective scores for the calculation of the Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank Order Correlation Coefficient (SROCC), which is the final step and is further described in this section. The calculation of the objective scores is presented before the final step.
Correlation between the subjective (sub) and objective (obj) scores was determined by PLCC (23) and SROCC (24) [57]. Here, n is the total number of images in the dataset.
SROCC is equal to PLCC applied on the ranks, in this case, of subjective and objective scores. For the image i, d i denotes the difference between the ranks. In order to compute SROCC, data have to be converted into ranks. Assuming no tied ranks exist, a simplified SROCC formula can be applied.
PLCC is the measurement of the strength of the linear relationship, while SROCC also measures the strength of the monotonic relationship. For example, for images distorted by various compression artifacts, most IQA metrics display nonlinear relationships with HVS [58]. Both correlation coefficients have a [-1; 1] range. Negative values indicate a negative correlation, while positive values indicate a positive correlation. If the values of both correlation coefficients are approaching zero, this suggests that relationship most likely does not exist.
The results of the experiment are presented in the two groups of Tables: Tables 7-9 for the overall correlation between subjective MOS scores and objective metric scores and Tables 10-12 for the correlation between different quality groups of MOS scores and objective metric scores. Tables 7-9 show the overall correlation using all images from the dataset. In Tables 8 and 9, metrics are sorted from highest to lowest, according to their PLCC and SROCC scores. Table 7 shows that the SH has the strongest positive correlation, while the DBI has the strongest negative correlation for both PLCC and SROCC. Table 8 presents that the JI has the highest PLCC, while ACC has the highest SROCC closely followed by JI. Table 9 shows that the PSNR has the highest PLCC, while UQI has the highest SROCC closely followed by PSNR.   Information from Tables 7-9 is also represented as the scatterplots in Figure 6a-d, respectively, showing a comparison between PLCC and SROCC values. Vertical and horizontal axes correspond to SROCC and PLCC, respectively. Here, metrics close to the dotted red line (y = x) have similar PLCC and SROCC values. Due to very similar values for external metrics, the scatterplot scale and position in Figure 6b were adjusted. The closer metric is to the (0, 0) point, the weaker correlation with MOS for that metric is. Best results are for metrics, which are closer to the upper right corner (positive correlation with MOS) or lower-left corner (negative correlation with MOS). From the visual inspection, it can also be observed that SROCC and PLCC values themselves display strong positive linear correlation, meaning both correlation coefficients are equally important. Tables 10-12 show the correlation between each metric and different quality groups based on MOS: low-quality (1.0-2.5), middle-quality (2.5-3.5), and high-quality (3.5-5]. For each quality group, the top three best results are highlighted except the results of the correlation in Table 10. Additionally, the information in Tables 10-12 are also presented as the bar charts in Figures 7-9. For each quality group in Tables 11 and 12, the top three best results are highlighted in light green.       Dividing results into three different quality groups allows for the more specialized comparison. The global correlation scores computed for the full dataset (Tables 7-9) do not allow identifying metrics that have a moderate correlation for all images (regardless of segmentation quality) from metrics that have a high correlation for images with above-average segmentation quality and low correlation for images with bad segmentation quality.

Discussion
The internal measures do not correlate well with perceived segmentation quality (the overall results in Table 7 and Figure 6a,d comparing to the external Table 8 and Figure 6b and FR-IQA measures (Table 9 and Figure 6c). Although the internal measures are used to evaluate results of clustering algorithms and help to select an optimal number of clusters, they were never intended to correlate with perceived image segmentation quality and relates more to how k-means is optimized. Thus, poor correlation is to be expected. In contrast, internal metrics combining cohesion and separation (such as DBI, CHI, and SH) seems to achieve slightly higher correlations, but as in Table 10 and Figure 7, only for segmentation results with high MOS scores. Note DBI shows a negative correlation as this measure returns lower scores for better clustering solutions. Overall, SH achieved moderate correlation for images in the MOS range (3.5-5]. The overall correlation scores for the external validation metrics are in a tight cluster (Table 8 and Figure 6b). As presented in Table 11 and Figure 8, lower MOS values correspond to the stronger correlations. This could be explained as the respondents agreeing more when determining the lower overall quality of the segmented image [7], while the opinions for better segmentation quality differs more. For the images in the low-quality group, P, F 1/2 , and JI show a strong positive linear correlation, MOS > 0.7. The F 1/2 and P metrics share a very similar correlation, as F 1/2 has a higher weight for P. Since there were no extended studies in this area, results in their entirety cannot be compared to previous studies. However, we observed SROCC values obtained by authors [7] for JI and F 1 metrics (0.848 and 0.848) used for single object segmentation are close to our overall SROCC values (0.811 and 0.802).
Best overall correlation for the IQA metric group was achieved by PSNR, UQI, SSIM, and SR-SIM (MOS > 0.8) (Table 9 and Figure 6c). All of them are also known to be very fast, according to the average calculation time for natural images from the TID2013 database [59]. These could be a reasonable choice for evaluating segmentation quality in terms of calculation speed and correlation with human perception.
It is widely accepted that the PSNR does not always agree with HVS assessing the quality of natural images distorted by various compression methods. From the overall results in Table 9 and Figure 6c, PSNR does not hold a significant advantage in the MOS score over most of other IQA metrics. Compared in the quality groups (Table 12 and Figure 9), we can state the similarity of IW-SSIM to PSNR. IW-SSIM is a relatively stable metric according to PLCC, and PSNR is stable only in the low-middle MOS (1.0-3.5) segmentation quality. The UQI metric is the most stable according to SROCC, while PSNR is only performing better for the low-middle segmentation quality and may be not an optimal choice depending on the segmentation situation. SR-SIM is a better choice for high-quality segmentations, MOS (3.5-5]. The reason for PSNR being able to compete with the advanced HVS metrics could be the nature of the segmented images, differing from the natural ones. The most distinct features of segmented images are clear contours and object shapes, uniform regions and colour, and absence of noise. HVS is sensitive to the structural information changes in the images. Therefore, metrics like UQI and SSIM are effective. HVS also largely relies on edge information for image interpretation [60]. In Reference [10], authors state that PSNR can be a good method to evaluate edge detection algorithms (for example, in BSR300 database).
The IQA metrics, in general, have a varying correlation to subjective assessment depending on the database, distortion type, image content, and image segmentation quality categories.
Authors of Reference [60] state the PSNR can sometimes have a strong correlation: 0.8756 (SROCC) and 0.8723 (PLCC) using natural images from the LIVE database. These results are very similar to our overall results: 0.8647 (PLCC) and 0.8608 (SROCC). PSNR does not outperform HVS-based metrics, which tend to be more stable across different databases of natural images and have even higher performance scores (MOS score).
Possible future directions for this research can include additional testing for RR, NR-IQA relations to subjective evaluation using larger-scale segmentation dataset(s) including multispectral satellite images. We suppose it is possible to select the prominent metrics for the combined measure, designed for improved correlation with subjective quality scores of the segmented images. To expand further, it can be interesting to evaluate the influence of algorithm selection, like using DBSCAN or another algorithm instead of k-means++, to the impact of correlation scores.
Depending on the goal or method, obtained segmentation results may have clusters assigned to a different colour value. For example, a segmented image consists of shades of a single colour versus very contrasting (or opposite) colour segments. Higher colour differences may impact some FR-IQA metric results and/or subjective evaluation itself. However, humans are able to distinguish more different levels of colour shades compared to gray shades [5]. Finally, determining a correlation among different groups of metrics might provide additional insight and a more diverse comparison.

Conclusions
This research aims to evaluate the correlation between the subjective and objective image quality metrics from the perspective of satellite image segmentation. Three broad classes of quality metrics were considered: internal and external cluster validation indices as well as FR-IQA measures.
From state-of-the-art technologies, we can conclude that there is no extensive research related to the assessment of the effectiveness of satellite image segmentation. For the segmentation quality test, we constructed our dataset of the satellite images with GT, based on "DeepGlobe Land Cover Classification Challenge" dataset.
From the experimental studies, several essential observations related to the assessment of the effectiveness of satellite image segmentation are made.

•
When the segmentation results are diverse in perceived quality, then most external measures and FR-IQA metrics display very similar correlation with MOS.

•
As perceived segmentation quality decreases, the correlation with MOS increases for the external quality measures.

•
The best metric for evaluating high-quality segmentation (MOS range (3.5-5]) was SR-SIM, achieving SROCC, and PLCC scores above 0.8, which also have low computational complexity.

•
Since PSNR and SR-SIM complement each other covering full MOS range, they could be combined into a single measure.
The experimental studies show that dividing segmentation results into three different quality groups based on MOS allows the more specialized comparison of the objective quality metrics, according to perceived image quality.
Our study might provide insights to other research, where selecting the most suitable subjective metric from the HVS perspective is crucial. Herewith, our original results and obtained observations can be applied for improving current state-of-the-art segmentation methods.