A Fast and E ﬀ ective Method for Unsupervised Segmentation Evaluation of Remote Sensing Images

: The segmentation of remote sensing images with high spatial resolution is important and fundamental in geographic object-based image analysis (GEOBIA), so evaluating segmentation results without prior knowledge is an essential part in segmentation algorithms comparison, segmentation parameters selection, and optimization. In this study, we proposed a fast and e ﬀ ective unsupervised evaluation (UE) method using the area-weighted variance (WV) as intra-segment homogeneity and the di ﬀ erence to neighbor pixels (DTNP) as inter-segment heterogeneity. Then these two measures were combined into a fast-global score (FGS) to evaluate the segmentation. The e ﬀ ectiveness of DTNP and FGS was demonstrated by visual interpretation as qualitative analysis and supervised evaluation (SE) as quantitative analysis. For this experiment, the “Multi-resolution Segmentation” algorithm in eCognition was adopted in the segmentation and four typical study areas of GF-2 images were used as test data. The e ﬀ ectiveness analysis of DTNP shows that it can keep stability and remain sensitive to both over-segmentation and under-segmentation compared to two existing inter-segment heterogeneity measures. The e ﬀ ectiveness and computational cost analysis of FGS compared with two existing UE methods revealed that FGS can e ﬀ ectively evaluate segmentation results with the lowest computational cost. lied between WM and JM, indicated by the lower QR and D values. The OS value of JM was highest, indicating that the result of JM was more over-segmented. The US value of WM was highest, indicating that the result of WM was more under-segmented. For the T6 image, FGS performed the same as JM, and both performed worse than WM, indicated by the lower QR and D values. From the value of OS and US, we can ﬁnd that WM performed worse in over-segmentation and better in under-segmentation. For the T7 image, the QR and D values of the FGS were lower than WM and higher than JM, thus, indicating that the FGS performed better than WM and worse than JM. From the value of OS and US, we can ﬁnd that the result of JM was most under-segmented and the result of WM was most over-segmented. For the T8 image, both QR and D values at the optimal scale obtained by FGS were higher than JM and WM, indicating the segmentation at the scale obtained by FGS was worst. From the value of OS and US, we can ﬁnd that the result of WM was most over-segmented and the result of FGS was most under-segmented. In addition, the QR and D values were not much di ﬀ erent on Massachuetts Buildings Dataset, indicating that the performance of the three methods was very similar.


Introduction
With the rapid development of remote sensing technology, high spatial resolution remote sensing images can be more easily obtained and widely used in a variety of applications [1][2][3][4]. Compared with the low-and medium-resolution remote sensing image, the high-resolution remote sensing image contains more detailed spatial information, but the spectral resolution is lower [5]. If the pixel-based analysis method which only uses the spectral information of the image is applied to the high-resolution image, its rich spatial information will be ignored and more noise will be produced [6,7]. Therefore, geographic object-based image analysis (GEOBIA) has begun to emerge and can achieve better accuracy in the high-resolution image [8][9][10].
The purpose of GEOBIA is to effectively utilize the spatial and texture information of high-resolution images [11,12]. The first step of GEOBIA is to segment the image into a series Remote Sens. 2020, 12, 3005 3 of 20 need to construct the region adjacency graph (RAG) to get the adjacency between objects. However, the process of RAG construction is very complex and time-consuming, especially when the image size is large and the segmentation scale value is low. As the size of the image increases, the number of segmented objects is increasing, and the construction time of RAG is also increasing geometrically, which makes the existing methods difficult to apply in practice. Up to now, far too little attention has been paid to computational cost. Therefore, a fast and effective UE method is needed.
In this paper, we present a fast and effective UE method for remote sensing images using the WV and difference to neighbor pixels (DTNP). Different from most existing heterogeneity evaluation measures, the difference with neighboring pixels was calculated in DTNP which avoids the time-consuming RAG construction and computation with neighboring objects. The WV and DTNP were used to compute the intra-segment homogeneity and inter-segment heterogeneity, respectively. The WV and DTNP were then combined into a fast-global score (FGS) to evaluate segmentation. In order to demonstrate the effectiveness and computational cost of DTNP and FGS, the other two existing methods were compared in the qualitative and quantitative analysis.
The main contributions of this paper are as follows: (1) Most of the existing inter-segment heterogeneity measures are based on RAG to compute the difference with neighboring objects, but the proposed DTNP computes the difference with neighboring pixels and is sensitive to both under-segmentation and over-segmentation. (2) The proposed FGS performs comparable to JM and WM, but the computational cost is much lower than JM and WM, which can be used for optimal segmentation parameter selection and segmentation parameter optimization. (3) Evaluating image segmentation by combining intra-segment homogeneity and inter-segment heterogeneity is more effective than evaluating image segmentation only by inter-segment heterogeneity, indicated by comparison of Sections 3.1 and 3. 2 The rest of this paper is arranged as follows. In Section 2, we elaborate on the specific details of the proposed method as well as the experimental areas and data. Section 3 gives the experimental results. Sections 4 and 5 describe the discussion and conclusions, respectively.

Overview
The schematic of segmentation evaluation through FGS which combines WV and DTNP is shown in Figure 1. The FGS was computed by the intra-segment homogeneity and the inter-segment heterogeneity under a series of different segmentation scales. First, we represented intra-segment homogeneity by computing the WV. Second, we represented inter-segment heterogeneity by computing the DTNP. Third, WV and DTNP were combined into FGS, and the curve of FGS with scale was generated for segmentation evaluation. Note that the WV and DTNP were computed in eCognition Developer 9.0 [40], FGS was computed in Python 3.6, equipped in a computer with 64-bit Windows 10, Intel Core i5-8265U CPU at 1.8 GHz, and 8 GB RAM. The algorithm has been open source (https://github.com/mfzhao1998/FGS).

Study Area and Data
A Gaofen-2 (GF-2) scene in Tongzhou District, Beijing, China, which was acquired on 9 September 2018, was applied to evaluate the performance of the proposed method. The GF-2 images contain four multi-spectral bands (blue, green, red, and near-infrared) with a resolution of 3.2 m and a panchromatic band with a resolution of 0.8 m. The NNDiffuse Pan Sharpening [41] which performs best when the combination of all multispectral bands covers the spectral range of the panchromatic raster in ENVI software [42] was used to fuse panchromatic data with multispectral data to obtain multispectral data with a resolution of 0.8 m. Four subsets of a residential area, an industrial area, a farmland area, and a mixed area in the GF-2 scene were selected for testing the proposed method ( Figure 2).

Study Area and Data
A Gaofen-2 (GF-2) scene in Tongzhou District, Beijing, China, which was acquired on 9 September 2018, was applied to evaluate the performance of the proposed method. The GF-2 images contain four multi-spectral bands (blue, green, red, and near-infrared) with a resolution of 3.2 m and a panchromatic band with a resolution of 0.8 m. The NNDiffuse Pan Sharpening [41] which performs best when the combination of all multispectral bands covers the spectral range of the panchromatic raster in ENVI software [42] was used to fuse panchromatic data with multispectral data to obtain multispectral data with a resolution of 0.8 m. Four subsets of a residential area, an industrial area, a farmland area, and a mixed area in the GF-2 scene were selected for testing the proposed method ( Figure 2).

Image Segmentation
Image segmentation is a technique that segments the image into different image regions with high homogeneity and mutual connection, corresponding to the objects or spatial structure features of interest. The ''Multi-resolution Segmentation'' algorithm in eCognition Developer 9.0 was used to perform image segmentation. The details of the algorithm can refer to Benz et al. [43]. The algorithm starts from a single pixel and uses the "bottom-up" region merging method to form polygon objects

Image Segmentation
Image segmentation is a technique that segments the image into different image regions with high homogeneity and mutual connection, corresponding to the objects or spatial structure features of interest. The "Multi-resolution Segmentation" algorithm in eCognition Developer 9.0 was used to perform image segmentation. The details of the algorithm can refer to Benz et al. [43]. The algorithm starts from a single pixel and uses the "bottom-up" region merging method to form polygon objects with different sizes (scales). Two adjacent image regions merge and grow. The judgment of merging depends on the heterogeneity of adjacent objects. Several small objects can be merged into a large object through several steps. The merging of two objects must ensure that the heterogeneity of them is less than the given threshold. The final object size is decided by this threshold. Therefore, this threshold can be regarded as a scale parameter. Other segmentation parameters include color factor and shape factor, and the weights between them. The color factor is the spectral characteristics of the image-the weight of each band. The shape factor is composed of two parameters, smoothness and compactness, which can prevent the image object from being broken too much, so that the segmented object has the shape characteristics of the actual object.
In this paper, the color factor to shape factor ratio was set to 9:1. All spectral bands were used and given the same weight. Smoothness and compactness were given the same weight. A series of segmentation produced by the Multi-resolution Segmentation was carried out with different scale parameters at intervals of 2, ranging from 10 to 70, so as to adapt to different scenes such as cities and suburbs.

Unsupervised Evaluation Using Fast-Global Score
Most evaluation methods mainly consider intra-segment homogeneity and inter-segment heterogeneity, excluding internal continuity and boundary complexity that are difficult to apply to remote sensing images [25,30]. Appropriate segmentation parameters were defined that minimized intra-segment homogeneity and maximized inter-segment heterogeneity [29]. In other words, the difference within an object is the lowest, and the difference between the objects is the highest. First, the variance was used as global intra-segment homogeneity, weighted by each segment's area. It is defined as follows: where m is the band number of the image, v ib is the variance of object i in band m, and v i is the mean variance of object i averaged by all bands.
where n is the total object number and a i is the area of object i. WV gives the same weight to each band, and we give v i different weight according to the size of the area, avoiding the instability caused by small segments. Third, the DTNP of each object was computed as follows: is the extended bounding box of object i with distance d (Figure 3),which is equal to is the set of pixel of object , c b is the mean of band .
is the DTNP of object . A global DTNP is formed by area weighting of each segment: This approach takes local spectral difference into full consideration and make the result more reasonable by area weighting. Compared with other methods [24,30], the computation of DTNP is more convenient because it does not need to construct RAG and compute heterogeneity with multiple neighboring objects. Finally, WV and DTNP are combined into FGS to evaluate the quality of segmentation considering both the intra-segment homogeneity and inter-segment heterogeneity. In order to consider intra-segment homogeneity and inter-segment heterogeneity equally, WV and DTNP were normalized to a 0-1 scale using the formula in Equation (5): where is one of WV or DTNP obtained by a set of segmentation parameters, and are the minimum and maximum value of WV or DTNP. Note that Low WV and high DTNP values represent better intra-segment homogeneity and inter-segment heterogeneity, respectively. To assign an overall FGS to each segmentation result, WV and DTNP were combined using the formula in Equation (6): This approach takes local spectral difference into full consideration and make the result more reasonable by area weighting. Compared with other methods [24,30], the computation of DTNP is more convenient because it does not need to construct RAG and compute heterogeneity with multiple neighboring objects. Finally, WV and DTNP are combined into FGS to evaluate the quality of segmentation considering both the intra-segment homogeneity and inter-segment heterogeneity. In order to consider intra-segment homogeneity and inter-segment heterogeneity equally, WV and DTNP were normalized to a 0-1 scale using the formula in Equation (5): where X is one of WV or DTNP obtained by a set of segmentation parameters, X min and X max are the minimum and maximum value of WV or DTNP. Note that Low WV and high DTNP values represent better intra-segment homogeneity and inter-segment heterogeneity, respectively. To assign an overall FGS to each segmentation result, WV and DTNP were combined using the formula in Equation (6): where weight w can determine the relative weights of the intra-segment homogeneity and inter-segment heterogeneity. In this paper, the same weight was given to WV and DTNP. Note that the higher FGS value indicates better segmentation quality.

Accuracy Assessment Measures for the Proposed Method
The supervised quantitative evaluation method is often used to evaluate the accuracy of UE methods and has been proposed in quantities [25]. The index of quality rate (QR), over-segmentation (OS), under-segmentation (US), and D which is composed of US and OS were used to validate the effectiveness of the proposed method [26]. QR represents the similarity of the reference object and the corresponding object, and ranges from 0 to 1. Zero means that the similarity between them is maximum and the segmentation result is the best. OS and US evaluate the degree of over-segmentation and under-segmentation by calculating the ratio of over-segmented area to reference object and the ratio of under-segmented area to the corresponding object, respectively. Both OS and US have a range of (0, 1). Zero means neither over-segmentation nor under-segmentation. D combines OS and US and considers both over-segmentation and under-segmentation. A lower D value reflects higher segmentation quality.
In this experiment, manual visual interpretation in ArcGIS 10.6 was used to generate reference objects for four typical study areas, and the above four accuracy assessment measures were computed to verify the effectiveness of DTNP and FGS. In this paper, we generated 40 reference objects for T1, T2, and 30 reference objects for T3, T4 (a total of 140 reference objects), as shown in Figure 2 (Green polygon).

Comparison with Other UE Methods and Inter-Segment Heterogeneity Measures
To demonstrate the effectiveness of the proposed method, the two methods proposed by Johnson et al. [30] and Wang et al. [24] were compared to FGS. The three methods use the same intra-segment homogeneity measure, but the inter-segment heterogeneity measure is different. Johnson's method used MI to measure the correlation between all segments, and its effectiveness has been proven in many studies [22,24,29,30,39]. The BSH was used by Wang's method to measure inter-segment heterogeneity, focusing on enhancing the objectivity of heterogeneity with local spatial statistics compared to MI. Different from MI and BSH, DTNP computes the difference to neighbor pixels.
To individually evaluate the effectiveness of DTNP, the UE method proposed by Yang et al. [31], which only uses intra-segment homogeneity and inter-segment heterogeneity was used. As the scale parameter increases, the inter-segment heterogeneity decreases while DTNP increases. When representative objects are matched, since inter-segment heterogeneity almost remains after reaching the optimal scale parameters, the growth trend of inter-segment heterogeneity will suddenly weaken or stop. The index of .
H which measures changes in inter-segment heterogeneity between scales is defined by the expression given in Equation (7): where H(l) is the inter-segment heterogeneity value when the scale is l and ∆l is the interval of scale parameter. Based on the above analysis, when the segmentation result with the segmentation scale parameter l − ∆l is close to the appropriate segmentation result, at the scale parameter l, H(l) is defined by the expression given in Equation (8): With the increase of scale parameters, DTNP keeps increasing while MI and BSH keep decreasing. Therefore, when using MI and BSH to measure inter-segment heterogeneity, the lowest value of I represents the optimal segmentation parameter. When using DTNP to measure inter-segment heterogeneity, the largest value of I represents the optimal segmentation parameter.
In this paper, three inter-segment heterogeneity measures were evaluated by the optimal segmentation results obtained by the index of I in Section 3.1. To evaluate the effectiveness of FGS, we qualitatively analyzed its effectiveness through visual assessment, and qualitatively evaluate its effectiveness through the accuracy assessment measures in Section 3.2 by comparing with Johnson's method [33] and Wang's method [21].

Effectiveness Analysis of DTNP
The Multi-resolution Segmentation algorithm in eCognition Developer 9.0 was used in four study areas to generate 31 segmentation results with 2 intervals, ranging from 10 to 70 for each study area. We used the method stated in Section 2.6 to compute the . H and I of DTNP, MI, and BSH to verify the inter heterogeneity measure, as presented in Figure 4. Lower DTNP values indicate lower inter-segment heterogeneity, while lower MI and BSH indicate higher inter-segment heterogeneity. As segmentation scale parameters increase, the DTNP value was increasing, and the MI and BSH value was decreasing gradually in the four study areas. From the . H value, we can find that the change trend of DTNP and MI was more stable than BSH. Because the three indices have different trends with scale, the maximum I value of DTNP indicates the optimal scale parameter, and the minimum I value of MI and BSH indicates the optimal scale parameter. For the T1 image, the optimal segmentation using the three inter-segment heterogeneity measures was obtained by using the scale at 30, 60, and 20, respectively. For the T2 image, the optimal segmentation using the three inter-segment heterogeneity measures was obtained by using the scale at 66, 66, and 50, respectively. For the T3 image, the optimal segmentation using the three inter-segment heterogeneity measures was obtained by using the scale at 52, 56, and 20, respectively. For the T4 image, the optimal segmentation using the three inter-segment heterogeneity measures was obtained by using the scale at 58, 62, and 32, respectively.
The local details of the optimal segmentation results of the four study areas obtained by DTNP, MI, and BSH are shown in Figure 5. In the first subset of T1, the playground was segmented well in the DTNP result, but it was under-segmented in the MI result and over-segmented in the BSH result. In the second subset of T1, buildings were effectively separated from shadows and some roads in the DTNP and BSH result, whereas there were some buildings mixed with shadows and some roads in the MI result. In the subset of T2, it was more under-segmented in the DTNP and MI result than in the BSH result. The buildings were not effectively separated from the surrounding buildings. In the first subset of T3, the results of DTNP and BSH were equivalent. Both can better separate buildings from buildings, and buildings from trees, but one object was under-segmented, which failed to effectively separate the two buildings. Each geographical object obviously exhibited serious over-segmentation phenomenon in the result of BSH. In the second subset of T3 and the first subset of T4, the farmland was segmented well in the DTNP and MI result, but it was over-segmented in the BSH result. In the second subset of T4, the farmland was segmented well in the DTNP result, but it was under-segmented in the MI result and over-segmented in the BSH result.   In order to quantitatively evaluate the effectiveness of DTNP, Table 1 shows the accuracy assessment results of the optimal segmentation for four study areas using the UE methods which only based on inter-segment heterogeneity. For the T1 image, the QR and D values of the DTNP methods were both lower than MI and BSH, thus, indicating that the DTNP had a better performance in the T1 result. Furthermore, the OS value of BSH was much higher than DTNP and MI, indicating that the result of BSH was more over-segmented. The US value of MI was much higher than DTNP Figure 5. Subsets of the optimal segmentation for four test images obtained by the DTNP and other existing two inter-segment heterogeneity methods.
In order to quantitatively evaluate the effectiveness of DTNP, Table 1 shows the accuracy assessment results of the optimal segmentation for four study areas using the UE methods which only based on inter-segment heterogeneity. For the T1 image, the QR and D values of the DTNP methods were both lower than MI and BSH, thus, indicating that the DTNP had a better performance in the T1 result. Furthermore, the OS value of BSH was much higher than DTNP and MI, indicating that the result of BSH was more over-segmented. The US value of MI was much higher than DTNP and BSH, indicating that the result of MI was more under-segmented. For the T2 image, DTNP performed the same as MI, and the BSH had a better performance in the T2 result, because the QR and D values of the BSH methods were both lower than DTNP and MI. From the value of OS and US, we can find that the result of BSH was more over-segmented and the result of DTNP and MI was more under-segmented. For the T3 image, both QR and D values at the optimal scale obtained by DTNP were lower than MI and BSH, indicating the segmentation at the scale obtained by DTNP was better. The OS value of DTNP and MI was similar and much lower than that of BSH, indicating that the result of BSH was more over-segmented. From the value of US, we can find that the result of MI was most under-segmented and the result of BSH had less over-segmentation. For the T4 image, the QR and D values of the BSH methods were both lower than DTNP and MI, indicating that the BSH had a better performance in the T4 result. From the value of OS and US, we can find that the result of BSH was most over-segmented but had the best performance in the under-segmentation, and the result of MI was most under-segmented but had a best performance in the over-segmentation.

Effectiveness Analysis of FGS
To further prove the effectiveness of FGS, this paper compared it with Johnson's method (JM) [30] and Wang's method (WM) [24]. Because FGS's intra-segment homogeneity evaluation indicators are the same as them, the difference lies in the inter-segment heterogeneity evaluation indicators. Note that all three methods use the same normalization method, thus, the lower measure values in Johnson's method (JM) indicate higher segmentation quality and the higher measure values in the FGS and Wang's method (WM) indicate higher segmentation quality. The optimal results obtained by the three methods are shown in Figure 6. For the T1 image, the optimal segmentation using the three methods was obtained by using the scale at 30, 30, and 32, respectively. For the T2 image, the optimal segmentation using the three methods was obtained by using the scale at 24, 18, and 38, respectively. For the T3 image, the optimal segmentation using the three methods was obtained by using the scale at 34, 24, and 32, respectively. For the T4 image, the optimal segmentation using the three methods was obtained by using the scale at 30, 38, and 44, respectively. three methods was obtained by using the scale at 30, 30, and 32, respectively. For the T2 image, the optimal segmentation using the three methods was obtained by using the scale at 24, 18, and 38, respectively. For the T3 image, the optimal segmentation using the three methods was obtained by using the scale at 34, 24, and 32, respectively. For the T4 image, the optimal segmentation using the three methods was obtained by using the scale at 30, 38, and 44, respectively. The local details of the optimal segmentation results of the four study areas obtained by FGS, JM, and WM are shown in Figure 7. In the first subset of T1, the playground was segmented better in the FGS and JM result, and it was under-segmented in the WM result. In the second subset of T1, the results of the three methods were under-segmented, because the buildings were confused with some vegetation, and the result of WM was more under-segmented. In the first subset of T2, the FGS and JM had similar performance and the result of them was more over-segmented, but the result of WM showed the phenomenon of under segmentation locally. In the second subset of T2, the results were over-segmented in the FGS and JM results. In the WM result, buildings and vegetation were not effectively separated, and under-segmentation occurred. In the first subset of T3, the buildings were all over-segmented in the results of the three methods, but the FGS and WM performed better than JM. In the second subset of T3, the farmland was all over-segmented in the results of the three methods, but geographic objects in JM were more fragmented. In the two subsets of T4, the and WM performed better in segmenting farmland than FGS, although the results of the three methods were all over-segmented. The local details of the optimal segmentation results of the four study areas obtained by FGS, JM, and WM are shown in Figure 7. In the first subset of T1, the playground was segmented better in the FGS and JM result, and it was under-segmented in the WM result. In the second subset of T1, the results of the three methods were under-segmented, because the buildings were confused with some vegetation, and the result of WM was more under-segmented. In the first subset of T2, the FGS and JM had similar performance and the result of them was more over-segmented, but the result of WM showed the phenomenon of under segmentation locally. In the second subset of T2, the results were over-segmented in the FGS and JM results. In the WM result, buildings and vegetation were not effectively separated, and under-segmentation occurred. In the first subset of T3, the buildings were all over-segmented in the results of the three methods, but the FGS and WM performed better than JM. In the second subset of T3, the farmland was all over-segmented in the results of the three methods, but geographic objects in JM were more fragmented. In the two subsets of T4, the and WM performed better in segmenting farmland than FGS, although the results of the three methods were all over-segmented. effectively separated, and under-segmentation occurred. In the first subset of T3, the buildings were all over-segmented in the results of the three methods, but the FGS and WM performed better than JM. In the second subset of T3, the farmland was all over-segmented in the results of the three methods, but geographic objects in JM were more fragmented. In the two subsets of T4, the and WM performed better in segmenting farmland than FGS, although the results of the three methods were all over-segmented. In order to quantitatively evaluate the effectiveness of FGS, Table 2 shows the accuracy assessment results of the optimal segmentation for four study areas using the three methods. For the T1 image, the QR and D values of the WM methods were both higher than FGS and JM, thus, indicating that the FGS and JM had a better performance in the T1 result. From the value of OS and US, we can find that the result of WM was most under-segmented and the three methods were In order to quantitatively evaluate the effectiveness of FGS, Table 2 shows the accuracy assessment results of the optimal segmentation for four study areas using the three methods. For the T1 image, the QR and D values of the WM methods were both higher than FGS and JM, thus, indicating that the FGS and JM had a better performance in the T1 result. From the value of OS and US, we can find that the result of WM was most under-segmented and the three methods were equivalent in over-segmentation. For the T2 image, the QR and D values of the WM methods were both lowest, indicating that the WM had the best performance. The QR and D values of the JM methods were both highest, indicating that the JM had the worst performance. The OS value of JM was much higher than FGS and WM, indicating that the result of JM was more over-segmented. The US value of WM was much higher than FGS and JM, indicating that the result of WM was more under-segmented. For the T3 image, FGS performed the same as WM, and the JM had the worst performance in the T2 result, because the QR and D values of the JM methods were both higher than FGS and WM. From the value of OS and US, we can find that the result of JM was more over-segmented and the result of FGS and WM was more under-segmented. For the T4 image, both QR and D values at the optimal scale obtained by WM were lower than FGS and JM, indicating the segmentation at the scale obtained by WM was better. From the value of OS, we can find that the result of FGS was most under-segmented and the result of WM had less over-segmentation. The US value of JM and WM was similar and much higher than that of FGS, indicating that the results of JM and WM were more under-segmented.

The Performance of FGS on Other Datasets
In order to evaluate the performance of FGS more effectively, Massachuetts Buildings Dataset was used [44], which can be obtained from the website (http://www.cs.toronto.edu/~vmnih/data/). The images in this dataset contain three multi-spectral bands (blue, green, and red) with a resolution of 1 m. In this paper, four images with the size of 1500 × 1500 were selected for the experiment (Figure 8). The optimal results obtained by the three methods are shown in Figure 9. For the T5 image, the optimal segmentation using the three methods was obtained by using the scale at 28, 24, and 34, respectively. For the T6 image, the optimal segmentation using the three methods was obtained by using the scale at 32, 32, and 28, respectively. For the T7 image, the optimal segmentation using the three methods was obtained by using the scale at 32, 44, and 28, respectively. For the T8 image, the optimal segmentation using the three methods was obtained by using the scale at 30, 28, and 26, respectively. respectively. For the T6 image, the optimal segmentation using the three methods was obtained by using the scale at 32, 32, and 28, respectively. For the T7 image, the optimal segmentation using the three methods was obtained by using the scale at 32, 44, and 28, respectively. For the T8 image, the optimal segmentation using the three methods was obtained by using the scale at 30, 28, and 26, respectively. The accuracy assessment results of the optimal segmentation for four study areas using the three methods are presented in Table 3. For the T5 image, the performance of FGS lied between WM and JM, indicated by the lower QR and D values. The OS value of JM was highest, indicating that the result of JM was more over-segmented. The US value of WM was highest, indicating that the result of WM was more under-segmented. For the T6 image, FGS performed the same as JM, and both performed worse than WM, indicated by the lower QR and D values. From the value of OS and US, we can find that WM performed worse in over-segmentation and better in under-segmentation. For the T7 image, the QR and D values of the FGS were lower than WM and higher than JM, thus, indicating that the FGS performed better than WM and worse than JM. From the value of OS and US, we can find that the result of JM was most under-segmented and the result of WM was most over-segmented. For the T8 image, both QR and D values at the optimal scale obtained by FGS were higher than JM and WM, indicating the segmentation at the scale obtained by FGS was worst. From the value of OS and US, we can find that the result of WM was most over-segmented and the result of FGS was most under-segmented. In addition, the QR and D values were not much different on Massachuetts Buildings Dataset, indicating that the performance of the three methods was very similar.
(c) (d) The accuracy assessment results of the optimal segmentation for four study areas using the three methods are presented in Table 3. For the T5 image, the performance of FGS lied between WM and JM, indicated by the lower QR and D values. The OS value of JM was highest, indicating that the result of JM was more over-segmented. The US value of WM was highest, indicating that the result

Computational Cost
This paper uses T1 as an example to analyze the computational cost of FGS, JM, and WM, and the relationship with the total number of objects. Because the intra-segment homogeneity measures of the three methods of FGS, JM, and WM are the same and the difference is the inter-segment heterogeneity measure, the computing efficiency of DTNP, MI, and BSH also reflects the efficiency of the three UE methods. Please note that all three measures of inter-segment heterogeneity were computed under the environment of eCognition Developer 9.0. DTNP, MI, BSH and the total number of objects at different scales of T1 are shown in Figure 10. As the scale increases, the number of objects obtained by segmentation decreases geometrically. The computational cost of DTNP, MI and BSH is positively correlated with the number of objects, and the correlation coefficients are 0.9906, 0.9999 and 0.9999 respectively, P values are all less than 0.05. Under the same scale, DTNP has the least computational cost, far less than MI and BSH. Therefore, FGS has the least computational cost compared with JM and WM.

Discussion
With the development of aviation and aerospace remote sensing technology, high-resolution remote sensing images containing more detailed spatial information are increasingly used. In order to make better use of spatial details, image segmentation is an indispensable processing step. The quality of image segmentation directly affects the accuracy of subsequent analysis, so it is important to evaluate image segmentation. Although the SE method can provide more accurate results, it requires manual generation of reference objects, which makes it difficult to be widely used in practical applications. The UE method can objectively evaluate the segmentation results without prior knowledge. Compared with SE, UE can not only be used for comparison and selection of segmentation methods, and the setting of segmentation parameters, but also for segmentation parameters optimization. Therefore, the UE method has gained more attention.
This study proposed the FGS combining intra-heterogeneity and inter-homogeneity to evaluate image segmentation. Compared with the existing UE method, FGS uses the same inter-segment heterogeneity evaluation measure, but creatively uses DTNP as the inter-segment heterogeneity measure. As the scale increases, the inter-segment heterogeneity also increases. The DTNP curve shown in Figure 8 can clearly reveal the process of increasing heterogeneity. FGS is always low when over-or under-segmented. The FGS curve shown in Figure 6 can clearly reveals the change from under-segmentation to optimal segmentation results and under-segmentation.
DTNP was proposed to measure inter-segment heterogeneity quickly and effectively. Through qualitative analysis of three inter-segment heterogeneity measures ( Figure 5), we found that the BSH had a higher separability and MI had a lower separability in segmentation results, but the result was also most over-segmented in BSH result. Through quantitative analysis of three inter-segment heterogeneity measures (Table 1), we found that the QR and D values of DTNP were the lowest in the T1 and T3, the same as MI and were both lower than BSH in the T3, and the highest in the T4, thus, indicating that DTNP has comparable performance to MI and BSH as the inter-segment heterogeneity measure. Furthermore, the OS value of BSH was highest and the US value of BSH was lowest in four study areas, indicating that BSH performs best in under-segmentation but worst in over-segmentation. The OS value of MI was lowest and the US value of MI was highest in four study areas, indicating that MI performs best in over-segmentation but worst in under-segmentation. From the value of OS and US, we can find that DTNP can perform better than BSH on over-segmentation and better than MI on under-segmentation. Therefore, compared to MI and BSH, which are more

Discussion
With the development of aviation and aerospace remote sensing technology, high-resolution remote sensing images containing more detailed spatial information are increasingly used. In order to make better use of spatial details, image segmentation is an indispensable processing step. The quality of image segmentation directly affects the accuracy of subsequent analysis, so it is important to evaluate image segmentation. Although the SE method can provide more accurate results, it requires manual generation of reference objects, which makes it difficult to be widely used in practical applications. The UE method can objectively evaluate the segmentation results without prior knowledge. Compared with SE, UE can not only be used for comparison and selection of segmentation methods, and the setting of segmentation parameters, but also for segmentation parameters optimization. Therefore, the UE method has gained more attention.
This study proposed the FGS combining intra-heterogeneity and inter-homogeneity to evaluate image segmentation. Compared with the existing UE method, FGS uses the same inter-segment heterogeneity evaluation measure, but creatively uses DTNP as the inter-segment heterogeneity measure. As the scale increases, the inter-segment heterogeneity also increases. The DTNP curve shown in Figure 8 can clearly reveal the process of increasing heterogeneity. FGS is always low when over-or under-segmented. The FGS curve shown in Figure 6 can clearly reveals the change from under-segmentation to optimal segmentation results and under-segmentation.
DTNP was proposed to measure inter-segment heterogeneity quickly and effectively. Through qualitative analysis of three inter-segment heterogeneity measures ( Figure 5), we found that the BSH had a higher separability and MI had a lower separability in segmentation results, but the result was also most over-segmented in BSH result. Through quantitative analysis of three inter-segment heterogeneity measures (Table 1), we found that the QR and D values of DTNP were the lowest in the T1 and T3, the same as MI and were both lower than BSH in the T3, and the highest in the T4, thus, indicating that DTNP has comparable performance to MI and BSH as the inter-segment heterogeneity measure. Furthermore, the OS value of BSH was highest and the US value of BSH was lowest in four study areas, indicating that BSH performs best in under-segmentation but worst in over-segmentation. The OS value of MI was lowest and the US value of MI was highest in four study areas, indicating that MI performs best in over-segmentation but worst in under-segmentation. From the value of OS and US, we can find that DTNP can perform better than BSH on over-segmentation and better than MI on under-segmentation. Therefore, compared to MI and BSH, which are more sensitive to over-segmentation or under-segmentation, DTNP can keep stability and remain sensitive to both over-segmentation and under-segmentation.
To further prove the effectiveness of the proposed FGS, we compared it with JM and WM. The qualitative results (Figure 7) show that the segmentation results determined by the three methods can distinguish different geographic objects, but cannot accurately distinguish all geographic objects, because a single segmentation scale is not applicable to all geographic objects. Therefore, it is necessary to further study the optimization of segmentation parameters. The QR and D in the qualitative results (Table 2) indicate that FGS does not always perform better than the JM and WM methods. However, in the quantitative results of T1 and T3 images, the QR and D values of FGS were the lowest, indicating that FGS performs best. The experiment on Massachuetts Buildings Dataset demonstrates that the performance of FGS is similar to that of the compared method, and it can effectively evaluate the segmentation quality.
From the value of QR and D in Tables 1 and 2, we can find that the quality of the segmentation results obtained by determining the optimal segmentation scale only through the evaluation of inter-segment heterogeneity was worse than that of the segmentation results obtained by determining the optimal segmentation scale through combining the evaluation of intra-segment homogeneity and inter-segment heterogeneity, which indicates the importance of intra-segment homogeneity and inter-homogeneity in segmentation quality evaluation.
From the result of computational cost, we can find that the computational cost of DTNP was much less than that of MI and BSH, and the computational cost of BSH was more than that of MI. DTNP need not construct RAG and computes with neighborhood objects, which lead the result of less computational cost. The computation process of BSH needs to use RAG twice, but it cannot be stored in the eCognition environment. Therefore, the computational cost of BSH is significantly increased by the two construction processes of RAG. If MI and BSH are computed by programming, the difference in computational cost will be greatly reduced. Note that the correlation coefficient between the computational cost of BSH and the number of objects is the lowest, because the number of objects is too few and the computational cost is too low, so it is difficult to make full use of computing resources. Since the computational cost of the heterogeneity in UE is more than the computational cost of other steps, we can infer from the computational cost of DTNP, MI, and BSH that the computational cost of FGS is much less than JM and WM.

Conclusions
The FGS based on WV and DTNP was proposed to evaluate the remote sensing image segmentation. The innovation of this study is to use the pixels of the object neighborhood as the inter-segment heterogeneity computational object. Compared with the existing methods, the performance is similar and the computational cost is greatly reduced. The experimental results show DTNP can clearly reveal the process of increasing heterogeneity and FGS can clearly reveal the change from over-segmentation to optimal segmentation results and under-segmentation. Furthermore, compared to MI and BSH, DTNP can keep stability and remain sensitive to both over-segmentation and under-segmentation. FGS has the similar performance with JM and WM but has the least computational cost. This advantage is more prominent when the UE method is applied to segmentation parameter optimization. The result in Sections 3.1 and 3.2 indicates both homogeneity and heterogeneity play an important role in segmentation evaluation. This method can be effectively used in GEOBIA. In future research, more effective and time-consuming unsupervised evaluation methods should be studied and effectively applied to segmentation parameter optimization.

Conflicts of Interest:
The authors declare no conflict of interest.