1. Introduction
Remote-sensing images are generally ground images taken by sensing equipment mounted on an air or space vehicle [
1,
2]. The spatial resolution of the images has increased steadily as remote-sensing sensors have advanced. These images greatly benefit a wide range of fields, including urban spatial planning, land surveying, environmental monitoring, etc. [
3,
4]. However, remote-sensing images with high spatial resolution usually contain complex ground targets and objects of different scales [
5,
6,
7]. They have scale dependency, which further presents a challenge to HRSI classification. According to classification strategies, they can be classified into pixel-based and object-based classification methods [
8].
The pixel-based classification methods usually determine the category to which each pixel belongs based on the grayscale information of image elements [
9]. Depending on the level of automation, they can be categorized into supervised and unsupervised classifications. Unsupervised methods include the iterative self-organizing analysis algorithm [
10], K-means [
11,
12], and fuzzy clustering [
13]. Zhang et al. [
14] proposed a K-means-based framework to extract effective feature-learning representations for remote-sensing image classification. Common methods for supervised classification include the maximum likelihood method [
15], multilayer perceptron [
16], SVM [
17] and random forest classifier [
18]. Dong et al. [
19] combined conditional random fields, SVM and random forests to propose a method based on multi-model fusion for high-resolution remote-sensing applications. Though the aforementioned pixel-based classification approaches are more appropriate for high-spectral resolution remote-sensing images with a greater number of wavebands, it is still a challenge for these methods to capture the long-range contextual relationship between pixels.
Further, with the development of deep learning, semantic segmentation has been successfully applied in HRSI classification, such as in fully connected network [
20], U-Net [
21], DeepLab [
22]. These methods are the data-driven approach to learning high-level features of pixels and establish contextual relationships between pixels, which compensate for the limitations of pixel classification based on manual features. Theoretically, these methods can fit all the pixel features to be classified. However, these methods rely heavily on the richness of the training data. For HRSI, the detailed information (texture features, geometric features and spatial features) is much richer due to the richness [
23]. These pixel-based classification methods are prone to the “pretzel phenomenon” and generate misclassification. To solve these problems, many research works have been carried out to reduce the pretzel effect by increasing the network depth, redesigning the network structure and adding Markov random fields for post-processing [
24].
Compared with the pixel-oriented approach, the object-oriented method is mostly applied for HRSI classification [
25]. They apply the strategy of segmentation and then classification for HRSI classification, which compensates for the pretzel effect caused by pixel-based classification methods. Kim et al. [
26] analyzed the role of texture, scale and objects in image classification of high-resolution aerial images, and the results showed that the object-based multi-scale classification algorithm had the highest accuracy. Ma et al. [
27] analyzed the factors affecting the accuracy of object-oriented classification, and found that the optimal segmentation scale was related to the image spatial resolution and the study area, and that random forests performed better in object-based classification. Zheng et al. [
28] proposed an object-oriented Markov random field model. Specifically, this method built the weighted-region neighborhood graph based on region size and edge information as feature information, and then achieved semantic segmentation by probabilistic inference from random fields. However, the above object-based approaches still hardly meet the accuracy needs of HRSI classification, despite these methods improving the classification accuracy.
Currently, many studies use object-based segmentation integrating a deep neural network approach for HRSI classification. These methods avoid complex artificially designed features and improve the classification accuracy. Hong et al. [
29] proposed a depth feature-based remote-sensing image classification method using multi-scale object analysis and a convolutional neural network (CNN) model. Zhou et al. [
30] proposed a fine-grained functional-area classification method based on segmented objects and convolutional neural networks, and combined frequency statistics methods to identify the functional classes of basic units. Although the above object-based classification methods can obtain higher accuracy using deep learning networks, they fail to determine segmentation scale due to the network output size, which can easily cause over-segmentation or under-segmentation problems.
Superpixel segmentation, which situates neighboring pixels into uniformly distributed irregular pixel blocks, has been successfully applied to HRSI classification. Lv et al. [
31] proposed a deep-learning method based on CNN and energy-driven sampling for HRSI classification. Li et al. [
32] adopted a deep neural network method for objects’ standardized segmentation for HRSI classification. These superpixel-based methods can effectively delineate and map the features of high spatial resolution images. However, these methods only perform feature extraction at multiple scales, which not only increases the information redundancy, but also increases the non-separability between features.
Figure 1 shows the current deep neural network approach using superpixels,
,
,
,
scales are sampled and these data are simultaneously used as the input to feature extraction for this superpixel block classification. This approach only obtains a category, while not obtaining the long-range dependencies of superpixel blocks.
Although some progress has been made in deep learning for HRSI classification based on superpixel segmentation, it is still worth exploring. The main problems faced are as follows. (1) Semantic scale. HRSI usually contains features at different scales, and using images of fixed scale range as input will increase the burden of heterogeneous feature representation of superpixel objects at different scales by the network. (2) Long-distance dependence. The superpixel objects only contain pixels with homogeneity in a smaller range, and they fail to determine the class to which the object belongs using only one superpixel object because of the homospectral and heterospectral, and the long-distance dependency between surrounding objects.
To tackle the above problems, this paper proposes a long-distance dependent deep neural network structure for HRSI classification. The main contributions are: (1) For semantic scaling, we propose a multiple-channel all-inclusive shared deep neural network considering the multiple scales between different superpixel objects. A larger range of superpixel objects is used as input, and each object is used as a feature extraction unit, which enhances the feature contribution of each segmented object to feature classification. (2) For long-distance dependence, we design a deep neural network with long-range dependencies. A mesh of contextual correspondences between input objects is established, and contextual dependencies between surrounding distant objects are incrementally enhanced, while the class of each superpixel object is determined.
4. Experiment and Analysis
In this section, we first perform a qualitative and quantitative comparison with competitive methods (OBIA-SVM, Superpixel-DCNN and DeepLab v3 [
35]) on two test images to validate the performance of our proposed method. OBIA-SVM is implemented using the eCognition Developer 9.0. In GF and QB, the optimal scale, shape and compactness obtained by human visual judgment are set to the parameters of 60, 0.6, 0.8 and 60, 0.85, 0.9, respectively. For Superpixel-DCNN and the proposed, they both use SLIC for superpixel segmentation. Superpixel-DCNN uses the network structure in
Table 2, for comparison. DeepLab v3 follows the parameter settings of [
35] for training and testing on the dataset.
The classification accuracy of each category in the image, overall accuracy (OA) and statistical Kappa (
) coefficients are used as evaluation metrics. Taking the binary classification as an example, the detailed calculation is given below.
where T is the total number of pixels in the accuracy assessment.
where ∑ is the chance accuracy represented by (TP + FP)(TP + FN) + (FN + TN)(FP + TN).
4.1. Classification Results
Classification results.The quantitative comparison results of our proposed method and competitive methods on two images are provided in
Table 3. For GF, the proposed method obtains the highest classification accuracy with OA of 0.79 and
of 0.76. The OA and
for OBIA-SVM, Superpixel-DCNN and DeepLab v3 are (0.57, 0.51), (0.65, 0.62), and (0.70, 0.68), respectively. For each category of classification accuracy, the effectiveness of the proposal is further demonstrated. The accuracy of buildings, roads, vegetation, water, and bare soil increased by 0.15, 0.41, 1.05 and 0.34 compared to OBIA-SVM, with especially vegetation and water increasing more significantly. The accuracy of road increased dramatically by 0.15 in comparison to Superpixel-DCNN, and the accuracy of buildings, vegetation, water, and bare soil increased by 0.17, 0.25, and 0.23, respectively. The proposal increased considerably by 0.22 for buildings and slightly by 0.08, 0.08, 0.13 and 0.13 for roads, vegetation, water, and bare soil as compared to DeepLab v3.
The proposal obtained the highest classification accuracy on QB with OA and
higher than GF, as shown in
Table 4. On QB, the OA of the proposed method is 0.92 and
is 0.89. The OA and
of the proposed method increased by (0.48, 0.62), (0.26, 0.21) and (0.19, 0.20), respectively, in comparison to OBIA-SVM, Superpixel-DCNN, and DeepLab v3. The overall performance ranking of classification accuracy is OBIA-SVM < Superpixel-DCNN < DeepLab V3 < the proposed method. The classification performance of each category on QB proved the effectiveness. The accuracy of each category was much higher than OBIA-SVM, at 0.91, 0.90, 0.89, 0.93, and 0.94, respectively. The proposal improved more considerably than Superpixel-DCNN for roads and water, by 0.20 and 0.13, respectively. However, for buildings and vegetation, their classification accuracy increased less. For buildings, roads, woodland, vegetation and water, the accuracy increased by 0.21, 0.20, 0.30 and 0.08 relative to DeepLab v3, respectively.
Discussion and analysis. The proposed method had a higher classification performance than OBIA-SVM, Superpixel-DCNN and DeepLab v3 on two images. OBIA-SVM method used the strategy of object-based segmentation before classification for remote-sensing image. However, it was limited by the selection of segmentation parameters and the feature representation of the examined objects, which resulted in classification outcomes which were noticeably inferior to those of other approaches, as shown in
Figure 8 and
Figure 9. The influence of the SVM features allowed confusing each category, despite the fact that OBIA-SVM could obtain fine boundaries on specific feature categories, such as buildings and roads. For example, roads on QB were misclassified as buildings, with water bodies misclassified into vegetation. Superpixel-DCNN classified each segmented object using a deep neural network after applying a superpixel-segmentation approach to obtain a precise boundary. Compared to OBIA-SVM, Superpixel-DCNN was more accurate in classifying objects. Since Superpixel-DCNN only performed feature extraction on a segmented object, lacking contextual information between objects, it achieved lower classification accuracy obtained on the water in GF and roads in QB. This was owing to the large scale of these objects, the difficulty of characterizing the linear properties of roads using only one segmentation object, and the lack of information transmission between upper and lower objects in Superpixel-DCNN. DeepLab v3 was an end-to-end classification method based on deep neural networks, which was a semantic segmentation algorithm. However, DeepLab v3 had lower classification accuracy than our proposed method in the classification of small samples. This is a result of the DeepLab v3 algorithm’s ability to classify each individual pixel and need for a wide range of training data. Deeplab v3 provided a lower classification accuracy due to the homospectral and heterospectral on high-resolution remote-sensing images.
The superpixel-based proposal broadened the semantic range of the data input and could identify long-distance semantic relationships between numerous segmented objects, which enabled our proposed method to correctly classify large-scale features such as water in GF and roads in QB. Additionally, objects were input into CNN and LSTM networks, which not only gathered high-level features of objects, but also increased the effectiveness of classification.
Finally, the speed of classification is also important. Since the comparison methods OBIA-SVM, Superpixel-DCNN and the proposed method all used a segmentation first and then classification strategy, the time of the optimal segmentation parameters was difficult to quantify. For a fair comparison, we tested the processing speed (sec/sample) of the classification models on the test images. All evaluations were based on an Intel(R) i7-7700 CPU and a DUAL-RTX 2070-O8G-EVO.
Table 5 compared the estimated time complexity of the different methods. Although Superpixel-DCNN, DeepLab v3 and the proposed method spent more time on the CPU, the compatibility with the GPU made these methods the fastest in processing speed.
4.2. The Effect of Semantic Range on Classification Accuracy
The semantic range was quantified as the number of input segmentation objects. Superpixel segmentation objects were introduced in
Section 3.2, and they were encircled by various-sized circles. The number of input objects (
N) was chosen as 9, 12, 16, 20, 25, and 36 for measuring the classification performance to investigate the impact of various semantic ranges on classification accuracy. The overall classification maps for GF and QB with various semantic ranges are displayed in
Figure 10. The classification maps for roads and water bodies in GF exhibited discontinuity, or misclassification, at smaller semantic ranges. With no essential change in classification accuracy, plane buildings and vegetation display extremely good color continuity. On GF, bare soil had mixed classifications with buildings and an overall linear and plane shape. Bare soil exhibited an increase in correct classification on GF as the semantic range widens. For QB, the classification map exhibited a similar phenomenon to GF. The QB contained a large number of homogeneous regions such as vegetation, water, and textures on QB. They displayed less complexity while maintaining excellent classification accuracy at a constrained semantic range. Since roads and buildings were linear types on QB, the classification map showed the same class color constancy as the semantic range expands. On QB, woodland was plane and point-like, and the classification differed dramatically from the maps for smaller semantic ranges.
Figure 11 depicted the accuracy of each category in the two test images. Overall, as the semantic range widens, the classification accuracy of each category tended to steadily increase, with various categories showing different patterns in classification accuracy. The overall results were similar to conclusions obtained above. As the semantic range widens, the classification accuracy for roads and water in GF increased from 0.38 to a maximum value of 0.79, with a more significant growth between 9 and 16. For buildings and vegetation, their classification accuracy increased insignificantly. For bare soil, the classification accuracy trend was between linear features (roads, water) and plane features (buildings, vegetation). Both the roads in QB and GF followed the same pattern, whereby the classification accuracy increased as the semantic range raised, peaking at values between 20 and 25. While most of the buildings in QB exhibited a linear type, its classification accuracy was significantly affected by the semantic range. Since the accuracy change curves for vegetation and water were nearly constant, it was possible to classify these objects accurately within a limited semantic range. The classification accuracy for vegetation on QB reached its highest level in the semantic range of 16–25.
Result analysis. Objects with linear shapes are subject to larger semantic range images. This might be because there were not as many objects with the linear category within a certain semantic range, which caused a lack of long-distance dependencies between the obtained objects and misclassification. We designed the semantic range with plane circular-selection segmentation objects as input, which had a major impact on why the classification progress of faceted objects was less impacted by the semantic range. Strong dependencies between plane segmentation objects allowed the network to construct a heterogeneous representation between them. Additionally, the homogeneity of plane objects was more influenced than that of the semantic range, as was the case for vegetation and water with high homogeneity in QB. Contrarily, since distinct buildings in GF were impacted by mixed image elements, they were heterogeneous and required a wider semantic range for classification.
4.3. Ablation Studies for Network Configuration
An ablation study of the long-range dependency network was conducted to evaluate the efficiency of joining various network operations to assist in the design of classification networks. A multichannel convolutional network and an LSTM constituted the classification network, whose primary function was to identify the long-distance dependencies among the segmented input objects. Consequently, the multichannel convolutional network serves as the foundational network (MCCB). To evaluate the performance of the classification, MCCB and LSTM networks were combined, either with or without configured longitudinal connections (NL, YL). Additionally, we contrasted the impact of various LSTM layer counts (l = 3, l = 4, l = 5) on the dependencies between the input objects. The detailed network combinations and classification results were shown in
Table 6. The networks for each combination allowed the random training variables (epoch time, learning rate, batch size) to be deterministic. The effectiveness of the configured networks was evaluated using the classification accuracy of each category on the evaluation test images.
The experimental results showed that MCCB performed significantly worse than MCCB+NL, MCCB+YL, MCCB + YL (l = 3), MCCB + YL (l = 4) and MCCB + YL (l = 5) for classification on all features. Despite the fact that the input objects in MCCB shared the same parameters, they lacked the contextual relationship between the objects, which led to a lower level of classification accuracy. The classification performance of MCCB + NL was slightly higher than that of MCCB overall. This was because MCCB+NL merely expanded the underlying network by a laterally connected LSTM and did not perform information dependency extraction between the input objects. Therefore, the overall classification performance was lower than that of MCCB + YL.
MCCB + YL (l = 3), MCCB + YL (l = 4) and MCCB + YL (l = 5) established long-range dependencies for the input objects and had higher classification accuracy on the whole. From the classification results of MCCB + YL (l = 4) and MCCB + YL (l = 5), increasing the number of layers of the network did not significantly increase the classification accuracy. This indicated that the LSTM’s layer number had less of an impact on classification accuracy. The accuracy affecting the classification of features was mainly dominated by the longitudinally connected LSTM.