1. Introduction
Semantic image segmentation, which aims to classify each pixel into one of the given categories, is an important task for understanding [
1,
2,
3] and inferring objects [
4,
5,
6] and their observed relations in a scene. As a bridge towards high-level tasks, semantic segmentation is adopted in various applications in computer vision and remote sensing areas, such as autonomous vehicle driving [
2,
7,
8], human pose estimation [
9,
10,
11], remote sensing image interpretation [
12,
13,
14,
15,
16], and 3D reconstruction [
17,
18,
19]. Over the last five years, remarkable success in the semantic scene labeling area has been gained through the usage of convolutional neural networks (CNNs) [
20,
21,
22,
23,
24,
25,
26] in dense prediction. Naturally, the ability to express the complex input–output relationships and the efficiency of integrated into the end-to-end learning framework are attributed to fully convolutional neural networks (FCNs).
Generally, recent semantic segmentation methods have often been formulated to convert the architecture of existing CNNs to FCNs [
22,
23,
27,
28,
29]. Coarse pixel-wise labeling is obtained by multi-scale and dilation strategies, whereas the fine segmentation is conducted by optionally integrating contextual information into the output map. Although active research has been conducted on these aspects, semantic image segmentation remains a challenging issue because of the complexity of balancing contextual information and pixel-level accuracy [
24,
26,
29,
30,
31]. Contextual relationships model the interactions between predicted labels and provide structured cues for dense prediction. In addition, various approaches in formulating compatible relations within contextual information have been proposed for performance improvement. A dominant paradigm for modeling contextual relationships advocates the use of the conditional random field (CRF), which computes unary and pairwise potentials for further refinement, on top of CNNs [
25,
26,
32]. By combining CRF and FCNs, the interactions between the predicted labels and the contextual information are well counterpoised. A few of these approaches utilize the pairwise or higher order CRF [
33,
34] as a post-process on FCN output to preserve sharp boundaries, while others formulate pixel-wise labeling problems with the CRF in conjunction with FCNs [
26,
35] in a unified framework and train in an end-to-end manner.
These leading approaches perform dense prediction in a discrete domain, and hence end with learning approximate mean-filed inference or graph model optimization in a fixed number of iterations. However, these methods require additional aides and do not guarantee the convergence of the inference process to the global or even local optimum [
26,
35]. Therefore, the efficiency of the expressive power might be lost if the uncertainty of the predicted label increases in each iteration.
In this paper, we propose a novel approach to address the issues mentioned. In contrast to the approaches optimized in the discrete domain, we formulate the pixel-wised labeling issue as a special case of manifold ranking (MR) problem in a continuous domain on top of CNNs. Motivated by [
36,
37,
38,
39], we observe that the MR model has a unique global optimal solution and is guaranteed to converge as a type of graphical model. Moreover, global optimum can be efficiently obtained by solving a linear equation. Unlike the Gaussian graphical models [
26,
35] that are performed in unary and pairwise streams in the sub-networks, we use the embedded manifold ranking optimization method only on a single stream by constructing the Laplacian matrix generated from possible pairs of vertices.
Numerous strategies without CRF optimization have been established to improve the semantic segmentation accuracy in the FCN or deconvolution manner, and each of them has its own superiorities [
25,
26,
27,
29,
35,
40]. In order to take these advantages, we propose a framework called dual multi-scale manifold ranking (
DMSMR) network to estimate the predicted labels in an end-to-end fashion. In each scale, the dilated and non-dilated convolution layers are jointly optimized by MR. With the dual multi-scale contextual information, the combined results achieve competitive accuracy without any additional aides. An overview of our proposed approach is illustrated in
Figure 1.
We conduct experiments on high spatial resolution remote sensing and close-range images to validate the effectiveness of the proposed approach. Both high spatial resolution remote sensing and close-range images are rich in details, such as texture and color information. The close-range images can be viewed as a special kind of high-resolution images and can guide us to find better CNN architectures to deal with high-resolution remote sensing images. In summary, the main contributions of our work are as follows:
(1) Multi-label MR graphical model for semantic segmentation. Unlike existing approaches that utilize the CRF as the post-processing or approximate inference in the discrete domain, we propose to model the MR method for semantic segmentation in a continuous domain. Our model is end-to-end optimization that can be linearly solved and guarantee a global optimal solution.
(2) Embedded feedforward single stream optimization method. In contrast to Gaussian graphical models, we propose an embedded single stream technique that requires only the Laplacian matrix obtained from pairs of vertices, which makes the gathering of the low-level cues as the contextual information more efficient.
(3) Dual multi-scale manifold ranking network. We adopt the multi-scale strategy to construct the dual-dilated and non-dilated networks and jointly optimize them with MR in a unified framework for semantic image segmentation. Our model is the first work to back propagate through manifold ranking and integrate it to deep learning architecture in the area of remote sensing.
2. Related Work
In the past decade, convolutional networks have been driving advances in object recognition. Therefore numerous semantic segmentation tasks have preferred to conduct dense prediction based on CNNs in both computer vision and remote sensing areas.
In [
21,
41,
42], each semantic object is refined from region proposals by CNN features. In contrast to these instance-awarded methods, Mostajabi et al. [
20] and Dai et al. [
43] sought to preserve the shape information for dense labeling from superpixel-wise proposal segments. Unlike these approaches, Farabet et al. [
44] trained on the entire image with a multi-scale strategy and labeled each pixel with the category of the object to which it belongs. A remarkable breakthrough was recently made by Shelhamer et al. [
22]. In their approach, the contemporary classification networks are converted into fully convolutional networks (FCNs) and the fully connected layers in standard CNNs are viewed as convolutional layers with large receptive filed. Yu et al. [
23] presented a dilated module to the FCNs to further broaden the receptive filed on the convolution layer. Instead of adopting the “convolution by pooling” schema in the classification task, they used a dilated rectangular prism on the convolution layer to preserve the receptive field. Similar strategies were proposed by Chen et al. [
24,
45] in the DeepLab framework. With the “hole” algorithm, a fast dense prediction is allowed on modern GPUs. More recently, Bearman et al. [
46] exploited a point-wise annotation for semantic segmentation, which creatively makes a better trade-off between training annotation cost and accuracy. In the area of remote sensing, Camps-Valls and Romero et al. [
47,
48] proposed the use of greedy layer-wise unsupervised pre-training that learns sparse features for remote sensing image classification. Tschannen et al. [
49] introduced a structured CNNs that employed Haar wavelet-based trees for identifying the semantic category of every pixel of remote sensing image. Piramanayagam et al. [
50] further exploited a multi-path CNNs that support both true ortho photo and digital surface model (DSM) for land cover classification. Marcu et al. [
51] presented a dual path, that is VGG-Net path and AlexNet path, to learn local and global representations of aerial images. Yuan et al. [
52] also conducted a dual clustering approach to select optimal bands for hyperspectral remote sensing images. A few of these approaches are derived from basic FCNs model and utilize different strategies, such as multi-scale pyramid pooling, dilated convolution, dual-path representations and symmetric structures, to improve the inner stability of CNNs. Nevertheless, these networks still need to be properly initialized from pre-trained model or additional aides and may lack of contextual information.
As special extensions to basic FCNs, the symmetric encoder/decoder structures are further exploited by numerous recent approaches. The symmetric structures are able to delineate finer details of the upsampled output. In [
27,
53], Kendall and Badrinarayanan et al.presented a novel semantic pixel-wise segmentation architecture called SegNet. The architecture comprises an encoder that corresponds to the 13 convolutional layers in the VGG-16 [
54] model and a decoder that maps the final features up to the full original image resolution. A similar schema was proposed by Hong and Hyeonwoo et al. [
28,
55]. The deconvolution network is composed of convolution and unpooling layers, thereby mitigating the limitations of the existing methods based on FCNs and handling the object in multi-scale space. Such symmetric structures were also applied to remote sensing image processing. Audebert et al. [
56] exploited the symmetric encoder-decoder structure to detect, segment and classify different varieties of wheeled vehicles from aerial images. Huang et al. [
57] further presented two symmetric encoder-decoder structures to fine-tune the networks from RGB and NRG bands. Audebert et al. [
58] combined the SegNet with SVM to generate the geometrically corrected orthophoto. These symmetric structures reduce possible loss in the uppooling procedure of CNNs. However, these approaches may suffer from the bottleneck of GPU memory and contextual information embedding in terms of training remote sensing images.
To overcome the above issues, various recent approaches use discrete CRF models on top of CNNs. The CRF is an effective optimization method that can further boost the performance of semantic segmentation. By exploiting more contextual information, the rough segments are able to infer the relationship with their surround pixels. In [
32], dense CRF [
33,
40] was proposed for the first time to improve accuracy by utilizing CRF as a post-process with more contextual information for fine predictions on top of CNNs. To make better use of contextual cues, Lin et al. [
29] exploited an efficient “patch-patch” and “patch-background” schema to improve the performance by the CRF optimization framework. Unlike [
24], Zheng et al. [
25] introduced a mean-filed approximate inference for CRF that has the advantages of CNNs and CRF and is easily incorporated to the CNNs. Furthermore, Vemulapalli et al. [
35] and Chandra et al. [
26] proposed the use of simple Gaussian conditional random field (G-CRF) for the task of structured prediction. In [
59], CNN features and hand-crafted features were combined to parse remote sensing images. Alam et al. [
60] further introduced a framework that combined with mean-field CRF inference and performed superpixel-level labellings on remote sensing images. Sherrah [
60] exploited the effectiveness of CRF post-processing approaches on top of CNNs and analyzed the major differences between close-range and remote sensing images in terms of contextual information. However, these methods either serve as a post-process or end up with mean-filed approximation and do not guarantee a global optimum.
Hence, we combine CNNs with the MR method, which guarantees a global optimum in a unified framework without additional aides. The multi-scale, dilated convolution strategies are also incorporated on top of CNNs to better delineate visual objects in remote sensing images. The MR method presented in [
36,
37,
39] is an effective graph-based ranking method that aims to find the underlying cluster or manifold structure from the given datasets. For a query data, MR seeks to rank the neighborhood relevance to the query. Unlike the CRF, the optimal ranking solution is linearly solved by constructing the Laplacian matrix [
61] from the neighbor contextual information, guaranteeing a global optimal solution in the continuous domain. Quan [
62] et al. exploited such characteristics and utilized the MR based co-segmentation strategy to find the common objects contained in a set of relevant images. Wang et al. [
63] presented an effective approach for salient band selection for hyperspectral image classification via MR. They put the band vectors in a more accurate manifold space and treats the salient band selection problem from a ranking perspective. Moreover, the MR method has been applied to estimate the status of many other complex low-level vision tasks, such as saliency detection [
38,
64], image retrieval [
65,
66] and visual tracking [
67]. Considering that the semantic segmentation task also has a manifold structure, in which each pixel is first assigned several probabilities (ranking) that belong to the given categories (underlying clusters) and then the maximum probability is obtained from them, we apply the MR method embedded in CNNs to exploit the efficient global optimal solution to semantic segmentation. Combined with dilated, multi-scale strategies, the MR method, which can further establish the foundation of the dense prediction task in an end-to-end manner, is introduced into this field.
5. Experiments
We have devised two groups of experiments on high resolution datasets, including close-range images (PASCAL VOC dataset and CamVid dataset) and remote sensing images (ISPRS Vaihingen dataset and EvLab-SS dataset), to validate the effectiveness of our model and find the approach that can be potentially applied to remote sensing image processing. For fair evaluation, the first group, which includes the PASCAL VOC dataset [
72] and ISPRS Vaihingen dataset [
73], is designed for comparison with a few recent state-of-the-art methods whose results are publicly available online. In this group, we evaluate our model by submitting the results to the server, wherein the ground truth of testing images are not available to all researchers. The second group, which includes the CamVid dataset [
68,
69] and the EvLab-SS dataset (See
Section 5.2.2), is used to evaluate the capacity of the proposed
DMSMR approach by comparing the methods that employ only one of the three strategies, namely, multi-scale convolution (
MS), broader receptive field (
Dilated) and MR optimization (
MR-opti) approaches. The detailed structures of the network with different strategies are explained in the
Appendix (See
Figure A1 and
Table A1).
In our
DMSMR model, the first five blocks are developed from the standard VGG-16 [
54] structures, which comprise convolutional and non-dilated convolutional layers. The dilation kernel sizes are 6, 4, 2, 2, and 1 pixels. For each scale, the pooling layer is followed by the non-dilated layers, which comprise three convolutional layers. The parameters of our implementation are shown in detail in
Table 1. The dilated and non-dilated layers are optimized with single stream manifold ranking algorithm and fused by Equation (
17). The structure is illustrated in
Figure 1. In the table and figure, the “ReLU” active function [
74] is implicitly employed in each convolutional layer. In our model, all layers are randomly initialized without using the pre-trained VGG-16 model. The hyper-parameters, such as learning rate, momentum and weight decay, are confirmed via cross validation. The entire net is trained in an end-to-end manner using SGD algorithm.
and
in Equation (
11) are both set to
as in [
32] in our experiments.
The proposed architectures are implemented using Caffe [
75] in a Win7 x64 platform running on an Intel I7-4790 CPU @ 3.6 GHz with a single GeForce GTX 1070 (8 GB RAM). Our model requires only 5523 MB of GPU memory. The source code is implemented with C++ and the model is publicly available at
http://earthvisionlab.whu.edu.cn/zm/SemanticSegmentation/index.html.
5.1. Experiment on Close-Range Dataset
As a special kind of high resolution image, close-range imagery is rich in details. Many of the recent breakthroughs [
12,
13,
14,
49,
50,
76] in the remote sensing area used pre-trained models on this kind of high resolution images. We adopt the PASCAL VOC dataset [
72] and the CamVid dataset [
68,
69] for training and testing and to evaluate the proposed approach on close-range images. The PASCAL VOC dataset is a golden standard measurement for semantic segmentation evaluation. Meanwhile the CamVid dataset comprises a small number of training images, and is a reasonable choice for evaluating the intrinsic capacity of the network that employs different strategies.
5.1.1. Evaluation on PASCAL VOC
The PASCAL VOC 2012 segmentation dataset comprises 20 object classes and one background class with 1464, 1449 and 1456 images for training, validation and testing, respectively. In our experiment, we use the extra annotations provided by [
77], thus obtaining a total of 10582 augmented training images [
77,
78]. For our model, we resize the images to
pixels as in DeepLab model [
24] and evaluate the model by remotely submitting the predictions to the test server (Our result on PASCAL VOC dataset is available at
http://host.robots.ox.ac.uk:8080/leaderboard). The evaluation metric is the standard Intersection-over-Union (IoU) averaged across the 21 classes. In our experiment, we train the model with the initial learning rate, momentum and weight decay 1e-9,
and
, respectively. The momentum and weight decay terms are utilized as suggested in FCNs framework [
22]. In addition, the learning rate is confirmed via cross validation. The initial parameters for smoothness coefficients
and
are set to 3 and 5, respectively. The drop-out layers are removed in our proposed approach. Our network converges after 60,000 iterations with a mini-batch size of 8.
Numerous methods have been applied to the PASCAL VOC 2102 dataset and achieve the high accuracy. However, the complexity has been increasing due to the gradual addition of aides, which unfortunately does not reveal the true performance of the deep architecture as stated by Kendall et al. [
27]. Our work in this benchmark do not aim to obtain the top score using additional aides, such as CRF post-processing [
24], region proposal [
28], multi-stage inference [
25], and pre-trained model from other dataset (e.g., Microsoft COCO [
79]). Instead, we seek to improve the performance by applying three main strategies, which include multi-scale convolution, a broader receptive field, and a single stream MR optimization method, to jointly upgrade the intrinsic structure of the network. The multi-scale strategy has the advantage of deep architecture because the potential scale is implicitly expressed by a pooling layer in the CNN. The broader receptive filed is captured by a dilated operation [
28], thus preventing the loss of resolution. By contrast, the feedforward single stream MR optimization method allows obtaining the optimal solution without the complicated inference procedure and can be trained in an end-to-end manner. Though we embed the feedforward MR optimization algorithm into the network, the optimal solution can be solved linearly rather than in a multi-stage inference schema.
Table 2 presents the results of the comparison to recent methods, and a few of the corresponding intuitive results are depicted in
Figure 3. In the table, we compare our method with several models that can be potentially applied to remote sensing area. We choose the listed models rather than all top scored approaches for the following reasons. First, the model should utilize as less additional aides as possible. Additional aides can hide the true performance of a network and are not easily transplanted to remote sensing application. Several models on the table, such as FCN-8s [
22], DeconvNet [
28] and SegNet [
27], have been applied to process remote sensing images. Second, the selected model needs to be tested on PASCAL VOC 2012 server and does not repeat with previous methods. Algorithms, such as DeepLab [
24], CRF-RNN [
25], DilatedConv [
28], and G-CRF [
35], are milestones on PASCAL VOC 2012 benchmark and satisfy such requirements. Third, training the model is not too much time consuming, especially when dealing with remote sensing images, which are usually bigger than close range indoor/outdoor images. The recent state-of-the-art approach, such as RefineNet [
80], employs ResNet-101 structures that may suffer from high GPU consumption and need MS-COCO dataset support. In the area of remote sensing, however, we do not have the large number extensions of labeled samples for training.
In the
Table 2, the proposed
DMSMR performs significantly (averaged approximately eight points) better than the similar methods without additional aides (methods without qualifying comments in
Table 2). This is because our method is composed of the dilated, multi-scale strategies and has characteristics that complement to a few basic networks, such as SegNet [
27], dilated convolutional network [
28] and DeepLab-Msc [
24]. Compared to recent methods, such as CRF-RNN [
25] and G-CRF [
35], our method achieves a similar score by optimizing with a single stream MR algorithm in an end-to-end manner. However, our approach does not require multi-stage inference or training two streams (i.e., unary term and pairwise stream, with unary initialized by other networks). Furthermore, some approaches, such as DeepLab [
24], have a worse result when they do not use all of the additional aides with a pre-trained model. However, our model yields superior results without these pre-trained weights.
5.1.2. Evaluation on CamVid
CamVid dataset [
68,
69], which is captured from high-definition (HD) video sequences with high quality, is designed for the road scene understanding. However, a relatively few number of images exist for training purpose. The dataset comprises 367 training images, 101 validation images and 233 testing images. The challenge data contains 11 semantic object classes which are downsampled to
pixels.
The overall training parameter settings for this dataset are as follows. The learning rate, momentum and weight decay are set to 1e-3,
and
, respectively. The momentum and weight decay terms are utilized as suggested in FCNs framework [
22]. In addition, the learning rate is confirmed via cross validation. The proposed network is trained at the default resolution of
with a mini-batch size of 2. The initial values for
and
are set to 3 and 5, respectively, through cross validation. Our network converges after 40,000 iterations.
We employ the pixel mean intersection over union (mIoU) measurement with respect to the band width around the object boundaries as in [
24] on the CamVid benchmark to analyze the expressive power of the proposed
DMSMR network. The experimental results are illustrated in
Figure 4. The comparisons between the
DMSMR approach and the networks employing different strategies are reported in
Table 3. We also analyze the accuracy change with respect to boundary in
Figure 5. As shown in
Figure 5a, we consider a narrow band, that is, trimap [
81] boundary, on CamVid dataset. A trimap divides an image into three regions of foreground, background and unknown.
Figure 5b shows boundary accuracy as the trimap width is varied. In this experiment, we set the same parameters as those in the
DMSMR model but with different strategies as previously stated. The three strategies, namely, multi-scale convolution (
MS), broader receptive field (
Dilated) and manifold ranking optimization (
MR-Opti) approaches, are utilized for comparison. Obviously, different strategies yield different performance for each of the classes. The
MS and
Dilated approaches help boost the performance in the situation where color and texture are uniformly distributed. In addition, the
MR-Opti achieves a score that is approximately 2.5% better than those of the
MS and
Dilated methods because more contextual information are considered. The results demonstrate that the combination of
MS,
Dilated and
MR-Opti approaches is possibly a better approach for semantic segmentation task on close-range images.
Figure 5 shows that improving the recognition of pixels around the boundary helps delineate the object because the smoothness potentials of the correctly detected pixels increase. Additionally, as can be seen from
Table 3, the
DMSMR method outperforms the approaches that employ only one strategy, indicating that the
DMSMR approach can improve the semantic segmentation result further by combing these strategies in close-range situations.
5.2. Experiment on High Resolution Remote Sensing Dataset
Compare to the close-range imagery, high resolution remote sensing images have a few special features, which are different from that of commonly encountered indoor/outdoor close-range images in the area of computer vision. High resolution remote sensing images are large and contain a potentially-unlimited scene context (i.e., the road could possibly pass through the entire image). In addition, the object scale on high resolution images dramatically varies when employing the training dataset captured from different satellites (i.e., GF-1 with spatial resolution 2.1 m, QuickBird with spatial resolution of 0.6 m), whereas the close-range images do not. In the following experiments, we adopt two kinds of benchmarks: the ISPRS 2D Vaihingen dataset and EVLab-SS dataset. The ISPRS 2D Vaihingen benchmark is a well-known high resolution aerial imagery semantic labeling database, whose spatial resolution is 0.9 cm with uniform color and texture distributions. The EVLab-SS benchmark, which is designed for evaluating the semantic segmentation results on remote sensing imagery, contains the images captured from different platforms (both aerial and satellite images are included) with different types of spatial resolutions (ranging from 0.1 m to 2 m). In addition, the images vary in color, gradient, and texture.
5.2.1. Evaluation on Vaihingen Dataset
The Vaihingen dataset comprises 6 classes with 33 image tiles, out of which 16 are fully annotated (tile numbers 1, 3, 5, 7, 11, 13, 15, 17, 21, 23, 26, 28, 30, 32, 34 and 37). The dataset is cropped from an aerial orthophoto mosaic (GSD 9 cm) with three spectral bands (i.e., red, green and near-infrared bands) that are rich in detail. The categories to be classified for each pixel are impervious surfaces, buildings, low vegetation, trees, and cars. In our experiment, we randomly sample 2932 patches of pixels from annotated images by sliding window. All patches are reserved for training. For the objective evaluation of the proposed approach, we submit the predicted results to the organizers who keep the ground truth.
The training procedure is performed with the SGD algorithm. The mini-batch size is set to 8, and each batch contains the cropped images that are randomly selected from training patches. These patches are resized to
pixels. We employ the “poly” learning policy, and the base learning rate is 1e-7 with the power of
. The momentum and weight decay are set to 0.9 and 0.0005, respectively, as recommended by Krizhevsky et al. [
82]. Smoothness coefficients
and
are set to 3 and 5, respectively. Our network converges after 50,000 iterations on this benchmark.
Figure 6 presents the visual comparison of these approaches. It can be seen from the error map that the CRF post-processing method (
ADL [
59] and
HUST [
83]) indeed helps improve the performance. Nevertheless, the upper left corner of the error map in the first row shows that even if the CRF post-processing method is employed, more incorrectly classified pixels will exist if the initial predictions are poorly provided. In
Table 4, we compare our approach with the methods using additional aides, such as the VGG-16 pre-trained model [
29,
76,
84], digital surface model (DSM) [
49,
85,
86], and the CRF post-processing [
59,
83]. We also compare our approach with traditional feature based methods [
87]. Recent advances in the area of computer vision have shown that very deep networks can improve the semantic segmentation accuracy [
27,
54]. Therefore, our
DMSMR approach reasonably outperforms the “
SVL” method by approximately
in overall pixel-wise accuracy and
on global F1 score. Although additional aides help improve accuracy, they are not the
core to segmentation engine [
53]. Our networks do not need these aides but achieve competitive scores compared with these approaches. For the fine-tuned networks from the pre-trained VGG-16 model (
ONE [
84],
DLR [
76],
UOA [
29],
RIT [
50]), their performances are not always steady compared to that of the proposed
DMSMR approach. Our overall accuracy varies approximately 0.1% (see
Ano (
Ano is available at
http://ftp.ipi.uni-hannover.de/ISPRS_WGIII_website/ISPRSIII_4_Test_results/2D_labeling_vaih/2D_labeling_Vaih_details_Ano/index.html) and
Ano2 in the ISPRS leader board.
Ano and
Ano2 are initialized with the same hyper-parameters, but the weights and biases terms are randomly initialized.) when tested on this benchmark. This is mainly caused by uncertainty of weights when trying to transfer the VGG-16 classification networks into semantic segmentation task. The dense prediction problem, such as semantic segmentation, is structurally different from image classification [
23]. Thus these performances are not as stable as expected. Our approach somehow utilizes the dual-dilated and non-dilated convolutional layers to prevent such instability.
5.2.2. Evaluation on EvLab-SS Dataset
The EvLab-SS benchmark (EvLab-SS dataset can be downloaded from our website
http://earthvisionlab.whu.edu.cn/zm/SemanticSegmentation/index.html.) is designed for the evaluation of the semantic segmentation algorithms on real engineered scenes, which aims to find a good deep learning architecture for the high resolution pixel-wise classification task in remote sensing area. The dataset is originally obtained from the Chinese Geographic Condition Survey and Mapping Project, and each image is fully annotated by the Geographic Conditions Survey (NO.GDPJ 01—2013) [
89] standards. The average resolution of the dataset is approximately
pixels. The EvLab-SS dataset contains 11 major classes, namely,
background, farmland, garden, woodland, grassland, building, road, structures, digging pile, desert and waters, and currently includes 60 frames of images captured by different platforms and sensors. The dataset comprises 35 satellite images, 19 frames of which are captured by the World-View-2 satellite [
90] (re-sample GSD 0.2 m), 5 frames are captured by the GeoEye satellite [
91] (re-sample GSD 0.5 m), 5 frames are captured by the QuickBird satellite [
92] (re-sample GSD 2 m), 6 frames are captured by the GF-2 satellite [
93] (re-sample GSD 1 m). The dataset also has 25 aerial images, 10 images of which with spatial resolution of 0.25 m and 15 images have a spatial resolution of 0.1 m. In our experiment, we divide the dataset into 37 frames for training, 8 frames for validation, and 15 frames for testing. We produce the training dataset by applying the sliding window with a stride of 128 pixels to the training images, thereby resulting in 48,622 patches with a resolution of
pixels. Similar methods are utilized on validation images, thus generating 13,539 patches for validation. The
Garden class, which is reserved for validating the expressive power of CNNs in real scenes, is absent in our validation images.
In the training procedure, each iteration comprises a feed-forward pass in which the model weights are adjusted by the SGD algorithm. Each training patch image in a batch is resized to
pixels. The mini-batch size is set to 12 and the corresponding training patches are randomly selected. We employ the “poly” learning policy and start with a learning rate 1e-7 with the power of
. Smoothness coefficients
and
are set to 3 and 5 in our experiments, respectively. The momentum and weight decay are set to 0.9 and 0.0005, respectively, as recommended by Krizhevsky et al. [
82]. Our network converges after 70,000 iterations on this dataset. In the following experiments, we set the same learning parameters for the methods employing only one strategy (
MS,
Dilated or
MR-Opti) as the
DMSMR approach.
Figure 7 is the visualization of the results on the validation patches with different methods.
Figure 8 illustrates the comparative results of employing different strategies with respect to the varying trimap band width. Quantitative results are shown in
Table 5. In our experiments, we adopt the overall pixel-wise accuracy and mean intersection over union (mIoU) measurements to evaluate the effectiveness of different approaches.
Compare to the 2D Vaihingen dataset provided by the ISPRS organization, the EvLab-SS dataset is inconsistently distributed in terms shape, color, and texture. The resolutions of the images captured from different sensors are dramatically varying. The buildings, roads and other classes are not obtained in the same scale. Therefore, the EvLab-SS dataset poses more challenge to researchers. It intuitively can be seen from
Figure 7 that the
DMSMR method can better delineate the boundary of an object. The results demonstrate the superiority of the combination of multi-scale (
MS), broader receptive field (
Dilated), and manifold ranking optimization (
MR-Opti) strategies, which can more accurately classify each pixel with varying spatial resolutions.
Figure 8 shows that although the mIoU score of the proposed DMSMR approach is relatively low with a small trimap width, it has become increasingly stable and competitive. By contrast, the mIoU scores of the MS, dilated, and MR-Opti approaches are unstable, even decreasing with a few small trimap widths. The main reason attribute to this phenomena is that the spatial resolution is different in the training patches, which may be ignored by only employing one strategy. In
Table 5, the special class (
Garden) is detected as 0.0% in all approaches, indicating that these methods can preserve the intrinsic nature of CNNs well. For the real engineered remote sensing data, the
Dilated approach does not appear to boost performance and decreases in overall accuracy and mean IoU by approximately 2.96%, 2.32%, respectively. This can be attributed to the numerous inhomogeneous objects in the training patches. For example, the road and buildings may not be completely covered in a single patch, which renders training with dilation operations in some layer meaningless. Although the
MR-Opti approach improves the overall accuracy by approximately 4%, this approach may disregard a few classes, such as the
Desert and Waters, due to insufficient contextual information with varying illumination and color. However, the
MS approach retains more contextual information in each scale space but still suffers from the optimization problem in each scale, resulting in 0.8% decrease in overall accuracy. Notably, the proposed
DMSMR approach can take the superior features of these strategies and overcome the drawbacks, achieving approximately 5% and 1% improvements in overall accuracy and mIoU score under the condition of limited training images and varying spatial resolutions.
6. Conclusions
In this paper, we present a DMSMR network for semantic image segmentation in a continuous domain. By extending the binary manifold ranking (MR) algorithm to a multi-label case, the assignment of a discrete label to each pixel can be linearly solved and a unique global optimum can be guaranteed. In addition, with the single stream MR method embedded into CNNs in a feedforward schema, the required parameters can be trained in an end-to-end manner. Furthermore, we propose to utilize dilated and non-dilated networks, which form dual layers to jointly optimize the results from the single stream manifold ranking network rather than on two separate streams, that is, unary and pairwise streams. Combined with multi-scale (MS), broader receptive field (Dilated) and manifold ranking optimization (MR-Opti) strategies, the proposed DMSMR network enables training without additional aides, such as multi-stage inference, region proposals, VGG-16 initialization, digital surface model (DSM) and CRF post-processing. Two groups of experiments on close-range and remote sensing high resolution datasets are designed to evaluate the performance. When discriminatively trained by submitting the results to the server on PASCAL VOC and ISPRS Vaihingen benchmarks, the proposed DMSMR network can achieve competitive results without additional aides compared to recent methods. Our experiments on publicly available datasets, including CamVid and EvLab-SS datasets, demonstrate the superior capacity of the proposed DMSMR approach over the methods that employ only one strategy. For the real world application in remote sensing, the combined strategy steadily boosts the performance even under limited training images and the varying spatial resolutions.
Nevertheless, the proposed approach may be further improved in the following ways. First, more prior information, such as orientation and texture, is expected to be integrated into the smoothness term in the multi-label manifold ranking objective function to delineate the visual objects with varying illumination and spatial resolution. Second, the generative adversarial nets [
94,
95,
96] (GAN) can be introduced to boost the performance by combining the adversarial term in the loss function with the limited number of training images. Third, model parallelism should be investigated when incorporating more prior knowledge to our model. For example, buildings and roads are the salient objects in remote sensing images that can guide the semantic contextual information. The prior information might be parallel-trained in a distributed system. Finally, the superpixel segmentation can be applied as a pre-processing step to reduce the number of optimization elements in the proposed multi-label MR graphical model.