Semantic Segmentation on Remotely Sensed Images Using an Enhanced Global Convolutional Network with Channel Attention and Domain Specific Transfer Learning
Chulalongkorn University Big Data Analytics and IoT Center (CUBIC), Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University, Phayathai Rd, Pathumwan, Bangkok 10330, Thailand
Data Science and Computational Intelligence (DSCI) Laboratory, Department of Computer Science, Faculty of Science, King Mongkut’s Institute of Technology Ladkrabang, Chalongkrung Rd, Ladkrabang, Bangkok 10520, Thailand
Geo-Informatics and Space Technology Development Agency (Public Organization), 120, The Government Complex, Chaeng Wattana Rd, Lak Si, Bangkok 10210, Thailand
Authors to whom correspondence should be addressed.
Remote Sens. 2019, 11(1), 83; https://doi.org/10.3390/rs11010083
Received: 5 December 2018 / Revised: 25 December 2018 / Accepted: 1 January 2019 / Published: 4 January 2019
(This article belongs to the Special Issue Convolutional Neural Networks Applications in Remote Sensing)
In the remote sensing domain, it is crucial to complete semantic segmentation on the raster images, e.g., river, building, forest, etc., on raster images. A deep convolutional encoder–decoder (DCED) network is the state-of-the-art semantic segmentation method for remotely sensed images. However, the accuracy is still limited, since the network is not designed for remotely sensed images and the training data in this domain is deficient. In this paper, we aim to propose a novel CNN for semantic segmentation particularly for remote sensing corpora with three main contributions. First, we propose applying a recent CNN called a global convolutional network (GCN), since it can capture different resolutions by extracting multi-scale features from different stages of the network. Additionally, we further enhance the network by improving its backbone using larger numbers of layers, which is suitable for medium resolution remotely sensed images. Second, “channel attention” is presented in our network in order to select the most discriminative filters (features). Third, “domain-specific transfer learning” is introduced to alleviate the scarcity issue by utilizing other remotely sensed corpora with different resolutions as pre-trained data. The experiment was then conducted on two given datasets: (i) medium resolution data collected from Landsat-8 satellite and (ii) very high resolution data called the ISPRS Vaihingen Challenge Dataset. The results show that our networks outperformed DCED in terms of
for 17.48% and 2.49% on medium and very high resolution corpora, respectively.
Semantic segmentation of earthly objects such as agriculture fields, forests, roads, and urban and water areas from remotely sensed images has been manipulated in many applications in various domains, e.g., urban planning, map updates, route optimization, and navigation [1,2,3,4,5], allowing us to better understand the domain’s images and create important real-world applications.
A deep convolutional neural network (CNN) is a well-known method for automatic feature learning. It can mechanically learn features at different levels and abstractions from raw images by multiple hierarchical stacking convolution and pooling layers [4,5,6,7,8,9,10,11,12,13,14]. To accomplish such a challenging task, features at different levels are required. Specifically, abstract high-level features are more suitable for the recognition of confusing manmade objects, while the labeling of finely structured objects could benefit from detailed low-level features . Therefore, different numbers of layers will affect the performance of deep learning models.
In the past few years, the modern CNNs have been extensively proposed including Global Convolutional Network (GCN)  in which the large kernel and effective receptive field play an important role in performing classification and localization tasks simultaneously. The GCN is proposed to address the classification and localization issues for semantic segmentation and to suggest a residual-based boundary refinement for further refining object boundaries. However, this type of architecture ignores the global context such as weights of the features in each stage. Furthermore, most methods of this type are just summed up the features of adjacent stages without considering their diverse representations. This leads to some inconsistent results that suffer from accuracy performance. The primary challenge of this remote sensing task is a lack of training data. This, in fact, has become a motivation of this work.
In this paper, we present a novel global convolutional network for segmenting multi-objects from aerial and satellite images. To this end, it is focused on three aspects: (i) varying backbones using ResNet50, ResNet101, and ResNet152, (ii) applying a “channel attention block” [16,17] to assign weights for feature maps in each stage of the backbone architecture, and (iii) employing “domain-specific transfer learning” [18,19,20] to relieve scarcity. Experiments were conducted using satellite imagery (from the Landsat-8 satellite), which was provided by a government organization in Thailand, and using well-known aerial imagery from the ISPRS Vaihingen Challenge corpus , which is publicly available. The results showed that our method outperforms the baseline including deep convolutional encoder–decoder (DCED) in terms of and by mean of class-wise intersection over union ().
The remainder of this paper is arranged as follows. The related work is discussed in Section 2. Section 3 describes our proposed methodology. Experimental datasets and evaluations are described in Section 4. Experimental results and discussions are presented in Section 5. Finally, we conclude our work and discuss future work in Section 6.
2. Related Work
Deep learning has been successfully applied for remotely sensed data analysis, notably land cover mapping on urban areas [1,2,3], and has increasingly become a promising tool for accelerating the image recognition process with high accuracy [4,5,6,7,8,9,10,11,12,13,14,22,23,24,25,26,27,28,29,30]. It is a fast-growing field, and new architectures appear every few days. This section is divided into three subsections: we discuss deep learning concepts for semantic segmentation, a set of multi-objects segmentation techniques using modern deep learning architectures, and modern techniques of deep learning.
2.1. Deep Learning Concepts for Semantic Segmentation
Semantic segmentation algorithms are often formulated to solve structured pixel-wise labeling problems based on a deep CNN. Noh et al.  proposed a novel semantic segmentation technique utilizing a deconvolutional neural network (DCNN) and the top layer from the DCNN adopted from VGG16 [4,8]. The DCNN structure is composed of upsampling layers and deconvolution layers, describing pixel-wise class labels and predicting segmentation masks, respectively. Their proposed deep learning methods yield high performance in PASCAL VOC 2012 corpus, with the 72.5% accuracy in the best-case scenario (the highest accuracy—as of the time of the writing of this paper—compared to other methods that were trained without requiring additional or external data). Long et al.  proposed adapted contemporary classification networks incorporating Alex, VGG, and GoogLe networks into a fully CNN. In this method, some of the pooling layers were skipped: Layer 3 (FCN-8s), Layer 4 (FCN-16s), and Layer 5 (FCN-32s). The skip architecture reduces the potential over-fitting problem and has shown improvements in performance, ranging from 20% to 62.2% in the experiments tested on PASCAL VOC 2012 data. Ronneberger et al.  proposed U-Net, a DCNN for biomedical image segmentation. The architecture consists of a contracting path and a symmetric expanding path that captures context and consequently enables precise localization. The proposed network claimed to be capable of learning despite the limited number of training images and performed better than the prior best method (a sliding-window DCNN) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Vijay Badrinarayanan [31,32,33] proposed a deep convolutional encoder–decoder network (DCED), called “SegNet,” that consists of two main networks, encoder and decoder, and some outer layers. The two outer layers of the decoder network are responsible for feature extraction, the results of which are transmitted to the layer adjacent to the last layer of the decoder network. This layer is responsible for pixel-wise classification (determining which pixel belongs to which class). There is no fully connected layer in between feature extraction layers. In the upsampling layer of the decoder, pool indices from the encoder are distributed to the decoder, where the kernel will be trained in each epoch (the training round) at the convolution layer. In the last layer (classification), softmax was used as a classifier for pixel-wise classification. The DCED is one of the deep learning models that exceeds the state of the art on many remote sensing corpus.
In this work, the DCED method was selected as our baseline since it is the most popular architecture used in various networks for semantic segmentation.
2.2. Modern Deep Learning Architectures For Semantic Segmentation
Recently, many approaches based on the DCED have achieved high performance on different benchmarks [16,31,32,33]. However, most of them still suffer from accuracy performance issues. Therefore, many works of modern deep learning architectures have been proposed, such as instance-aware semantic segmentation , which is slightly different from semantic segmentation. Instead of labeling all pixels, it focuses on the target objects and labels only pixels of those objects. FCIS  is based on techniques based on fully convolutional networks (FCNs). The mask R-CNN  was built around the FCN and is incorporated with a proposed joint formulation. Peng  presented the concept of large kernel matters to improve semantic segmentation with a global convolutional network (GCN) as shown in Figure 1. They proposed a GCN to address both the classification and localization issues for semantic segmentation. Large separable kernels were used to expand the receptive field, and a boundary refinement block was added to further improve localization performance near the boundaries. From the Cityscapes challenge, the GCN outperforms methods of all previous publications (all modern deep learning baselines) and has become the new state of the art. Therefore, the GCN was selected as our proposed method and as the main model of our work.
2.3. Modern Techniques of Deep Learning
Modern techniques of deep learning are important for the accuracy of a CNN. The most popular modern ideas used for semantic segmentation tasks, such as global context, the attention module, and semantic boundary detection, have been used for boosting accuracy.
Global context  is a modern method that has proven the effectiveness of global average pooling in the semantic segmentation task. For example, PSPNet  and Deeplab v3  respectively extend it to spatial pyramid pooling  and atrous spatial pyramid pooling , resulting in great performance at different benchmarks. However, to take advantage of the pyramid pooling module sufficiently, these two methods adopt the base feature network to downsample with atrous convolution eight times , which is time-consuming and memory-intensive.
Attention Module : Attention is helpful to focus on what we want. Recently, the attention module has increasingly become a powerful tool for deep neural networks [16,17]. The method in [16,17] pays attention to different scale information. In this work, we utilize a channel attention block to select features, similar to learning a discriminative feature network .
Refinement Residual Block : The feature maps of each stage in the feature network all go through the refinement residual block. For our work, we use the boundary refinement block (BR) to be a concept of “refinement residual block” from . The first component of the block is a convolution layer. We use it to unify the number of channels to 21. Meanwhile, it can combine the information across all channels. Then the following is a basic residual block , which can refine the feature map. Furthermore, this block can strengthen the recognition ability of each stage, inspired from the architecture of ResNet.
3. The Proposed Method
In this section, the details of our proposed network are explained (shown in Figure 2). The network is based on the GCN with three aspects of improvements: (i) the modification of backbone architecture (shown in P1 in Figure 2), (ii) applying the channel attention block (shown in P2 in Figure 2), and (iii) using the concept of domain-specific transfer learning (shown in P3 in Figure 2).
3.1. Data Preprocessing
In this paper, there are two benchmark corpuses, including (i) the ISPRS Vaihingen Challenge corpus and (ii) the Landsat-8 dataset. They are comprised of very high and medium resolution images, respectively. More details of the datasets will be explained in Section 4.1 and Section 4.2. Before a discussion of the model, it is worth explaining our data preprocessing procedure, since it is required when working with neural network and deep learning models. Thus, the mean subtraction is executed.
In addition, data augmentation is often required on more complex object recognition tasks. Therefore, a random horizontal flip is generated to increase the training data. For the ISPRS corpus, all images are standardized and cropped into pixels with a resolution of 9 cm/pixel. For the Landsat-8 corpus, each image is also flipped horizontally and scaled to with a resolution of 30 m/pixel from the original images (16,800 × 15,800 pixels).
3.2. A Global Convolutional Network (GCN) with Variations of Backbones
GCN  as shown in Figure 1 is a modern architecture that surpasses the drawbacks of a traditional semantic segmentation network, such as deep convolutional encoder–decoder (DCED) networks. A traditional network usually cascades convolutional layers in order to generate sophisticated features; they can be considered as local features that are specialized for a specific task. However, it is not necessary to employ only specialized features; the general features are also important. Thus, a GCN overcomes this issue by introducing a multi-level architecture, where each level aims to capture a different resolution of features, so both local and global features are considered in the model.
As shown in Figure 1, there are two main blocks in the GCN: a localization block and a classification block. From the localization view in the left block, the structure is a stack of classical fully convolutional layers called “levels.” Each level aims to construct features with different resolutions. From the classification view, there are two modules: the GCN and the boundary refinement (BR). For the GCN module, the kernel size of the convolutional structure should be as large as possible, which is motivated by the densely connected structure of the classification models. If the kernel size increases to the spatial size of the feature map (named the global convolution), the network will share the same benefits with the pure classification models. The BR module is added to further improve localization performance near the boundaries.
Although the GCN architecture has shown promising prediction performance, it is still possible to further improve by varying backbones using ResNet  with different numbers of layers as ResNet50, ResNet101, and ResNet152, as shown in Figure 3. Additionally, the GCN is suggested to work on a large kernel size. In this paper, we set the large kernel size as 9 (this previous work ).
3.3. The Channel Attention Block
Attention mechanisms [16,17] in neural networks are very loosely based on the visual attention mechanism found in humans and equips a neural network with the ability to focus on a subset of its inputs (or features): it selects specific inputs. Human visual attention is well-studied, and while there are different models, all of them essentially come down to being able to focus on a certain region of an image with a very high resolution, perceiving the surrounding image in a medium resolution, and then adjusting the focal point over time.
To apply this atttentional layer to our network, the channel attention block is shown in Block A in Figure 2 and its detailed architecture is shown in Figure 4. It is designed to change the weights of the remote sensing features on each stage (level), so that the weights are assigned more values on important features adaptively.
In the proposed architecture, a convolution operator gives the probability of each class at each pixel. In Equation (1), the final score is summed over all channels of the feature maps.where x is the output feature of network. w represents the convolution’s kernel, and . The number of channels is represented by K, and D is the set of pixel positions.where is the prediction probability. y is the output of the network. As shown in Equations (1) and (2), the final predicted label is the category with the highest probability. Therefore, we suppose that the prediction result is of a certain patch, while its true label is . Therefore, we can introduce a parameter to change the highest probability value from to , as Equation (3) shows.where is the new prediction of the network, and .
Based on the above formulation of the Channel Attention Block, we can explore its practical significance. In Equation (1), it implicitly indicates that the weights of different channels are equal. However, the features in different stages have different degrees of discrimination, which results in different consistency of prediction. Consequently, in Equation (3), the value applies the feature maps x, which represents the feature selection with the channel attention block.
3.4. Domain-Specific Transfer Learning
The overall idea of transfer learning is to use knowledge learned from tasks for which many labeled data are usable in settings where only little-labeled data are available. Creating labeled data is expensive, so optimally leveraging an existing dataset is key. Certain low-level features, such as edges, shapes, corners, and intensity, can be shared across tasks, and new high-level features specific to the target problem can be learned . Additionally, knowledge from an existing task acts as an additional input when learning a new target task.
Although the deep learning approach often performs promising prediction performance, it requires a large amount of training data. Since it is difficult to obtain annotated satellite images, the performance in prior works has been limited.
Fortunately, there is a recent concept called domain-specific transfer learning [18,19,20] that allows one to reuse the weights obtained from other domains’ inputs. It is currently very popular in the field of deep learning because it enables one to train deep neural networks with comparatively insufficient data. This is very useful since most real-world problems typically do not have millions of labeled data points to train such complex models.
In terms of inadequacy, we propose an effective transfer deep neural network to perform knowledge transfer between a very high resolution (VHR) corpus and a medium resolution (MR) corpus. It is shown in Figure 5.
4. Experimental Datasets and Evaluation
In our experiments, two types of datasets were used: (i) medium resolution imagery (satellite images; Landsat-8 dataset) made by the government organization in Thailand, named GISTDA (Geo-Informatics and Space Technology Development Agency (Public Organization)), and (ii) very high resolution imagery (aerial images; ISPRS Vaihingen dataset). All experiments were evaluated based on major metrics, such as , score, and score.
4.1. Landsat-8 Dataset
Landsat-8 is an American earth observation satellite and it collects and archive medium resolution (30-m spatial resolution) multispectral image data affording seasonal coverage of the global landmasses for a period of no less than 5 years. Landsat-8  images consist of nine spectral bands with a spatial resolution of 30 m for Bands 1–7 and 9. The ultra blue Band 1 is useful for coastal and aerosol studies. Band 9 is useful for cirrus cloud detection. The resolution for Band 8 (panchromatic) is 15 m. Thermal Bands 10 and 11 are useful in providing more accurate surface temperatures and are collected at 100 m. The approximate scene size is 170 km north–south by 183 km east–west (106 mi by 114 mi). Since Landsat-8 data includes additional bands, the combinations used to create RGB composites differ from Landsat 7 and Landsat 5. For instance, Bands 4, 3, and 2 are used to create a color infrared (CIR) image using Landsat 7 or Landsat 5. To create a CIR composite using Landsat 8 data, Bands 5, 4, and 3 are used.
In this type of data, the satellite images are from Nan, a province in Thailand. The dataset is obtained from Landsat-8 satellite consisting of 1012 satellite images as shown by some samples in Figure 6.
This corpus is comprised of a large, diverse set of medium resolution (16,800 × 15,800) pixels, where 1012 of these images have high quality pixel-level labels of five classes: agriculture, forest, miscellaneous, urban, and water. The 1012 images were split into 800 training and 112 validation images with publicly available annotation, as well as 100 test images with annotations withheld, and comparison to other methods were performed via a dedicated evaluation server. For quantitative evaluation, mean of class-wise intersection over union () and score are used.
4.2. ISPRS Vaihingen Dataset
One of the major challenges in remote sensing is the automated extraction of urban objects from data acquired by airborne sensors. The Semantic Labeling Contest provides two state-of-the-art airborne image corpora. The Vaihingen corpus shows a relatively small village with many detached buildings and small multi-story buildings, and the Potsdam corpus shows a typical historic city with large building blocks, narrow streets, and dense settlement structure. In our experiments, the Vaihingen corpus was selected and used.
The ISPRS 2D Semantic labeling challenge in Vaihingen  (Figure 7 and Figure 8) was used as our benchmark dataset. It consists of three spectral bands (i.e., red, green, and near-infrared bands), the corresponding DSM (digital surface model) and the NDSM (normalized digital surface model) data. Overall, there are 33 images of about 2500 × 2000 pixels at a ground sampling distance (GSD) of about 9 cm in the image data. Among them, the ground truth of only 16 images are available, and those of the remaining 17 images are withheld by the challenge organizer for the online test. For offline validation, we randomly split the 16 images with ground truth available into a training set of 10 images and a validation set of 6 images. For this work, DSM and NDSM data in all experiments on this dataset were not used. Following other methods, four tiles (Image Numbers 5, 7, 23, and 30) were removed from the training set as the validation set. Experimental results are reported on the validation set if not specified.
The multi-class classification task can be considered as multi-segmentation, where class pixels are positives and the remaining non-spotlight pixels are negatives. Let denote the number of true positives, denote the number of true negatives, denote the number of false positives, and denote the number of false negatives.
, , , and are shown in Equations (4)–(8). Precision is the percentage of correctly classified main pixels among all predicted pixels by the classifier. Recall is the percentage of correctly classified main pixels among all actual main pixels. is a combination of and .
To evaluate the performance of different deep models, we will discuss the above two major metrics (), the mean of class-wise intersection over union ()) on each category, and the mean value of metrics to assess the average performance.
5. Experimental Results and Discussion
The implementation is based on a deep learning framework, called “Tensorflow-Slim” , which is extended from Tensorflow. All experiments were conducted on servers with an Intel® Xeon® Processor E5-2660 v3 (25M Cache, 2.60 GHz), 32 GB of memory (RAM), an Nvidia GeForce GTX 1070 (8 GB), an Nvidia GeForce GTX 1080 (8 GB), and an Nvidia GeForce GTX 1080 Ti (11 GB). Instead of using the whole image (1500 × 1500 pixels) to train the network, we randomly cropped all images to be 512 × 512 as inputs of each epoch.
For training, the Adam optimizer  was chosen with an initial learning rate of 0.004 and the weight decay of 0.00001. Batch normalization  is used before each convolutional layer in our implementation to ease the training and make it be able to concatenate feature maps from different layers. To avoid overfitting, common data augmentations are used as details in Section 3.1. For measurements, we use the mean pixel intersection-over-union () and the score as the metric.
Inspired by [16,27,37], we use the “poly” learning rate policy where the learning rate is multiplied by Equation (9) with a power of 0.9 and an initial learning rate as . The learning rate is scheduled by multiplying the initial as seen in Equation (9).
All models are trained for 50 epochs with a mini-batch size of 4, and each batch contains the cropped images that are randomly selected from training patches. These patches are resized to pixels. The statistics of BN is updated on the whole mini-batch.
This section illustrates the details of our experiments. The proposed deep learning network is based on the GCN with three improvements: (i) varying the backbones using ResNet, (ii) channel attention and global average pooling, and (iii) domain-specific transfer learning. From all proposed strategies, there are six acronyms of strategies as shown in Table 1.
For the experimental setup, there were three experiments on two remotely sensed datasets: the Landsat-8 dataset and the ISPRS Vaihingen Challenge dataset (details in Section 4.1 and Section 4.2). The experiments aimed to illustrate that each proposed strategy can improve the performance. First, the GCN152 method was compared to the GCN50 method and the GCN101 method for the varying backbones using ResNet with different numbers of layers on the GCN network strategy. Second, the GCN152-A method was compared to the GCN152 method for the channel attention strategy. Third, the full proposed technique GCN152-TL-A method was compared to existing methods for the concept of domain-specific transfer learning.
5.1. Results of the Landsat-8 Corpus with Discussion
An experiment was conducted on the Landsat-8 corpus, and the result is shown in Table 2 and Table 3 by comparing between baseline and variations of the proposed techniques. It is shown that our network with all strategies, GCN152-TL-A, outperforms other methods. More details will be discussed to show that each of the proposed techniques can improve accuracy. Only in this experiment is there a state-of-the-art baseline, including a deep convolutional encoder–decoder (DCED) [31,32,33].
5.1.1. The Effect of an Enhanced GCN on the Landsat-8 Corpus
Our first strategy aims to increase an and score of the network by varying backbones using ResNet 50, ResNet 101, and ResNet 152 rather than the traditional one, the DCED method. From Table 2 and Table 3, the of GCN152 (0.7563) outperforms that of GCN50 (0.6847), GCN101 (0.7290), and the baseline method, DCED (0.6495); this yields a higher at 2.74%, 3.52%, and 4.43%, respectively. The of GCN152 (0.6364) outperforms that of GCN50 (0.5734), GCN101 (0.6154), and the baseline method, DCED (0.5384); this yields a higher at 2.10%, 3.50%, and 4.20%, consecutively. The main reason is due to higher precision, but a slightly lower recall. This can imply that enhanced GCN is more significantly efficient than the DCED method (baseline) for this medium resolution corpus and ResNet with a large number of layers is more robust than the small number of layers.
When comparing the results between the original GCN method and the enhanced GCN methods on the Landsat-8 corpus (Table 2), it is clearly shown that a GCN with a larger layer of backbone can improve network performance in terms of and .
5.1.2. The Effect of Using Channel Attention on the Landsat-8 Corpus
Our second mechanism focused on applying the channel attention block (details in Section 3.4) to change the weights of the features on each stage to enhance consistency. In Table 2 and Table 3, the of GCN152-A (0.7897) is greater than that of GCN152 (0.7563); this yields a higher score at 3.34%. The of GCN152-A (0.6726) is superior to that of GCN152 (0.6364); this yields a higher score at 3.62%. The result (Figure 9e and Figure 10e) shows that can make the network to obtain discriminative features stage-wise to make the prediction intra-class consistent. This is based on the consideration that we re-weighted all feature maps of each layer.
5.1.3. The Effect of Using Domain-Specific Transfer Learning on Landsat-8 Corpus
Our last strategy aims to use approach of domain-specific transfer learning (details in Section 3.3) by reusing the pre-trained weight from the GCN152-A model on the ISPRS Vaihingen corpus. From Table 2 and Table 3, the of the GCN152-TL-A method is the winner; it clearly outperforms not only the baseline but also all previous generations. Its is higher than that of the DCED (baseline) at 17.80%. Its is higher than that of the DCED at 17.94%. Additionally, the result illustrates that the concept of domain-specific transfer learning can enhance both precision (0.8293) and recall (0.8476).
Figure 9 and Figure 10 show 12 sample results from the proposed method. By applying all strategies, the images in the last column (Figure 9f and Figure 10f) are similar to the ground truths (Figure 9b and Figure 10b). Furthermore, -results and scores are improved for each strategy we added to the network as shown in Figure 9c–f and Figure 10c–f.
To achieve the highest accuracy, the network must be configured and many epochs must be trained until all parameters in the network are converged. Figure 11a illustrates that the proposed network has been properly set and trained until it is converged and runs more smoothly than the baseline in Figure 12a. Furthermore, Figure 11b and Figure 12b show that a higher number of epochs tend to show a better score. Thus, the number of chosen epochs based on the validation data is 49 (the best model for this dataset).
Twelve sample testing results (shown as Figure 9 and Figure 10) are based on the proposed method with respect to Nan (one of the northern provinces (changwat) of Thailand and where agriculture is the main industry). The results of the last column look closest to the ground truth in the second column.
As can be seen in Figure 9 and Figure 10, the performance of our best model outperforms other advanced models by a considerable margin on each category, especially for the agriculture, miscellaneous (Misc), and water classes. Furthermore, the loss curves shown in Figure 11a exhibit that our model performs better on all given categories.
5.2. Results of the ISPRS Vaihingen Challenge Corpus with Discussion
An experiment was conducted on the ISPRS Vaihingen Challenge corpus, and the result is shown in Table 4 and Table 5 by comparing between baseline and variations of the proposed techniques. This shows that our network with all strategies (GCN152-TL-A) outperforms other methods. More details will be discussed to show that each of the proposed techniques can improve accuracy. Only in this experiment is there one baseline, which is the DCED network.
5.2.1. Effect of the Enhanced GCN on the ISPRS Vaihingen Corpus
Our first strategy aims to increase the and score of the network by varying backbones using ResNet 50, ResNet 101, and ResNet 152 rather than the traditional one, the DCED method. From Table 4 and Table 5, the of GCN152 (0.7864) outperforms that of GCN50 (0.776), GCN101 (0.768), and the baseline method, DCED (0.7693); this yields a higher at 0.02%, 0.68%, and 1.01%, respectively. The of GCN152 (0.8977) outperforms that of GCN50 (0.8776), GCN101 (0.8972), and the baseline method, DCED (0.8651); this yields a higher at 0.02%, 0.68%, and 1.01% respectively. This can imply that an enhanced GCN is also more accurate than the DCED approach on a very high resolution dataset. ResNet with a large number of layers is still more robust than a small number of layers, the same as that performed on the Landsat-8 corpus (Section 5.1.1).
When comparing the results between the original GCN method and the enhanced GCN methods on the Landsat-8 corpus (Table 4), it is clear that the GCN with a larger backbone layer can improve network performance in terms of and
5.2.2. Effect of Using Channel Attention on ISPRS Vaihingen Corpus
Our second mechanism focused on utilizing the channel attention block to change the weights of the features on each stage to enhance the consistency. From Table 4 and Table 5, the of GCN152-A (0.7902) is greater than that of GCN152 (0.7864); this yields a higher score at 0.38%. The of GCN152-A (0.9057) is better than that of GCN152 (0.8977); this yields a higher score at 0.80%. The results (Figure 13e and Figure 14e) show that this can also cause the network to obtain discriminative features stage-wise to make intra-class prediction consistent with respect to very high resolution images.
5.2.3. The Effect of Using Domain-Specific Transfer Learning on the ISPRS Vaihingen Corpus
Our last strategy aims to perform domain-specific transfer learning (details in Section 3.3) by reusing the pre-trained weight from the GCN152-A model on the Landsat-8 corpus. From Table 4 and Table 5, the of the GCN152-TL-A method is the winner; it clearly outperforms not only the baseline but also all previous generations. Its is higher than the DCED (baseline) at 2.49% and 1.82% consecutively. Its is higher than the DCED and the GCN at 4.76% and 3.51%, respectively. Additionally, the result illustrates that the concept of domain-specific transfer learning can enhance both precision (0.7888) and recall (0.8001).
Figure 13 and Figure 14 shows 12 sample results from the proposed method. By applying all strategies, the images in the last column (Figure 13f and Figure 14f) are similar to ground truths (Figure 13b and Figure 14b). Furthermore, results and scores are improved for each strategy we added to the network as shown in Figure 13c–f and Figure 14c–f.
To further evaluate the effectiveness of the proposed GCN152-TL-A comparisons with the baseline method on the one challenging benchmark and the one private benchmark are presented in Table 2 and Table 3 for the Landsat-8 dataset with respect to Nan (Thailand) and Table 4 and Table 5 for the Vaihengen dataset. All extensive experiments on the Landsat-8 and ISPRS datasets demonstrate that the proposed method clearly achieves promising gains compared with the baseline approach.
Figure 13 and Figure 14 show twelve sample testing results from the proposed method on ISPRS Vaihingen corpus. The results of the last column are also similar to the ground truth in the second column same as performed on Landsat-8 corpus. Considering to each class (are shown in Table 3 and Table 5), almost every classes (three out of five) from our proposed methods are the winner in term .
As can be seen in Figure 13 and Figure 14, the performance of our best model outperforms other advanced models by a considerable margin on each category, especially for the impervious surface (IS), tree, and car categories. To show the effectiveness of the proposed methods, we performed comparisons against a number of state-of-the-art semantic segmentation methods, as listed in Table 4, Table 5 with respect to the ISPRS corpus, and Table 2 and Table 3 with respect to the Landsat-8 corpus. The DCED [31,32,33] and GCN  are the versions with ResNet-50 as their backbone. In particular, we re-implemented the DCED with Tensorflow-Slim , since the released code was built on Caffe . We can see that our proposed methods significantly outperform other methods on both the and .
In terms of the computational cost, our framework requires slightly additional training time compared to the baseline approach, DCED, by about 6.25% (6–7 h), and GCN, by about 4.5% (4–5 h). In our experiment, DCED’s training procedure took approximately 16 h per dataset, and finished after 50 epochs with 1152 s per epoch. Our framework is a modification of the GCN-based deep learning architecture. The channel attention model increases the time by 20 min compared with the GCN152 method. There is no additional time required when reusing pre-trained weights.
6. Conclusions and Future Work
In this study, we propose a novel CNN framework to perform semantic labeling on remotely sensed images. Our proposed method achieves excellent performance by presenting three aspects. First, a global convolutional network (GCN) is employed and enhanced by adding larger numbers of layers to better capture complex features. Second, channel attention is proposed to assign a proper weight for each extracted feature on different stages of the network. Finally, domain-specific transfer learning is introduced to allay the scarcity issue by training the initial weights using other remotely sensed corpora whose resolutions can be different. The experiments were conducted on two datasets: Landsat-8 (medium resolution) and the ISPRS Vaihingen Challenge (very hign resolution) datasets. The results show that our model that combines all proposed strategies outperforms baseline models in terms of and mean IoU. The final results show that our enhanced GCN outperforms the baseline (DCED)—17.48% for F1 on the Landsat-8 corpus and 2.48% on the ISPRS corpus.
In the future, more choices of semantic labeling, modern optimization techniques, and/or other novel activation functions will be investigated and compared to obtain the best GCN-based framework for semantic segmentation in remotely sensed images. Moreover, incorporating other data sources (e.g., a digital surface model) might be needed to increase the accuracy of deep learning for both the CNN and the modern deep learning layer with very low confidence simultaneously. These aforementioned issues will be investigated in future research.
T.P. performed all the experiments and wrote the paper; P.V. and T.P. performed the results analysis and edited the manuscript. K.J., S.L., T.P. and P.S. reviewed the results. T.P. revised the manuscript.
This research received no external funding.
T. Panboonyuen thanks the scholarship from The 100th Anniversary Chulalongkorn University Fund granted and The 90th Anniversary Chulalongkorn University Fund (Ratchadaphiseksomphot Endowment Fund). We greatly acknowledge Geo-informatics and Space Technology Development Agency (GISTDA), Thailand, for providing satellite imagery used in this study and T. Panboonyuen thanks to the staff from the GISTDA (Thanwarat Anan, Suwalak Nakya, Bussakon Satta) for the supply of LANDSAT-8 imagery and supporting ground data.
Conflicts of Interest
The authors declare no conflict of interest.
The following abbreviations are used in this manuscript:
|CNN||Convolutional Neural Network|
|DCED||Deep Convolutional Encoder–Decoder|
|GCN||Global Convolutional Network|
|VHR||Very High Resolution|
- Liu, Y.; Fan, B.; Wang, L.; Bai, J.; Xiang, S.; Pan, C. Semantic labeling in very high resolution images via a self-cascaded convolutional neural network. ISPRS J. Photogramm. Remote Sens. 2018, 145, 78–95. [Google Scholar] [CrossRef]
- Wang, H.; Wang, Y.; Zhang, Q.; Xiang, S.; Pan, C. Gated convolutional neural network for semantic segmentation in high-resolution images. Remote Sens. 2017, 9, 446. [Google Scholar] [CrossRef]
- Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [Google Scholar] [CrossRef]
- Panboonyuen, T.; Vateekul, P.; Jitkajornwanich, K.; Lawawirojwong, S. An Enhanced Deep Convolutional Encoder-Decoder Network for Road Segmentation on Aerial Imagery. In Recent Advances in Information and Communication Technology Series; Springer: Cham, Switzerland, 2017; Volume 566. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv, 2016; arXiv:1606.00915. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv, 2017; arXiv:1706.05587. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Panboonyuen, T.; Jitkajornwanich, K.; Lawawirojwong, S.; Srestasathiern, P.; Vateekul, P. Road Segmentation of Remotely-Sensed Images Using Deep Convolutional Neural Networks with Landscape Metrics and Conditional Random Fields. Remote Sens. 2017, 9, 680. [Google Scholar] [CrossRef]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the 2017 IEEE International Conference onComputer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv, 2015; arXiv:1502.03167. [Google Scholar]
- Kingma, D.; Ba, J. Adam: A method for stochastic optimization. arXiv, 2014; arXiv:1412.6980. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar]
- Noh, H.; Hong, S.; Han, B. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1520–1528. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
- Peng, C.; Zhang, X.; Yu, G.; Luo, G.; Sun, J. Large kernel matters—Improve semantic segmentation by global convolutional network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1743–1751. [Google Scholar]
- Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Learning a Discriminative Feature Network for Semantic Segmentation. arXiv, 2018; arXiv:1804.09337. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. arXiv, 2017; arXiv:1709.01507. [Google Scholar]
- Xie, M.; Jean, N.; Burke, M.; Lobell, D.; Ermon, S. Transfer learning from deep features for remote sensing and poverty mapping. arXiv, 2015; arXiv:1510.00098. [Google Scholar]
- Yosinski, J.; Clune, J.; Bengio, Y.; Lipson, H. How transferable are features in deep neural networks. In Advances in Neural Information Processing Systems 27 (NIPS 2014); Curran Associates, Inc.: Red Hook, NY, USA, 2014; pp. 3320–3328. [Google Scholar]
- Liu, J.; Wang, Y.; Qiao, Y. Sparse Deep Transfer Learning for Convolutional Neural Network. In Proceedings of the AAAI, San Francisco, CA, USA, 4–9 February 2017; pp. 2245–2251. [Google Scholar]
- International Society for Photogrammetry and Remote Sensing. 2D Semantic Labeling Challenge. Available online: http://www2.isprs.org/commissions/comm3/ wg4/semantic-labeling.html (accessed on 9 September 2018).
- Valada, A.; Vertens, J.; Dhall, A.; Burgard, W. Adapnet: Adaptive semantic segmentation in adverse environmental conditions. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 4644–4651. [Google Scholar]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder–decoder with atrous separable convolution for semantic image segmentation. arXiv, 2018; arXiv:1802.02611. [Google Scholar]
- Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral segmentation network for real-time semantic segmentation. arXiv, 2018; arXiv:1808.00897. [Google Scholar]
- Zhang, H.; Dana, K.; Shi, J.; Zhang, Z.; Wang, X.; Tyagi, A.; Agrawal, A. Context encoding for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Bilinski, P.; Prisacariu, V. Dense Decoder Shortcut Connections for Single-Pass Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 6596–6605. [Google Scholar]
- Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. DenseASPP for Semantic Segmentation in Street Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 3684–3692. [Google Scholar]
- Li, Y.; Qi, H.; Dai, J.; Ji, X.; Wei, Y. Fully convolutional instance-aware semantic segmentation. arXiv, 2016; arXiv:1611.07709. [Google Scholar]
- Zhao, H.; Qi, X.; Shen, X.; Shi, J.; Jia, J. Icnet for real-time semantic segmentation on high-resolution images. arXiv, 2017; arXiv:1704.08545. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Badrinarayanan, V.; Handa, A.; Cipolla, R. Segnet: A deep convolutional encoder–decoder architecture for robust semantic pixel-wise labelling. arXiv, 2015; arXiv:1505.07293. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder–decoder architecture for image segmentation. arXiv, 2015; arXiv:1511.00561. [Google Scholar]
- Kendall, A.; Badrinarayanan, V.; Cipolla, R. Bayesian segnet: Model uncertainty in deep convolutional encoder–decoder architectures for scene understanding. arXiv, 2015; arXiv:1511.02680. [Google Scholar]
- Dai, J.; He, K.; Sun, J. Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3150–3158. [Google Scholar]
- Barsi, J.A.; Lee, K.; Kvaran, G.; Markham, B.L.; Pedelty, J.A. The spectral response of the Landsat-8 operational land imager. Remote Sens. 2014, 6, 10232–10251. [Google Scholar] [CrossRef]
- Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Savannah, GA, USA, 2–4 November 2016; Volume 16, pp. 265–283. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2012; pp. 1097–1105. [Google Scholar]
- Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; Darrell, T. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; ACM: New York, NY, USA, 2014; pp. 675–678. [Google Scholar]
Figure 1. An overview of the original global convolutional network (GCN) and boundary refinement (BR) .
Figure 2. An overview of our proposed network.
Figure 3. An overview of the whole backbone pipeline in (left) the main backbone with varying by ResNet50, ResNet101, and ResNet152; (right) the major drivers of our main classification network (composed of a global convolutional network (GCN) and a boundary refinement (BR) block ).
Figure 4. Components of the channel attention block. The red lines represent the downsample operators, respectively. The red line cannot change the size of feature maps. It is only a path for information passing.
Figure 5. The domain-specific transfer learning strategy reuses pre-trained weights of models between two datasets—very high (ISPRS) and medium (Landsat-8; LS-8) resolution images.
Figure 6. Sample satellite images from Nan, a province in Thailand (left), and corresponding ground truth (right). The label of medium resolution dataset includes five categories: agriculture (yellow), forest (green), miscellaneous (brown), urban (red), and water (blue).
Figure 7. Overview of the ISPRS 2D Vaihingen Labeling corpus. There are 33 tiles. Numbers in the figure refer to the individual tile flag.
Figure 8. The sample input tile from Figure 7 (left) and corresponding ground truth (right). The label of the Vaihingen Challenge includes six categories: impervious surface (imp surf, white), building (blue), low vegetation (low veg, cyan), tree (green), car (yellow), and clutter/background (red).
Figure 9. Six testing sample inputs and output satellite images on Landsat-8 in the Nan province in Thailand, where rows refer to different images. (a) Original input image. (b) Target map (ground truth). (c) Output of Encoder–Decoder (Baseline). (d) Output of GCN152. (e) Output of GCN152-A. and (f) Output of GCN152-TL-A. The label of medium resolution dataset includes five categories: Agriculture (yellow), Forest (green), Miscellaneous (Misc, brown), Urban (red) and Water (blue).
Figure 10. Six testing sample input and output satellite images on Landsat-8 in Nan in Thailand, where rows refer to different images. (a) Original input image. (b) Target map (ground truth). (c) Output of Encoder–Decoder (Baseline). (d) Output of GCN152. (e) Output of GCN152-A. and (f) Output of GCN152-TL-A. The label of medium resolution dataset includes five categories: Agriculture (yellow), Forest (green), Miscellaneous (Misc, brown), Urban (red) and Water (blue).
Figure 11. Iteration plot on Landsat-8 corpus of the proposed technique, GCN152-TL-A; x refers to epochs and y refers to different measures (a) Plot of model loss (cross entropy) on training and validation datasets; (b) performance plot on the validation dataset.
Figure 13. Six testing sample input and output aerial images on ISPRS Vaihingen Challenge corpus, where rows refer different images. (a) Original input image. (b) Target map (ground truth). (c) Output of Encoder–Decoder (Baseline). (d) Output of GCN152. (e) Output of GCN152-A. and (f) Output of GCN152-TL-A. The label of the Vaihingen Challenge includes six categories: impervious surface (imp surf, white), building (blue), low vegetation (low veg, cyan), tree (green), car (yellow) and clutter/background (red).
Figure 14. Six testing sample input and output aerial images on ISPRS Vaihingen Challenge corpus, where rows refer different images. (a) Original input image; (b) Target map (ground truth); (c) Output of Encoder–Decoder (Baseline); (d) Output of GCN152; (e) Output of GCN152-A; and (f) Output of GCN152-TL-A. The label of the Vaihingen Challenge includes six categories: impervious surface (imp surf, white), building (blue), low vegetation (low veg, cyan), tree (green), car (yellow), and clutter/background (red).
Table 1. Abbreviations on our proposed deep learning methods.
|A||Channel Attention Block|
|GCN||Global Convolutional Network|
|GCN50||Global Convolutional Network with ResNet50|
|GCN101||Global Convolutional Network with ResNet101|
|GCN152||Global Convolutional Network with ResNet52|
|TL||Domain-Specific Transfer Learning|
Table 2. Results of the testing data of the Landsat-8 corpus between baseline and five variations of our proposed techniques in terms of , , , and .
Table 3. Results of the testing data of Landsat-8 corpus between each class with our proposed techniques in terms of .
Table 4. Results of the testing data of the ISPRS 2D semantic labeling challenge corpus between the baseline and five variations of our proposed techniques in terms of , , , and .
Table 5. Results of the testing data of ISPRS Vaihingen Challenge corpus between each class with our proposed techniques in terms of .
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).