A Dual-Path and Lightweight Convolutional Neural Network for High-Resolution Aerial Image Segmentation

Semantic segmentation on high-resolution aerial images plays a significant role in many remote sensing applications. Although the Deep Convolutional Neural Network (DCNN) has shown great performance in this task, it still faces the following two challenges: intra-class heterogeneity and inter-class homogeneity. To overcome these two problems, a novel dual-path DCNN, which contains a spatial path and an edge path, is proposed for high-resolution aerial image segmentation. The spatial path, which combines the multi-level and global context features to encode the local and global information, is used to address the intra-class heterogeneity challenge. For inter-class homogeneity problem, a Holistically-nested Edge Detection (HED)-like edge path is employed to detect the semantic boundaries for the guidance of feature learning. Furthermore, we improve the computational efficiency of the network by employing the backbone of MobileNetV2. We enhance the performance of MobileNetV2 with two modifications: (1) replacing the standard convolution in the last four Bottleneck Residual Blocks (BRBs) with atrous convolution; and (2) removing the convolution stride of 2 in the first layer of BRBs 4 and 6. Experimental results on the ISPRS Vaihingen and Potsdam 2D labeling dataset show that the proposed DCNN achieved real-time inference speed on a single GPU card with better performance, compared with the state-of-the-art baselines.


Introduction
With the rapid development of remote sensing technologies, more and more high-resolution aerial images are available for us to obtain information in various domains, such as urban planning, environmental monitoring, landscape classification, disaster relief, navigation, etc.As a result, accurate and real-time semantic segmentation of high-resolution aerial images is of great significance and receives more attention.Some traditional image segmentation methods, such as watershed algorithm [1], graph cuts [2], and random forest [3], are used to classify high-resolution aerial images.They usually need artificially setting thresholds and interaction controls and are sensitive to noises, thus they cannot provide accurate semantic segmentation results.
In the past few years, deep learning methods, especially the Fully Convolutional Network (FCN) [4], have significantly promoted the development of semantic segmentation.Some deep learning based semantic segmentation methods [4][5][6][7][8][9][10] developed for natural images have been applied to high-resolution aerial images and achieved good performance.However, the features extracted by these methods are not good at discriminating: (1) two objects which are classified into the same semantic label but with different appearances, named intra-class heterogeneity, as shown in Figure 1a, where the houses (or cars) have different shapes, sizes, and colors, but they belong to the same semantic label; and (2) two adjacent objects which are categorized into two different semantic labels but with similar appearances, named inter-class homogeneity, as shown in Figure 1b, where the low vegetation and trees are similar in colors, but their semantic labels are distinct.To tackle these two challenges, we need to consider each category of pixels as a whole, instead of assigning semantic label to each single pixel independently.To address the intra-class heterogeneity issue, we need to combine the multi-level and global context features to encode the local and global information, which can learn the discriminative and effective features to correctly categorize variant objects belonged to the same semantic label.Semantic boundaries can detect the feature variations on adjacent objects with similar appearance but different semantic labels.We can integrate it into the training process to help the network to learn the discriminative features to enlarge the inter-class differences.Based on the above two points, we propose a novel Deep Convolutional Neural Network (DCNN) that contains a spatial path and an edge path to tackle the problems of intra-class heterogeneity and inter-class homogeneity in high-resolution aerial images simultaneously.In remote sensing applications, one of the major challenges is automatically extracting urban objects from data acquired in real-time.To the best of our knowledge, most of the proposed semantic segmentation networks for high-resolution aerial images are focused on improving the accuracy with little attention paid to computational efficiency.These networks often have huge number of parameters and long inference time.In this work, we also take the computational efficiency into consideration for semantic segmentation of high-resolution aerial images.The feature extractor of the proposed DCNN is inspired from MobileNetV2 [11], which provides an efficient classification network.We modify it to improve the prediction accuracy by introducing atrous convolution and discarding the strided convolution in the deeper convolutional layers.
The remainder of this paper is arranged as follows.Section 2 gives an overview of related approaches for high-resolution aerial images segmentation.Section 3 describes the proposed method in detail.Section 4 presents the experimental results of our proposed method and comparisons with other methods.The discussion of obtained results is presented in Section 5. Finally, conclusions are drawn in Section 6.

Related Work
In computer vision, while convolutional networks have been used for a long time, their success was limited by the amount of available training images and high-performance computing resources [12].Since AlexNet [13] won the ImageNet Large Scale Visual Recognition Competition (ILSVRC) in 2012, DCNN has become the mainstream research method in the field of computer vision and achieved great success in various applications, such as image classification, object detection, semantic segmentation, object tracking, face recognition, etc.
The goal of semantic segmentation is to assign each pixel in an image with a semantic label [14].FCN [4] is considered a milestone in deep learning techniques for semantic segmentation, since it demonstrates how DCNNs could be trained end-to-end to solve this problem, efficiently learning how to produce dense pixel-level predictions for input images of arbitrary sizes.SegNet [5] introduces an encoder-decoder architecture for semantic segmentation.The encoder extracts features via convolution, max pooling, and activation layers, while storing the index of each max pooling window.The decoder is similar to the encoder, upsampling the input, using indices stored from the encoding stage.U-Net [6] is a U-shaped architecture, which is a symmetric DCNN and uses skip connections between the downsampling path and the upsampling path.It combines different levels of context information to predict a good segmentation map.DeepLab [7,8] introduces atrous convolution in DCNN to effectively enlarge the receptive fields without increasing the number of network parameters.Atrous Spatial Pyramid Pooling (ASPP) employs multiple parallel atrous convolutional layers with different dilation rates to exploit multi-scale features, thus capturing objects as well as image context at multiple scales.In DenseNet [9], each layer receives feature maps from all preceding layers and passes on its output feature maps to all subsequent layers.Therefore, the loss could be propagated to earlier layers directly, and the vanishing-gradient problem is alleviated.GCN [10] employs large convolutional kernels and effective receptive fields to address the classification and localization issues for semantic segmentation.PSPNet [15] is a DCNN to exploit the global context information of an image by different-region-based context aggregation through the pyramid pooling module.It concatenates the feature extraction layers and the upsampled pyramid pooling layers, combining local and global context information together.These state-of-the-art models employ the following technologies that are widely used in semantic segmentation algorithms: (1) skip connections between lower convolutional layers and higher convolutional layers to fuse features of different levels for better pixel-level labeling; (2) atrous convolution to enlarge receptive fields without increasing computational parameters; and (3) global pooling convolutional layer to guide the location of objects.These technologies are integrated into our method to tackle the challenge of intra-class heterogeneity problem.Holistically-nested Edge Detection (HED) [16] is an edge detection DCNN that adopts FCN architecture with multiple side-outputs for deeply supervised learning.In this paper, we use a HED-structured sub-network to extract semantic boundaries for deep supervision of our network in the learning process, which helps to deal with the problem of inter-class homogeneity.
In remote sensing research, DCNNs have been recently employed for high-resolution aerial images segmentation [17].Kampffmeyer et al. [18] focused on small object segmentation through measuring the uncertainty of DCNNs.This method achieved high overall accuracy as well as good performance for small objects segmentation.Guo et al. [19] exploited FCN with atrous convolution to perform semantic segmentation for high-resolution remote sensing images.They used graph-based segmentation and selective search method to augment the training data and conditional random fields(CRF) to refine the segmentation results.Chen et al. [20] proposed a DCNN based on DeepLabv3 [8], which adopted modified ASPP, a fully connected fusion path and pre-trained encoder for high-resolution remote sensing images segmentation.Liu et al. [21] introduced an effective method to detect manhole cover objects in remote sensing images.They designed two sub-networks: a multi-scale output network for manhole cover object-like edge generation, and a multi-level convolution matching network for object detection based on fused feature maps.Schuegraf and Bittner [22] proposed two parallel U-Net-like [6] DCNNs, which merged depth and spectral information.The output of the two DCNNs were combined together for binary building mask generation.Panboonyuen et al. [23] presented a DCNN based on GCN [10], which adopted more convolutional layers, channel attention module, and domain specific transfer learning.Liu et al. [24] proposed a FCN based DCNN, in which a spatial residual inception (SRI) module was employed to capture and aggregate multi-scale contexts for semantic segmentation by fusing muti-level features.Pan et al. [25] presented a DCNN for building extraction from high-resolution aerial images, which composed of a U-Net, channel attention mechanisms, and an adversarial network.Benjdira et al. [26] designed an unsupervised algorithm using Generative Adversarial Networks (GANs), which demonstrated improved performance when passing from the ISPRS Potsdam 2D labeling dataset to the ISPRS Vaihingen 2D labeling dataset.Pan et al. [27] presented a novel Dense Pyramid Network (DPN) based on DenseNet to extract and take full advantage of features.They used group convolutions to extract feature maps of each channel of multi-sensor data and channel shuffle to enhance the representation ability of the network.To deal with the class imbalance problem, they adopted the median frequency balanced focal loss.Yao et al. [28] proposed the dense-coordconv network (DCCN) to reduce the loss of spatial features and strengthen object boundaries.This method adopted DenseNet as backbone, putting coordinate information into feature maps.Liu et al. [29] designed ScasNet to improve the accuracy of manmade objects and intricate fine-structured objects by sequential global-to-local context aggregation in a self-cascaded manner.Wu et al. [30] presented four stacked fully convolutional networks (SFCNs) and feature alignment framework for multi-label land-cover segmentation.However, training the entire network was time-consuming due to huge network architecture.Marmanis et al. [31] trained a DCNN to extract scale-dependent class boundaries, and then used it with color and DSM information as input to the FCN to obtain the semantic labels.Although it achieved high accuracy, it was computationally complex and time-consuming.

Proposed Method
In this section, we introduce the details of our proposed DCNN for high-resolution aerial image segmentation.Instead of regarding the high-resolution aerial image segmentation task as a single and independent problem, we formulate it as a multi-task learning framework by exploring the complementary information, which can predict the results of semantic labels and boundaries simultaneously.A semantic label prediction path is designed to tackle the intra-class heterogeneity problem.It combines the multi-level and global context features to encode the local and global information to learn the discriminative and effective features for the two objects with different appearances but same semantic label.A boundary prediction path is designed for guiding the process of feature learning to differentiate the adjacent objects with similar appearance but different semantic labels.The proposed DCNN jointly trains and refines the semantic and boundary information in a unified network.Basically, the predictions of semantic labels and boundaries are both pixel-wise classification tasks, which need to extract the feature maps first.Our DCNN is thus constructed on a common feature extraction network, which first learns common representations using shared convolutional layers and then appends two parallel paths with respect to multi-level spatial features fusion and semantic boundaries detection.At the same time, we need to consider the compute efficiency of the DCNN for real-time applications.Therefore, we adopt a high-performance lightweight network architecture-MobileNetV2-as our basic feature extraction network.

Network Architecture
The overall architecture of our proposed DCNN is shown in Figure 2. It is an encoder-decoder network structure.The backbone of the encoder is based on the MobileNetV2 with two aspects of improvement: (1) replacing the standard convolution in the last four Bottleneck Residual Blocks (BRBs) with atrous convolution; and (2) removing the convolution stride of 2 in the the first layer of BRBs 4 and 6.The decoder contains a spatial path and an edge path.The spatial path combines the multi-level features and the global context information stage-by-stage to refine the semantic information.The edge path is a HED-like [16] network, which employs deep supervision at each side-output layer.The parameters of MobileNetV2 are shared and updated for the spatial path and the edge path jointly, while the parameters of the two individual paths are updated independently for inferring the probability of semantic labels and boundaries, respectively.Specifically, the feature maps predicted from BRBs 3-7 in MobileNetV2 are fed into two different paths (green and blue arrows shown in the figure) in order to acquire the segmentation masks of semantic objects and boundaries at the same time.
The detail descriptions of the encoder and decoder are given in the following subsections.

Encoder
The encoder extracts features from the input image.The basic structure of the encoder is similar to the original MobileNetV2, except we remove its fully-connected layers, and make two modifications to improve the network performance.Firstly, to effectively enlarge the receptive fields, we replace the standard convolution in BRBs 4-7 with atrous convolution applying strides (holes) of 2, 4, 8, and 16, respectively.Secondly, to acquire more detailed context information, we change the stride of the first layer of each block sequence in BRBs 4 and 6 from 2 to 1.

MobileNetV2 with Multi-Level Contextual Features
MobileNetV2 is an efficient and lightweight DCNN architecture that has demonstrated the state-of-the-art performance on multiple tasks and benchmarks in real-time applications.The basic building block is BRB, which is a bottleneck depthwise separable convolution with residuals.The structure of a typical BRB is shown in Figure 3.It is composed of three sublayers: a 1 × 1 "Expansion" layer with ReLU6, a 3 × 3 depthwise layer with ReLU6, and a 1 × 1 "Projection" layer without any non-linearity.The BRB architecture applies a non-linear function (ReLU6 [32]) that converts the input to the output by expanding and projecting channels.The 1 × 1 "Expansion" layer is a 1 × 1 convolution to expand the number of channels input to the 3 × 3 depthwise convolution.The "Expansion" layer always has more output channels than its input channels.The ratio between the number of output channels and the number of input channels is given by expansion factor.The default expansion factor is 6.The 3 × 3 depthwise layer performs lightweight depthwise convolution [33] by applying a single convolution operation per input channel.The 1 × 1 "Projection" layer makes the number of output channels the same as the input ones.There is residual connection [34] between input channels and output channels if the convolution stride equals 1, which improves the ability of gradient propagation across multiplier layers.Each layer has batch normalization and the activation function ReLU6.However, the output of the 1 × 1 "Projection" layer does not have an activation function applied to it, because appending a non-linearity after it destroys useful feature information.
The MobileNetV2 used for feature extraction in our proposed DCNN contains a fully convolution layer with 32 channels, followed by seven BRBs described in Table 1.Each BRB contains n basic blocks, as shown in Figure 3.We apply atrous convolution through BRBs 4-7 with strides 2, 4, 8, and 16 to enlarge the receptive fields and capture context information at different levels.To get more detailed context information, we modify the stride of the first layer in BRBs 4 and 6 from 2 to 1 in the original MobileNetV2.

Atrous Convolution
In the application of DCNNs for semantic segmentation, max-pooling and striding convolution are employed to reduce the memory occupancy and enlarge the receptive fields.As a result, the resolution of output feature maps reduces significantly.Although "deconvolution" [4] layers or upsampling operations could be used, the loss of the spatial information, especially the boundary information, is too large.Atrous convolution, also called dilated convolution, has been shown that it can enlarge the receptive fields without reducing the image resolution [7].In the case of 1D atrous convolution, given an input signal x(i) with a filter w(k) of length K, the output of y(i) is defined as: where r is the dilation rate that indicates the stride at which we sample the input signal.When r = 1, it is standard convolution.In atrous convolution, the convolution kernel is expanded by the dilation rate, and r − 1 zeros are inserted along the space dimension between the adjacent weights to create a sparse filter.The size of the receptive field can be calculated as: Figure 4 gives a simple example of 2D atrous convolution.Figure 4a shows a standard 3 × 3 convolution, a special case for dilation rate = 1, covering a 3 × 3 receptive field.Figure 4b demonstrates a 3 × 3 atrous convolution with dilation rate = 2.While the convolution kernel size is still 3 × 3, the receptive field is increased to 7 × 7. Figure 4c illustrates a 3 × 3 atrous convolution with dilation rate = 3.Its receptive field is 11 × 11, but the actual number of parameters is still 3 × 3.

Decoder
The function of the decoder is to predict the semantic label of each pixel at the same resolution of the input image.It constructs the pixel-wise semantic label from the feature maps extracted by the encoder.The feature maps output from the encoder are in low resolution and has many channels, with each channel representing a particular feature.As shown in Figure 2, the spatial path is used to merge these feature maps by a series of convolutional layers and Channel Attention Module (CAM).The edge path is employed to detect semantic boundaries by deep supervision.Finally, 8× bilinear upsampling are used on feature maps output from the spatial path to recover the resolution.

Spatial Path
The intra-class heterogeneity problem is mainly because of the lack of context information.Therefore, we need the multi-level receptive fields and context information to refine the spatial information.The outputs of BRBs 3-7 in MobileNetV2 have different receptive fields.In the lower block, the network output features with fine spatial information, but it has poor semantic information due to its small receptive fields and without guidance of spatial context.While in the upper block, it has good semantic information because of its large receptive fields, but the feature maps are coarse.To sum up, the lower block provides finer spatial predictions, while the upper block generates more accurate semantic predictions.Therefore, we introduce the spatial path to take advantage of these blocks for better predictions.In our spatial path, we sum up the features of adjacent blocks stage-by-stage, as shown in Figure 2.Then, these layers are concatenated together and fed to a convolution layer to further fuse features of different receptive fields.However, different scale of receptive fields provide features with different discrimination, resulting in inconsistent semantic segmentation results.Therefore, to generate identical semantic label for one certain class, we need to use more discriminative features.Here, we adopt the high semantic information generated by global average pooling from BRB 7.With this global context information, we introduce the strongest consistency constraint into the network as a guidance.
Furthermore, to refine the features of each BRB, we propose a specific CAM inspired by SENet [35].As shown in Figure 5, CAM is designed to assign a weight factor for each feature channel, which could guide the feature learning adaptively and assign important channels with higher weights.It employs the global average pooling on each feature channel to encode an attention vector, which is used to re-weight the original features.

Edge Path
In the task of high-resolution aerial image segmentation, it is hard to discriminate two classes with similar appearance when they are spatially adjacent.To improve the discriminative ability of the network on this problem, we introduce the edge path (as shown in Figure 2) to guide the feature learning for semantic segmentation task.To extract the semantic boundaries accurately, we adopt the network architecture proposed in HED [16].In our proposed edge path, we attach side-outputs to the last five blocks of MobileNetV2 for semantic boundary detection.We apply deep supervision at each side-output block to learn multi-level representations for semantic boundary predictions.In detail, 1 × 1 convolutional layers with one channel are appended to each of the last five blocks of MobileNetV2 to generate semantic boundary score maps.Then, these score maps are concatenated together and fed to a 1 × 1 convolutional layer with one channel to output the final score map.This semantic boundary detection network could distinguish the semantic boundaries between two adjacent objects that belong to different classes, making the inter-class features distinction as great as possible.

Lost Function
In this paper, we use the Softmax loss to supervise the training of the spatial path, and adopt the binary cross entropy loss to supervise the training of the edge path.The training of the network is formulated as a per-pixel classification problem regarding the groundtruth segmentation masks including semantic objects and their boundaries.Therefore, the loss function of our DCNN can be written as: where L spatial is defined as the Softmax loss of the spatial path, L edge and L k side denote the binary cross entropy loss of the fused edge and the side-output edges in the edge path, respectively.The number of the edge side-output, K, is 5. α and β are the balance weights.

Network Training
In this section, we introduce our training details of the proposed network.

Transfer Learning
In remote sensing domain, due to the expensive cost and complicated acquisition process, there is insufficient training data with accurate annotations for semantic segmentation task.Compared with the limited data in remote sensing, much more training data of natural images are available.Studies (e.g., [36]) have proven that transfer learning in DCNNs can alleviate the problem of insufficient training data.The parameters learned from lower layers in a DCNN can be shared across tasks, while those in higher layers are specific to different tasks.Therefore, transferring the parameters learned from other domains could help reduce overfitting and make the network converge quickly.
We utilized a two-step training procedure to train our DCNN.Firstly, the original MobileNetV2, which is developed for image classification task, was trained on the ImageNet dataset [37].Then, the encoder was loaded with the pre-trained parameters in the first step, while the rest layers randomly initialized with Kaiming initialization [38].Finally, we fine-tuned the whole DCNN on the ISPRS 2D semantic labeling dataset [39] in an end-to-end manner.

Implementation Details
We trained the proposed DCNN using stochastic gradient descent (SGD) [13] with batch-size 16, base learning rate 0.01, momentum 0.9 and weight decay 0.0005.The learning rate of the pre-trained weights was set as half of the base learning rate.We trained the DCNN for 50 epochs and divided the learning rate by 10 after 25, 35, and 45 epochs.As for α and β in Equation (3), we finally used the values of 50 and 0.0025, respectively, after a series of experiments.
As the images of ISPRS 2D semantic labeling dataset are very high-resolution, we could not feed them directly into our DCNN.We randomly cropped all the images into 256 × 256 patches as inputs of each epoch.To avoid overfitting, data augmentation was employed.We used mean subtraction and random cropping on the input image patches to augment the dataset in the training process.For the label of semantic boundaries, we extracted the boundaries from the semantic segmentation's groundtruth with the MATLAB imgradient function.
Our DCNN was implemented under the pytorch [40] framework.All experiments were executed on a Linux PC with 64 bit Ubuntu 18.04, CPU i7-5930K with 64 GB memory, and a Nvidia Geforce GTX TITAN X GPU with 12 GB memory.

Dataset
We evaluated the proposed network on the benchmark dataset of the ISPRS 2D semantic labeling dataset [39].It is comprised of very high-resolution aerial images over two cities, Vaihingen and Potsdam, in Germany.The semantic labels of the dataset contain six classes: impervious surfaces (e.g., roads), buildings, low vegetation, trees, cars, and clutters.

ISPRS Potsdam
The Potsdam dataset is comprised of 38 images with size of 6000 × 6000 at a spatial resolution of 5 cm.Twenty-four tiles composed the training set, and the other 14 tiles were preserved as test set.Each tile has the following bands: IR, R, G, and blue (B) color channels; DSM; and nDSM.We selected 18 tiles (2_10, 2_11, 3_10, 3_11, 4_10, 4_11, 5_10, 5_11, 6_7, 6_8, 6_9, 6_10, 6_11, 7_7, 7_8, 7_9, 7_10, and 7_11) for training and 6 tiles (2_12, 3_12, 4_12, 5_12, 6_12, and 7_12) for validation in the training set.The other 14 tiles were reserved for testing.Note that only the three-band IRRG images extracted from raw four-band IRRGB data were used, and DSM and nDSM data on this dataset were not used.

Evaluating Metrics
To measure the performance of different DCNNs, we used the following two metrics: Overall accuracy and F1.Let TP denote the number of true positives, TN denote the number of true negatives, FP denote the number of false positives, and FN denote the number of false negatives.Overall accuracy is a metric that takes into account all correctly classified pixels indistinctly.It can be written as [41]: F1 is considered as the harmonic mean of precision and recall.It is defined as [41]: where precision = TP TP+FP and recall = TP TP+FN .

Ablation Study
In this subsection, we decompose our method to study how each component affects the segmentation performance.We used the unmodified MobileNetV2 (denoted as MNetV2) and spatial path (described in Section 3.3.1)as our base semantic segmentation network.Then, we evaluated whether the modified MobileNetV2 (denoted as MNetV2 * , described in Section 3.2.1),and edge path (described in Section 3.3.2) can bring benefit to the final segmentation performance.As shown in Table 2, we can observe that the performance of the MNetV2 * achieved higher accuracy than the MNetV2, which demonstrates that the MNetV2 * can preserve more useful information and provide contextual detail information.This is due to that removing the stride of the first layer in BRBs 4 and 6 and employing atrous convolution in BRBs 4-7 provide more detailed information and enlarge the receptive fields.They improved the overall accuracy from 86.09% to 88.72%.Especially, the F1 score value of the car category was improved from 59.43% to 82.98% by a large margin.The edge path is employed to address the inter-class homogeneity problem.Under the guidance of deep supervisory signal from the edge path, the network could discriminate the semantic boundaries between two adjacent objects.Finally, this improved the overall accuracy from 88.72% to 89.61%.Figure 6 presents the visual comparisons of the segmentation results on the ISPRS Vaihingen test tiles.The first row is an image patch with building roofs of different shapes.We can observe that the MNetV2+SP confused similar manmade objects, such as the building roofs and roads, and it obtains inaccurate localization for buildings.The MNetV2 * +SP could predict the shapes of the building roofs more accurately and distinguish the building roofs and roads with similar colors.With the help of the edge path, the MNetV2 * +SP+EP could label the contours of building roofs more clearly.The second row of Figure 6 is an image patch with highly inconsistent cars.The MNetV2+SP labeled all the cars together, while the MNetV2 * +SP and the MNetV2 * +SP+EP could discriminate almost all of the cars clearly.For the four cars in the top-right corner in the image patch, the MNetV2 * +SP+EP could detect their contours by the help of the edge path and label them one by one, while the MNetV2 * +SP could not separate them completely.Low vegetation and trees are prone to be confused by DCNNs due to their similar colors, as shown in the third and fourth rows of Figure 6.The MNetV2+SP and the MNetV2 * +SP mislabeled some low vegetation areas as trees, while the MNetV2 * +SP+EP could provide a relatively proper segmentation results for these plants under the guidance of semantic boundaries detected by the edge path.The clutter category is hard to be properly labeled by DCNNs because it contains a variety of different categories of objects, as shown in the fifth row of Figure 6.We can observe that the MNetV2+SP could not give a good prediction.The results of the MNetV2 * +SP are relatively better, while they are still less accurate.The MNetV2 * +SP+EP produced more accurate and robust segmentation results.Figure 6g is the semantic boundaries generated by the edge path of our proposed DCNN.We can observe that our DCNN could predict accurate and clear object contours while suppressing most of the scattered and minor edge responses inside the objects.

364
To verify the performance, we evaluate the proposed DCNN on the test tiles of ISPRS 2D Labeling Table 3 shows the number of parameters of the compared models.Our model has 2.3M parameters,

381
which is bigger than ESPNet, and smaller than the others.

Comparing with Other Methods
To verify the performance, we evaluated the proposed DCNN on the test tiles of ISPRS 2D semantic labeling dataset, and compared it with other widely-used lightweight models listed below: (1) ICNet: Zhao et al. [42] introduced an Image Cascade Network (ICNet) that incorporates multi-resolution branches under proper label guidance to reduce computations, and further fuses these branches to generate the final results.(2) ESPNet: Mehta et al. [43] proposed ESPNet for semantic segmentation of high-resolution images.
It is based on the Efficient Spatial Pyramid (ESP) module, which is computationally efficient.(3) BiSeNet: BiSeNet [44] designs a spatial path with small stride to preserve the spatial information and generate high-resolution feature maps, and a context path with fast downsampling to obtain large receptive fields parallelly.In the pursuit of better accuracy without loss of speed, a Feature Fusion Module (FFM) is employed to fuse the two paths and refine the final prediction.We used ResNet18 as the backbone of BiSeNet in the experiment.(4) LW_RefineNet: LW_RefineNet [45] is a lightweight version of RefineNet [46].It reduces the number of parameters and floating point operations in the original RefineNet by replacing the 3 × 3 convolutional layers with 1 × 1 convolutional layers and removing the Residual Convolutional Unit (RCU).We used LW_RefineNet-50 as the comparing mode.
Table 3 shows the number of parameters of the compared models.Our model has 2.3M parameters, which is bigger than ESPNet, and smaller than the others.

Comparison on the ISPRS Vaihingen Dataset
The quantitative results of the compared models are exhibited in Table 4.As shown in the table, our proposed method outperformed the others on each category F1 score, average F1 score, and overall accuracy, especially for the low vegetation category and the car category.Moreover, as shown by the ROC and PR curves in Figure 7, our method provided better performance on all categories.Figure 8 gives comparisons on qualitative performance of the ISPRS Vaihingen test tiles.In the first row of fine-structured buildings, ICNet and ESPNet provided inaccurate and incomplete labeling, while BiSeNet and LW_RefineNet were relatively better.Our proposed DCNN generated more coherent segmentation results.The second row of Figure 8 is an image patch with cars of different shapes and colors, which is a representative intra-class heterogeneity problem.ICNet, ESPNet, BiSeNet, and LW_RefineNet were less effective at labeling these confusing cars separately.In contrast, our proposed method could generate good segmentation results with precise semantic boundaries.The third and fourth rows are low vegetation and trees that are similar in color, which represents the challenge of inter-class homogeneity.ICNet, ESPNet, BiSeNet, and LW_RefineNet confused these two classes and mislabeled some low vegetation areas as trees.Our network presented more accurate and robust labeling due to the employment of the edge path.The clutter category contains confusing manmade objects and is hard to label, as shown in the fifth row of Figure 8. Deep models often mislabel it into the car category due to their similar shapes and colors, and confuse it with buildings because of their similar colors.We can observe that ESPNet mislabeled more than half of the clutters into cars and buildings.The results of ICNet, BiSeNet, and LW_RefineNet were relatively good, but they still mislabeled about half of the clutters.Our DCNN presented better labeling than all the above methods.Therefore, our proposed method gave better visual quality on the ISPRS Vaihingen test tiles.

Comparison on the ISPRS Potsdam Dataset
The numerical results of the compared deep models are listed in Table 5.As shown in the table, our model achieved the best performance in terms of category F1 score, average F1 score, and overall accuracy.Furthermore, the ROC and PR curves shown in Figure 9 also verify the advantage of our proposed DCNN.  Figure 10 exhibits the visual comparisons between the deep models on the ISPRS Potsdam test tiles.We can observe that all four comparison models were less good at discriminating manmade objects, such as buildings and roads, while our model could generate more precise segmentation results.For the confusing categories of low vegetation and trees, our proposed method also performed better than the others.Although there are a few flaws in the segmentation results of our model, it can provide relatively more coherent labeling and more accurate semantic boundaries.

Running Time
For the comparison of running time, we used the same image patch size of 1024 × 1024.Table 6 shows the number of frames per second (FPS) that can be processed by all the comparing models on a single NVIDIA Titan X GPU.Our network runs at a speed that is as competitively fast as ESPNet while achieving a better accuracy.The inference speed indicates that it is possible to run our network for high-resolution aerial image segmentation in real-time.

Performance Discussion on the Benchmarks
Table 7 shows the quantitative results on the ISPRS Vaihingen and Potsdam test tiles.As shown, the F1 scores of all the categories are above 82.00%.For the categories impervious surface and buildings, the F1 scores are even greater than 92.00%, which is as accurate as manual annotations by human being.This demonstrates the effectiveness of the stage-by-stage multi-scale contexts aggregation strategy adopted in the spatial path.For the confusing categories of low vegetation and trees, the results are also competitive, which is mainly due to the employment of the edge path in our network.For the fine-structured cars, our model can provide robust segmentation results, especially on the ISPRS Potsdam dataset, achieving 94.12% of F1 score.This great performance is mainly derived from the employment of the modified MobileNetV2 and the spatial path.Overall, our method achieved 88.35% of average F1 score and 89.61% of overall accuracy on the ISPRS Vaihingen dataset and 91.27% of average F1 score and 89.93% of overall accuracy on the ISPRS Potsdam dataset.This verified the effectiveness of our proposed method on improving the segmentation accuracy of high-resolution aerial images.The quantitative results on the ISPRS Potsdam dataset is slightly better than that of the ISPRS Vaihingen dataset.The possible reasons are that the spatial resolution the ISPRS Potsdam dataset is higher than that of the the ISPRS Vaihingen dataset and the ISPRS Potsdam dataset provides more training samples.The visual performance of our proposed method on the two datasets is shown in Figure 11.We can observe that our network could obtain coherent and robust labeling results.Moreover, our method could provide labeling with smooth boundary and accurate localization, especially for the confused low vegetation and trees.For the segmentation of fine-structured buildings and cars, it can label most of them precisely with coherent contours.
Although our method achieved competitive results on the two public benchmarks, it still has limitations in dealing with high-resolution aerial images with complex backgrounds.In the ISPRS Vaihingen dataset, our network confused some parts of the buildings as impervious surface, as shown in the second column of Figure 11a.We can see that the white parts of the buildings are very difficult to identify, even by human being.While our network could distinguish low vegetation and trees preferably by employing the edge path, it still confused them in some challenge situations.As shown in the second column of Figure 11a, it mislabeled low vegetation in shadow as trees.Incorporating elevation information (such as DSM) may further improve the discriminative ability of our method on these two categories.As shown in the fourth and fifth column of Figure 11a, our network mislabeled tiny houses as cars, which are very similar in shapes and colors.As shown in the sixth column of Figure 11a, our method mislabeled some parts of the buildings as low vegetation, because the color of the rooftops are similar to the color of low vegetation.In the ISPRS Potsdam dataset, our method could not perform well in labeling clutters, which contain variant categories of objects.The above limitations show that, with only the IRRG channels in the image, the DCNN can merely learn features based on the color and context information of the objects.When objects are similar in color and context, DCNN cannot distinguish them in some difficult situations.In remote sensing domain, since other helpful information (such as DSM) is available in some cases, we can use them to help improving the performance of our DCNN in the future.

Influence of Semantic Boundary on Segmentation Results
Figure 12 shows some examples of the predicted semantic boundaries generated by the edge path.We can observe that the edge path can provide preferable semantic boundaries between different semantic objects, which provide important guiding information for discriminating them.The examples shown in Figure 12 demonstrate that, when the semantic boundary maps in column Figure 12e are as accurate as the corresponding groudtruths in Figure 12d, the final semantic segmentation results are of high performance.However, when the semantic boundary maps fail to generate strong edge responses, or suppress disturbing responses inside the objects, on some semantic boundaries in the image (such as the areas marked by the red circles in Figure 12), the network can hardly provide precise segmentation results, even producing incorrect ones.These results demonstrate that the edge path provides significant information for accurate semantic segmentation in our network architecture.

Conclusions
In this work, a novel dual-path and lightweight DCNN is proposed for semantic segmentation in high-resolution aerial images.We design the spatial path and the edge path to address the challenges of intra-class heterogeneity and inter-class homogeneity existing in high-resolution aerial image segmentation.The spatial path makes full use of multi-level features and eliminate the loss of spatial information.The edge path is a HED-like network used to predict the semantic boundaries for deep supervision.Moreover, we enhance the computational efficiency of the proposed DCNN by employing the backbone of MobileNetV2.We modify the base MobileNetV2 in the following two aspects: (1) replacing the standard convolution in the last four BRBs with atrous convolution; and (2) removing the convolution stride of the first layer in BRBs 4 and 6.Experimental results on the ISPRS 2D semantic labeling dataset illustrate the advantages of our proposed DCNN.The proposed network was compared with other lightweight DCNNs, such as ICNet, ESPNet, BiSeNet, and LW_RefineNet, and achieved the best segmentation results, both quantitatively and qualitatively, while yielding real-time inference speed.
Author Contributions: G.Z. wrote the manuscript, designed the network, and conducted the experiments.T.L., Y.C. and P.J. contributed to the conceptual design of the experiments and reviewed and revised the paper.

Figure 1 .
Figure 1.Examples of intra-class heterogeneity and inter-class homogeneity in high-resolution aerial images: (a) houses (or cars) have different shapes and colors, but they belong to the same semantic label; and (b) low vegetation and trees are similar in appearance, but they belong to two different semantic labels.

Figure 2 .
Figure 2. The architecture of the proposed network.Given an input image, we use the modified MobilenetV2 to extract the shared feature maps.Then, two paths are appended to capture semantic context and boundary context while simultaneously generating semantic segmentation maps and edge score maps.

Figure 3 .
Figure 3.The basic structure of BRB.There are two types of blocks: (a) residual block with stride of 1; and (b) block with stride of 2 for downsampling.

Figure 4 .
Figure 4. Example of atrous convolution of 3 × 3 kernel size with different dilation rates: (a) atrous convolution with dilation rate = 1, also known as standard convolution, which has a receptive field of 3 × 3; (b) atrous convolution with dilation rate = 2, which has a receptive field of 7 × 7; and (c) atrous convolution with dilation rate = 3, which has a receptive field of 11 × 11.While the receptive field grows exponentially, the number of parameters associated with each filter is identical.

Figure 5 .
Figure 5.The structure of the Channel Attention Module (CAM).

Figure 6 .
Figure 6.Examples of semantic segmentation results on the ISPRS Vaihingen test tiles.(a) shows raw images.(b) shows the groundtruths.(c)-(e) shows the segmentation results of MNetV2+SP, MNetV2 * +SP, MNetV2 * +SP+EP, respectively.(f) shows the semantic boundaries extracted from groundtruths by Matlab imgradient function.(g) show the predicted semantic boundaries of our proposed network.

Figure 6 .
Figure 6.Examples of semantic segmentation results on ISPRS Vaihingen test tiles: (a) raw images; (b) the groundtruths; (c-e) the segmentation results of MNetV2+SP, MNetV2 * +SP, and MNetV2 * +SP+EP, respectively; (f) the semantic boundaries extracted from groundtruths by MATLAB imgradient function; and (g) the predicted semantic boundaries of our proposed network.

Figure 7 .
Figure 7. ROC and PR curves of all the comparing models on the ISPRS Vaihingen test tiles: (a) the ROC curve; and (b) the PR curve.Classes from left to right: impervious surface (Imp.Surf.), buildings, low vegetation (Low Veg.), trees, and cars.

Figure 9 .
Figure 9. ROC and PR curves of all the comparing models on the ISPRS Potsdam test tiles: (a) the ROC curve; (b) the PR curve.Classes from left to right: impervious surface (Imp.Surf.), buildings, low vegetation (Low Veg.), trees, cars.

Version December 5 ,Figure 9 .
Figure 9. ROC and PR curves of all the comparing models on the ISPRS Potsdam test tiles: (a) is the ROC curve.(b) is the PR curve.Classes from left to right: Impervious Surface (Imp.Surf.), Building, Low Vegetation (Low Veg.), Tree, Car.

Figure 10
Figure 10 exhibits the visual comparisons between the deep models on the ISPRS Potsdam test

Figure 11 .
Figure 11.Example results of semantic segmentation on the ISPRS Vaihingen and Potsdam test tiles: (a) results on the ISPRS Vaihingen dataset; and (b) results on the ISPRS Potsdam dataset.The first column is the image tiles, the second to sixth columns are the image patches in top-left, top-right, center, bottom-left, and bottom-right, respectively.In (a,b), the first to third rows show the raw images, the corresponding groundtruths, and segmentation results, respectively.

Figure 12 .
Figure 12.Visualization of the semantic boundary maps generated by the edge path: (a) raw images; (b) the groundtruths of semantic segmentation; (c) the segmentation results generated by our network; (d) the groundtruths of semantic boundary; and (e) the predicted semantic boundaries generated by the edge path.

Table 1 .
The architecture of MobileNetV2 used for feature extraction in our proposed DCNN.Each line describes a sequence of 1 or more identical layers.t: expansion factor; c: the number of output channels.n: the number of repeated layers; s: stride of the first layer of each sequence, all others use stride 1; as: stride of the atrous convolution.

Table 3 .
Number of parameters of the compared models.

Table 3 .
The number of parameters of the compared models.

Table 4 .
Quantitative comparison with the state-of-the-art models on the ISPRS Vaihingen test tiles (the values in bold are the best).

Table 5 .
Quantitative comparision with the state-of-the-art models on the ISPRS Potsdam test tiles (the values in bold are the best).

Table 5 .
Quantitative comparison with the state-of-the-art models on the ISPRS Potsdam test tiles (the values in bold are the best).

Table 6 .
Inference speed comparison of our proposed method against other state-of-the-art methods.

Table 7 .
Quantitative results on the ISPRS Vaihingen and Potsdam test tiles.