Remote Sensing Road Extraction by Road Segmentation Network

: Road extraction from remote sensing images has attracted much attention in geospatial applications. However, the existing methods do not accurately identify the connectivity of the road. The identiﬁcation of the road pixels may be interfered with by the abundant ground such as buildings, trees, and shadows. The objective of this paper is to enhance context and strip features of the road by designing UNet-like architecture. The overall method ﬁrst enhances the context characteristics in the segmentation step and then maintains the stripe characteristics in a reﬁnement step. The segmentation step exploits an attention mechanism to enhance the context information between the adjacent layers. To obtain the strip features of the road, the reﬁnement step introduces the strip pooling in a reﬁnement network to restore the long distance dependent information of the road. Extensive comparative experiments demonstrate that the proposed method outperforms other methods, achieving an overall accuracy of 98.25% on the DeepGlobe dataset, and 97.68% on the Massachusetts dataset.


Introduction
Remote sensing road extraction aims to identify the road pixels in the images and complete the binary segmentation of the road. Nowadays, road extraction is needed in many applications [1,2] such as traffic navigation and urban planning. However, it is difficult to extract the road from the remote sensing images due to increased noise information [3]. The noise information mainly comes from occlusion or shadows of other ground objects, and similar categories. Moreover, the roads are not always regular, and the width and curvature of the roads varies greatly in different scenes. These factors reduce the accuracy of road extraction methods and prevent the restoration of complete road connectivity.
Recently, numerous methods have been proposed to obtain a road map. According to whether the Convolutional Neural Network (CNN) architecture is used or not, road extraction methods can be divided into two categories: hand-crafted feature methods and CNN methods.
The hand-crafted feature methods divide the pixels into road and non-road by extracting shallow features and using empirical hypothesis. Early road extraction methods used hand-crafted features (geometry, texture, spectrum, etc.), multiple algorithms (edge detection, tracking, area clustering, etc.), and were combined with empirical assumptions to extract the road in the image. M. Barzohar et al. [4] used the width, length, curvature, and pixel intensity of the road as empirical hypothesis information, and the geometric probability model with the maximum posterior probability estimation was established for road extraction. J. Hu et al. [5] first detected the local uniform area around the pixel to generate the road tree structure, and then a Bayesian decision model was used to obtain the final road. M. Song et al. [6] used support vector machines as a classifier and considered weight combination of spectral information and shape information to identify road pixels. However, the hand-crafted feature methods depend on the quality of feature selection. These hand-crafted feature methods are rough and depend heavily on prior knowledge.
With the rapid development of deep learning [7], CNN methods have become the mainstream methods due to the representation power [8]. Li et al. [9] used CNN to classify pixels, and then a linear integral convolution algorithm was used to enhance the road connection structure. G. Mattyus et al. [10] used the variant of ResNet [11] for road segmentation and then designed a road inference algorithm to correct the segmentation results. Xu et al. [12] utilized local and global attention mechanisms to enhance road information in DenseNet. However, CNN methods do not fully excavate the information of context features extracted from the CNN [13], and they do not consider enough about the strip shape features and long distance dependence of roads.
The road pixels in remote sensing images may be interfered with by the abundant ground such as buildings, trees, and shadows, which may introduce poor road connectivity. It is important to enhance the context information related to the road and suppress the interference information of non-road. In this paper, a road segmentation method is proposed for road extraction with high-resolution remote sensing images. The road information is extracted by enhancing the context characteristics in the segmentation step and maintaining the stripe characteristics in a refinement step. The contributions are as follows: (1) To enhance road context features, an end-to-end road segmentation network is designed, since road connectivity is easily disturbed by noise. (2) To strengthen the context information belonging to the road, an inter-layer selfattention mechanism is introduced to generate weight maps. (3) To hold the strip information of the road, a refine network with the striping pooling is introduced to refine the results of the segmentation network.

Materials
Two large public datasets were used for experiment: the DeepGlobe dataset and the Massachusetts dataset. A brief introduction is as follows.
The DeepGlobe dataset is a satellite dataset that uses images from Thailand, Indonesia and India. The land area of the dataset is about 2220 km 2 and contains a variety of image scenes (urban, rural, wilderness, tropical rainforest, seaside, etc.). The DeepGlobe dataset consists of 6226 images with the size of 1024 × 1024. The spatial resolution of DeepGlobe is 50 cm/pixel, which is used for the road extraction challenge. In the experiment, 4358 images are used for training, and the remaining 1868 images are used for testing.
The Massachusetts dataset is an aerial dataset that uses images from Massachusetts, US. The dataset has a resolution of 1.2 m/pixel, covers a land area of 2.25 km 2 , and contains urban, suburban, and rural environments. The size of each image is 1500 × 1500. The Massachusetts dataset consists of 1108 training images, and 49 testing images.
The experiments were performed on PyTorch library with GPU. Within the limits of computational cost, the size of the input image was cropped to 512 × 512 pixels in both the DeepGlobe dataset and the Massachusetts dataset. To alleviate the over-fitting problem, data augmentations were used, such as image flipping, image shifting and scaling, and color jittering.
The backbone network of the proposed method was ResNet34, due to its powerful encoder capability. The Adam optimizer was used to train the proposed network. The training epoch was set to 50. The batch size was 24 for the Massachusetts Dataset and 16 for the DeepGlobe dataset. The learning rate was initially set to 2 × 10 −4 and reduced by a factor of 0.1. During the testing phase, each input image was predicted a total of eight times by flipping different angles, and the final result was the average output of these outputs.
To verify the performance of the proposed method, four widely used metrics were employed: overall accuracy (OA), precision (P), recall (R), and F1-score. After the calcula-tion of true positive (TP), true negative (TN), false positive (FP), and false negative (NT), the measures of OA, P, R, and F1-score were calculated as follows: The precision and recall were hoped to remain at a high level, indicating improved performance, although the two values are contradictory in some cases and have a negative correlation. F score is used to consider the precision and the recall simultaneously. It can be regarded as a weighted average of precision and recall, where its maximum value is 1 and minimum value is 0.

The Proposed Method
The proposed method is introduced in this section. First, the network architecture is illustrated. Then, a self-attention mechanism is exploited to enhance the inter-layer context features. Finally, a refinement network is designed to refine the road map.

Network Architecture
An overview of the network architecture is shown in Figure 1. The network adopts a UNet-like architecture to extract the road map from a high-resolution remote sensing image. The U-Net architecture concatenates multiple features from high layers to low layers, which is widely used in remote sensing images [14]. The proposed method extracts features by convolution layers and further produces a dense pixel-wise output by deconvolution layers. However, multiple features may contain redundant noise information from non-road pixels. To enhance the context information in adjacent layers, an attention mechanism exploits the hierarchical features to generate attention maps w 1 , w 2 , w 3 . These weight maps are involved in skip connections and adjacent layers to enhance the context information. Then, the low-level features are weighted and fused into the corresponding layers in the deconvolution layers to emphasize the context features of the road. After multiple deconvolutions and sigmoid layers, the features are restored to the size of input image. To enhance the strip information of a road, a network with strip pooling is designed to refine segmentation results.

Segmentation Step
The spatial context information in different convolutions is complementary: the global context in deep layers and the local context in shallow layers. Skip connections from shallow to high layer can introduce location information into semantic information. However, most of the information in images will cause disturbance to the road. Therefore, a self-attention mechanism is proposed to emphasize the target information in adjacent layers.
The attention mechanism (AM) is shown in Figure 2. The feature of the low layer is denoted as F i , while the feature of the high layer is up-sampled with bilinear interpolation to the same size of F i , and then denoted as F j . The concatenation of F i and F j is sent to a series of convolution layers which contains 1 × 1 and 3 × 3. After the sigmoid layer, the AM obtains the weighted map, w, which has two effects. The first is used to select the useful information and refine the concatenated feature with residual learning. The refined feature is the 'F i ' for the next adjacent layer of network in the AM. The second introduces the skip connections to enhance the detailed information of road. Thus, there are two outputs of the AM: weighted map and refined feature. Mathematically, the AM computes asF whereF is the refined feature by AM, and the weight map w = 1/ 1 + e −H(cat(F i ,F j )) . Cat(·) denotes the concatenation on adjacent layer features F i , F j , R(·) is three consecutive 3 × 3 convolution operations, and H(·) means the operation of alternating 3 × 3 convolution and 1 × 1 convolution.

Refinement Step
The outputs from the segmentation network are coarse and rough, which will lead to error in road segments caused by the interference information. The tensor voting algorithm and the conditional random field (CRF) are widely used as post-processing to refine the results from segmentation network. However, the former is unstable due to the parameterdependence, while the latter model is overly complex. A designed refinement network is exploited to refine the results of the segmentation network. The difference is that longdistance dependence of the road is also considered here. The refined network produces more stable road maps and is embedded in the overall network for end-to-end training.
The designed refinement network is shown in Figure 3. The input of DRN is the coarse map, which has the width and height of W and H, respectively, and one channel. The output of DRN is the refined map with the same size. There are three corresponding layers in DRN. In the encoder, the number of channels is changed to 1, 16, 32, and 64 in turn, and the sizes of the features are sub-sampled as 1/2, 1/4, and 1/8. Due to the characteristic of long-distance dependence on the road, maintaining connectivity in the extracted results is challenging. To capture long-distance dependence between different locations in features, the strip pooling [15] that averages all values in a row or column feature is embedded at the high layer features of 64 channels. It allows the distribution of roads in scattered areas to be connected on the feature map. To capture local context information, two spatial pooling layers are used to collect short distance dependencies. The output features of two pooling layers are concatenated together, and then 1 × 1 convolution operations are used to change the channel numbers. Finally, the decoder recovers the characteristic resolution and obtains a refined map of the road. Road extraction is a pixel-level recognition and judgment; however, there is a great imbalance between road pixels and non-road pixels in the image. Thus, the constraint on road pixels in the loss function is of great importance. As a binary segmentation task, road extraction most commonly uses the binary cross entropy (bce) loss function. Inspired by [16], the bce is given a dynamic weight probability, which is set by the frequency of the road pixel. Suppose the batch size is represented as B, and the size of the image is H × W, then the weight α is calculated as follows: where y n is the ground truth of the i-th batch, and sum(·) represents the statistics of the number for road pixels. Thus, the bce loss is modified to be [(1 − α)y n log a n + α(1 − y n ) log(1 − a n )], where a n is the sigmoid value in the end of model, f w (x n ) is the output of the model for x n , and y n is the ground truth for x n , which are binary maps. In order to ensure the loss function still has the ability to constrain the road after multiple iterations, the dice loss is used to measure the similarity of prediction and ground truth: .
The final loss function used in this article is a combination of L bce and L dice to alleviate the performance problems caused by sample imbalance in road extraction.

Experiments
In the experiments, both the DeepGlobe dataset [17] and Massachusetts dataset [18] are used and five compared methods were considered: the FCN [19], the UNet [20], the CasNet [21], the ResUNet [22], and the D-LinkNet [23]. Among them, FCN has the sequential operations of down-sampling in early layers. CasNet uses the VGG-Net as an encoder network, while ResUNet, D-LinkNet, and the proposed method use the ResNet34 as an encoder network. All the compared methods use cross-entropy as the training loss and employ the same data processing for fairness. Table 1 shows the quantitative results on the DeepGlobe dataset. Numerically, the proposed method achieved the best performance on three metrics of OA, P, and F1. D-LinkNet uses dilated convolution to increase the receptive fields on the feature, and it obtained the highest performance on recall. ResUNet had an advantage in P, and was only 0.23% lower than our proposed method. The results of FCN and UNet were worse than that of other methods, which was mainly caused by the loss of road context information after multiple down-sampling operations on different features.  Table 2 reports the comparative results on the Massachusetts dataset. Compared with D-LinkNet, the proposed method increased 0.14%, 1.24%, and 0.41% on OA, P, and F1, respectively. The original encoder structure of UNet leads to poor performance in the four metrics. Regarding the improvement of the encoder structure, the ResUNet combines residual learning in its encoder, the CasNet uses VGG-Net, and D-LinkNet uses ResNet as its encoder. They had advantages of 2.16%, 1.15%, and 3.28% compared to UNet on P, respectively. To observe the performance of different designs in the proposed method, Table 3 shows the results of the ablation experiment on the Massachusetts dataset. The Base method only uses ResNet as an encoder and multiple deconvolution layers as a decoder. AM is the attention mechanism to enhance the context feature from the inter-layer, and DRN is the network to refine road maps. The last line represents the proposed method, and the reciprocal of road frequency was used as weight in the loss function. Compared to the base method, the AM contributed 3.26% performance improvement on P, and the DRN showed an improvement of 0.6% on P compared to the Base combined with AM method. Our proposed method further promoted 0.21%, while the R score dropped by 0.01%. Additional examples of road extraction are shown in Figure 4. It is obvious from the subgraph (c,e) that road connectivity could not be restored by FCN [19], UNet [20], and ResUNet [22]. The results of CasNet [21] and the DinkNet [23] missed segments on subgraph (f) compared with our proposed method. In the subgraph (a,b), the edge pixels of the trunk road were extracted completely by the proposed method. These examples demonstrate the effectiveness of our proposed method on road extraction.

Conclusions and Future Work
In this paper, an end-to-end road segmentation network is proposed to extract the road from high-resolution remote sensing images. Although the existing CNN architectures have made achievements, they have not explored enough the context features with adjacent layers of CNN. In this paper, an inter-layer self-attention mechanism was designed to obtain road context information. The roads in images always have the shape of a thin stripe. The proposed method introduces the strip pooling in the refinement network to better restore the topology and long-distance dependence of roads. The whole network achieved better performance by enhancing the context feature in adjacent layer and strip feature of roads in refined processing. The quantitative results demonstrate the effectiveness of the proposed method and the superiority of road extraction from high-resolution remote sensing images. The improvements of our proposed method obtained on two datasets can be mainly attributed to two designs. The first factor is that the context feature enhanced by the attention mechanism used in adjacent layers guarantees the correct identification of road pixels. The second factor is that the consideration of the strip characteristics of the road in the refinement network makes the extracted roads retain a better topology.
Although the proposed method is successfully applied to remote sensing road extraction, there are still some possible extensions in the future work. This paper only exploits remote sensing images to extract the road, which is difficult with complex scenarios [22] (dense urban area, different viewpoints). Future work will tackle the complex scenarios by integrating multiple data sources, such as multi-temporal images [24], and multi-modal data [25].