Cascaded Residual Attention Enhanced Road Extraction from Remote Sensing Images

: Efﬁcient and accurate road extraction from remote sensing imagery is important for applications related to navigation and Geographic Information System updating. Existing data-driven methods based on semantic segmentation recognize roads from images pixel by pixel, which generally uses only local spatial information and causes issues of discontinuous extraction and jagged boundary recognition. To address these problems, we propose a cascaded attention-enhanced architecture to extract boundary-reﬁned roads from remote sensing images. Our proposed architecture uses spatial attention residual blocks on multi-scale features to capture long-distance relations and introduce channel attention layers to optimize the multi-scale features fusion. Furthermore, a lightweight encoder-decoder network is connected to adaptively optimize the boundaries of the extracted roads. Our experiments showed that the proposed method outperformed existing methods and achieved state-of-the-art results on the Massachusetts dataset. In addition, our method achieved competitive results on more recent benchmark datasets, e.g., the DeepGlobe and the Huawei Cloud road extraction challenge.


Introduction
With the rapid development of earth observation technology, large-scale and highresolution remote sensing imagery has become the most important data source for object extraction. Automatic road recognition and extraction from remote sensing images is an essential step towards many applications, including a high-definition map and the updating of GIS (Geographic Information System) datasets. Despite existing extensive research about automatic road extraction, the accurate and efficient extraction of roads from remote sensing images for GIS applications is still a great challenge. This is partially due to the complexity between roads and backgrounds and partially due to the variation in the width of roads and in the spatial resolution of images [1,2].
Deep learning-based methods can automatically learn and extract representative and distinctive features from a large number of training samples. They have been widely applied in remote sensing because they achieve higher performance than traditional road extraction methods [3][4][5]. The most popular strategy for road extraction is encoder-decoderbased networks [6][7][8][9][10]. The encoder module extracts multi-scale features from the input images, then the decoder module interprets and upsamples the features end to end for road extraction. Although recent studies have made great leaps in this regard, there are still some problems that must be solved [3,4,11].
(1) Discontinuity on narrow roads. First, roads are usually linear and continuous. Because there are fewer pixels in the cross-section direction of the road in the image, especially for roads that are narrow, the detailed spatial context is easily lost with repeated downsampling during feature extraction. Existing methods attempt to recover spatial information by fusing the high-resolution features extracted from the shallow convolutional layers using skip connections [9,12,13]. However, the semantic information of features extracted from shallow convolutional layers is insufficient, and it introduces noise that makes the extracted roads discontinuous and jagged, as shown in the yellow dashed ellipse in Figure 1.
(2) Coarse classification at the boundary. Second, existing methods use a specific threshold to classify the upsampled feature maps of roads and backgrounds directly [11,12]. However, threshold-based segmentation usually leads to inaccurate extraction, which is indicated by a zigzag at the boundary of a road. The undesired zigzag effects can be resolved only when the features capture information in a global context rather than locally at the road boundaries, as shown by the red-dashed ellipse in Figure 1. Roads extracted using existing methods are discontinuous and jagged at the boundary, especially when the roads are narrow, as shown in the yellow and red circles, respectively. Columns (a-c) represent sample images, corresponding labels, and results extracted by D-LinkNet50, respectively.
To address the above problems, we propose a cascaded residual attention-enhanced coarse-to-fine network named CRAE-Net that combines spatial and channel attention modules to optimize and fusion multi-scale features. It preserves the detailed spatial context in high-resolution features and improves threshold-based segments using a lightly refined network to ensure the accuracy of narrow roads and smooth boundary extraction. We designed a parallel multi-path network based on the pretrained ResNet50 to enhance multi-scale features and the combined detailed spatial context and rich semantics. Then, channel and spatial attention were cascaded to optimize and merge multi-scaled features. Additionally, a lightweight network was connected to further refine the boundary of the roads. It sufficiently preserved detailed spatial information and optimized boundary segmentation to achieve accurate road extraction.
In summary, our contributions are as follows: (1) We designed a cascaded attentionbased structure to aggregate spatial details and semantic information for continuous road extraction; (2) we introduced a lightweight coarse-to-fine segment module for smooth road boundary recognition; and (3) we published our benchmark results on the DeepGlobe and the Huawei competition datasets for comparison with related research. The source code is available at: https://github.com/liaochengcsu/Cascade_Residual_Attention_Enhanced_ for_Refinement_Road_Extraction, accessed on 26 December 2021.
The rest of this paper is organized as follows. Related works on road extraction are summarized in Section 2. Section 3 introduces the details of the proposed method for refined road extraction. Section 4 describes the experiments and analyzes the results. Section 5 presents the conclusion of this paper.

Related Works
Machine learning technology has developed rapidly in recent years, especially with the proposal of the Fully Convolutional Network (FCN) [13], which is a milestone in the field of image processing research and has achieved good results for efficient image segmentation. There have been many deep-learning-based studies related to remote sensing segmentation recently [2,[14][15][16][17][18][19][20][21][22][23][24][25][26]. For the most relevant problems in road extraction research , we briefly review related works, including refined road boundary extraction and continuous road regional recognition.
Refined road boundary extraction. The U-Net [27] designed encoder-decoder architecture is based on an FCN for medical image segmentation. It introduces skip connections to recover the spatial details lost by the downsampling options at the encoder stage. Many U-Net-derived works [9,[28][29][30][31] have achieved excellent performance, especially for road boundary extraction, because the potential spatial information represented the details of the road boundary in the shallow layer was focused . In addition, SegNet [6] further refined spatial details and reduced the complexity of computation compared with the FCN.
The ResNet [32] introduced the residual connection to solve the problem of exploding and disappearing gradients in deep convolutional networks. This residual block makes highly complex Convolutional Neural Networks (CNNs) converge faster and more stably. Many studies achieved significant performance by introducing the pre-trained ResNet backbone. ResU-Net [33] improved the accuracy of segmentation significantly by introducing the residual blocks to U-Net. DenseNet [34] designed densely connected blocks using short paths between shallow layers and deep layers for feature reuse, and performed with lower computational complexity than the ResNet. The performance of road extraction was improved using dense blocks in [29,35]. The Coord-Dense-Global (CDG) model [36] introduced an attention mechanism to enhance the edge information and global context of roads based on DenseNet. For road boundary refinement, Conditional Random Fields (CRF) were introduced as post-processing strategies to optimize the extracted result [37,38]. Ref. [39] proposed a coarse-to-fine algorithm, utilizing gray-value distribution to pre-segment the potential roads and using structure context features for final road extraction.
Because the spatial context details are lost in the downsampling operation during encoding, it is difficult to recover the spatial context, especially for narrow road boundaries. Moreover, most existing segmentation-based methods directly classify pixels using a fixed threshold and ignore the structure of the edge, causing severely zigzagged boundaries. Continuous road regional recognition. Prior knowledge of a road, such as its orientation or topological information, has been used for continuous road extraction [11,35,[40][41][42][43][44][45]. These constraints have been proven effective for road extraction. A series of DeepLab [7,38] methods introduced atrous convolution for segmentation tasks, which enlarges the receptive field without increasing the computational complexity and improves the regional consistency. Ref. [12] introduced an Atrous Spatial Pyramid Pooling (ASPP) module to capture multi-scale global semantic information for efficient road extraction. Moreover, Ref. [8] applied structural similarity loss to improve the continuity of the extracted roads based on multi-scale features.
The attention mechanism is another popular method for improving regional continuity, Refs. [31,[46][47][48][49][50] introduced spatial and channel attention layers, which effectively improved the segmentation performance, especially for continuity in the road area. Benefiting from the Generative Adversarial Networks (GAN), Refs. [40,[51][52][53] obtained impressive results from images through adversarial machine learning between generative and discriminative models. D-LinkNet [54] introduced dilated convolutional layers in the center part of a pretrained encoder-decoder structure to enlarge the receptive field with efficient computation and memory without reducing the resolution of the features, and achieved first place in the CVPR DeepGlobe Road Extraction Challenge [55].
However, for road extraction research, inaccurate boundary recognition and discontinuous segmentation results, especially for narrow roads, are still unavoidable.

Overview
In this work, we propose a cascaded attention-aggregated refinement network to extract roads from remote sensing images. It aims to solve the problems of discontinuous road extraction and jagged boundary identification in existing methods. The main structure of the proposed method is illustrated in Figure 2. Specifically, we base our study on the ResNet50 backbone pre-trained on ImageNet.
(1) For the discontinuity of extracted roads, we introduce a cascaded attention-based residual module to enhance extracted multi-scale features, especially for maintaining detailed spatial information. Because the high-resolution features extracted from shallow convolutional layers contain spatial details that are vital for small object recognition, such as narrow roads, it may introduce noise for the insufficiently represent features extracted through a shallow convolutional layer. The designed module not only combines exact spatial details and rich semantic information, but also improves the ability to capture the long-distance similarity of roads. (2) For jagged road boundary recognition, we designed a lightweight U-Net-like network to refine the boundaries of roads in the original scale, and we achieved, without introducing much computational complexity, smoother road boundary extraction than the existing methods that directly filter using a threshold.

Cascaded Attention Feature Enhancement
The attention mechanism was first proposed to address the bottleneck problem that arises with the use of a fixed-length encoding vector [56], where the decoder would have limited access to the information provided by the input. It uses a weighted sum of all of the encoder hidden states to flexibly focus the attention of the decoder on the most relevant parts of the input sequence, which greatly improved the performance of the sequence model, especially for machine translation. It has been successfully transferred to image processing applications, especially for semantic segmentation in recent years, and significantly improved the performance in remote sensing image processing. The attention mechanism is mainly divided into spatial attention and channel attention in image processing applications. The spatial attention aims to capture long-distance correlation using a space pixel-by-pixel similarity calculation, while the channel attention is mainly used to assign weights to each feature channel by calculating the correlation in channel levels.
For completeness, we briefly introduce the attention mechanisms used in our work, including the spatial attention module and channel attention module [46][47][48], as shown in Figure 3. In this work, there are four scales of features extracted by the ResNet50 pre-trained backbone. Features having high resolution that are extracted from shallow layers maintain great spatial detail; features having low resolution that are extracted from deeper convolution layers containing rich semantic features with spatial information loss. Roads usually appear as long and narrow linear structures in remote sensing images. There are few pixels in the cross-section of the road due to the limited road width. Therefore, we only utilize the three scales of features having a high resolution, which better preserves the necessary road spatial information. In order to enlarge the receptive field and extract a wide range of continuous roads, we introduced the ASPP module in the path with the highest feature resolution to obtain more global features.
Since the features with high resolution flow through fewer network layers, which retain more spatial information while introducing noise information and make the road edge appear jagged. We added the cascaded residual blocks represented as RB to extract rich semantic features and preserve the spatial details by maintaining the spatial resolution of features. Furthermore, the Spatial Attention (SA) layer is introduced to capture the long-distance similarity of roads as well as to enhance the consistency of the characteristics of the road, especially for the narrow roads. At the decoding stage, the enhanced multi-scale features are fused through skip connections, and at the same time, the Channel Attention (CA) layer is utilized to perform channel-level filtering on the multi-scale features at the upsampling stage and to obtain features with aggregated rich semantic and accurate spatial details. The details of these blocks were shown in Figure 4. Considering the roads are narrow and continuous in remote sensing imagery, the tiny road is susceptible to interference from nearby background pixels, causing the problem of extracting discontinuity results. We enhance the semantic features of the road area by introducing a spatial attention mechanism to capture the long-distance correlation of roads, which improves the continuity of narrow roads significantly. Besides, the decoder of semantic segmentation networks fuses the features through skip connection by spatial addition or channel concatenation. It ignores the capability of extracting spatial details and semantic features by features with different scales. Based on this investigation, we introduced the channel attention mechanism to realize the adaptive fusion of features at different scales, and optimizes the spatial detail and semantic information of features, thereby enhancing the feature for road representation. The comparison of different feature fusion methods is shown in Figure 5.

Coarse-to-Fine Boundary Optimization
Almost all of the decoder strategies used in existing segmentation-based methods upsample the feature maps to the same scale as the input and classify the pixels of a road or background according to a specific threshold. As roads are always obscured by another object such as a tree or building shadows, especially at the edges of roads so that the context characteristics of the road boundary are unsmooth, making the boundary of the roads difficult to be recognized as smoothly as realistically. The yellow dash circle in Figure 6, from (a) to (d), represent the original imagery, visualizing a feature map of the last layer in DLinkNet50, the threshold-based segmentation on the feature map, and the labels, respectively. Obviously, the roads extracted based on the threshold are coarse at the boundary of occluded roads. The binary cross-entropy loss function L_bce represented as Formula (1) is usually used to evaluate the distance between predicted results and true labels in binary segmentation applications. Since, the L_bce is a pixel-based metric that does not consider the overall consistency of the prediction results and the labels. The detailed structures of roads that characterize boundaries are valuable in the field of remote sensing. Therefore, we designed a U-Net-like network optimized using the Dice [57] loss to refine the upsampled feature maps. The Dice loss function was introduced to optimize output features, and is represented as L_dice in Formula (2). The Dice loss function measures the similarity between the prediction results and the labels; thus, the boundaries of the road could be further optimized.
The added U-Net is lightweight, considering the complexity of the model. It is connected at the end of a segmentation branch and includes two encoding units with a low number of feature channels to refine the enhanced features efficiently, as shown in Figure 7. The cross-entropy loss function optimizes feature extraction, and the Dice loss function achieves road boundary refinement, which forms a coarse-to-fine road extraction structure. The total loss is the sum of the two loss functions as shown in Formula (3). Because the L_bce is much smaller than L_dice, we set the weights α and β to 4 and 1, respectively, in our experiment to keep a balanced weight between the segmentation and optimization branch.

Evaluation Metrics
Semantic segmentation-based road extraction from remote sensing images is a typical binary segmentation task that seeks to classify every pixel as the road or background. Thus, we evaluated the performance of the proposed method through general semantic segmentation-based evaluation metrics.
The prediction results comprise four cases, including correctly classified positive samples, incorrectly classified positive samples, correctly classified negative samples, and incorrectly classified negative samples. They are represented as True-Positive (TP), False-Positive (FP), True-Negative (TN), and False-Negative (FN), respectively. Precision, Recall, F1-Score, and Intersection over Union (IoU) are four common comprehensive evaluation metrics based on the above indicators. Their calculations are shown in Formulas (4)- (7).
Because the test set includes multiple images, we calculate the accuracy of each image separately, and we finally average the evaluation results of all the images.

Datasets and Strategies
In this section, we describe the experimental datasets, including the Massachusetts Road Dataset [58], the DeepGlobe Road Extraction Challenge dataset [55], and the Huawei Cloud competition dataset [59].
The Massachusetts dataset contains 1171 tiled images, including 1108 for training, 14 for validation, and 49 for testing. The size of all images is 1500 × 1500 pixels with a resolution of 1 m.
The DeepGlobe dataset contains 8570 images, including 6226 for training, 1243 for validation, and 1101 for testing, that were captured over Thailand, Indonesia, and India. We only utilized the training subset in our experiment because the corresponding labels of the other two subsets were not available. The size of all the images is 1024 × 1024 pixels with a resolution of 0.5 m. We randomly split these images into 4316 for training, 617 for validation, and 1293 for testing according to the ratio 7:1:2.
The road extraction dataset for the Huawei Cloud competition contains training and testing subsets. We only utilized the training set in our experiments because the labels for the testing set are private. There are two tiles of images from the Beijing2 satellite with a resolution of 0.8 m in the training set. The size of the images is 40,391 × 33,106 and 34,612 × 29,810 pixels, respectively. We divided the images into training and testing areas according to a 7:3 ratio, as illustrated in Figure 8. We clipped the test areas to patches of 1500 × 1500 pixels with an overlap of 18 pixels for testing, considering the limitations of a video memory size. Although it is widely acknowledged that a larger overlapping region and fusing results using voting strategy will significantly improve performance, we want to directly examine the performances without extensive engineering. We trained our model on only the training set and tested the performance on the testing set. The validation set was used only to validate the method on the above three datasets. We clipped the training set to patches of size 512 × 512 during the training stage, with an adapted overlap due to the limitations of video memory size. The test and validation sets were predicted and cropped to the original size to evaluate the accuracy.
Our experiments were conducted on a server with single 2080Ti GPU. The Adam was used to optimize the models with the recommended hyper-parameters, and the initial learning rate was 0.01. We adopted a data augmentation strategy with a random flip in the vertical and horizontal directions, color jitter, and random rotation during the training stage. We utilized the Test Time Augmentation (TTA) strategy in the testing stage for all comparison methods. The predict result was the arithmetic mean of the predictions, including the original as well as the horizontally and vertically flipped images.

Performance on Massachusetts
To verify the proposed method, we conducted comparative experiments on the Massachusetts dataset to compare the performance of our method with other classical semantic segmentation methods, including the U-Net [27], SegNet [6], Res-U-Net [28], DeepLabv3+ [7], and D-LinkNet [54] series methods. Res-U-Net introduces the residual block based on the U-Net. D-LinkNet34 is based on ResNet34 pre-trained on ImageNet, and the backbone of DeepLabv3+ and D-LinkNet50 is ResNet50 pre-trained on ImageNet.
The experimental results are shown in Table 1. Our proposed method achieved significant improvement compared with the classical semantic segmentation methods and achieved SOTA (State-of-the-Art) results on the Massachusetts test dataset. Furthermore, the proposed method gained a large performance margin over D-LinkNet50 and achieved a 4.92% improvement in IoU metrics compared with D-LinkNet34, which attained first place in the CVPR DeepGlobe 2018 Road Extraction Challenge. We made the best result in bold under each metrics. For the convenience of comparing the extraction results of related methods, we illustrate the extraction sample and local details in Figure 9 and Figure 10, respectively. Subparts (a) to (h) show the results of related methods in Figure 9. In Figure 10, subparts (c) to (i) display the local details of the proposed method and the results for U-Net, SegNet, Res-U-Net, DeepLabv3+, D-LinkNet34, and D-LinkNet50 corresponding to the yellow circle marked in (a). From the visualized local detail results, it can be inferred that our method extracted a smoother boundary than the comparison methods.  To further verify the performance of our proposed method, we evaluated it with the latest related road extraction methods on the Massachusetts dataset. Numerous studies have referenced this dataset, but the applied evaluation metrics are not uniform. Therefore, we select methods include [12,30,36,60,61] for a fair comparison. All of these methods were proposed within the last two years. The TSKD-Road [60] proposed a lightweight knowledge distillation-based topological space network to produce more continuous roads and maintain low computational complexity and network parameters. The CDG [36] introduced an attention mechanism based on DenseNet to enhance the edge information in the global context of the road. DGRN [12] designed a global dense residual network to aggregate abundant multi-scale features based on the ASPP for remedying the loss of spatial features. JointNet [61] combined dense connections with dilated convolution to enlarge the receptive field, which enabled the extraction of multi-scale roads effectively. RB-UNET [30] proposed a reconstruction biased U-Net for road extraction to capture rich semantic information from multiple upsampling operations.
Because the source code of the related works were not available, we referenced their reported results. As shown in Table 2, the proposed method achieved significant improvement both in the F1 Score and IoU metrics. Ours outperformed the latest related studies and achieved SOTA results on the Massachusetts test dataset. We made the best result in bold under each metrics.

Performance on DeepGlobe
We designed an experiment on the DeepGlobe dataset, which contains high-resolution images, to evaluate the performance of the proposed method. Because there is no uniform data division performed by the related road extraction method, we compared our method with only the classical semantic segmentation methods introduced above.
Although many studies have used this dataset for experiments, they do not make public the details of the data split, which makes it impossible to compare performance between related studies. Therefore, we provide the details of data division and a new benchmark, which is convenient for comparisons in follow-up research. The experiment result is shown in Table 3. We made the best result in bold under each metrics. The results show that our method outperforms the comparison methods significantly both on the F1 Score and IoU metrics. In particular, our method achieved a 1.53% improvement over the IoU metric achieved by D-LinkNet50. For the convenience of comparing the extracted results of related methods, we illustrate the extracted sample and the exact local details in Figure 11 and Figure 12, respectively. In Figure 12, subparts (c) to (i) represent the local detail of the proposed method and the results for U-Net, SegNet, Res-U-Net, DeepLabv3+, D-LinkNet34, and D-LinkNet50 that correspond to the yellow circle marked in (a). According to the visualized results, the extracted results of the proposed method are more continuous than those of the comparison methods. Moreover, the road boundaries of our method are smoother than those of others.

Performance on Huawei Cloud
We earned second place in the preliminaries and first place in the finals of the Huawei Cloud competition [59] for road extraction from remote sensing images based on the idea of this works. To validate the advantage of the proposed method and provide a benchmark on this dataset, we designed an experiment to compare our method and the above classical segmentation methods on the training set. The results are shown in Table 4. Our method outperformed the comparison methods on the Huawei test dataset. We made the best result in bold under each metrics. For the convenience of comparing the extracted results for related methods, we also present some samples and local details in Figure 13 and Figure 14, respectively. The results of the compared works and those of our method are shown in subparts (a) to (h) of Figure 13. In Figure 14, subparts (c) to (i) represent the local details of the proposed method and the results for U-Net, SegNet, Res-U-Net, DeepLabv3+, D-LinkNet34, and D-LinkNet50 corresponding to the yellow circle marked in (a).
Overall, the visualized results of the above three datasets show that the introduced attention-based residual module and the lightweight road optimization network significantly improved the continuity of roads as well as the smoothness of the road boundaries.

Ablation Study
Our method introduced attention-based cascaded residual blocks to enhance multiscale spatial details and semantic features. Furthermore, the lightweight U-Net is connected at the end of the network to optimize the boundary of the roads. Thus, we can extract multi-scale roads from multi-resolution remote sensing imagery accurately. To evaluate the effect of each module of the proposed method, we designed an ablation experiment on the DeepGlobe dataset. We selected the D-LinkNet50 as our baseline because our network is based on it. The attention-based residual block and lightweight refinement network are represented as Att_B and Ref_B, respectively. The experimental result is shown in Table 5. The experimental result shows that the introduced modules significantly improved the performance of the network. Att_B obtained a 0.84% IoU improvement relative to the baseline, which indicates that the local detail context in shallow features plays a significant role in road extraction, especially for narrow roads. Ref_B obtained a 0.77% IoU improvement relative to the baseline, which suggests that the boundary optimization module is better than the threshold segmentation method. Overall, our method achieved a 1.53% improvement in IoU metrics over the baseline.

Efficiency Comparison
Extensive experiments on three public datasets with different image spatial resolutions proved that the proposed method achieved significant improvements over related works. Efficiency is also important in the automatic interpretation of large-scale remote sensing images. Therefore, we discuss the relations between the IoU accurate with trainable parameters and Floating-point Operations (FLOPs) of related works on the DeepGlobe dataset in this section. The experimental result is shown in Table 6. According to the results, our method achieved the highest accuracy of the comparison methods. In particular, the proposed method maintains much lower parameters and FLOPs than the ResNet50 pre-trained methods, such as DeepLabv3+ and D-LinkNet50. Although D-LinkNet34 significantly reduced the number of parameters and FLOPs, which benefited from a lighter pre-trained model than our method, the IoU accuracy also dropped. Overall, our method achieved a better trade-off between accuracy and efficiency than related methods. To intuitively compare the results, we visualize the relationship between the accuracy and trainable parameters and FLOPs in Figure 15.

Discussion and Conclusions
For problems that exist in road extraction from remote sensing images: (1) The extracted results are discontinuous, and (2) the boundaries of the road are zigzagged and blurred. This work proposed an attention-based cascaded network to optimize road extraction. There are two main parts: The attention-based residual block that is introduced to maintain the multi-scale spatial details of the roads, and the lightweight optimization network that is designed to refine the road boundaries.
We conducted extensive experiments on three public datasets to evaluate the performance of the proposed method. It outperformed the latest related works and achieved SOTA results. Furthermore, we constructed new benchmarks and provided a detailed description of the data division in the other two datasets, which is convenient for follow-up studies to use for comparison.
These research results suggest that the structure information contained in the shallow convolutional layers plays a vital role in recovering detail, especially for small objects such as buildings and roads. Existing methods learn features from the data end to end, which more or less, ignores prior knowledge such as the structure of the road boundary. We will continue to focus on these structural constraints to address the dependence on data in existing data-driven methods.