Cascaded Attention DenseUNet (CADUNet) for Road Extraction from Very-High-Resolution Images

: The use of very-high-resolution images to extract urban, suburban and rural roads has important application value. However, it is still a problem to effectively extract the road area occluded by roadside tree canopy or high-rise buildings to maintain the integrity of the extracted road area, the smoothness of the sideline and the connectivity of the road network. This paper proposes an innovative Cascaded Attention DenseUNet (CADUNet) semantic segmentation model by embedding two attention modules, such as global attention and core attention modules, in the DenseUNet framework. First, a set of cascaded global attention modules are introduced to obtain the contextual information of the road; secondly, a set of cascaded core attention modules are embedded to ensure that the road information is transmitted to the greatest extent among the dense blocks in the network, and further assist the global attention module in acquiring multi-scale road information, thereby improving the connectivity of the road network while restoring the integrity of the road area shaded by the tree canopy and high-rise buildings. Based on binary cross entropy, an adaptive loss function is proposed for network parameter tuning. Experiments on the Massachusetts road dataset and the DeepGlobe-CVPR 2018 road dataset show that this semantic segmentation model can effectively extract the road area shaded by tree canopy and improve the connectivity of the road network.


Introduction
Road information is of vital importance in the fields of urban and rural development [1], emergency and disaster relief [2], vehicle navigation [3] and geographic information systems [4]. With the rapid development of remote sensing technology, veryhigh-resolution (VHR) remote sensing images have been used for extracting road information [5]. In practice, most road data updates still use manual interpretation, which is time-consuming and laborious and lacks quality control. Many road extraction algorithms have been developed [6][7][8]. These algorithms can be divided into traditional machine learning methods [9][10][11][12][13][14] and the latest deep learning methods [15][16][17]. Some traditional road extraction methods mainly use the spectral features of remote sensing images, occasionally supplemented by texture features. However, this method is difficult to effectively use the geometric and context information in remote sensing images [18], and it is easy to produce "salt and pepper" noise [19]. Among the traditional methods, the object-based approach obviously improves the effect on road extraction. Instead of pixels, it uses image objects as the basic unit, utilizing their spectral, geometric, textural and contextual features for information extraction, thereby improving product quality [20,21]. On the one hand, this method is highly dependent on the quality of image segmentation, and how to find suitable parameters for segmentation is itself a difficult problem. On the other hand, there are many spectral, texture, geometric and contextual features, and it is difficult to determine which features are most suitable for road information extraction. When the data source or regional conditions change, the features required for classification need to be adjusted prediction model. Oktay et al. [51] proposed an attention module that learns weighted images from a high level to focus on the useful features and suppresses the irrelevant regions in the intermediate feature map, thereby improving the prediction performance.
In response to the problems, we propose an innovative Cascaded Attention Dense-UNet (CADUNet) by imbedding two attention modules, such as global attention and core attention, into the DenseUNet framework. We use the core attention module to extract road areas, including the occluded parts, and use the global attention module to enhance global context information about the road network. The main contributions of this article are as follows: 1.
The core attention modules and the global attention modules are cascaded in the DenseUNet together to combine road information at different scales, thus improving the connectivity of the road network and the smoothness of the sidelines.

2.
An adaptive loss function is introduced to solve the problem of too-small ratio of roads to non-road areas in the training samples.
The rest of the paper is structured as follows: In Section 2, we introduce the CADUNet method. Section 3 specifies data preparation used in the experiment. Section 4 shows the results. Section 5 explores the mechanisms for the effectiveness of the network model and Section 6 provides the conclusions.

Methods
The proposed CADUNet is a composite semantic segmentation network established by imbedding global and core attention modules into the DenseUNet framework. The DenseUNet is an integration of two classical networks of UNet and DenseNet [52]. UNet usually consists of two parts: encoder and decoder. DenseUNet normally consists of dense blocks and transition down layers associated with UNet. When making the DenseUNet, the dense block and transition down layers are inserted into the encoder part of UNet to replace the original convolutional layers and pooling layers, thus improving the performance of UNet in semantic segmentation [40,45]. In the CADUNet, global attention modules are further added to the decoder part of UNet ( Figure 1). In addition, core attention modules are embedded between the encoder and decoder. To obtain better results, it is necessary to obtain high-level semantic information from images while retaining the low-level detailed information. The information from the lower layers can be transferred to the higher layers along the information transmission path. This compensates for the details of the low-level function and high-level semantic information [44]. The following subsections provide the details.

Encoder
We use dense blocks and transition down layers in the encoder part of UNet. The dense block is composed of four dense layers (Figure 2), and the output of each dense layer has a feature map of the same channel dimension. In each dense block, all layers maintain dense connections. Dense blocks are connected through transition down layers between them. In a single dense block, the function F l ( ) is used for nonlinear conversion between layers. The dense connection is defined as Equation (1) [52]: where l is the number of dense layers in each dense block, D 1 is the output feature map of the first layer and [D 0 , D 1 , D 2 , . . . , D l−1 ] is a cascade of all previous feature maps of the first layer.
Considering that DenseNet will generate too many feature maps, associated with too many model parameters, we define a growth rate K to control the number of feature maps, where K represents the number of feature layers output by each layer. We set K to 48. It is the same as the size of the feature maps inside each dense block ( Figure 2).
To reduce the amount of calculation and increase the receptive field, a down transition layer is used after each dense block. Each transition layer is composed of batch normalization (BN), rectified linear unit (ReLU), bottleneck layer (1 × 1 convolution) and average pooling layer (2 × 2). Considering that DenseNet will generate too many feature maps, associated with too many model parameters, we define a growth rate K to control the number of feature maps, where K represents the number of feature layers output by each layer. We set K to 48. It is the same as the size of the feature maps inside each dense block ( Figure 2).
To reduce the amount of calculation and increase the receptive field, a down transition layer is used after each dense block. Each transition layer is composed of batch normalization (BN), rectified linear unit (ReLU), bottleneck layer (1 × 1 convolution) and average pooling layer (2 × 2).   Architecture of CADUNet (The parameters include: k, the kernel size; n, the number of output channels; s, the stride size; p, the padding size). many model parameters, we define a growth rate K to control the number of feature map where K represents the number of feature layers output by each layer. We set K to 48. It the same as the size of the feature maps inside each dense block ( Figure 2).
To reduce the amount of calculation and increase the receptive field, a down trans tion layer is used after each dense block. Each transition layer is composed of batch no malization (BN), rectified linear unit (ReLU), bottleneck layer (1 × 1 convolution) and a erage pooling layer (2 × 2).

Attention Mechanism
The attention mechanism can help to focus more attention on interesting targets [44,45]. This study uses two attention modules: core attention module [44] and global attention module [45]. In the core attention module, the input value of the signal is calculated by calculating the output of the last dense block (Figure 3). The core attention module contains two inputs, one is the output to the three dense blocks, and the other is the attention signal input. By connecting the low-level features to the high-level features, the core attention module can weaken the background information and enhance useful local details, thereby reducing the misjudgment of the original jump connection feature and improving the integrity of the extracted road network. The introduction of the core attention module, on the one hand, can ensure the maximum transmission of road information between all layers of the network. On the other hand, it can assist the global attention module to improve the integrity of the road while eliminating the tree canopy occlusion effect.   In the global attention module, the global average pool is first used to extract global context information from the high-level feature map ( Figure 4). The global average pool is convenient to obtain global context information in images [45]. Then, the output of global context information is activated through a sigmoid function. Finally, weighted features are added to the feature map to integrate global information. The global attention module uses the global average layer to collect the global context information from the feature map and enhances the global information of the feature map, thereby solving the interruption of road extraction caused by tree canopy occlusion.

Experiment Preparation
The datasets used in this study are from the Massachusetts road dat DeepGLOBE-CVPR 2018 road dataset (CVPR dataset) [55,56]. They are compose age datasets for training, validation and test, associated with corresponding r maps. The Massachusetts road dataset contains a total of 1171 images. Each imag dataset is 1500 × 1500 pixels, with a spatial resolution of 1.2 m and a coverage are square kilometers. The dataset covers a variety of typical urban, suburban and ru with a total area of more than 2600 square kilometers. The CVPR road dataset

Decoder
We mainly made two adjustments to the decoder in CADUNet. One is to use a simple up-sampling operation with a step size of 2 in the first layer, and the second is to use 4 global attention modules plus 3 improved up-sampling operations. In the improved up-sampling operations, 1 × 1 convolution, BN and ReLU operations are performed first, followed by 3 × 3 convolution operations, BN and ReLU operations, and finally, simple up-sampling. This matches the size of the resulting output by the attention module. We add the output of the last global attention module to the corresponding layer in the encoder. After that, the output relates to the corresponding layer in the encoder. Then, a simple up-sampling operation is added to restore the size of the image to the same as the original input image following 1 × 1 convolution, BN and ReLU operations. For the final convolution, BN, ReLU and sigmoid operations are used to generate the predicted road map.

Adaptive Loss Function
In this paper, we consider road extraction as a binary semantic segmentation. The proportion of road area is usually less than 10%, and the proportion of non-road backgrounds is usually greater than 90%. In the case of random sampling, the training efficiency is low since negative samples occupy most of the training samples [24]. To this end, we adopt a new adaptive loss function to adjust the imbalance between positive and negative samples: where, P road and P background respectively represent the percentage of roads and non-roads in the entire area. L BCE is the binary cross entropy loss [53], and L IoU is the intersection ratio index [54] and emphasizes the deviation between the predicted road and the actual road. The calculation formula of each is as follows: where g i (i = 0, 1, 2, . . . , n) is the ground truth of the i-th pixel, p i (i = 0, 1, 2, . . . , n) is the predictions of the i-th pixel and n is the number of pixels.

Experiment Preparation
The datasets used in this study are from the Massachusetts road dataset and DeepGLOBE-CVPR 2018 road dataset (CVPR dataset) [55,56]. They are composed of image datasets for training, validation and test, associated with corresponding reference maps. The Massachusetts road dataset contains a total of 1171 images. Each image in this dataset is 1500 × 1500 pixels, with a spatial resolution of 1.2 m and a coverage area of 2.25 square kilometers. The dataset covers a variety of typical urban, suburban and rural areas, with a total area of more than 2600 square kilometers. The CVPR road dataset contains 6226 satellite images with a size of 1024 × 1024 pixels and a spatial resolution of 50 cm. Accordingly, these datasets can be divided into rural, suburban and urban road datasets, as shown in Figure 5.
To make training, validation and test datasets for this experiment, all image datasets were cropped and augmented. First, the images and the corresponding reference maps were expended by random rotation (90 degrees, 180 degrees and 270 degrees), random horizontal and vertical flips and random brightness adjustment (0.5-1.5). Then, they were randomly cropped to 256 × 256 pixels [36]. Finally, from the Massachusetts dataset, we obtained 50,545 images, of which 42,963 were for training and 7582 were for validating, and the test dataset is 49 original 1500 × 1500 images. From the CVPR road dataset, 84,000 images were obtained, of which 71,400 were for training, 12,600 images were for validating and the test dataset is 105 original 1024 × 1024 images.
We compare this method with UNet [30], DeepLab v3+ [32], DenseUNet [40], the improved DenseUNet (CDenseUNet) with only the core attention modules and the improved DenseUNet (GDenseUNet) with only the global attention modules. 6226 satellite images with a size of 1024 × 1024 pixels and a spatial resolution of 50 cm. Accordingly, these datasets can be divided into rural, suburban and urban road datasets, as shown in Figure 5.
To make training, validation and test datasets for this experiment, all image datasets were cropped and augmented. First, the images and the corresponding reference maps were expended by random rotation (90 degrees, 180 degrees and 270 degrees), random horizontal and vertical flips and random brightness adjustment (0.5-1.5). Then, they were randomly cropped to 256 × 256 pixels [36]. Finally, from the Massachusetts dataset, we obtained 50,545 images, of which 42,963 were for training and 7582 were for validating, and the test dataset is 49 original 1500 × 1500 images. From the CVPR road dataset, 84,000 images were obtained, of which 71,400 were for training, 12,600 images were for validating and the test dataset is 105 original 1024 × 1024 images.
We compare this method with UNet [30], DeepLab v3+ [32], DenseUNet [40], the improved DenseUNet (CDenseUNet) with only the core attention modules and the improved DenseUNet (GDenseUNet) with only the global attention modules. This experiment is implemented on a high-performance computing platform: the CPU is composed of 2 groups of Intel Xeon 5120 with 14 cores, associated with 128 GB of working memory, the GPU is 2 groups of NVIDIA P100 with 16 GB of memory and the operating system uses CentOS 7. We used the TensorFlow backend to execute on the deep learning framework of Keras. The Adam function [57] is used for parameter optimization. Each epoch processed 16 images. The learning rate was initially set to 0.0001, and was reduced by 0.02 times per period, and the number of epochs was set to 50.
In this experiment, we use overall accuracy (OA), precision, recall, 1 score F − , and Intersection over Union (IoU) for validation. Equations (5)-(9) [36,54,58] describe these assess- This experiment is implemented on a high-performance computing platform: the CPU is composed of 2 groups of Intel Xeon 5120 with 14 cores, associated with 128 GB of working memory, the GPU is 2 groups of NVIDIA P100 with 16 GB of memory and the operating system uses CentOS 7. We used the TensorFlow backend to execute on the deep learning framework of Keras. The Adam function [57] is used for parameter optimization.
Each epoch processed 16 images. The learning rate was initially set to 0.0001, and was reduced by 0.02 times per period, and the number of epochs was set to 50. In this experiment, we use overall accuracy (OA), precision, recall, F 1−score , and Intersection over Union (IoU) for validation. Equations (5)- (9) [36,54,58] describe these assessment metrics: where, TP, FP, FN and TN represent true positive, false positive, false negative and true negative, respectively.

Massachusetts Dataset
In the Massachusetts dataset, the road occluding mainly comes from the tree canopy aside the rural and suburban roads, while the images occluded by urban roads are few. Figure 6 shows the partial roads occluded by tree canopies in rural areas (scenes 1-3), the partial roads occluded by tree canopies in suburbs (scenes 4-5) and the partial roads occluded by urban high-rise buildings in urban areas (scene 6).
According to these results, the CADUNet proposed in this paper has achieved good results on the blocked roads in the rural, suburban and urban areas. It can be found that there is a gap between the results of DeepLab v3+ and UNet when the road is occluded by tree canopy and its shadows. The results derived from the proposed CADUNet are closer to the ground truth than those from the other methods. The smoothness of the road edges is significantly improved. UNet performs well in the scenes 3 and 6, but performs poorly in the scenes 1, 2 and 4. DeepLab V3+ performed well in the scenes 1 and 3, and DenseUNet performed well in the scenes 3 and 5 but did not perform well in the remaining scenes. In the scenes 1 and 2, the performance of CDenseUNet and GDenseUNet is poor, and the performance in the other scenarios is better. Finally, the CADUNet has achieved the best results in all six scenes by eliminating the occluding effects of tree canopies aside the road. Figure 7 shows the extracted results from the CVPR road dataset with these 6 methods. The first and second scenes contain roads occluded by tree canopies in rural areas, and the third and fourth scenes are roads covered by tree canopies in the suburbs. The fifth and sixth scenes show roads in the urban area covered by the shadows of high-rise buildings. Our CADUNet method has achieved good results in the information extraction of rural, suburban and urban roads. The first scene shows that when facing a partially covered road, the results obtained by the UNet method are better than that from DeepLabv3+ and DenseUNet. In the second and sixth scenes, when part of the tree canopy and high-rise building shadows block the road, the performance of DeepLab V3+ is better than UNet and DenseUNet. The DenseUNet only shows better performance in the fourth scene, while CDenseUNet performs better in the third and sixth scenes, merely. GDenseUNet and CADUNet obtained the best results in the first, second, third and fifth scenes occluded by the tree canopy. Obviously, the global attention mechanism plays an obvious role in extracting roads occluded by tree canopy and building shadows. In the CVPR dataset, the global attention mechanism plays a key role in solving the occluding problem. Other methods show poor effects on the fourth scene, owing to not using the core attention mechanism. Therefore, this CADUNet method has achieved good results with the cascading dual attention mechanism. According to these results, the CADUNet proposed in this paper has achieved good results on the blocked roads in the rural, suburban and urban areas. It can be found that there is a gap between the results of DeepLab v3+ and UNet when the road is occluded by tree canopy and its shadows. The results derived from the proposed CADUNet are closer to the ground truth than those from the other methods. The smoothness of the road edges is significantly improved. UNet performs well in the scenes 3 and 6, but performs poorly in the scenes 1, 2 and 4. DeepLab V3+ performed well in the scenes 1 and 3, and Dens-eUNet performed well in the scenes 3 and 5 but did not perform well in the remaining scenes. In the scenes 1 and 2, the performance of CDenseUNet and GDenseUNet is poor, and the performance in the other scenarios is better. Finally, the CADUNet has achieved the best results in all six scenes by eliminating the occluding effects of tree canopies aside the road.

Massachusetts Dataset
For the Massachusetts road dataset, the 6 methods are used to extract complex road networks, including rural (scenes 1-3 in Figure 8), suburban (scenes 4-5 in Figure 8) and urban road networks (scene 6 in Figure 8), and transportation hub (scenes 7-8 in Figure  8). From the extraction results in rural, suburban and urban areas, CADUNet performs well on sparse rural roads, suburban and urban roads neighboring parking lots. When comparing other models, the CADUNet method not only depends on the visual characteristics of the road, but also has a certain reasoning ability by modeling the road context. It can be seen from Figure 8 that the road network obtained by UNet and Deeplab V3+ networks has obvious defects. Compared with UNet and DeepLab V3+, DenseUNet has some improvements. Compared with the standard DenseUNet, CDenseUNet and GDens-

Massachusetts Dataset
For the Massachusetts road dataset, the 6 methods are used to extract complex road networks, including rural (scenes 1-3 in Figure 8), suburban (scenes 4-5 in Figure 8) and urban road networks (scene 6 in Figure 8), and transportation hub (scenes 7-8 in Figure 8). From the extraction results in rural, suburban and urban areas, CADUNet performs well on sparse rural roads, suburban and urban roads neighboring parking lots. When comparing other models, the CADUNet method not only depends on the visual characteristics of the road, but also has a certain reasoning ability by modeling the road context. It can be seen from Figure 8 that the road network obtained by UNet and Deeplab V3+ networks has obvious defects. Compared with UNet and DeepLab V3+, DenseUNet has some improvements. Compared with the standard DenseUNet, CDenseUNet and GDenseUNet reduce road interruption and enhance the connectivity of the road network. Compared with the previous 5 models, the results obtained by CADUNet perform better road connectivity and fewer road interruptions. In the experiment with the CVPR road dataset, the CADUNet method reached the highest overall accuracy, F1-score and IoU, reaching 97.09%, 76.28% and 62.08%, respectively (Table 2). Compared with UNet, the recall and IoU of this method are increased by 6.14% and 3.83%, respectively. Compared with Deeplab V3+, the CADUNet method increases the IoU by 5.57%. After adding two attention mechanisms, the CADUNet method has increased F1-score and IoU by 1.67% and 2.11% compared to the DenseUNet. Accuracy assessment shows that the OA, recall, precision, F1-score and IoU obtained by CADUNet are the highest, reaching 98.00%, 76.55%, 79.45%, 77.89% and 64.12%, respectively (Table 1). Compared with UNet, the F1-score and IoU with the CADUNet method increased by 2.49% and 4.26%, respectively. Compared with the standard DenseUNet, the F1-score and IoU by CADUNet increased by 3.25% and 4.16%, respectively. After adding two attention modules, the intersection ratio by CADUNet increased by 3.04% and 2.21% respectively, compared to CDenseUNet and GDenseUNet.

CVPR Dataset
As shown in Figure 9, the results based on the CVPR road dataset include rural roads (scenes 1-3), suburban roads (scenes 4-5) and urban roads (scene 6). The best results extracted by CADUNet are rural roads, followed by suburban roads and urban roads. Comparison among the 6 models shows that the results of UNet and DeepLab V3+ have the worst road network connectivity and severe road incompleteness. CDenseUNet and GDenseUNet have made progress based on DenseUNet, but still have their own shortcomings, and the connectivity of the road is poor. Due to imbedding the cascading dual attention mechanism into the DenseUNet, the CADNUnet method has obtained the best results in terms of road network connectivity.    Figure 10a,b reflects the changes of the loss function with epochs on the Massachusetts and CVPR training datasets. As the training epochs increases, the losses of all 6 models gradually decrease with the increased training batches. The CADUNet proposed in In the experiment with the CVPR road dataset, the CADUNet method reached the highest overall accuracy, F1-score and IoU, reaching 97.09%, 76.28% and 62.08%, respectively (Table 2). Compared with UNet, the recall and IoU of this method are increased by 6.14% and 3.83%, respectively. Compared with Deeplab V3+, the CADUNet method increases the IoU by 5.57%. After adding two attention mechanisms, the CADUNet method has increased F1-score and IoU by 1.67% and 2.11% compared to the DenseUNet.  Figure 10a,b reflects the changes of the loss function with epochs on the Massachusetts and CVPR training datasets. As the training epochs increases, the losses of all 6 models gradually decrease with the increased training batches. The CADUNet proposed in this paper shows a better descending rate on the loss function than UNet, DeepLab V3+, CDenseUNet and GDenseUNet. UNet and DeepLab V3+ performed the worst. Figure 6c,d reflects the changes in the loss function corresponding to the training epochs on the Massachusetts and CVPR validation datasets, respectively. The CADUNet proposed in this paper has the lowest loss value verified on the two datasets, that is, the result obtained by the method is the closest to the truth. After 25 epochs of CADUNet, the model tends to be stable. this paper has the lowest loss value verified on the two datasets, that is, the result obtained by the method is the closest to the truth. After 25 epochs of CADUNet, the model tends to be stable.

Discussion
In the results of road extraction from VHR images, the occluding effect of the tree canopy and high-rise buildings aside the road often leads to the incompleteness of the road surface and even the interruption of the road network. As the basic framework of the proposed CADUNet, the DenseUNet semantic segmentation network performs well on employing the deep features of the image, avoiding gradient dispersion and making the network easy to train. Its feature reuse function can ensure that the most road information is preserved between the network layers, thereby improving the connectivity of the extracted road network. Therefore, it lays a solid foundation for road information extraction. Furthermore, the global attention module that we added to the DenseUNet model can enhance the global context information from the road feature map, thereby reducing the road interruption caused by tree canopy occlusion and building shadows to a certain extent, and the road integrity is significantly improved. We added the core attention module

Discussion
In the results of road extraction from VHR images, the occluding effect of the tree canopy and high-rise buildings aside the road often leads to the incompleteness of the road surface and even the interruption of the road network. As the basic framework of the proposed CADUNet, the DenseUNet semantic segmentation network performs well on employing the deep features of the image, avoiding gradient dispersion and making the network easy to train. Its feature reuse function can ensure that the most road information is preserved between the network layers, thereby improving the connectivity of the extracted road network. Therefore, it lays a solid foundation for road information extraction. Furthermore, the global attention module that we added to the DenseUNet model can enhance the global context information from the road feature map, thereby reducing the road interruption caused by tree canopy occlusion and building shadows to a certain extent, and the road integrity is significantly improved. We added the core attention module to the DenseUNet model to fuse more low-level features into the high-level feature map, so as to ensure that road information is transmitted to the greatest extent in dense blocks in the network, and further assist the global attention module to obtain more road information at the encoding part. This module improves the connectivity of the road network, and at the same time restores the integrity of the road surface and the smoothness of the sideline at the decoding part. Figure 11 shows the accuracy assessment results of six examples using the total six models on the Massachusetts dataset, where the green, red and blue areas represent TP, FP and FN, respectively. The first line in the figure shows an image with a loop road and its extraction results. Only the CDenseUNet and CADUNet models with the core attention mechanism have the most extent of TP and the least area of FP and FN, and the loop is relatively complete. This shows that the core attention mechanism makes up for the deficiency of the global attention mechanism to a certain extent. The second row shows the extraction result of the road that is sheltered by the elevated railway. It can be seen from these panels that the use of CDenseUNet, GDenseUNet and CADUNet models can extract limited roads sheltered by railways. This reflects the superiority of the core and the global attention modules. CADUNet has the most TP areas and the least FP and FN areas due to the use of two cascaded attention modules. The third row shows the extraction result of an image with the intersection of the main road and the minor road. With UNet and the DeepLab V3+ model, only the main road can be identified. Based on DenseUNet, CDenseUNet and CADUNet models, the extraction quality is better than the other three models. The CADUNet model achieves the largest TP areas and the smallest FP and FN areas, which embodies the advantage of the cascaded attention mechanism. The fourth row shows the extraction results of roads that are occluded by dense tree canopies on the roadside. The CDenseUNet, GDenseUNet and CADUNet models obtained good results. The dual attention mechanism integrated in the CADUNet model can solve the problem of roads being occluded by the tree canopy. It can be seen from the panels in row 5 that all the above six models can identify the main road but cannot identify the minor road connected to residential houses. In the Massachusetts dataset, the labeled dataset generally does not include such minor roads, so that they were ignored in the six network models when learning. Therefore, the error is due to the inconsistency of the labeling data and the overall labeling dataset. Row 6 concerns an image with the main and minor road intersection area. In the extraction results, the DeepLab V3+ and DenseUNet models present poor results, while the minor roads are not recognized. However, the main road and one of the minor roads can be well-identified by using CDenseUNet, GDenseUNet and CADUNet models, and the most TP areas and the least FP and FN areas can be achieved with the CADUNet model. At the same time, all six models still missed one of the minor roads labeled in the evaluation dataset. Although the minor road is labeled in the evaluation data, its features as a road are not obvious, which makes it difficult for the six models to recognize. Figure 12 shows the accuracy assessment results of six examples of road extraction using the total six models on the CVPR dataset, and the color definitions are consistent with the foregoing. The panels in the first row show an image with the intersection of the main road and its minor roads in rural areas. These 6 models merely extract the main road, but not the minor road, which is related to the labeling dataset. In the labeling dataset, only a small part of the roads of this type are labeled, and most are not labeled. As a result, these deep learning models cannot be used to recognize the minor roads of this type. The panels in the second line show the country road that is shaded by trees. For this kind of image with rural roads, DenseUNet, CDenseUNet, GDenseUNet and CADUNet models have achieved good results, obtaining more TP areas and fewer FP and FN areas, which highlights the effectiveness of DenseUNet, as the basis of these networks, and the dual attention mechanisms in road extraction. For the image with parallel roads shown in the third row, the CADUNet model performs well, achieving the most TP area and the least FP and FN area, which reflects the superiority of the cascaded attention mechanism. However, there is still a gap between this extracting result and the labeled dataset, because one of the parallel roads is omitted from the labeling data. The images in the fourth row show the crossing area of the two roads, associated with a roadside canopy occluding effect. For this image, the CADUNet model achieved more TP areas and the least FP and FN areas, achieving the best recognition effect, thus reflecting the advantages of the cascaded attention mechanism. The fifth and six row reflects the image of a curved road and its extracting effect in an urban area, and part of the road is obviously occluded by the shadow of the buildings. Good results were derived only through the CDenseUNet and CADUNet models, and the results through the other four models are relatively poor, which indicates that the core attention mechanism has a significant role in extracting this type of road.  Figure 12 shows the accuracy assessment results of six examples of road extraction using the total six models on the CVPR dataset, and the color definitions are consistent with the foregoing. The panels in the first row show an image with the intersection of the main road and its minor roads in rural areas. These 6 models merely extract the main road, but not the minor road, which is related to the labeling dataset. In the labeling dataset, only a small part of the roads of this type are labeled, and most are not labeled. As a result, these deep learning models cannot be used to recognize the minor roads of this type. The panels in the second line show the country road that is shaded by trees. For this kind of image with rural roads, DenseUNet, CDenseUNet, GDenseUNet and CADUNet models have achieved good results, obtaining more TP areas and fewer FP and FN areas, which highlights the effectiveness of DenseUNet, as the basis of these networks, and the dual attention mechanisms in road extraction. For the image with parallel roads shown in the third row, the CADUNet model performs well, achieving the most TP area and the least FP and FN area, which reflects the superiority of the cascaded attention mechanism. How- this image, the CADUNet model achieved more TP areas and the least FP and FN areas, achieving the best recognition effect, thus reflecting the advantages of the cascaded attention mechanism. The fifth and six row reflects the image of a curved road and its extracting effect in an urban area, and part of the road is obviously occluded by the shadow of the buildings. Good results were derived only through the CDenseUNet and CADUNet models, and the results through the other four models are relatively poor, which indicates that the core attention mechanism has a significant role in extracting this type of road.

Conclusions
In this study, we proposed an innovative CADUNet model based on the DenseUNet framework to solve the problems of incomplete road surface, uneven sidelines and poor road network connectivity due to roadside tree canopy in HRV images. We added global attention modules to obtain the global information of the road and introduced core attention modules to ensure that road information is transmitted to the greatest extent among the various layers of the network in dense ranges. The model can extract more road information from multiple locations to improve road integrity and enhance the robustness of

Conclusions
In this study, we proposed an innovative CADUNet model based on the DenseUNet framework to solve the problems of incomplete road surface, uneven sidelines and poor road network connectivity due to roadside tree canopy in HRV images. We added global attention modules to obtain the global information of the road and introduced core attention modules to ensure that road information is transmitted to the greatest extent among the various layers of the network in dense ranges. The model can extract more road information from multiple locations to improve road integrity and enhance the robustness of feature extraction under tree canopy and urban high-rise building shadows. Finally, an adaptive loss function was introduced to balance the ratio of road areas to non-road areas in the training samples. This article used the Massachusetts dataset and the DeepGLOBE-CVPR 2018 dataset for comparative experiments. The results showed that the CADUNet model is more encouraging in road extraction from VHR images. Although our network model has achieved good performance, there is still room for improvement for the problems of insufficient and excessive semantic segmentation of roads concerning sideline smoothness, interruption and the connectivity of the road network. In addition, it is expected that the quality of the label data set will be further improved in the follow-up work.
Author Contributions: Jing Li and Yong Liu conceived and designed the study program; Jing Li conducted the experiment and wrote the manuscript; Yong Liu and Yindan Zhang provided revision opinions and experimental guidance; Yang Zhang participated in the discussion of project planning and paper revision. All authors have read and agreed to the published version of the manuscript. Data Availability Statement: All data generated or analyzed during this study are included in this article.