Road Extraction Convolutional Neural Network with Embedded Attention Mechanism for Remote Sensing Imagery

: Roads are closely related to people’s lives, and road network extraction has become one of the most important remote sensing tasks. This study aimed to propose a road extraction network with an embedded attention mechanism to solve the problem of automatic extraction of road networks from a large number of remote sensing images. Channel attention mechanism and spatial attention mechanism were introduced to enhance the use of spectral information and spatial information based on the U-Net framework. Moreover, residual densely connected blocks were introduced to enhance feature reuse and information ﬂow transfer, and a residual dilated convolution module was introduced to extract road network information at different scales. The experimental results showed that the method proposed in this study outperformed the compared algorithms in overall accuracy. This method had fewer false detections, and the extracted roads were closer to ground truth. Ablation experiments showed that the proposed modules could effectively improve road extraction accuracy.


Introduction
Road network plays an important role in real-world applications, such as intelligent city and intelligent traffic management [1,2].As an important means of obtaining maps of road networks, road extraction has received extensive attention from scholars in this field.At present, road extraction from images is a basic research task of remote sensing.Compared with the traditional visual interpretation of remote sensing images, the automatic road interpretation algorithm of remote sensing images can obtain road network information quickly and at a low cost.However, their manifestations in remote sensing images are significantly different due to the large differences in shape, width, length, and material of road networks in different regions.Therefore, accurate extraction of road network information from remote sensing images is still a challenging problem.
So far, a series of road extraction algorithms have been developed, which can be mainly divided into two categories: traditional road extraction algorithms and deep learning-based road extraction algorithms.
In the traditional road extraction algorithm, some scholars used a template matching algorithm for semi-automatic extraction, including a rectangular template matching algorithm [3], a circular template matching algorithm [4], and a T-shaped template matching algorithm [5].This type of semi-automatic template matching algorithm had high requirements for the adaptive ability of the template, and had defects such as inaccurate extraction and unsmooth roads for complex road networks with frequent changes in curvature and width.Some scholars focused on the positive effect of spectral features of remote sensing images on extraction accuracy.Shi et al. [6] combined the spectral information, spatial information, and shape features of remote sensing images to obtain a road binary map.Coulibaly et al. [7] combined the spectral angle algorithm with Lowe's scale-invariant features transform descriptors to achieve high-quality extraction of road networks.Vector data were also introduced into the road network automatic extraction algorithm to assist road extraction and improve the extraction accuracy.Cao et al. [8] introduced GPS data in the process of road centerline extraction.Manandhar et al. [9] introduced volunteer geographic information to assist road network extraction.
In traditional road extraction methods, satisfactory results have been achieved.However, since traditional methods need to manually design road feature extraction algorithms and adjust the threshold parameters, they are not suitable for road extraction from largescale and complex remote sensing data.Some scholars used machine learning algorithms to quickly extract road networks from remote sensing images to solve the aforementioned problems.Mokhtarzade et al. [10] used BP neural networks with a different number of iterations of different hidden layer sizes for road extraction.M. Song et al. [11] used the support vector machine method for road extraction.Maurya et al. [12] first removed non-road areas based on morphological features, and then extracted road areas by K-means clustering.Seppke et al. [13] proposed a parallel super-pixel-based road tracking method combining geometric and topological representations.In the analysis framework of adaptive mean shift, Huang et al. [14] used mean shift to obtain an object-oriented representation of hyperspectral data, and then used a support vector machine for feature interpretation.
Recently, deep learning is widely used in remote sensing and has shown good performance for image classification [15][16][17] and segmentation tasks [18][19][20][21][22]. Wei et al. [23] proposed to use a patch-based convolutional neural network for road extraction.Alshehhi et al. [24] further developed a neural network that can extract roads and buildings simultaneously.Zhong et al. [25] proposed to use a fully convolutional neural network for road extraction.Based on the fully convolutional neural network, a series of encoderdecoder neural networks are developed.Panboonyuen et al. [26] used the SegNet structure as the benchmark network for road segmentation, and used deconvolution to replace the residual up-sample of the fully convolutional neural network.Zhang et al. [27] combined the advantages of U-Net structure to extract scale information with the advantage of residual structure to connect different features maps.To make the network easier to train, Xu et al. [28] combined the U-Net structure with a densely connected network.In 2017, LinkNet was proposed by Chaurasia et al. [29].Zhou et al. [30] used LinkNet structure and dilated convolution to segment roads and became the best solution in DeepGlobe-2018.Subsequently, a variety of improved methods based on the U-Net structure are proposed.He et al. [31] added a pyramid pooling structure to the U-Net structure, and used a structural similarity loss function to constrain the network.Wulamu et al. [32] extracted features for roads at different scales through different convolutional layers.
Besides, some scholars focused on the image of the loss function on the road extraction accuracy.He et al. [31] used a structural similarity loss function to constrain the network training process to enhance the detailed information of the road extraction results and avoid over-smoothing the extraction results.Wei et al. [23] used road structure information to adjust the cross-entropy loss function.Mosinska et al. [33] proposed a topological-aware loss function to solve the problem of abnormal road network topology in road extraction results, which effectively reduced road network topological errors.These loss functions only considered strengthening the use of road network structural information and do not fully consider the structural differences of road training samples [34].
In recent years, as a way to recalibrate feature maps, the attention mechanism is widely used in the field of computer vision [35][36][37][38][39][40].The attention mechanism can give a higher weight to the discriminative features and reduce the weight of unnecessary information, effectively improving the recognition accuracy.
Therefore, we propose a Road Extraction convolutional neural Network with an embedded Attention mechanism (RENA) for remote sensing images.In the proposed network, we embed the spatial attention mechanism and the channel attention mechanism based on the U-Net.The spatial attention mechanism is used to improve the spatial detail information of road extraction results, and the channel attention mechanism is used to recalibrate the spectral features of remote sensing images.Under this U-Net framework, residual densely connected blocks are also embedded [41] to integrate the information flow transmission advantages of residual densely connected blocks.Besides, we also introduce a residual dilated convolution module to enhance the ability to extract multiscale information.
The main contributions of this study are summarized as follows: 1.
This study uses the U-Net structure to achieve end-to-end road network extraction from remote sensing images.

2.
This study designs an attention module that combines spatial attention and channel attention to enhance spatial detail information and promote the use of spectral features.

3.
Based on the U-Net structure, this study embeds the residual dense connection blocks to achieve information flow transfer and feature reuse, and uses a residual dilated convolution module to achieve multiscale information extraction.
The rest of the paper is organized as follows.Section 2 of the paper presents the details of the proposed method.Section 3 of the paper is the experimental configurations and experimental results including the ablation experiments.Section 4 is the conclusion part of the paper.

U-Net Framework with Embedded Attention Mechanism
U-Net framework is a network of encoding and decoding architecture, which is widely used in remote sensing images semantic segmentation [42][43][44][45] and classification tasks [46].
The U-Net framework consists of two parts, namely the encoder layer and the decoder layer.The encoder is used for pixel-by-pixel semantic feature extraction of remote sensing images, and the context information of the image is extracted by it.The decoder layer is used to decode the feature maps and locate the region of interest (ROI) in the image to finally obtain semantic segmentation results.In the U-Net framework, in order to effectively transfer the information of low-level feature maps to the high-level feature maps, the feature maps extracted by the encoder layer combine with the feature maps extracted by the decoder layer in the form of a skip connection to achieve effective fusion of low-level feature maps and high-level feature maps.
The attention mechanism is mainly used to recalibrate the weights of the feature maps.At present, there are three main categories of attention mechanisms used in the field of remote sensing, including spatial attention mechanisms, channel attention mechanisms, and temporal attention mechanisms.For the spatial attention mechanism, it is a way to recalibrate the spatial information of remote sensing images.The spatial attention mechanism reduces the dimension of the feature maps to two dimensions by performing feature compression on the feature maps, and obtains the two-dimensional weight through the normalization function to guide the feature maps for spatial recalibration, so that highfrequency information gets higher weight.For the channel attention mechanism, it is to recalibrate the weight of the image channel direction.The channel attention mechanism obtains the global feature of the image channel direction through the global pooling operation, and uses it as the weight of the feature maps to guide the feature maps to recalibrate the channel direction.Make discriminative features get higher weights.In this paper, we extract the global mean weight and global maximum weight of the image through mean pooling and max pooling, respectively.The temporal attention mechanisms are a way of empowering temporal information and are mainly used to process multi-temporal remote sensing data.In this paper, multi-temporal data is not used, therefore, only spatial attention mechanism and channel attention mechanism are used.
In this paper, the U-Net framework is used as the main frame.The RGB three-channel remote sensing image is adopted as the input data of the network, and the binary image road network map is obtained through end-to-end network processing.Among them, the network output results are two types of results, road, and non-road.The U-Net framework includes four encoder layers and four decoder layers.The four encoder layers are used to extract image features of different scales, and the four decoder layers are used to restore image details, and realize the information flow transmission and the reuse of different levels of feature maps by means of skip connections.The spatial attention block and the channel attention block are combined in a cascaded form and jointly embedded into the U-Net framework, which enhances the utilization of spectral information while improving spatial details.Residual dense connection blocks are embedded in each encoder layer and decoder layer, which are used to connect feature information at different levels to achieve efficient transmission of information flow.In addition, the residual dilated convolution module is added to the last encoder layer to further expand the receptive field.The flowchart of the proposed method is listed in Figure 1.
Remote Sens. 2022, 14, x FOR PEER REVIEW 4 of 18 operation, and uses it as the weight of the feature maps to guide the feature maps to recalibrate the channel direction.Make discriminative features get higher weights.In this paper, we extract the global mean weight and global maximum weight of the image through mean pooling and max pooling, respectively.The temporal attention mechanisms are a way of empowering temporal information and are mainly used to process multi-temporal remote sensing data.In this paper, multi-temporal data is not used, therefore, only spatial attention mechanism and channel attention mechanism are used.In this paper, the U-Net framework is used as the main frame.The RGB three-channel remote sensing image is adopted as the input data of the network, and the binary image road network map is obtained through end-to-end network processing.Among them, the network output results are two types of results, road, and non-road.The U-Net framework includes four encoder layers and four decoder layers.The four encoder layers are used to extract image features of different scales, and the four decoder layers are used to restore image details, and realize the information flow transmission and the reuse of different levels of feature maps by means of skip connections.The spatial attention block and the channel attention block are combined in a cascaded form and jointly embedded into the U-Net framework, which enhances the utilization of spectral information while improving spatial details.Residual dense connection blocks are embedded in each encoder layer and decoder layer, which are used to connect feature information at different levels to achieve efficient transmission of information flow.In addition, the residual dilated convolution module is added to the last encoder layer to further expand the receptive field.The flowchart of the proposed method is listed in Figure 1.

U-Net Framework
In the proposed U-Net road extraction framework, we use four encoder layers and four decoder layers, and add the feature maps obtained by the corresponding encoder layers and decoder layers through skip connections.
For the four encoder layers, we adopt the pre-trained model of the ResNet34 network to speed up the model convergence.In each encoder layer, there are multiple residual blocks (ResBlock).For each ResBlock, it consists of two convolutional layers, two batch normalization layers, and a rectified linear unit (ReLU).Among them, the convolution layer is used to extract remote sensing image feature information.The batch normalization layer is used to normalize the data distribution to speed up the convergence of network training, avoid gradient disappearance or gradient explosion, and at the same time, avoid overfitting of the trained model.The ReLU is used to perform nonlinear processing on the data to enhance the model's ability to fit nonlinear relationships.In the encoder layer, the first ResBlock of the second to fourth layers of it also includes a down sampling process, which down-samples the feature maps and increases the number of feature maps channels.In the U-Net framework, there are four decoder layers.In each decoder layer, the structure is roughly the same as that of the encoder layer, including a transposed convolution layer, three batch normalization layers, and three ReLU activation functions.

U-Net Framework
In the proposed U-Net road extraction framework, we use four encoder layers and four decoder layers, and add the feature maps obtained by the corresponding encoder layers and decoder layers through skip connections.
For the four encoder layers, we adopt the pre-trained model of the ResNet34 network to speed up the model convergence.In each encoder layer, there are multiple residual blocks (ResBlock).For each ResBlock, it consists of two convolutional layers, two batch normalization layers, and a rectified linear unit (ReLU).Among them, the convolution layer is used to extract remote sensing image feature information.The batch normalization layer is used to normalize the data distribution to speed up the convergence of network training, avoid gradient disappearance or gradient explosion, and at the same time, avoid overfitting of the trained model.The ReLU is used to perform nonlinear processing on the data to enhance the model's ability to fit nonlinear relationships.In the encoder layer, the first ResBlock of the second to fourth layers of it also includes a down sampling process, which down-samples the feature maps and increases the number of feature maps channels.In the U-Net framework, there are four decoder layers.In each decoder layer, the structure is roughly the same as that of the encoder layer, including a transposed convolution layer, three batch normalization layers, and three ReLU activation functions.Among them, the transposed convolution is a parameter learnable up sampling operation on the feature maps.

Channel and Spatial Attention Module (CSAM)
In the U-Net framework, after processing through the encoder layer, the size of the feature maps becomes smaller, and the number of channels increases.Conversely, after being processed by the decoder layer, the number of channels in the feature maps decreases, the size becomes larger, and the spatial information increases.Therefore, in order to maintain the channel information and spatial information of the encoder layer and the decoder layer, we embed the channel and spatial attention modules (Figure 2) after it.
Among them, the transposed convolution is a parameter learnable up sampling operation on the feature maps.

Channel and Spatial Attention Module (CSAM)
In the U-Net framework, after processing through the encoder layer, the size of the feature maps becomes smaller, and the number of channels increases.Conversely, after being processed by the decoder layer, the number of channels in the feature maps decreases, the size becomes larger, and the spatial information increases.Therefore, in order to maintain the channel information and spatial information of the encoder layer and the decoder layer, we embed the channel and spatial attention modules (Figure 2) after it.The channel attention block (CAB) includes pooling layers, fully connected layers, and convolutional layers.Through the maximum value pooling and the mean value pooling, the maximum value and the mean value features are extracted from the feature maps respectively.Then, the maximum value feature and the mean value feature are respectively converted into a one-dimensional vector through the fully connected layer, and the one-dimensional vector is normalized by the sigmoid function to obtain the maximum channel direction weight and the mean channel direction weight.Subsequently, the element-wise multiplication is performed on the two channel weights and the input feature maps to obtain the recalibrated feature maps respectively.Finally, the dimensionality reduction is performed on the result of concatenating two recalibrated feature maps and using convolutional layers.
The channel attention block can be expressed in the following form: ) ) Among them,  The channel attention block (CAB) includes pooling layers, fully connected layers, and convolutional layers.Through the maximum value pooling and the mean value pooling, the maximum value and the mean value features are extracted from the feature maps respectively.Then, the maximum value feature and the mean value feature are respectively converted into a one-dimensional vector through the fully connected layer, and the onedimensional vector is normalized by the sigmoid function to obtain the maximum channel direction weight and the mean channel direction weight.Subsequently, the element-wise multiplication is performed on the two channel weights and the input feature maps to obtain the recalibrated feature maps respectively.Finally, the dimensionality reduction is performed on the result of concatenating two recalibrated feature maps and using convolutional layers.
The channel attention block can be expressed in the following form: Among them, f m ca represents the feature maps processed by the channel attention block, f m input represents the input feature maps, w ca represents the convolution kernel of the channel attention block, b ca represents the bias term of the channel attention block, and ξ(.) represents the sigmoid function.• represents the convolution operator.and ⊗ represents the element-wise multiplication operation.p avg (.) represents the mean value pooling operator, p max (.) represents the maximum value pooling operator, and cat(.)represents the map cascade operation.The spatial attention block (SAB) includes convolutional layers and 1 × 1 convolutional layers.The feature maps are extracted through the convolution layer, and then the feature maps are reduced to single-channel features through the 1 × 1 convolution layer, and the single-channel feature map is normalized through the sigmoid function to obtain the spatial weight.Then, element-wise multiplication is performed on the spatial weights and the input feature maps to obtain the recalibrated feature maps.Finally, a convolutional layer is used to perform convolution operation on the recalibrated feature maps.
The spatial attention block can be expressed in the following form: where f m sa represents the feature maps processed by the spatial attention block, f m input represents the input feature maps, w sa represents the convolution kernel of spatial attention block, b sa represents the bias term of spatial attention block After being processed by the channel attention block and the spatial attention block respectively, the feature maps obtained by the two modules are fused in a cascaded manner, and 1 × 1 convolutional layers are used for dimensionality reduction processing to obtain the final feature map.
The fusion process of channel attention and spatial attention can be expressed as: where f m f use represents the fused feature maps, w f use represents the convolution kernel of spatial attention block, b f use represents the bias term of spatial attention block, δ(.) and the ReLU function.

Residual Densely Connected Blocks (RDCB)
To address the information flow transfer problem, Huang et al., proposed densely connected blocks (DB) [41].Densely connected blocks have been widely used in the field of computer vision due to their powerful information transfer and information extraction capabilities [41,47].By using multiple skip connections, the low-level feature maps are connected with the high-level feature maps, while maintaining the characteristics of the feedforward layer, the low-level feature maps information is effectively transferred to the high-level feature maps.In the encoder layer and the channel attention block, the feature maps become smaller after being processed by the encoder layer, and the spatial information of the feature maps is further compressed after being processed by the channel attention block.In order to further utilize the information of the input feature maps, we use residual densely connected blocks to calculate the residual between the input feature map and the DB-processed feature map through the residual structure.The residual densely connected block is shown in Figure 3.
The spatial attention block (SAB) includes convolutional layers and 1 × 1 convolutional layers.The feature maps are extracted through the convolution layer, and then the feature maps are reduced to single-channel features through the 1 × 1 convolution layer, and the single-channel feature map is normalized through the sigmoid function to obtain the spatial weight.Then, element-wise multiplication is performed on the spatial weights and the input feature maps to obtain the recalibrated feature maps.Finally, a convolutional layer is used to perform convolution operation on the recalibrated feature maps.
The spatial attention block can be expressed in the following form: ( ) After being processed by the channel attention block and the spatial attention block respectively, the feature maps obtained by the two modules are fused in a cascaded manner, and 1 × 1 convolutional layers are used for dimensionality reduction processing to obtain the final feature map.
The fusion process of channel attention and spatial attention can be expressed as:  and the ReLU function.

Residual Densely Connected Blocks (RDCB)
To address the information flow transfer problem, Huang et al., proposed densely connected blocks (DB) [41].Densely connected blocks have been widely used in the field of computer vision due to their powerful information transfer and information extraction capabilities [41,47].By using multiple skip connections, the low-level feature maps are connected with the high-level feature maps, while maintaining the characteristics of the feedforward layer, the low-level feature maps information is effectively transferred to the high-level feature maps.In the encoder layer and the channel attention block, the feature maps become smaller after being processed by the encoder layer, and the spatial information of the feature maps is further compressed after being processed by the channel attention block.In order to further utilize the information of the input feature maps, we use residual densely connected blocks to calculate the residual between the input feature map and the DB-processed feature map through the residual structure.The residual densely connected block is shown in Figure 3.In the proposed network, in order to preserve the spatial features of the feature map, RDB is used for feature extraction on feature maps.As shown in Figure 1, after using CSAM to recalibrate the features of the encoder layer and the decoder layer, we use RDB to perform feature extraction on the recalibrated feature maps to extract information at different scales.

Residual Dilated Convolution Module (RDCM)
Dilated convolutional layers are widely used to extract features under different receptive fields to extract features at different scales.A larger receptive field is more favorable for extracting large objects, and a small receptive field can extract the feature information of small objects.The dilated convolution does not reduce the spatial resolution of the image, and at the same time, by setting the dilation rate, it realizes the operation of expanding the receptive field to different degrees.
In this paper, after all of the encoder layer feature extraction, we use a residual dilated convolution module to extract the feature information of different scale targets as shown in Figure 4.The residual dilated convolution module mainly includes four dilated convolution layers with dilation rates of 1, 2, 4, and 8, which are used to extract feature information under four receptive fields and four rectified linear units.In this module, the residual structure is used to perform element addition operations on the extracted four kinds of feature information, so as to fully integrate the feature information at different scales.
( ) As shown in Figure 5, we list the schematic diagrams of dilated with dilation rates of 1 and 2, respectively.The residual dilated convolution module can be defined as: Among them, f m input represents the initial output feature maps, f rdcm (.) represents the dilated convolution layer.dilation = 1, 2, 4, 8 represent the dilation rates of 1, 2, 4, and 8, respectively.f m rdcm1 , f m rdcm2 , f m rdcm3 , and f m rdcm4 represent the output feature maps at different dilation rates, respectively.f m rdcm represents the final output of the residual dilated convolution module.
As shown in Figure 5, we list the schematic diagrams of dilated convolution layers with dilation rates of 1 and 2, respectively.

Experimental Dataset Information
To verify the effectiveness of the proposed method, we conduct comparative experiments using the Deep Globe Road Extraction dataset.The Deep Globe Road Extraction dataset is the dataset used in the CVPR Deep Globe 2018 road extraction challenge.The dataset includes 6226 training images, 1243 verification images, and 1101 testing images.The size of each image is 1024 × 1024, and the image resolution is 0.5 m.

Road Extraction Network Training Configuration Information
In the process of network training in this paper, two loss functions are used to constrain, which are binary cross entropy and dice coefficient loss, respectively.
As shown in Table 1, the parameter configuration information of the main modules of the proposed network is listed.

Experimental Dataset Information
To verify the effectiveness of the proposed method, we conduct comparative experiments using the Deep Globe Road Extraction dataset.The Deep Globe Road Extraction dataset is the dataset used in the CVPR Deep Globe 2018 road extraction challenge.The dataset includes 6226 training images, 1243 verification images, and 1101 testing images.The size of each image is 1024 × 1024, and the image resolution is 0.5 m.

Road Extraction Network Training Configuration Information
In the process of network training in this paper, two loss functions are used to constrain, which are binary cross entropy and dice coefficient loss, respectively.
As shown in Table 1, the parameter configuration information of the main modules of the proposed network is listed.The PyTorch framework is used for network training.The ADAM is used as the network optimizer.The network learning rate is 1 × 10 −3 , the network batch size is 4, and the training epoch is 40.In the model training process, PyTorch v1.8 in the Windows environment is used, the CPU used is AMD 5600X@4.5GHz, and the GPU used is NVIDIA RTX3090 with 24 G memory.The training process took 40 h.

Comparison and Quantitative Evaluation Metrics
To better verify the effect of the proposed method, we selected three mainstream remote sensing image road network extraction algorithms for comparison, including the U-Net method [30], LinkNet34 method [29], and D-LinkNet [30], which is the best solution in DeepGlobe-2018 and the state-of-the-art method.Among them, all the comparison algorithms were retrained with the same data and training epochs as in this study.
In quantitative evaluation experiments, five mainstream quantitative metrics are used to evaluate the performance of road extraction algorithms, which include accuracy, precision and recall, F1 score, and IoU (Intersection over Union) [48].
The accuracy is used to calculate the proportion of all correctly predicted samples to the total samples.It is an overall performance evaluation indicator, which is defined as follows: where TP is true positive, representing the number of samples that predict positive samples as positive samples.FP is false positive, representing the number of samples that predict negative samples as positive samples.TN is true negative, representing the sample that predicts negative samples as negative samples.FN is false negative, representing the number of samples that predict positive samples as negative samples.
The precision is the proportion of correctly predicted positive samples to all predicted positive samples, which is defined as follows: The recall is the proportion of samples that are predicted to be positive among the samples that are actually positive, which is defined as follows: The ranges of accuracy, precision, and recall are 0-1, and the higher the value, the better the performance of the two-class model.
F1 score is an indicator used in statistics to measure the performance of the binary classification model.It is a harmonic average indicator of the precision rate and recall rate of the model, which is used to balance the precision rate and the recall rate.Its value range is 0-1, and the higher the value, the better the performance of the binary classification model.The F1 score is defined as follows: IoU is a standard that measures the accuracy of detecting corresponding objects in a specific dataset.It is used to measure the correlation between the true value and the predicted value.The higher the correlation, the higher the IoU value.The IoU is defined as follows: loU = TP TP + FP + FN (13)

Visual Evaluation Results
The effect of the proposed method was verified in the visual evaluation.According to the density of the road network, visual evaluation from two aspects, dense road network, and sparse road network, was conducted to verify the extraction ability of the proposed method in densely built-up areas and open areas, respectively.

Visual Evaluation Results for Dense Road Network
Three samples were selected for analysis in the densely built-up area, as shown in Figure 6.The results using LinkNet34 and D-LinkNet showed that all roads were effectively identified, but in these two methods, a large number of false detection samples, that is, the nonroad areas, were identified as road areas, and the result roads obtained by these two methods were wider than ground truth.U-Net and the proposed method could obtain similar results as ground truth with relatively few false detections, but could not identify areas occluded by trees.In the lower-left corner of Figure 6, U-Net and the proposed method had broken roads.From the remote sensing images, it was seen that trees were present in this area.In addition, the U-Net framework extraction results showed multiple fine false detection roads.
As shown in Figure 7, the three comparison algorithms had different degrees of false detection in the upper left corner of Figure 7, and the proposed method had fewer false detections.In terms of missed detection, LinkNet34 and D-LinkNet missed relatively few cases, while U-Net and the proposed method were relatively more.
In Figure 8, D-LinkNet had many false detections in the large-scale road network above the figure, and some buildings were identified as roads.The LinkNet34 method could identify roads more completely, but many false detections still existed.Moreover, the identified road width was wider than ground truth.The proposed method had few false detections, but some missed detections existed.The proposed method was closer to ground truth in terms of road shape and road width.

Visual Evaluation Results for Sparse Road Network
Another three samples in the open area were selected for analysis.As shown in Figure 9, U-Net, LinkNet34, and RENA methods could extract roads effectively, while D-LinkNet could not extract complete roads.In addition, the roads extracted using U-Net, and RENA methods were closer to ground truth in morphology, while LinkNet34 had false detections.As shown in Figure 10, compared with ground truth, U-Net, LinkNet34, and D-LinkNet had different degrees of false detection, while RENA had no false detection.As shown in Figure 11, the remote sensing image showed that there were two parallel roads.The U-Net and RENA methods could effectively extract the road and divide it into two roads, while LinkNet34 and D-LinkNet recognized the two parallel roads as a single road.In addition, LinkNet34 and D-LinkNet had false detections in the upper left corner.9, U-Net, LinkNet34, and RENA methods could extract roads effectively, while D-LinkNet could not extract complete roads.In addition, the roads extracted using U-Net, and RENA methods were closer to ground truth in morphology, while LinkNet34 had false detections.As shown in Figure 10, compared with ground truth, U-Net, LinkNet34, and D-LinkNet had different degrees of false detection, while RENA had no false detection.As shown in Figure 11, the remote sensing image showed that there were two parallel roads.The U-Net and RENA methods could effectively extract the road and divide it into two roads, while LinkNet34 and D-LinkNet recognized the two parallel roads as a single road.In addition, LinkNet34 and D-LinkNet had false detections in the upper left corner.9, U-Net, LinkNet34, and RENA methods could extract roads effectively, while D-LinkNet could not extract complete roads.In addition, the roads extracted using U-Net, and RENA methods were closer to ground truth in morphology, while LinkNet34 had false detections.As shown in Figure 10, compared with ground truth, U-Net, LinkNet34, and D-LinkNet had different degrees of false detection, while RENA had no false detection.As shown in Figure 11, the remote sensing image showed that there were two parallel roads.The U-Net and RENA methods could effectively extract the road and divide it into two roads, while LinkNet34 and D-LinkNet recognized the two parallel roads as a single road.In addition, LinkNet34 and D-LinkNet had false detections in the upper left corner.From the six sets of visual experiments in densely built-up areas and open areas, it was seen that LinkNet34 and D-LinkNet could completely extract the road network, but many false detections were present, and the shape of the extracted road was different from that of ground truth.U-Net and RENA had fewer false detections, but the cases of missed detection were present, and U-Net had some fine false detection roads.In addition, the roads extracted by U-Net and RENA were closer to ground truth in morphology, and the From the six sets of visual experiments in densely built-up areas and open areas, it was seen that LinkNet34 and D-LinkNet could completely extract the road network, but many false detections were present, and the shape of the extracted road was different from that of ground truth.U-Net and RENA had fewer false detections, but the cases of missed detection were present, and U-Net had some fine false detection roads.In addition, the roads extracted by U-Net and RENA were closer to ground truth in morphology, and the road width was also closer to ground truth.

Quantitative Evaluation Results
This study used 101 images with a size of 1024 × 1024 for road extraction experiments to validate the proposed method in quantitative evaluation, and calculated quantitative evaluation indicators with the real label results.The results are shown in Table 2.As shown in Table 2, the RENA method achieved the highest accuracy in four out of five metrics, indicating that the proposed method outperformed the compared algorithms in road extraction accuracy.In the recall indicator, the D-LinkNet obtained the highest accuracy, indicating that it had fewer missed detections, which was consistent with the conclusion of visual evaluation.
In the precision indicator, U-Net and RENA scored higher than LinkNet34 and D-LinkNet, indicating that U-Net and RENA had less false detection.In the recall indicator, LinkNet34 and D-LinkNet scored higher than U-Net and RENA, indicating that LinkNet34 and D-LinkNet had fewer missed detections.It was difficult to achieve high precision and high recall at the same time due to their mutual influence.Therefore, the methods were evaluated in terms of accuracy, F1-score, and IoU.The proposed method was higher than the compared algorithms in terms of three overall indicators for evaluation, indicating that the proposed method was better than the compared algorithms.

Discussion
This study conducted ablation experiments on channel attention block, spatial attention block, residual dense connection block, and residual dilated convolution module to verify the actual effect of each proposed module in road extraction experiments.
As shown in Table 3, RENA without removing any module achieved the highest accuracy among the four metrics and the second-best accuracy among one metric.In addition to a certain increase in the precision of removing the channel attention block, the results after removing each module had a certain degree of reduction in accuracy, indicating that the modules proposed in this study had a positive effect on accuracy improvement.As shown in Table 3, in terms of recall, the result with a module removed was lower than the result without the module removal to a certain extent, indicating that the result with the module removal had more wrong selections than RENA.Combinedwith Figure 12, it was seen that in addition to the original RENA results, the results without the removed modules had varying degrees of false selection, which was in line with the conclusion indicated by the quantitative indicators.For residual dilated convolution module and residual densely connected blocks, the model dropped more in accuracy, F1-score, and IoU after removing these two modules.It showed that residual dilated convolution module and residual densely connected blocks contributed more to the model in feature extraction.The spatial attention block and channel attention block were used as feature recalibration modules; the model accuracy decreased to a lesser extent after removal.The results showed that all the proposed modules could positively affect the experimental results; the residual dilated convolution module and residual densely connected blocks contributed more to the network, and the attention module contributed relatively less.In the overall indicators Accuracy, F1-score and IoU, the original RENA results were higher than the results with the removal of modules, indicating that the proposed module could effectively enhance the performance of road extraction.

Conclusions
In this study, we used U-Net as the main framework to construct an end-to-end road extraction network with the embedded attention mechanism.From remote sensing images, we obtained road and non-road binary images, and extracted the road network from the input.In this network framework, the use of spectral information of remote sensing images was strengthened through the channel attention mechanism, and the detailed information in the road extraction results was enhanced through the spatial attention mechanism.Residual densely connected blocks were introduced in the network to enhance the information flow transfer and feature reuse of feature maps at different levels.At the same time, we introduced a residual dilated convolution module to enhance the extraction ability of road networks of different scales.Visual and quantitative experiments showed that the proposed method had higher overall accuracy and lower false detection rate than the comparison algorithms.Ablation experiments showed that the proposed modules could effectively improve the accuracy of road extraction.
However, in the case of missed detection, the proposed method still had certain shortcomings, and when faced with information occlusion such as vegetation, the proposed method still had certain defects.Future studies should employ hyperspectral imagery to better use spectral information to address the aforementioned issues.In addition, future studies should introduce more input information, such as Open Street Map road information, to assist the road extraction work and improve the road extraction accuracy.

Figure 1 .
Figure 1.Schematic diagram of the proposed road extraction network.

Figure 1 .
Figure 1.Schematic diagram of the proposed road extraction network.
ca fm represents the feature maps processed by the channel attention block, input fm represents the input feature maps, ca w represents the convolution kernel of the channel attention block, ca b represents the bias term of the channel attention block, and (.)  represents the sigmoid function.represents the convolution operator.and  represents the element-wise multiplication operation.
feature maps processed by the spatial attention block, input fm represents the input feature maps, sa w represents the convolution kernel of spatial atten- tion block, sa b represents the bias term of spatial attention block feature maps, fuse w represents the convolution kernel of spatial attention block, fuse b represents the bias term of spatial attention block, ( ) .
Remote Sens. 2022, 14, x FOR PEER REVIEW 11 of 18 (a) Input (b) Result using U-Net (c) Result using LinkNet34 (d) Result using D-LinkNet (e) Result using RENA (f) Ground truth

Figure 6 .
Figure 6.Visual evaluation results of the first group of dense road networks.
Result using U-Net (c) Result using LinkNet34 (d) Result using D-LinkNet (e) Result using RENA (f) Ground truth

Figure 7 .
Figure 7. Visual evaluation results of the second group of dense road networks.

Figure 6 .
Figure 6.Visual evaluation results of the first group of dense road networks.

Figure 6 .
Figure 6.Visual evaluation results of the first group of dense road networks.
Result using U-Net (c) Result using LinkNet34 (d) Result using D-LinkNet (e) Result using RENA (f) Ground truth

Figure 7 .
Figure 7. Visual evaluation results of the second group of dense road networks.

Figure 7 .
Figure 7. Visual evaluation results of the second group of dense road networks.

Figure 8 .
Figure 8. Visual evaluation results of the third group of dense road networks.3.4.2.Visual Evaluation Results for Sparse Road Network Another three samples in the open area were selected for analysis.As shown in Figure9, U-Net, LinkNet34, and RENA methods could extract roads effectively, while D-LinkNet could not extract complete roads.In addition, the roads extracted using U-Net, and RENA methods were closer to ground truth in morphology, while LinkNet34 had false detections.As shown in Figure10, compared with ground truth, U-Net, LinkNet34, and D-LinkNet had different degrees of false detection, while RENA had no false detection.As shown in Figure11, the remote sensing image showed that there were two parallel roads.The U-Net and RENA methods could effectively extract the road and divide it into two roads, while LinkNet34 and D-LinkNet recognized the two parallel roads as a single road.In addition, LinkNet34 and D-LinkNet had false detections in the upper left corner.
(a) Input (b) Result using U-Net (c) Result using LinkNet34

Figure 8 .
Figure 8. Visual evaluation results of the third group of dense road networks.

Figure 8 .
Figure 8. Visual evaluation results of the third group of dense road networks.3.4.2.Visual Evaluation Results for Sparse Road Network Another three samples in the open area were selected for analysis.As shown in Figure9, U-Net, LinkNet34, and RENA methods could extract roads effectively, while D-LinkNet could not extract complete roads.In addition, the roads extracted using U-Net, and RENA methods were closer to ground truth in morphology, while LinkNet34 had false detections.As shown in Figure10, compared with ground truth, U-Net, LinkNet34, and D-LinkNet had different degrees of false detection, while RENA had no false detection.As shown in Figure11, the remote sensing image showed that there were two parallel roads.The U-Net and RENA methods could effectively extract the road and divide it into two roads, while LinkNet34 and D-LinkNet recognized the two parallel roads as a single road.In addition, LinkNet34 and D-LinkNet had false detections in the upper left corner.

Figure 9 .
Figure 9. Visual evaluation results of the first group of sparse road networks.

Figure 9 .
Figure 9. Visual evaluation results of the first group of sparse road networks.

Figure 9 .
Figure 9. Visual evaluation results of the first group of sparse road networks.

Figure 10 .
Figure 10.Visual evaluation results of the second group of sparse road networks.
(a) Input (b) Result using U-Net (c) Result using LinkNet34

Figure 10 .
Figure 10.Visual evaluation results of the second group of sparse road networks.

Figure 9 .
Figure 9. Visual evaluation results of the first group of sparse road networks.

Figure 10 .
Figure 10.Visual evaluation results of the second group of sparse road networks.
Result using U-Net (c) Result using LinkNet34 Remote Sens. 2022, 14, x FOR PEER REVIEW 14 of 18 (d) Result using D-LinkNet (e) Result using RENA (f) Ground truth

Figure 11 .
Figure 11.Visual evaluation results of the third group of sparse road networks.

Figure 11 .
Figure 11.Visual evaluation results of the third group of sparse road networks.

Figure 12 .
Figure 12.Visual evaluation results of the second group of sparse road networks.

Table 1 .
Parameter configuration of the proposed network main module.

Table 1 .
Parameter configuration of the proposed network main module.
Note: Bold font indicates column maximum.

Table 3 .
Quantitative evaluation results of ablation experiments.
Note: Bold font indicates column maximum.