1. Introduction
With the development of remote sensing and computer science, high-resolution remote-sensing images are extensively applied in have received extensive attention in disaster management, urban planning, and other fields [
1,
2,
3]. Obtaining target information quickly and intelligently from high-resolution remote-sensing images is an urgent challenge to be solved in the remote-sensing community today. Road-based geographic information serves city planning [
4], vehicle navigation [
5], geographic information management [
6,
7], etc., and is one of the key contents in the target extraction of remote-sensing-image extraction.
Road segmentation of remote-sensing images is a very challenging task [
8] that essentially belongs to the classification of pixels. In such images, each pixel is classified and recognized. Compared with the general target segmentation, road segmentation is unique and complex. As important geographic information, roads are often affected by various factors [
9], resulting in low segmentation accuracy. For example, (1) The narrow and connectivity of the road determines its small proportion in the whole image; (2) long and narrow roads will be blocked by vegetation, buildings and their shadows, making it more difficult to extract them from high-resolution remote sensing images; and (3) desert, bare soil, etc., have similar texture and spectral features with roads, which will also increase the difficulty of extraction.
Among the traditional road-extraction methods, Sun et al. [
10] proposed a high-resolution remote-sensing-image road-extraction method based on the fast-progress and mean-shift methods. Road nodes are used as input and the mean-shift method is then used to initially divide them. Finally, the road between the set nodes is extracted by the fast-progress method. Anil et al. [
11] proposed a method based on the active contour model. This method uses a median filter to pre-process the image, enters the initial seed point, and then uses the active-contour model to extract the road. Chen et al. [
12] established the global features of the road by automatically merging the road vector and road skeleton, and then extracted the local features of the image under the global feature constraints. It can be seen that the traditional methods mostly start from the morphological structure of the road, but with the improvement in remote-sensing image resolutions, complex situations reduce the effectiveness of such methods. With fast development in deep learning, increasingly more segmentation networks have been developed in recent years. As the first end-to-end learning network, fully convolutional networks (FCNs) [
13] use convolution, up-sampling, and skip structures to achieve pixel-level classification. The target segmentation effect under complex conditions is poor due to the limited receptive field. Later, multi-scale [
14] context semantic fusion modules were proposed, such as the Spatial Pyramid Pooling (SPP) module of the Pyramid Scene Parsing Network (PSPNet) [
15] and the Atrous Spatial Pyramid Pooling (ASPP) module of deeplabv3 [
16], which fully utilized context information [
16]. Compared with traditional methods, neural networks can automatically extract multiple features other than colors, such as textures, shapes and lines. With the ability to automatically extract high-dimensional features, neural networks have been widely used in image fields, such as image classification, scene recognition, target detection, and semantic segmentation. Several scholars have also applied it to the field of remote sensing. Hong et al. provided a baseline solution for remote sensing image classification tasks using multimodal data by developing a multimodal deep learning framework [
17]. Hong et al. proposed a mini graph neural network (miniGCN) that enables the combination of CNN and GCN for hyperspectral image classification [
18]. Wang et al., proposed a new tensor low-rank and sparse representation method for hyperspectral anomaly detection [
19]. Zhu et al., effectively extracted and fused global and local environmental information through an attention-enhanced multi-path network. The network uses multiparallel paths to learn multi-scale features of the space and attention modules to learn channel features for accurate extraction of building footprints and precise boundaries [
20].
Some researchers apply neural networks to road extraction. For example, Chen et al. investigated the methods of automatic road extraction from remote sensing data and proposed a tree structure to analyze the progress of road extraction methods from different aspects [
21]. Tamara et al. [
22] proposed a road-segmentation model that combined a residual network with a U-Net network and used the residual structure to deepen the network to extract strong semantic information features. Zhang et al. [
23] defined a DCGAN with specific conditions and achieved road segmentation by continuously optimizing the relationship between the generation network and confrontation network. Zhou et al. [
24] improved the D-Linknet by adding a dilated convolutional layer based on LinkNet [
25], using dilated convolution to expand the receptive area and retain spatial information, and fusing contextual information on multiple scales. Zhou et al. propose a new fusion network to fuse remote sensing images and location data to play the role of location data in road connectivity inference. A reinforced loss function is proposed to control the accuracy of road prediction output, which improves the accuracy of road extraction [
26]. Yan et al. [
27] proposed HsgNet based on global higher-order spatial information, modeled by bilinear pooling to obtain the feature distribution of weighted spatial information. Wan et al. [
28] proposed a dual-attention road extraction network and constructed a new attention module to extract road-related features in spatial and channel dimensions, which can effectively solve the problem of road extraction discontinuity and maintain the integrity of roads. Li et al. [
29] proposed a cascaded attention enhancement module considering multi-scale spatial details of roads to extract boundary-refined roads from remotely sensed images. Liu et al. [
30] proposed a road extraction network based on channel and spatial attention (RSANet). Huo et al. [
31] proposed a remote sensing image road extraction method with completion UNet, which introduces multi-scale dense dilation convolution to capture road regions.
The high-resolution remote-sensing image provides detailed road information but with significant noise [
21]. In addition, the road structure is more complex. The global information of roads affects the structure and continuity of the road, and the local information affects the details of the road. Extracting and combining global and local information are very important for road segmentation. In the encoding-decoding network, U-Net [
32], LinkNet, D-Linknet, and other networks only use simple convolution and pooling operations to extract features. The global information of the road is not fully taken into account, and no further attention is paid to the dependence between channels on the same level. Although the HsgNet method considers global information, it is indistinguishable from the above-mentioned network in combining global and local information, and are all connected by concat through a skip structure. Compared with deep features, shallow features have more location information, but their semantics are weaker and road features are not obvious [
33]. The features directly supplemented by the skip structure have vague and ambiguous information, which is not conducive to refining the details. Therefore, this study aims to efficiently extract and fuse global and local context information to reduce the interference of fuzzy features [
34,
35], ensure the representativeness and usefulness of road features, and improve the accuracy of target segmentation. Finally, a high-resolution remote-sensing road-extraction network (GMR-Net) is proposed. The specific contributions of this study are the following.
A new segmentation network for road extraction, called GMR-Net, is proposed, in which the encoding part uses the GC block attention module to enhance the focus on global information, and the decoding part filters out useless features by gating units to refine the segmentation details.
To verify the accuracy and generalization ability of the model, experiments were conducted on the DeepGlobe Road Extraction dataset [
36] and Massachusetts Roads dataset [
37]. Experimental results show that, compared with D-Linknet, U-Net, RSANet, and PSPNet, the method proposed in this study achieves the expected results and shows better performance.
The rest of this article is organized as follows. In
Section 2, the GMR-Net high-resolution remote-sensing road-extraction method is introduced. Experimental details and results are presented in
Section 3 and discussion in
Section 4. Conclusions are given in
Section 5.
5. Conclusions
Roads have the characteristics of narrowness, complexity and connectivity, posing different problems to road extraction in different environments. It is particularly important to focus on the information in global and local contexts and remove interference from other features. Therefore, a neural network for remote-sensing road extraction based on the fusion of local and global information is proposed in this study. The proposed network, GMR-Net, consists of three parts. The first part is the GC-resnet. A GCblock, which extracts deep features, is used to realize global context modeling and capture the relationship between channels. The second part is the MDC, through which the context information of different regions can be aggregated. The third part efficiently combines the global and local information through GR, filters the ambiguous features in the encoding stage, and gradually refines the segmentation details. This model is separately tested, compared and analyzed using the DeepGlobe Roads and Massachusetts Roads datasets, and the extraction results of U-Net, PSPNet, RSANet, and D-Linknet. It is found that the GMR-Net can effectively extract road features, ensure the continuity and integrity of road extraction, and show good generalization ability. Although the proposed method improves the accuracy of road segmentation, the speed of road extraction is slow and the network is not computationally lightweight enough. Therefore, there is still room for improvement in the proposed method. Ensuring high-precision road extraction while accelerating the segmentation speed is worth further investigation.
Transformer structures have excellent global information modeling capability and are currently advanced and competitive in the field of computer vision. Our future work will design a Transformer-based model for road extraction tasks and investigate the potential of Transformer structures for road extraction in remote sensing images.