RoadFormer: Road Extraction Using a Swin Transformer Combined with a Spatial and Channel Separable Convolution

: The accurate detection and extraction of roads using remote sensing technology are crucial to the development of the transportation industry and intelligent perception tasks. Recently, in view of the advantages of CNNs in feature extraction, its related road extraction methods have been proposed successively. However, due to the limitation of kernel size, they perform less effectively at capturing long-range information and global context, which are crucial for road targets distributed over long distances and highly structured. To deal with this problem, a novel model named RoadFormer with a Swin Transformer as the backbone is developed in this paper. Firstly, to extract long-range information effectively, a Swin Transformer multi-scale encoder is adopted in our model. Secondly, to enhance the feature representation capability of the model, we design an innovative bottleneck module, in which the spatial and channel separable convolution is employed to obtain ﬁne-grained and globe features, and then a dilated block is connected after the spatial convolution module to capture more integrated road structures. Finally, a lightweight decoder consisting of transposed convolution and skip connection generates the ﬁnal extraction results. Extensive experimental results conﬁrm the advantages of RoadFormer on the Deepglobe and Massachusetts datasets. The comparative results of visualization and quantiﬁcation demonstrate that our model outperforms comparable methods.


Introduction
The extraction of the road from remote sensing images has long been a hot research topic owing to its essential role in applications including automatic driving, vehicle navigation, and road monitoring [1,2].In the past decades, researchers have achieved good results with high-contrast images using traditional methods involving mathematical morphology and texture analysis [3][4][5].However, these methods are usually limited by fixed parameters and have been proven to underperform when applied to low-contrast images [6][7][8].
From the machine learning perspective, the road extraction work can be regarded as a classification task with two categories (road and background), which is equivalent to a binary segmentation task.Considering the excellent performance of deep learning in recent years for computer vision tasks, researchers nowadays prefer to use deep learning methods to deal with road extraction tasks.Some recent works have explored CNN-based road extraction techniques [9][10][11][12][13], which outperform traditional methods by overcoming the shortcomings mentioned above.However, these works only simplify road extraction to a semantic segmentation problem and ignore the inherent structure of the road.Extracting roads is not an exact segmentation problem due to two reasons.First, the resolution of remote sensing images is usually lower than that of images in general tasks, which means that road segmentation networks should have a large receptive field.Second, since the road areas in remote sensing images are often slender and complicated, the network is supposed to retain the fine-grained feature of the image.CNN-based models are not effective enough to solve the problem because the receptive field is usually determined by the convolution kernel size.The current CNN-based models mainly use a 3 × 3 convolution kernel, which is far from satisfying the demand of road extraction tasks, while further increasing the size of the convolution kernel will increase the computational cost with little improvement.Moreover, the pooling will lose image details during image downsampling.Therefore, a new structure is still needed for solving road extraction tasks.
Fortunately, the Vision Transformer (ViT) [14] shows that transformer architecture has excellent potential to face the problems mentioned above.The attention mechanism fuels the transformer to better build long-range dependence so that global information can be utilized at both deep and shallow layers [15].An increasing number of transformer structures have been developed in different computer vision applications, especially Swin Transformer [16], which has made important achievements in semantic segmentation tasks.Compared with CNN-based models, the Swin Transformer has stronger contextual semantic relevance and a wider receptive field, owing to its shifted windowing scheme and hierarchical architecture.Therefore, the motivation of our model is that introducing the transformer mechanism into the road extraction task may help to further improve the segmentation.
Based on the above discussion, a new road segmentation network with Swin Transformer as the backbone is proposed, named RoadFormer.Considering the distribution and morphological characteristics of roads, an innovative bottleneck is designed.The bottleneck generates the spatial and channel features through the separable convolution and a dilated convolution module in multi-scale is deployed to capture more integrated road structures.The major contributions of this paper can be described as follows: (1) The proposed model is the first to apply the Swin Transformer as the backbone network to road extraction, achieving an effective perception of global and local road features.
(2) A bottleneck merging the spatial and channel separable convolution and dilated convolution is designed, which makes our model able to capture the local details and global structures of roads more effectively.
The remaining parts of this paper are structured as follows.In Section 2, the overview of previous road extraction works is provided, and the differences between our method and the related methods are also analyzed.In Section 3, the architecture and design of the proposed model are described in detail.In Section 4, implementation details of the experiments are presented, and comparative experiments are conducted and analyzed.Finally, conclusions are given in Section 5.

Related Works
In this section, the related road extraction works are reviewed.Then, the structure of the transformer is introduced and its advantages in the road extraction task are analyzed.

Road Extraction Methods
Numerous approaches for extracting roads from remote sensing images have been presented in recent years, and they may be divided into two primary categories: traditional and deep learning-based methods [30].Early traditional methods relied heavily on manually designed features or morphological features.Among these methods, the advanced directional morphological operator was presented to prevent the introduction of form biases and successfully retrieve the road shape features.[3].In addition, linear features that resemble ribbons or ridges are extracted to categorize the road regions, which perform more robustly than previous methods [31].However, these traditional methods usually lack robustness to incomplete structure, illumination, and contrast changes [6,7].
To solve the difficulties existing in the traditional methods mentioned, deep learningbased approaches were employed for road extraction.As a representative of CNN-based methods, a patch-based CNN model was proposed for road extraction from high-resolution remote sensing data [32].Later, RoadNet [13] was presented to extract the road surface, centerlines, and edges in several tasks.In order to preserve more spatial detail information and enhance road integrity, a superpixel segmentation and graph convolutional network was recently developed [33].The CNN-based methods above can accomplish high accuracy, while their processing speed has to be increased.
In order to address the shortcomings of CNNs, the fully convolutional network (FCN) substitutes the fully connected layer with deconvolution, which achieves end-to-end pixellevel classification.In the early works, it was established that the FCN approach was successful in maintaining the continuity and integrity of roads for road extraction tasks [17].Later, it was suggested to use UFCN to extract roads from aerial images taken by UAV [34].Subsequently, FCN-32 was applied for extracting the road in the high-resolution image [35].
To comprehensively utilize multi-scale information from images, U-Net series-equipped skip connection modules were developed [18,19,36].SegNet [22] adopted the encoderdecoder structure, where the edge position can be restored in the decoder by the index value reserved in the encoder.Recently, to obtain better segmentation results, DeepLab series methods [21,37,38] employed dilated convolution to capture long-range information and developed a pyramid-shaped pooling layer to retain the spatial structure.
Although FCN models improve the efficiency of road extraction, they often misclassify road areas and backgrounds in highly complex scenarios.Meanwhile, FCN-based models will lose edge position information due to the existence of pooling layers.In addition, missing long-range information limits the segmentation accuracy of U-Net and SegNet.Additionally, dilation convolution makes Deeplab perform well in large target extraction but poorly in small targets.To solve the problems above, we introduce the transformer structure to our road segmentation task.

Transformer-Based Approaches
Lately, transformer architecture [39] has become vibrant in the computer vision field in view of its special attention mechanism.Transformer's attention mechanism can enable it to learn long-range features and model global information, in contrast to CNN's emphasis on local features.The Vision Transformer (ViT) [14] accomplished satisfactory results in image classification and showed great potential in computer vision, where the image patches are considered the token of the transformer module.Although the design is feasible, there are still many apparent disadvantages [29].The quadratic computational load imposed by transformers brings a considerable cost that is intolerable in segmentation tasks for largesize images.Furthermore, although transformer could capture long-range information and global context, it is difficult to capture low-level information needed in segmentation [40].
To reduce the memory requirements of transformers, Liu et al. [16] conceived the Swin Transformer, which adopts a strategy of merging neighboring patches to build a hierarchical representation structure.With these hierarchical representations, the model can easily make dense predictions using a feature pyramid network.Meanwhile, the Swin Transformer computes self-attention in non-overlapping windows with only linear computational complexity.These advantages make it suitable as a segmentation backbone.In view of the global feature-capturing capability and lower computational complexity of the Swin Transformer, we introduce it as an encoder into our network.

Feature Separation
For the road extraction task, an obvious challenge is that the distribution of roads requires the model to have strong long-range information acquisition ability, while the slender and complex road characteristics require the model to have enough detail processing ability.Having both of these capabilities for general convolution operations would be contradictory.According to Tao et al. [41], the spatial and channel features of roads exist apparent differences, and thus processing the features of different dimensions separately can improve the accuracy of segmentation.From the perspective of information representation, the channel features can reflect the image's local details, and spatial features can help the network capture long-range information.Therefore, for road extraction, it is necessary to distinguish spatial and channel properties.
In previous works, depth-wise (DW) separable convolution was intended to divide the conventional convolution into depth-wise and point-wise, effectively reducing the computational complexity [42].Compared with traditional convolution, DW separable convolution has fewer numbers of parameters and a lower cost of operation but still achieves almost the same results.Zhou et al. [28] use DW separable convolution combined with a graph convolution network (GCN) to achieve feature separation.Motivated by the previous work above, we replaced the original DW separable convolution series structure with a parallel structure to obtain the channel and spatial features.

Method
This section provides a detailed description of the proposed model's architecture.In Section 3.1, the overall design of RoadFormer is introduced.Then, the workflow of the encoder is described in Section 3.2, and the design of the bottleneck for road feature refining is presented in Section 3.3.Lastly, Section 3.4 provides the decoder and loss function.

RoadFormer Overall Design
We provide a road extraction model called RoadFormer to overcome the receptive field constraints and capture detailed information in remote sensing images.The architecture of RoadFormer is divided into three sections, as displayed in Figure 1:

Encoder
Without loss of generality, the distribution of roads should be continuous and throughout the whole image, and the model is supposed to have a great capacity to collect long-range information.We adopted the Swin Transformer as the encoder for the suggested model because of its prowess in modeling long-range information relationships.(1) Swin Transformer-based encoder: the encoder downsamples and encodes the input RGB image into multi-scale high-dimensional feature maps, which are necessary inputs for the decoder and bottleneck.
(2) Feature separation bottleneck: the bottleneck separates the high dimensional input feature maps into channel and spatial features.Meanwhile, a dilated block consisting of four dilated convolution layers is applied to the spatial feature to expand the receptive field.
(3) Lightweight decoder: bottleneck-generated feature maps are alternately upsampled and merged with encoder-generated feature maps to the top decoder block.Then, the segmentation result is obtained from the top decoder by using transposed convolution and a sigmoid.
In the subsequent sections, each network component will be described in detail.

Encoder
Without loss of generality, the distribution of roads should be continuous and throughout the whole image, and the model is supposed to have a great capacity to collect longrange information.We adopted the Swin Transformer as the encoder for the suggested model because of its prowess in modeling long-range information relationships.Different from the transformer, the Swin Transformer replaces the multi-head self-attention module (MSA) with a block that can be made up of shifted window-based MSA, MLP, LayerNorm, and a residual connection.Continually and alternately, the W-MSA and SW-MSA (MSA with regular and shifted windowing configurations) are applied in a block.The structure of the Swin Transformer blocks is presented in Figure 2.

Bottleneck
To obtain the spatial and channel features effectively, a parallel structure combined with DW and PW convolution is developed in RoadFormer.The process of the separabl convolution module is shown in Figure 3. Specifically, parallel connections between spa tial convolution and channel convolution are made after the encoder.In the channel con volution part, a 1D convolution kernel is used to convolute the feature map along th channel direction.In the spatial convolution part, each feature map is convoluted by a × × 1 1 k kernel and concatenated as spatial feature maps.The refined feature maps by spatial convolution and channel convolution have a size of × × H W N, which is consisten with the input.
Previous works have proved that traditional convolution tends to have a finite recep tive field, which performs not well in segmentation tasks [20,43].Fortunately, dilated con volution can effectively expand the receptive fields while keeping the resolution of featur The encoder of the proposed model is composed of four stacked Swin Transformer modules.The original image (H × W × 3) is transported to the patch partition in the first layer and divided into patches ( H 4 × W 4 × 48).Then, these patches are converted into tokens by linear embedding layer mapping.After that, the tokens are fed successively alternately into Swin Transformer blocks and patch merging layers to create a hierarchical representation.To be specific, Swin Transformer blocks produce feature maps at the current layer scale while patch merging layers downsample these maps.Notably, the output of patch merging layers is simultaneously supplied by skip connection to the relevant layer of the decoder and handled as the input of the next Swin Transformer block.

Bottleneck
To obtain the spatial and channel features effectively, a parallel structure combined with DW and PW convolution is developed in RoadFormer.The process of the separable convolution module is shown in Figure 3. Specifically, parallel connections between spatial convolution and channel convolution are made after the encoder.In the channel convolution part, a 1D convolution kernel is used to convolute the feature map along the channel direction.In the spatial convolution part, each feature map is convoluted by a k × 1 × 1 kernel and concatenated as spatial feature maps.The refined feature maps by spatial convolution and channel convolution have a size of H × W × N, which is consistent with the input.Previous works have proved that traditional convolution tends to have a finite receptive field, which performs not well in segmentation tasks [20,43].Fortunately, dilated convolution can effectively expand the receptive fields while keeping the resolution of feature maps.Referring to D-LinkNet [24], we set a cascade and parallel structure of dilated convolution after the spatial convolution module.The receptive fields of each layer will be 3, 7, 15, and 31 if each layer's dilation rates are set to 1, 2, 4, and 8, as demonstrated in Figure 4.The Swin Transformer encoder downsamples the original input with one reduction of 1 4 and three reductions of 1  2 .For an image with a size of 1024 × 1024, the output feature map size of the encoder will be 32 × 32.In this case, the receptive field of the dilated block can cover almost the entire range of the feature map.Through this design, this architecture considerably enhances our model's capacity to capture long-range information.

Decoder and Loss Function
To recover the segmentation details, a decoder is employed in RoadFormer.Symmetrically with an encoder, four decoder blocks and a final convolution layer are adopted to upsample the feature maps.Figure 5 depicts the decoder block's structural layout.Specifically, in each decoder block, the features are filtered by a 3 × 3 convolution layer first and are upsampled by a transposed convolution layer subsequently.Then, the features are filtered by a 3 × 3 convolution layer again.After the convolution, the upsampled features are added with the results of the encoder in the corresponding scale.After going through four decoder blocks, one transposed convolution layer and two convolution layers with 3 × 3 kernels will process the feature maps to be the same size as the source image.Lastly, a sigmoid classifier is applied to extract road areas by mapping the output to a range of 0 to 1, where the threshold is set to 0.5 to classify the road areas and background.Binary cross entropy (BCE) loss and dice coefficient loss make up RoadFormer's loss function.The BCE loss, which is most frequently employed in the binary segmentation task, is defined as follows: Binary cross entropy (BCE) loss and dice coefficient loss make up RoadFormer's loss function.The BCE loss, which is most frequently employed in the binary segmentation task, is defined as follows: where o indicates the predicted results after sigmoid, t indicates the true label, and N indicates the batch size.Road segmentation is a particular scenario where the foreground and background are severely imbalanced.Therefore, the loss function should have adaptability for unbalanced data distribution.Dice loss is more focused on the mining of foreground regions during training, whose supervised contribution to the network does not vary with the size of the image.Therefore, it is suitable for solving the situation where the foreground accounts for a relatively small amount.The formulation of the dice loss is: To prevent a zero in the denominator, we added a smooth parameter s.The optimized L Dice can be described as follows: The smooth parameter avoids the zero division problem and prevents the overfitting of the model.The total loss can be computed as: where α and β denote the weights that could balance the two loss functions.
The loss function designed above makes the model extract roads accurately and retain road connectivity.Through the loss function design, the feature information is most effectively conveyed to the segmentation result, which could ensure road extraction accuracy and retain road connectivity simultaneously.

Experimental Results and Analysis
In this section, the dataset and model training details are introduced first.Subsequently, the evaluation metrics commonly used in road extraction tasks are presented.Next, the ablation experimental results are analyzed, which confirms the validity of our model design.Finally, visualization and quantitative results of our approach in comparison to other SOTA methods are then shown.

Datasets and Experiment Implementation
Datasets: In this paper, the Deepglobe dataset and the Massachusetts road dataset are used for the experiment, as shown in Figure 6.The following is a detailed description of the two datasets: 1.
Deepglobe Dataset: Deepglobe is the dataset prepared for the 2018 Deepglobe road extraction challenge.This dataset includes 6226 images with a resolution of 0.5 m and a size of 1024 × 1024 pixels.These RGB images in JPG format cover Thailand, India, and Indonesia, and include roads of cement, asphalt, and mountain.Each annotation image is a three channel binary image in PNG format, which uses (255, 255, 255) and (0, 0, 0) to present roads and backgrounds, respectively.In the experiment of our model, the dataset was split into the training set (4987 images) and the test set (1246 images).

Massachusetts dataset:
The Massachusetts road dataset consists of 1108 images for training, 14 images for validation, and 49 images for testing, all of which are 1500 × 1500 in size.According to [44], the resolution of Massachusetts can be inferred to be about 1.5 m.The source image in TIF format is three channel color image and its label in TIFF format is a binary image that uses white and black to distinguish roads and backgrounds.The roads of cement and asphalt are the main types in this dataset.
son to other SOTA methods are then shown.

Datasets and Experiment Implementation
Datasets: In this paper, the Deepglobe dataset and the Massachusetts road dataset are used for the experiment, as shown in Figure 6.The following is a detailed description of the two datasets: 1. Deepglobe Dataset: Deepglobe is the dataset prepared for the 2018 Deepglobe road extraction challenge.This dataset includes 6226 images with a resolution of 0.5 m and a size of 1 0 2 4 1 0 2 4 × pixels.These RGB images in JPG format cover Thailand, India, and Indonesia, and include roads of cement, asphalt, and mountain.Each annotation image is a three channel binary image in PNG format, which uses (255, 255, 255) and (0, 0, 0) to present roads and backgrounds, respectively.In the experiment of our model, the dataset was split into the training set (4987 images) and the test set (1246 images).2. Massachusetts dataset: The Massachusetts road dataset consists of 1108 images for training, 14 images for validation, and 49 images for testing, all of which are 1 5 0 0 1 5 0 0 × in size.According to [44], the resolution of Massachusetts can be inferred to be about 1.5 m.The source image in TIF format is three channel color image and its label in TIFF format is a binary image that uses white and black to distinguish roads and backgrounds.The roads of cement and asphalt are the main types in this dataset.

Data augmentation:
In order to demonstrate that our model works effectively on large-size remote sensing images, we directly use uncropped images with 1024 × 1024 size as the input of the network.To comprehensively utilize the limited training set, we employ geometric transformation and photometric distortion to augment the data.The geometric transformation method includes random clipping and horizontal and vertical flip transformation.In the photometric distortion part, random luminance transformation and random contrast transformations are used.Saturation and hue transformation are applied after the RBG image is converted to HSV space.In addition, test time augmentation, including horizontal flip and vertical flip, is adopted in the testing phase.In this phase, the predicted results are restored to match the origin direction, and the final predicted results are given according to the average of augmentation outputs.
Implementation detail: All the experiments are implemented on an NVIDIA GeForce RTX 3090 GPU using Pytorch in a Linux environment.To make the model have better results, the learning rate schedule strategy is employed.Specifically, we adopt a poly strategy to modify the learning rate dynamically to make the model have a better convergence speed.An adaptive moment estimation (Adam) optimizer is applied in the training phase of our model.Meanwhile, multiple sets of learning rate parameters are set, and according to the convergence of the model, 2e-4 was selected as the start learning rate.In the ablation experiments, different pretrained Swin Transformers are employed to test the performance of road extraction.We train RoadFormers with Swin-T, Swin-S, and Swin-B as the backbone using 4, 4, and 2 as the the batch size, respectively.

Evaluation Metrics
Road extraction can be approached as a segmentation problem with two classes of roads and backgrounds [30].Therefore, the effectiveness of the road extraction models is assessed using the evaluation metrics of binary segmentation.Precision (Pr), recall (Rc), F1-score, and intersection over union (IoU) are the four performance evaluation measures that are most frequently utilized.Precision reflects the percentage of road extraction results that are correctly classified, which can be formulated as: where true positive and false positive of road extraction (TPre and FPre) represent the numbers of pixels correctly and incorrectly classified as road areas, respectively.Different from precision, recall represents the percentage of properly recognized pixels in the whole road label, which can be formulated as: where the false negative of road extraction (FNre) denotes the number of road pixels extracted as other areas.In addition, the F1-score, which offers a more thorough evaluation of the model's performance, is the harmonic mean of Pr and Rc.It can be calculated as follows: F1-Score = 2TPre 2TPre + FPre + FNre (7) Without loss of generality, IoU is the intersection of ground truth and road extraction findings divided by their union, which can be calculated as follows: The four evaluation metrics mentioned above are adopted in our quantitative experiments.

Ablation Experiments
In this part, the ablation experiment is carried out to evaluate the performance of encoder modules with different backbones.We use ResNet-50 and the Swin Transformer series as the encoder of the network.According to different configurations, the Swin Transformer can be formed as Swin-T (tiny size), Swin-S (small size), and Swin-B (base size).As shown in Table 1, Swin-T achieved better results under the condition that the number of parameters of ResNet-50 is comparable.Among them, Swin-B achieved the best performance with four times the number of parameters of ResNet-50.To trade off the performance and cost of the model, Swin-S was selected for subsequent ablation experiments.In the comparison experiments, we mainly use the results of Swin-B for comparison because of its better performance.In practices with computational efficiency needs, Swin-T is a good choice because of its small size and fast speed.We conduct another ablation experiment to demonstrate that the bottleneck part is valid.The quantitative results of different module configurations are shown in Table 2.As is seen from the results, the spatial and channel separable convolution significantly enhances the model's overall performance.We set dilated blocks after spatial convolution and channel convolution, respectively.Obviously, it makes sense to treat global and detailed features separately.The model with feature separation performs significantly better with recall, F1-score, and IoU than the model without such structure.Meanwhile, dilated block after spatial convolution improves IoU and F1-score.In contrast, although the addition of dilated block after channel convolution increases the precision, the other performance metrics are reduced.This is due to the fact that the effect of the dilated block is to expand the receptive fields, which are compatible with the spatial features being separated out.However, the channel features separated by a 1 × 1 convolution focus on the information of the pixel itself, and it is meaningless to expand its receptive fields.The above results confirm that spatial convolution followed by dilated convolution improves the performance, while that becomes worse after channel convolution.The results of the ablation experiment with different configurations are presented in Figure 7.Among different configured models, the road extraction results with feature separation and dilated block have better continuity and detail.In summary, the feature separation module effectively enhances the comprehensive performance of the model.

Comparative Experiments
We conduct the experiments via a comparison with SOTA approaches on the Deepglobe dataset and Massachusetts road dataset in terms of accuracy, recall, IoU, and F1-score to completely evaluate the effectiveness of the proposed approaches.Visualiza-

Comparative Experiments
We conduct the experiments via a comparison with SOTA approaches on the Deepglobe dataset and Massachusetts road dataset in terms of accuracy, recall, IoU, and F1-score to completely evaluate the effectiveness of the proposed approaches.Visualization results of the proposed model with five representative models are presented, and quantitative analysis and results are given in this section.

Experiments on the Deepglobe Dataset
On the Deepglobe dataset, RoadFormer was compared with FCN, U-Net, PSPNet [20], DeeplabV3, Seg-Net [22], LinkNet [23], D-LinkNet [24], HourGlass [26], Batra et al. [27], and SwinUnet [29].Among the methods above, FCN and U-Net are representatives of the classic segmentation models.PSPNet employed a pyramid pooling structure to gather information about the context.DeeplabV3 developed dilated convolution to enlarge the receptive field and aggregate the multi-scale features by using an ASSP module.SwinUnet is a novel transformer-based model originally used for medical image segmentation.We show the visualization results obtained by RoadFormer and these five presentative methods above.For the other methods, we quoted the quantitative results from their source, so visualization results are missing as they were not available.
For an intuitive evaluation of road extraction performance, eight representative images with different scenes were chosen from the test set.Figure 8 shows the road extraction results of these images by using six different methods, respectively.The extracted roads of eight images are listed in eight rows and eight columns.The input images, ground truth images, and results of FCN, U-Net, PSPNet, DeeplabV3, SwinUnet, and RoadFormer are displayed in the left-to-right columns.For the image of the town scene (first to third rows), the results obtained by U-Net and DeeplabV3 miss much road information, while other methods work well.In obscured scenes (fourth to sixth rows), the discontinuous road structures all appeared in the results of other methods, and RoadFormer's extraction results remain complete.For low-contrast scenes (the seventh and eighth rows), none of the five methods can extract the road structure completely, while RoadFormer is able to extract road areas precisely.It is worth noting that the roads extracted by SwinUnet perform better than other CNN-based models in terms of continuity, which is due to the long-range dependence established by the transformer.However, SwinUnet is missing some of the slender roads, while RoadFormer still performs well in this case due to its bottleneck design.The visualization results above show that the performance of RoadFormer outperforms the other methods.The integrity and continuity of the road are well preserved due to the long-range information-capturing ability and feature separation strategy of RoadFormer.
For making a more thorough evaluation of the proposed method, we quantified and compared RoadFormer with SOTA methods, including FCN, U-Net, PSPNet, DeeplabV3, Seg-Net, LinkNet, D-LinkNet, HourGlass, Batra et al., and SwinUnet.Table 3 displays the quantitative performance results of these methods on the Deepglobe dataset.RoadFormer (with a Swin-B backbone) obtains the best results for precision (85.8%),IoU (73.1%), and F1-score (84.5%), and the second-best result for recall (83.2%), which is only less than Batra et al.It is worth pointing out that the method of Batra et al. employed a strategy of multitask learning considering road direction information.The method enhances the correlation between the extracted segments but also leads to an increase in cost.The lightweight RoadFormers (using Swin-T and Swin-S as the backbone) still outperform most SOTA methods in terms of performance metrics.This result substantiates the reliability of the model structure we suggested.
which is due to the long-range dependence established by the transformer.However, SwinUnet is missing some of the slender roads, while RoadFormer still performs well in this case due to its bottleneck design.The visualization results above show that the performance of RoadFormer outperforms the other methods.The integrity and continuity of the road are well preserved due to the long-range information-capturing ability and feature separation strategy of RoadFormer.For making a more thorough evaluation of the proposed method, we quantified and compared RoadFormer with SOTA methods, including FCN, U-Net, PSPNet, DeeplabV3, results in Figure 9 further demonstrate that RoadFormer has better adaptability than the other methods in complex scenes.To further evaluate the model performance, quantitative comparison results between RoadFormer and the other nine methods are given in Table 4.We can observe from the table that the other methods achieve a recall rate lower than 75%, except for PSPNet, CA-DUNet, and RoadFormer.More importantly, other methods achieve IoU rates lower than 65% and F1-scores lower than 78%, except SGCN and RoadFormer.Obviously, recall, IoU, and F1-score performance are all best on RoadFormer.Among these SOTA methods, SGCN also uses the technique of feature separation.RoadFormer achieves better results owing to its ability to capture long-range information and a larger receptive field.Notably, our model uses the entire image as the input, while the other models use the cropped patches as the input.Thus, it is evident that RoadFormer is more capable of handling highresolution images.In comparison to other approaches, RoadFormer consistently outperforms them in criteria such as recall, IoU, and F1-score, proving that it is extremely superior.The visualization and quantitative comparison results above confirm that the proposed method has a higher capacity to extract roads.The visualization results intuitively show that RoadFormer is more adaptable to complex road and fine road scenarios.In Tables 3 and 4, we can see that the RoadFormer with a tiny size has lower computational complexity and a smaller number of parameters, but still has good performance.Although RoadFormer with base size has higher complexity, it has the best performance.For 1024 × 1024 image input, the inference time of the base model on RTX3090 is less than 0.3 s per frame, which can meet the practical application requirements.The following three factors are mainly accountable for RoadFormer's superiority.(1) Swin Transformer modules help the model establish long-range dependence.(2) By feature separation module, a refined feature map allows the model to perceive different features separately.(3) The model's receptive field is further expanded by the dilated block that results from spatial convolution.Furthermore, according to the quantitative results, our model has the highest IoU and F1-score on the Deepglobe and Massachusetts datasets, further demonstrating the superiority of our approach.

Conclusions
For road extraction tasks, we present a novel model called RoadFormer that uses a Swin Transformer as its backbone.The spatial and channel separable convolution are combined in the design of RoadFormer to improve the feature representation of the model.In addition, a dilated block is adopted after the spatial convolution, which effectively helps the model capture better global contextual information and obtain larger receptive fields.Ablation experiments demonstrate the validity of our module design.The Deepglobe and Massachusetts datasets were used in experiments that were thoroughly assessed.The proposed method outperforms previous SOTA methods, as shown by the comparison of visualization and quantitative results, which supports the proposed model's superiority and effectiveness.
The proposed model was trained on RGB remote sensing image datasets (Deepglobe and Massachusetts).In practice, the accuracy of road extraction could be further improved by the fusion of multimodal data.Specifically, DEM information and geological background are very important, which makes it easier to extract road features in special scenarios.In addition, the multiple channels of information in satellite remote sensing and radar imagery can offer different information on the road.Moreover, the architecture of the Swin Transformer block could be further optimized and tweaked, and the different loss functions could be investigated for improving the model performance.In the future, we will collect multimodal remote sensing datasets mentioned above and further improve the model performance by optimizing the architecture and loss function.

19 Figure 1 .
Figure 1.RoadFormer architecture consists of an encoder, bottleneck, and decoder.Multi-scale feature representation can be produced by the encoder.The high-dimension feature maps can be obtained by the separable convolution and dilated block in the bottleneck.The final results are given by the decoder.

Figure 1 .
Figure 1.RoadFormer architecture consists of an encoder, bottleneck, and decoder.Multi-scale feature representation can be produced by the encoder.The high-dimension feature maps can be obtained by the separable convolution and dilated block in the bottleneck.The final results are given by the decoder.

Figure 2 .
Figure 2. The architecture of Swin Transformer blocks.

19 Figure 3 .
Figure 3. Structure of the parallel separable convolution.The separable convolution split the features into channel features and spatial features.Figure 3. Structure of the parallel separable convolution.The separable convolution split the features into channel features and spatial features.

Figure 3 .
Figure 3. Structure of the parallel separable convolution.The separable convolution split the features into channel features and spatial features.Figure 3. Structure of the parallel separable convolution.The separable convolution split the features into channel features and spatial features.

Figure 3 .
Figure 3. Structure of the parallel separable convolution.The separable convolution split the features into channel features and spatial features.

Figure 4 .
Figure 4. Structure of the dilated block.Multi-scale receptive fields are constructed through dilated convolution layers, which makes the network can extract different scale features.

Figure 4 .
Figure 4. Structure of the dilated block.Multi-scale receptive fields are constructed through dilated convolution layers, which makes the network can extract different scale features.

19 Figure 5 .
Figure 5. Structure of the decoder block.

Figure 5 .
Figure 5. Structure of the decoder block.

Figure 6 .
Figure 6.Some samples of the Deepglobe dataset and the Massachusetts dataset, the first and second rows are for the Deepglobe dataset, and the third and fourth rows are for the Massachusetts dataset.

×Figure 6 .
Figure 6.Some samples of the Deepglobe dataset and the Massachusetts dataset, the first and second rows are for the Deepglobe dataset, and the third and fourth rows are for the Massachusetts dataset.

19 Figure 7 .
Figure 7. Visualization results of different configurations of the bottleneck.In the 1024 × 1024 size image, the 350 × 320 size red boxes highlight the areas where RoadFormer's road extraction results are better.

Figure 7 .
Figure 7. Visualization results of different configurations of the bottleneck.In the 1024 × 1024 size image, the 350 × 320 size red boxes highlight the areas where RoadFormer's road extraction results are better.

Table 1 .
Quantitative comparison of different backbones for RoadFormer using the Deepglobe dataset.

Table 2 .
Quantitative comparison of different configurations of the bottleneck.

Table 4 .
Results of the Massachusetts road dataset's quantitative performance.