A Road Extraction Algorithm for the Guided Fusion of Spatial and Channel Features from Multi-Spectral Images

Gao, Lin; Zhang, Yongqi; Jiao, Aolin; Zhang, Lincong

doi:10.3390/app15041684

Open AccessArticle

A Road Extraction Algorithm for the Guided Fusion of Spatial and Channel Features from Multi-Spectral Images

¹

School of Information Science and Engineering, Shenyang Ligong University, Shenyang 110159, China

²

School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(4), 1684; https://doi.org/10.3390/app15041684

Submission received: 11 December 2024 / Revised: 27 January 2025 / Accepted: 3 February 2025 / Published: 7 February 2025

(This article belongs to the Special Issue Intelligent Computing and Remote Sensing—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

In the road extraction task, for the problem of low utilization of spectral features in high-resolution remote sensing images, we propose a Multi-spectral image-guided fusion of Spatial and Channel Features for road extraction algorithm (SC-FMNet). The method is designed with a two-branch input network structure including Multi-spectral image and fused image branches. Based on the original MSNet model, the Spatial and Channel Reconstruction Convolution (SCConv) module is introduced in the coding part in each of the two branches. In addition, a Spatially Adaptive Feature Modulation Mechanism (SAFMM) module is introduced into the decoding structure. The experimental results in the GF2-FC and CHN6-CUG road datasets show that the method can better extract the road information and improve the accuracy of road segmentation, which verify the effectiveness of SC-FMNet.

Keywords:

road extraction; remote sensing image; spectral feature; spatially adaptive feature

1. Introduction

In the current informationization era, the extraction of road information from high-resolution remote sensing images (HR-RSIs) has become a hot issue in the research field. Thanks to the application of high-resolution remote sensing image technology, the description of road information has become more detailed, covering not only basic attributes such as road geometry, radial characteristics, topology, and context but also detailed features such as texture and structure. As the resolution of the image increases, the road is more clearly visualized as a feature. These rich road targets and detail features provide strong support for the automatic extraction of road information [1].

These are the following difficulties in the road decoding of HR-RSIs: (1) Confusing image elements. When performing the road extraction task, some image elements (pixels) may be incorrectly categorized as roads or non-roads in remote sensing images or other types of images. (2) There is less inter-class variation and more intra-class variation. Spectral features of roads and some non-road objects (e.g., parking lots, buildings, shadows, etc.) may be very similar [2,3], making it difficult for classification algorithms to distinguish between them. (3) Shadow occlusion and visual occlusion in remotely sensed imagery can lead to missing information, thus affecting the accuracy [4] and completeness of roadway extraction.

In this paper, road extraction algorithms based on deep learning methods are further categorized into convolutional neural network methods based on “encoding-decoding” and neural network methods based on attention mechanisms.

Based on the “encoding-decoding” convolutional neural network method, U-Net [5] usually consists of an encoder and a decoder, and its unique “U”-shaped network structure gives it this name. Bin Wang [6] et al. proposed a U-Net model that specifically emphasizes the importance of connectivity. The model repaired the problem of incomplete road information extraction due to occlusion by iteratively correcting the lines of road interruptions. Zhang [7] et al. designed the Res-UNet model, which incorporates the strengths of residual learning and the U-Net structure, which in turn optimizes and enhances the accuracy of road network extraction. In order to improve the accuracy and reliability of the extraction of small-sized roads in high-resolution remote sensing images, Wang [8] et al. proposed an improved deep neural network model, DDU-Net. This model adds an auxiliary small decoder to U-Net to construct a dual-decoder structure for finer feature acquisition. Gao [9] et al. proposed an improved U-Net model by incorporating deep residuals and dilation-aware mechanisms for the road extraction task. In the encoder stage of the model, they introduced a residual module to mitigate the degradation problem during training. Meanwhile, in order to enhance the model’s ability to extract multi-scale features, they added an expansion-aware unit between the encoder and the decoder, and this improvement expanded the model’s sensory field. The extraction of unpaved roads in rural areas is more challenging compared to urban roads. Kearney et al. successfully identified and extracted unpaved roads in rural areas by using the SegNet model, which was particularly effective. Makhlouf Y [10] et al. proposed a convolutional neural network, which is based on a down-sampling followed by up-sampling architecture, with the aim of extracting roads from aerial images. The proposed encoder–decoder structure allows the network to retain boundary information and is less complex in terms of depth, number of parameters, and memory size.

A neural network method based on attention mechanism. With the rapid development of attention mechanism in the field of computer vision, remote sensing image road extraction methods have also begun to incorporate this mechanism, and combining the attention mechanism with CNN can significantly improve the performance and efficiency of the model, as well as enhance the interpretability and adaptability of the model. Song [11] et al. proposed a novel lane line detection algorithm for unmanned geographic information awareness using the hybrid attention mechanism ResNet and the line anchor classification. The hybrid attention mechanism is added after the convolutional, normalization, and activation layers of the backbone network, respectively, so that the model can pay more attention to the important lane line features in order to improve the relevance and efficiency of feature extraction. Li [12] et al. proposed a network called CRAE-Net (Cascaded Residual Attention-Enhanced Network), which optimizes feature fusion by integrating dual-attention modules and handles multi-scale features effectively. Thus, it copes with the challenge of recognizing narrow roads and improves the smoothness and connectivity of road boundaries. Vaswani [13] et al. proposed the Transformer model, which effectively mitigates the long distance dependency challenge in processing inputs and outputs and significantly reduces the consumption of computational resources through its parallel processing nature. Dosovitskiy [14] et al. proposed the model of Vision Transformer (ViT), which relies on the attention mechanism to ensure that the global information of the image can be effectively utilized at different levels of the network. It solves the problem of detail information loss during road image extraction and improves the accuracy and detail capturing ability of image processing. Liu [15] et al. proposed the Swin Transformer model, a model that significantly improves the quality of road image extraction by reinforcing the connection between contextual semantics. Zhang [16] et al. designed a U-shaped dual-resolution road segmentation network which incorporates a feature fusion module. This network framework employs a self-attention mechanism based on CSwin Transformer [17] to construct its encoder part, which effectively obtains the global contextual links between pixels and helps to extract the complete road information.

The contributions are as follows:

(1): Based on the MSNet network, for the problem of low utilization of spectral features from HR-RSIs, we propose a refined road extraction algorithm for the guided fusion of spatial and channel features in Multi-spectral images (SC-FMNet);
(2): A Multi-spectral branch structure is designed on the basis of fused image branches. The Spatial and Channel reconstruction Convolution (SCConv) module is merged into the two branches, respectively, which cascades the spatial reconstruction unit and the channel reconstruction unit to remove redundant features through reconstruction.
(3): The Spatially Adaptive Feature Modulation Mechanism (SAFMM) module is embedded into the decoding structure, which mainly consists of a Spatially Adaptive Feature Modulation (SAFM) unit and a Convolutional Channel Mixer (CCM) unit. The SAFM unit learns the multi-scale features and uses the non-local information to adaptively modulate the features so as to select the most suitable modulation for each pixel position.

2. Materials and Methods

2.1. SC-FMNet

SC-FMNet is improved based on the MSNet network by Du [18] et al. Figure 1 provides the structure of the whole network. The network structure is divided into four parts: First, the coding part consists of upper and lower input branches; the upper branch is used to extract the features of Multi-spectral images, which can provide balanced spectral information about the road surface, and they can provide deeper semantic information than the fused images due to the different sizes of the Multi-spectral images with the same network depth. The lower branch is used to obtain the features of the fused image, and the fused image takes into account the high-resolution and Multi-spectral features, which can provide accurate detail and texture information, and is directly applied for remote sensing image interpretation. In the third part, in order to integrate the two branches, features are extracted from Multi-spectral images and fused images, respectively, and fused using Aggregated and Gated Operation (AGO). And the global asymmetric semantic fusion module is added for processing deep abstract features. In the last part, the extracted features from the upper and lower branches are fused to better adjust the weights in space and channels using the Convolutional Block Attention Module (CBAM), and finally, the road prediction map is outputted by the decoder.

2.2. Space and Channel Reconstruction Convolution (SCConv)

Convolutional neural networks have demonstrated excellent performance with large amounts of computer vision tasks; however, this performance improvement comes at the cost of significant computational resource consumption, partly due to the redundant features extracted by the convolutional layers. In SC-FMNet, U2Net [19] is used as the baseline, which is a two-layer nested U-Net structure. Due to the complex network structure of U2Net, the number of model parameters required is large, especially when training on large-scale datasets; the training time is long, and the training efficiency is low.

In this paper, we improved the MSNet network by introducing a spatial and channel reconstruction convolution module [20] (SCConv) in the coding part of the two branches, as shown in Figure 1. In each feature extraction, it is obtained by summing up the last extracted features and the previous features; therefore, the SCConv module is added to the previous features, and by introducing two modules, the Spatial Reconstruction Unit (SRU) and the Channel Reconstruction Unit (CRU), the model performance is reduced by utilizing these two modules to learn the spatial correlation and channel correlation of the feature map, respectively, thus reducing the spatial redundancy and channel redundancy of the feature map while decreasing the computational complexity. By introducing two modules, the Spatial Reconstruction Unit (SRU) and the Channel Reconstruction Unit (CRU), the spatial correlation and channel correlation of the feature map are learned, respectively, so as to reduce the spatial redundancy and channel redundancy of the feature map, to improve the performance of the model and at the same time to reduce the computational complexity, and in this way, to realize a more fine-grained feature sextraction method. The structure of SCConv is shown in Figure 2.

The SCConv module consists of two parts: the SRU optimizes spatial features, and the CRU optimizes channel features. First, the input feature

X

passes through the

1 \times 1

convolution process from the previous convolution block (ConvBlock) and enters the SCConv module. The spatially optimized feature

X^{w}

is then generated after SRU. SRU reduces spatial redundancy in features and enhances useful spatial information through separation and reconstruction operations. The spatially optimized feature

X^{w}

is further passed into the CRU to generate the channel-optimized feature

Y

. The CRU reduces channel redundancy through segmentation, transformation, and fusion operations while preserving the expressive power of the features. Finally, the channel-optimized feature

Y

is again processed by

1 \times 1

convolution and passed to the next convolution block for further processing.

2.2.1. Spatial Reconstruction Unit (SRU)

SRU enhances feature representation in the spatial dimension and suppresses invalid spatial information by separating redundant spatial features and reconstructing valid information. The separation operation aims to distinguish those feature maps that are rich in information from those that have less spatial content information. The SRU structure is shown in Figure 3.

Using the parameters of the trainable GN (group normalization) layer, it is possible to evaluate the variance of different feature maps over spatial pixels. Subsequently, by implementing the operations of reweighting and reconstruction, the feature maps are able to be separated and reconstructed for the purpose of optimizing feature representation. Specifically, given a particular intermediate feature map

X \in R^{N \times C \times H \times W}

, where represents the

N

batch dimension,

C

represents the channel dimension, and

H

and

W

represent the height and width of the spatial dimension, respectively. The process starts by normalizing the input feature

X

. First, normalize the input features

X

by subtracting the mean

μ

and dividing it by the standard deviation

σ

. The computational process can be formulated as Equation (1):

X_{o u t} = G N (X) = γ \frac{X - μ}{\sqrt{σ^{2} + ε}} + ϑ

(1)

where

μ

and

σ

are the mean and standard deviation of

X

,

ε

is a small positive constant added for division stability, and

γ

and

ϑ

are trainable affine transformations.

The variance of spatial pixels in each batch and channel is measured using the trainable parameter

γ \in R^{C}

in the GN layer. Richer spatial information represents more variance for larger

γ

spatial pixels. The normalized weights

W_{γ} \in R^{C}

represent the importance of different feature maps as shown in Equation (2):

W_{γ} = \{w_{i}\} = \frac{γ_{i}}{\sum_{j = 1}^{C} γ_{j}}, i, j = 1, 2, \dots, C

(2)

The weights of the

W_{γ}

-weighted feature mapping are mapped to the range (0, 1) using the

S i g m o i d

function. Next, a threshold is set (in the experiment, the threshold is set to 0.5), and the weights above the threshold are set to 1 to obtain the informative weights

W_{1}

. At the same time, the weights below the threshold

W_{2}

are set to 0 to obtain the non-informative weights. The whole process of obtaining

W

can be expressed as Equation (3):

W = G a t e (S i g m o i d (W_{γ} (G N (X))))

(3)

Finally, the input features

X

are multiplied by

W_{1}

and

W_{2}

to obtain two different sets of weighted features: information-rich features

X_{1}^{w}

and information-poor features

X_{2}^{w}

.

To reduce the spatial redundancy of features, further reconstruction operations are performed. The interaction between different channels (channels with more information and channels with less information) is identified to enhance the flow of information between these channels as a way to improve accuracy, reduce redundant features, and enhance the feature representation capability of the CNN. Adding information-rich features with information-poor features to generate information-rich features saves spatial resources.

The method of cross-reconstruction, instead of simple direct addition, is used, which can more fully integrate two weighted different information features and enhance the information exchange between them. Then, the cross-reconstructed features

X^{w 1}

and

X^{w 2}

are fused and spliced together using element multiplication, element summation, and splicing operations to obtain the spatially refined feature map

X^{w}

. The whole reconstruction process can be expressed as Equation (4):

\{\begin{cases} X_{1}^{w} = W_{1} \otimes X, \\ X_{2}^{w} = W_{2} \otimes X, \\ X_{11}^{w} \oplus X_{22}^{w} = X^{w 1}, \\ X_{21}^{w} \oplus X_{12}^{w} = X^{w 2}, \\ X^{w 1} \cup X^{w 2} = X^{w} . \end{cases}

(4)

where

\otimes

denotes element-by-element multiplication,

\oplus

denotes element-by-element summation, and

\cup

is the cascade operation.

2.2.2. Channel Reconstruction Unit (CRU)

The structure of the channel reconstruction unit, CRU, is shown in Figure 4. CRU consists of three parts: segmentation, transformation, and fusion. The rich representative features in the channel are extracted by segmenting the channel in the feature map and using the efficient convolution operation as well as reusing features.

Split: A spatially refined feature

X^{w} \in R^{c \times h \times w}

is received, and the spatially refined feature

X^{w}

is first split into two parts: channel

α C

and channel

(1 - α) C

, as shown in the split part of Figure 4, where

0 \leq α \leq 1

is the split ratio. Then, the number of channels of these two parts of the feature is further compressed using a

1 \times 1

convolution kernel as a way to improve the computational efficiency. The spatially optimized feature

X^{w}

is divided into the upper part

X_{u p}

and the lower part

X_{l o w}

by performing the split and compression operations.

Transform: A transform operation is performed on the compressed features, and the upper half of

X_{u p}

is fed into the upper transform stage as a “rich feature extractor”. In this stage, efficient convolution operations, group-wise convolution (GWC), and point-wise convolution (PWC) are used instead of the expensive standard

k \times k

convolution in order to extract high-level representative information and reduce computational cost. Although GWC reduces the parameters and the computation cost, it also cuts off the information flow between different channel groups, so another path uses PWC to help the information flow across channels. So, the

k \times k

GWC operation and the

1 \times 1

PWC operation are performed on the same

X_{u p}

, respectively. Finally, the outputs are summed to form the merged representative feature map

Y_{1}

, as shown in the transform section of Figure 4. The upper transform stage can be expressed in Equation (5):

Y_{1} = M^{G} X_{u p} + M^{P_{1}} X_{u p}

(5)

where

M^{G} \in R^{\frac{α C}{g r} \times k \times k \times c}, M^{P_{1}} \in R^{\frac{α C}{r} \times 1 \times 1 \times c}

are the learnable weight matrices of GWC and PWC, respectively, and

X_{u p} \in R^{\frac{α C}{r} \times h \times w}

and

Y_{1} \in R^{c \times h \times w}

are the upper input and output feature maps, respectively. In short, the upper transformation stage utilizes the outputs of GWC and PWC on the same feature maps

X_{u p}

to form

Y_{1}

, which is used to extract rich representative information while reducing the computational cost.

The second half of

X_{l o w}

is fed into the lower transformation stage. In this stage, a cheap

1 \times 1

PWC operation is used to generate feature maps with shallow hidden details as a “complement to the rich feature extractor (detail information)”. In addition, reusing

X_{l o w}

yields more feature maps without additional computational cost. Finally, the generated and reused feature maps are merged to form the output

Y_{2}

as shown in Equation (6):

Y_{2} = M^{P_{2}} X_{l o w} \cup X_{l o w}

(6)

where

M^{P_{2}} \in R^{\frac{(1 - α) C}{r} \times 1 \times 1 \times (1 - \frac{(1 - α)}{r}) C}

is the learnable weight matrices of the PWC,

\cup

is the cascade operation, and

X_{l o w} \in R^{\frac{(1 - α) C}{r} \times h \times w}

and

Y_{2} \in R^{c \times h \times w}

are the lower input and output feature maps, respectively. In short, the lower transformation stage reuses the feed-forward feature map

X_{l o w}

and utilizes the

1 \times 1

PWC operation as a complement to the rich feature extractor and then joins them to form feature

Y_{2}

with complementary detail information.

Fuse: Instead of simply splicing or adding these two types of features directly after performing the transform operation, the output features

Y_{1}

and

Y_{2}

of the upper and lower transform stages are adaptively merged using the simplified SKNet [21] method, as shown in the fuse part of Figure 4. Global spatial information with channel statistics is first collected using global average pooling, where the channel statistics

S_{m} \in R^{c \times 1 \times 1}

are computed as shown in Equation (7):

S_{m} = P o o l i n g (Y_{m}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} Y_{c} (i, j), m = 1, 2

(7)

Then, the upper global channel descriptors

S_{1}

and

S_{2}

and the lower global channel descriptor S are stacked. And based on this, the feature importance vector

β_{1}, β_{2} \in R^{c}

is generated using the channel soft attention operation as shown in Equation (8) below:

β_{1} = \frac{e^{s 1}}{e^{s 1} + e^{s 2}}, β_{2} = \frac{e^{s 2}}{e^{s 1} + e^{s 2}}, β_{1} + β_{2} = 1

(8)

Finally, the upper feature

Y_{1}

and the lower feature

Y_{2}

are merged in terms of channels under the guidance of the feature importance vector

β_{1}, β_{2}

to obtain the refined channel feature Z, as shown in Equation (9):

Y = β_{1} Y_{1} + β_{2} Y_{2}

(9)

2.3. Spatial Adaptive Feature Modulation Mechanism (SAFMM)

Due to the low resolution of Multi-spectral images, it usually tends to cause image blurring or distortion when the size is small or the compression is severe. The Spatially Adaptive Feature Modulation Mechanism [22] (SAFMM), on the other hand, is able to dynamically adjust the features at each pixel location to improve the image quality, especially in maintaining the image details and textures with significant advantages. Based on this, the SAFMM module is added to the decoder as shown in Figure 1’s decoder section.

The structure of SAFMM is shown in Figure 5. SAFMM consists of three parts: a shallow convolution, a stacked Feature Mixing Module (FMM), and an up-sampler layer. Specifically, a

3 \times 3

convolutional layer is first used to transform the input image into the feature space and generate the shallow feature

F_{0}

. The FMM is then used to generate the shallow features. Feature extraction is then performed using multiple stacked feature mixing modules to generate finer deep features from

F_{0}

. The FMM consists of a Spatially Adaptive Feature Modulation (SAFM) unit, a Convolutional Channel Mixer (CCM), and two jump connections. Finally, these extracted features are reconstructed using the up-sampler module. The SAFM unit uses independent computations to learn multi-scale features and aggregates these features for dynamic spatial modulation. Since SAFM preferentially exploits non-local feature dependencies, CCM is further introduced to simultaneously fuse local context information and hybrid channels.

2.3.1. Space Adaptive Feature Modulation Unit (SAFM)

The SAFM module utilizes non-local information to adaptively modulate features, thus allowing the model to select the most appropriate modulation for each pixel location, enhancing adaptation to different regions. To introduce the ability to interact over long distances and model dynamics in convolution, a multi-head paradigm is followed using parallel and independent computations that allow each head to process different scales of information about the input, and then these features are aggregated to generate an attention map for spatially adaptive feature modulation.

The detailed structure of SAFM is shown in Figure 6. The normalized input features are first divided into four groups and then fed into the Multi-Scale Feature Generation Unit (MFGU), which performs a

3 \times 3

deep convolution for the first group and merged sampling for the rest. Given the input feature

X

, the formula for this process is shown in Equation (10):

\begin{array}{l} [X_{0}, X_{1}, X_{2}, X_{3}] = S p l i t (X), \\ {\hat{X}}_{0} = D W - C o n v_{3 \times 3} (X_{0}), \\ {\hat{X}}_{i} = ↑_{p} (D W - C o n v_{3 \times 3} (↓_{\frac{p}{2^{i}}} (X_{i}))), 1 \leq i \leq 3 . \end{array}

(10)

where

S p l i t (\cdot)

corresponds to the channel segmentation operation,

D W - C o n v_{3 \times 3} (\cdot)

is a

3 \times 3

deep convolution,

↑_{p} (\cdot)

denotes the fast implementation of up-sampling a specific level of features to the original resolution

p

via nearest neighbor interpolation, and

↓_{\frac{p}{2^{i}}}

denotes the pooling of input features to the size of

\frac{p}{2^{i}}

.

These extracted short- or long-range features are connected via channel dimension, and

1 \times 1

convolution is performed to aggregate them. This is shown in Equation (11) below:

\hat{X} = C o n v_{1 \times 1} (C o n c a t ([{\hat{X}}_{0}, {\hat{X}}_{1}, {\hat{X}}_{2}, {\hat{X}}_{3}]))

(11)

where

C o n c a t (\cdot)

denotes a cascade operation and

C o n v_{1 \times 1} (\cdot)

is a

1 \times 1

convolutional.

2.3.2. Feature Blending Module (FMM)

To further combine local contextual information and perform channel blending simultaneously, a compact convolutional channel mixer (CCM) based on FMBConv [23] is introduced. The CCM interacts with features from different channels to further enhance image detail recovery. It normalizes the features through LayerNorm and subsequently performs cross-channel feature fusion.

The CCM consists of a

3 \times 3

convolution and a

1 \times 1

convolution. In this case, the preceding

3 \times 3

convolution encodes the spatial local context and doubles the number of channels of the input features of the hybrid channel; the following

1 \times 1

convolution reduces the number of channels back to the original input dimension.

The SAFM and CCM are formed into a unified Feature Mixing Module (FMM), where the features are extracted into the FMM, which combines different feature information to enhance the capture and recovery of image details. These features integrate the information through a series of blending operations. The FMM can be written as shown in Equation (12):

\begin{array}{l} Y = S A F M (L N (X)) + X, \\ Z = C C M (L N (Y)) + Y . \end{array}

(12)

where

L N (\cdot)

is the LayerNorm layer and

X, Y

, and

Z

are the intermediate features.

3. Results and Analysis

3.1. Experimental Setup

All experiments in this paper are performed on a Linux operating system with GPU RTX 3090 (24 GB), 45 GB of RAM, and a CPU Intel(R) Xeon(R) Platinum 8362 CPU @ 2.80 GHz. Experimental models are built using the current popular Pytorch 2.0.0 deep learning framework to build network models. The experimental training phase is optimized using the Adam algorithm.

β_{1} = 0.99, β_{2} = 0.999

. The learning rate is set to 1 × 10⁻³ for 70 epochs of iterations, and the learning rate is adjusted after every 15 epochs. Each time the learning rate is adjusted, the new learning rate is 0.95 times the old learning rate.

In the road extraction task, this paper adopts both qualitative and quantitative dimensions for the evaluation of classification results. For qualitative assessment, it is mainly based on subjective evaluation, including the judgment of the completeness of the results, the clarity of the edges, and the consistency of the edges. For quantitative evaluation, this paper adopts several commonly used performance metrics to measure the effectiveness of the model, including recall, precision, F1-score, overall accuracy (OA), and intersection over union (IoU). TP stands for the number of pixels correctly predicted as roads; TN is the number of pixels correctly predicted as non-roads; FP stands for the number of pixels incorrectly predicted as roads, and FN stands for the number of pixels incorrectly predicted as non-roads.

(1): Recall

Recall is used to evaluate the accuracy of the model in recognizing road pixels. It is the ratio of the number of pixels correctly predicted as roads to the total number of actual road pixels. It is calculated using the following formula:

R e c a l l = \frac{T P}{T P + F N}

(13)

(2): Precision

Precision, also known as accuracy, is one of the most intuitive metrics to evaluate, reflecting the number of pixels correctly predicted (road and non-road) as a percentage of the total number of pixels. The specific calculation formula is as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(14)

(3): F1-score

The F1-score is a composite measure of model performance that provides a consistent perspective for evaluating the overall effectiveness of the model by calculating the reconciled mean of precision and recall. When the F1-score rises, it indicates an overall enhancement of the model’s performance. The F1-score is a key metric for evaluating the model’s balance between precision and recall. The specific calculation formula is as follows:

F 1 s c o r e = 2 * \frac{R e c a l l * P r e c i s i o n}{R e c a l l + P r e c i s i o n}

(15)

(4): OA

OA is the ratio of the number of correctly categorized samples to the total number of samples in the sample. The specific calculation formula is as follows:

O A = \frac{T P + T N}{T P + F N + F P + T N}

(16)

(5): IoU

IoU is an important metric for measuring the extent to which predicted roadway areas overlap with real roadway areas. Specifically, IoU calculates the ratio of the area of intersection of the predicted area with the real area to the area of their concurrency. The specific calculation formula is as follows:

I o U = \frac{T P}{T P + F P + F N}

(17)

3.2. Datasets and Proprecessing

In order to investigate the effectiveness of the research method in this paper for road information extraction from high-resolution remote sensing images, two publicly available datasets are used for validation, namely the GF2-FC [24] road dataset and the CHN6-CUG [25] road dataset.

A.: GF2-FC road dataset

All images are from the GF2 satellite with a spatial resolution of 0.8 m. Compared with the existing dataset, 3.2 m multispectral images were added, covering the cities of Taiyuan, Shanghai, and Dalian. The already prepared road dataset is then preprocessed to generate more training samples by flipping, cropping, and performing other operations on the dataset, which in turn enriches the diversity of the training data so that the model can better adapt to diverse scenes and changes. Finally, 6350 images are selected as the experimental objects, and the image size is 400 × 400. They are divided into the training set and the test set according to an 8:2 ratio.

B.: CHN6-CUG road dataset

The CHN6-CUG dataset is from Google Earth, and each image has a resolution of 1 m and a size of

512 \times 512

pixels. The dataset covers the areas of Shanghai, Macau, Hong Kong, Wuhan, Beijing, and Shenzhen in China. The road markings in the dataset take into account the pavement coverage, distinguishing between tracked and untracked pavements. Analyzed from the point of view of geographic and physical characteristics, the roads are classified to cover a wide range of categories such as railroads, highways, urban roads, as well as rural roads.

Similarly, multispectral images were obtained using down-sampling. Operations such as cropping and data enhancement are used before the training of the network model, and finally, 6293 images are selected as the experimental objects with an image size of

400 \times 400

. They are divided into the training set and the test set at a ratio of 8:2.

3.3. Ablation Experiments

(1): GF2-FC road dataset

In this paper, for model training on the GF2-FC dataset, the image size of the training and test sets is uniformly set to

400 \times 400

pixels, and the output results are ensured to be binarized images. All networks were set to the same experimental environment and parameters for a fairer comparison. Firstly, the unaltered MSNet network structure is used as a benchmark; then, the SCConv and SAFMM modules are added separately, and finally, the ablation experiments are conducted with the combination of MSNet, SCConv, and SAFMM.

From the results of the ablation experiments on the GF2-FC dataset in Figure 7, it can be seen that Figure 7c is the unimproved MSNet network model, which has incoherent road extraction results, and is also susceptible to the interference of buildings, etc., resulting in poor extraction accuracy. Figure 7d shows the addition of the SAFMM module, which has greatly improved the road connectivity, but the road boundary is vaguer. Figure 7e shows the adding of the SCConv module; the road extraction accuracy is significantly improved, and the breaks are reduced, but there are more false extraction results. Figure 7f shows the network model used in this chapter; after adding the SCConv and SAFMM modules, the extracted road information is also the most complete compared to all the effect diagrams, and the road connectivity has a greater improvement and is less susceptible to interference such as buildings, and the road boundaries are clearer.

As can be seen in Table 1, the MSNet model without the addition of the SCConv and SAFMM modules is the lowest in each evaluation metric, with 94.84% OA, 79.91% precision, 63.12% recall, 70.18% F1-score, and 54.06% IoU. After incorporating the SAFMM module, all evaluation metrics improved. After adding the SCConv module, although each evaluation metric also improved, the improvement is relatively limited compared to the original model. The last row of the table shows the network SC-FMNet proposed in this paper; after adding both the SCConv and SAFMM modules, the road extraction effect is significantly improved, and each evaluation index reaches the highest value; compared with the original model of MSNet, OA improves by 3.73%, precision improves by 0.83%, recall improves by 9.35%, the F1-score improves by 5.83%, and IoU improves by 7.24%. By inserting both the SCConv and SAFMM modules into MSNet, road information can be extracted from remote sensing images more accurately.

(2): CHN6-CUG road dataset

In this paper, for model training on the CHN6-CUG dataset, the image size of the training and test sets is uniformly set to

400 \times 400

pixels, and it is ensured that the output results are binarized images. All networks were set to the same experimental environment and parameters for a fairer comparison. Firstly, the unaltered MSNet network structure is used as a benchmark; then, SCConv and SAFMM are added separately, and finally, the combination of MSNet, SCConv, and SAFMM is used for ablation experiments.

As can be seen from the results of the ablation experiments on the CHN6-CUG dataset in Figure 8, Figure 8c, which is the unimproved MSNet model, has a low extraction accuracy for places where there is a large amount of vegetation and buildings along the roadside and for narrow roads, which are susceptible to background interference, and the extracted roads also have breaks and lack detailed information. Figure 8d shows the adding of the SAFMM module; the extraction accuracy is improved, and the dense and narrow places are handled more accurately, but the road boundary is not clear. Figure 8e shows the adding of the SCConv module; the extraction results are better for the main roads, but there is still discontinuity for the branch roads and the complex background. Figure 8f shows the network model used in this chapter; after adding the SCConv and SAFMM modules, the ability to obtain image features is strengthened, comprehensively obtaining the contextual feature information, and is not easily interfered by the vegetation and buildings, and the completeness of road extraction is higher.

As can be seen in Table 2, the MSNet model without the addition of the SCConv and SAFMM modules has the lowest of all evaluation metrics, with an OA of 95.46%, a precision of 78.83%, a recall of 62.91%, an F1-score of 66.93%, and an IOU of 56.00%. With the addition of the SAFMM module, there is a large improvement in all evaluation metrics. However, after adding the SCConv module, the improvement is not obvious. The last row of the table shows the network SC-FMNet proposed in this paper; after adding both the SCConv and SAFMM modules, all the indexes reach the optimal results; compared with the original model of MSNet, OA improves by 0.7%, precision improves by 0.77%, recall improves by 3.95%, and the F1-score and IOU improve by 5.68% and 7.00%, respectively. Combining the SCConv and SAFMM modules can further improve the segmentation performance of the network model and enhance the segmentation effect.

3.4. Results and Comparison

(1): GF2-FC road dataset

As shown in Figure 9, some of the prediction results of the model trained using the GF2-FC dataset in multiple comparison networks are demonstrated. The figure shows, in order, the original high-resolution remote sensing image, the corresponding labeled maps, and the prediction result maps of A2FPN, BANet, DCSwin, MANet, UNetFormer, ABCNet, and the SC-FMNet network proposed in this chapter, respectively.

As can be seen in Figure 9, both in simple and complex backgrounds, the results extracted by the network in Figure 9d,e,h have more isolated points, and the breakage of the road is more obvious. In the process of road segmentation, it is easy to be interfered with by the nearby buildings, which leads to the segmentation result that fails to meet the expected segmentation target. The network model shown in Figure 9c has improved in the accuracy of road segmentation, and the road segmentation results are more accurate, but the phenomenon of breaks and burrs of roads still occurs. Figure 9f,g have higher extraction accuracy in simple backgrounds, but in complex and dense areas, the extracted roads are not smooth, and in the prediction results, it is observed that the integrity of some roads is impaired, as well as the existence of problems such as misrecognition and the omission of recognition, which affect the extraction results. Comparative analysis shows that Figure 9j is the network proposed in this paper, both in simple background and complex background; it can be seen that Figure 9j is not easy to be interfered with by nearby buildings. On the basis of Figure 9i, the extracted road information can be more continuous, and at the same time, it can also reduce the situation of misidentification and the omission of identification, which improves the model’s robustness as well as the accuracy of the extraction of roads.

This section focuses on the comparison of experimental segmentation results using the network model SC-FMNet proposed in this paper with other network models on the GF2-FC road dataset. The experimental results of the commonly used evaluation metrics are shown in Table 3.

As shown in Table 3, in comparison with the other models selected, the network SC-FMNet proposed in this paper achieves the optimal results in all the metrics, with an OA of 98.57%, a precision of 80.74%, a recall of 72.47%, an F1-score of 76.01%, as well as an IOU of 61.30%. Compared with other methods, the evaluation indexes are all improved. Combined with the prediction result plots in Figure 9, it can be concluded that the network model SC-FMNet proposed in this paper shows better results in improving the integrity of road extraction and can enhance the accuracy of road segmentation to some extent.

(2): CHN6-CUG road dataset

As shown in Figure 10, some of the prediction results of the model trained using the CHN6-CUG dataset in multiple comparison networks are demonstrated. The original high-resolution remote sensing image, the corresponding labeled maps, and the prediction result maps of A2FPN, BANet, DCSwin, MANet, UNetFormer, ABCNet, and the SC-FMNet network proposed in this paper, respectively, are shown in the figure in order.

As can be seen from the comparison results in Figure 10, the networks in Figure 10e,g,h are ineffective for the extraction of both main and branch roads, especially for the narrow and thin roads in the areas covered by trees or buildings, and the extracted roads all have obvious breakpoints and discontinuities. The roads extracted in Figure 10c are relatively high in accuracy and completeness, but there is still the problem of omission. Figure 10d,f are clearer for the main roads, but the roads extracted from the road edges and narrow paths have interrupted and discontinuous phenomena and also misidentified the background as a road. For Figure 10j, for the network proposed in this paper, we can see that Figure 10j in Figure 10i is based on the absence of isolated points, and the extraction of road boundaries are clearer in the absence of sheltered roads extracted from road edges that are smooth; in the presence of sheltered roads, it can be extracted coherently, and there is no obvious omission problem. The extracted road detail information can be expressed well, with good coherence.

This section focuses on the comparison of experimental segmentation results using the network model SC-FMNet proposed in this paper with other network models on the CHN6-CUG road dataset. The experimental results of the commonly used evaluation metrics are shown in Table 4.

As shown in Table 4, the network SC-FMNet in this paper has an OA of 96.16%, a precision of 79.60%, a recall of 66.86%, an F1-score of 72.61%, as well as an IOU of 63.10%. As a whole, the network SC-FMNet in this paper has advantages in all evaluation indexes and can extract the features of the road well, which makes the model better for the segmentation of remote sensing images with higher extraction accuracy.

4. Conclusions

In this paper, a Multi-spectral image-guided fusion of Spatial and Channel Features for road extraction algorithm (SC-FMNet) is proposed. The SCConv module is first introduced in the encoding part in both branches, respectively, which exhibits higher accuracy of road extraction as well as reduced computational complexity. Then, the SAFMM module is introduced in the decoding structure to improve the image quality by spatially localizing the image features for adaptive adjustment. And the network SC-FMNet in this paper is compared with several network models for experiments on the GF2-FC road dataset and the CHN6-CUG road dataset, respectively. The experimental results show that SC-FMNet exhibits good extraction results when performing the road extraction task.

In summary, applying deep learning to road information extraction from high-resolution remote sensing images can significantly improve the recognition and understanding of road information using the intelligent interpretation system. However, its drawbacks and limitations should not be ignored, so it is necessary to continuously explore and optimize the structure and algorithms of the deep learning model to improve its performance and adaptability in practical applications. In the future, we will continue to study the use of Multi-spectral images to improve the effect of road extraction from remote sensing images.

Author Contributions

Conceptualization, L.G.; Methodology, Y.Z.; Writing—original draft, L.G. and A.J. Foundation, L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Liaoning Provincial Department of Education Youth Project, Gao Lin (NO.1030040000560), LiaoningProvince Applied Basic Research Program (YouthSpecial Project, 2023JH2/101600038), ShenyangYouth Science and Technology Innovation Talent Sup-port Program (RC220458),Basic Research Special Funds for Undergraduate Universities in Liaoning Province (Guangxuan Program of Shenyang Ligong University (SYLUGXRC202216)), and Basic Research Special Funds for Undergraduate Universities in Liaoning Province (LJ212410144067).

Data Availability Statement

The data presented in this study are openly available in GF2-FC road dataset; CHN6-CUG road dataset at http://cugurs5477.mikecrm.com/ZtMn5tR (accessed on 2 February 2025), reference number [25,26].

Acknowledgments

Thanks to all the editors and reviewers.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lian, R.; Wang, W.; Mustafa, N.; Huang, L. Road extraction methods in high-resolution remote sensing images: A comprehensive review. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 5489–5507. [Google Scholar] [CrossRef]
Qian, W.; Li, Y.; Ye, Q.; Ding, W.; Shu, W. Disambiguation-based partial label feature selection via feature dependency and label consistency. Inf. Fusion 2023, 94, 152–168. [Google Scholar] [CrossRef]
Parlak, B.; Uysal, A.K. A novel filter feature selection method for text classification: Extensive Feature Selector. J. Inf. Sci. 2023, 49, 59–78. [Google Scholar] [CrossRef]
Gui, Y.; Li, D.; Fang, R. A fast adaptive algorithm for training deep neural networks. Appl. Intell. 2023, 53, 4099–4108. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; part III 18. Springer International Publishing: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Wang, B.; Chen, Z.; Wu, L. Road extraction from high-resolution remote sensing images with U-Net network taking connectivity into account. J. Remote Sens. 2020, 24, 1488–1499. [Google Scholar]
Zhang, Z.; Liu, Q.; Wang, Y. Road extraction by deep residual u-net. IEEE Geosci. Remote Sens. Lett. 2018, 15, 749–753. [Google Scholar] [CrossRef]
Wang, Y.; Peng, Y.; Li, W.; Alexandropoulos, G.C.; Yu, J.; Ge, D.; Xiang, W. DDU-Net: Dual-decoder-U-Net for road extraction using high-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
Gao, L.; Song, W.; Dai, J.; Chen, Y. Road extraction from high-resolution remote sensing imagery using refined deep residual convolutional neural network. Remote Sens. 2019, 11, 552. [Google Scholar] [CrossRef]
Makhlouf, Y.; Daamouche, A.; Melgani, F. Convolutional Encoder-Decoder Network for Road Extraction from Remote Sensing Images. In Proceedings of the 2024 IEEE Mediterranean and Middle-East Geoscience and Remote Sensing Symposium (M2GARSS), Oran, Algeria, 15–17 April 2024; pp. 11–15. [Google Scholar]
Song, Y.; Huang, T.; Fu, X.; Jiang, Y.; Xu, J.; Zhao, J.; Yan, W.; Wang, X. A novel lane line detection algorithm for driverless geographic information perception using mixed-attention mechanism ResNet and row anchor classification. ISPRS Int. J. Geo Inf. 2023, 12, 132. [Google Scholar] [CrossRef]
Li, S.; Liao, C.; Ding, Y.; Hu, H.; Jia, Y.; Chen, M.; Xu, B.; Ge, X.; Liu, T.; Wu, D. Cascaded residual attention enhanced road extraction from remote sensing images. ISPRS Int. J. Geo Inf. 2022, 11, 9. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; 2017; Volume 30, Available online: https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html (accessed on 2 February 2025).
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Zhang, Z.; Miao, C.; Liu, C.; Tian, Q. DCS-TransUperNet: Road segmentation network based on CSwin transformer with dual resolution. Appl. Sci. 2022, 12, 3511. [Google Scholar] [CrossRef]
He, B.; Song, Y.; Zhu, Y.; Sha, Q.; Shen, Y.; Yan, T.; Nian, R.; Lendasse, A. Local receptive fields based extreme learning machine with hybrid filter kernels for image classification. Multidimens. Syst. Signal Process. 2019, 30, 1149–1169. [Google Scholar] [CrossRef]
Du, Y.; Sheng, Q.; Zhang, W.; Zhu, C.; Li, J.; Wang, B. From local context-aware to non-local: A road extraction network via guidance of multi-spectral image. ISPRS J. Photogramm. Remote Sens. 2023, 203, 230–245. [Google Scholar] [CrossRef]
Qin, X.; Zhang, Z.; Huang, C.; Dehghan, M.; Zaiane, O.R.; Jagersand, M. U2-Net: Going deeper with nested U-structure for salient object detection. Pattern Recognit. 2020, 106, 107404. [Google Scholar] [CrossRef]
Li, J.; Wen, Y.; He, L. Scconv: Spatial and channel reconstruction convolution for feature redundancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6153–6162. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 510–519. [Google Scholar]
Sun, L.; Dong, J.; Tang, J.; Pan, J. Spatially-adaptive feature modulation for efficient image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 13190–13199. [Google Scholar]
Tan, M.; Le, Q.V. Efficientnetv2: Smaller models and faster training. arXiv 2021, arXiv:2104.00298. [Google Scholar]
Ren, B.; Ma, S.; Hou, B.; Hong, D.; Chanussot, J.; Wang, J.; Jiao, L. A dual-stream high resolution network: Deep fusion of GF-2 and GF-3 data for land cover classification. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102896. [Google Scholar] [CrossRef]
Zhu, Q.; Zhang, Y.; Wang, L.; Zhong, Y.; Guan, Q.; Lu, X.; Zhang, L.; Li, D. A global context-aware and batch-independent network for road extraction from VHR satellite imagery. ISPRS J. Photogramm. Remote Sens. 2021, 175, 353–365. [Google Scholar] [CrossRef]
Li, R.; Wang, L.; Zhang, C.; Duan, C.; Zheng, S. A2-FPN for semantic segmentation of fine-resolution remotely sensed images. Int. J. Remote Sens. 2022, 43, 1131–1155. [Google Scholar] [CrossRef]
Tsai, F.J.; Peng, Y.T.; Tsai, C.C.; Lin, Y.Y.; Lin, C.W. Banet: A blur-aware attention network for dynamic scene deblurring. IEEE Trans. Image Process. 2022, 31, 6789–6799. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Duan, C.; Zhang, C.; Meng, X.; Fang, S. A novel transformer based semantic segmentation scheme for fine-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
He, P.; Jiao, L.; Shang, R.; Wang, S.; Liu, X.; Quan, D.; Yang, K.; Zhao, D. MANet: Multi-scale aware-relation network for semantic segmentation in aerial scenes. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Liu, Y.; Chen, H.; Shen, C.; He, T.; Jin, L.; Wang, L. Abcnet: Real-time scene text spotting with adaptive bezier-curve network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 9809–9818. [Google Scholar]

Figure 1. SC-FMNet network structure diagram.

Figure 2. SCConv structure diagram.

Figure 3. SRU structure diagram.

Figure 4. CRU structure diagram.

Figure 5. SAFMM structure diagram.

Figure 6. SAFM structure diagram.

Figure 7. The ablation effect was extracted from the GF2-FC dataset.

Figure 8. The ablation effect was extracted from the CHN6-CUG dataset.

Figure 9. Comparison chart of different network test results in the GF2-FC dataset.

Figure 10. Comparison of network test results in the CHN6-CUG dataset.

Table 1. The ablation results were compared with the GF2-FC dataset.

Methods	OA (%)	P (%)	R (%)	F1 (%)	IoU (%)
MSNet	94.84	79.91	63.12	70.18	54.06
MSNet + SAFMM	95.95	79.64	70.35	73.77	58.44
MSNet + SCConv	95.02	79.50	66.69	71.41	55.53
SC-FMNet	98.57	80.74	72.47	76.01	61.30

Table 2. Results of the ablation experiments were compared in the CHN6-CUG dataset.

Methods	OA (%)	P (%)	R (%)	F1 (%)	IoU (%)
MSNet	95.46	78.83	62.91	66.93	56.00
MSNet + SAFMM	96.08	79.44	65.38	71.80	57.00
MSNet + SCConv	95.54	76.98	62.96	69.37	56.25
SC-FMNet	96.16	79.60	66.86	72.61	63.10

Table 3. Comparison experiment of multi-model in the GF2-FC dataset.

Methods	OA (%)	P (%)	R (%)	F1 (%)	IoU (%)
A2FPN [26]	94.81	79.62	61.83	70.03	53.89
BANet [27]	93.96	79.03	51.05	62.21	45.15
DCSwin [28]	93.80	77.46	51.36	61.77	44.68
MANet [29]	94.77	78.62	58.56	69.45	53.20
UNetFormer [30]	94.40	72.46	57.37	66.62	49.95
ABCNet [31]	93.71	76.74	50.82	61.15	44.04
MSNet	94.84	79.91	63.12	70.18	54.06
A2FPN [26]	98.57	80.74	72.47	76.01	61.30

Table 4. Comparative experiments of multiple models in the CHN6-CUG dataset.

Methods	OA (%)	P (%)	R (%)	F1 (%)	IoU (%)
A2FPN	95.38	77.22	59.20	63.25	55.29
BANet	94.99	54.03	52.81	58.13	47.98
DCSwin	93.77	69.39	42.50	52.71	45.79
MANet	94.11	72.98	44.19	55.05	49.20
UNetFormer	93.10	72.39	45.11	55.58	48.49
ABCNet	94.04	70.18	47.03	56.32	40.98
MSNet	95.46	78.83	62.91	66.93	56.00
SC-FMNet	96.16	79.60	66.86	72.61	63.10

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, L.; Zhang, Y.; Jiao, A.; Zhang, L. A Road Extraction Algorithm for the Guided Fusion of Spatial and Channel Features from Multi-Spectral Images. Appl. Sci. 2025, 15, 1684. https://doi.org/10.3390/app15041684

AMA Style

Gao L, Zhang Y, Jiao A, Zhang L. A Road Extraction Algorithm for the Guided Fusion of Spatial and Channel Features from Multi-Spectral Images. Applied Sciences. 2025; 15(4):1684. https://doi.org/10.3390/app15041684

Chicago/Turabian Style

Gao, Lin, Yongqi Zhang, Aolin Jiao, and Lincong Zhang. 2025. "A Road Extraction Algorithm for the Guided Fusion of Spatial and Channel Features from Multi-Spectral Images" Applied Sciences 15, no. 4: 1684. https://doi.org/10.3390/app15041684

APA Style

Gao, L., Zhang, Y., Jiao, A., & Zhang, L. (2025). A Road Extraction Algorithm for the Guided Fusion of Spatial and Channel Features from Multi-Spectral Images. Applied Sciences, 15(4), 1684. https://doi.org/10.3390/app15041684

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Road Extraction Algorithm for the Guided Fusion of Spatial and Channel Features from Multi-Spectral Images

Abstract

1. Introduction

2. Materials and Methods

2.1. SC-FMNet

2.2. Space and Channel Reconstruction Convolution (SCConv)

2.2.1. Spatial Reconstruction Unit (SRU)

2.2.2. Channel Reconstruction Unit (CRU)

2.3. Spatial Adaptive Feature Modulation Mechanism (SAFMM)

2.3.1. Space Adaptive Feature Modulation Unit (SAFM)

2.3.2. Feature Blending Module (FMM)

3. Results and Analysis

3.1. Experimental Setup

3.2. Datasets and Proprecessing

3.3. Ablation Experiments

3.4. Results and Comparison

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI