Dual-Task Supervised Network for SAR and Road Vector Image Matching

Cai, Hanyu; Xian, Yong; Li, Shaopeng; Ma, Decao

doi:10.3390/rs17203504

Open AccessTechnical Note

Dual-Task Supervised Network for SAR and Road Vector Image Matching

by

Hanyu Cai

¹,

Yong Xian

^1,*,

Shaopeng Li

^1,2 and

Decao Ma

¹

College of Missile Engineering, Rocket Force University of Engineering, Xi’an 710025, China

²

Department of Automation, Tsinghua University, Beijing 100084, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(20), 3504; https://doi.org/10.3390/rs17203504

Submission received: 23 August 2025 / Revised: 28 September 2025 / Accepted: 9 October 2025 / Published: 21 October 2025

(This article belongs to the Special Issue Smart Monitoring of Urban Environment Using Remote Sensing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose a Siamese U-Net dual-task supervised network (SUDS) that simultaneously learns SAR-road-vector matching and road segmentation, achieving 80.2% localization accuracy within a 5-pixel registration error and 91.0% within a 10-pixel registration error.
We built a large-scale SAR-VEC dataset (2503 precisely aligned SAR/vector pairs) and show that the joint matching–segmentation loss suppresses interference more effectively than single-task networks (Unet++, Siam-U, NCC, MatchNet).

What are the implications of the main findings?

Enables accurate, real-time positioning of SAR imagery against road networks in GNSS-challenged environments, offering a ready-to-use model for navigation applications.
Demonstrates that dual-task supervision is an effective tuning-light strategy for cross-modal matching with sparse vector data, providing a reproducible path to boost robustness without extra post-processing.

Abstract

We propose using Synthetic Aperture Radar (SAR) images as real-time images and road vector images as reference images for matching navigation, and propose a Siamese U-Net dual-task supervised network for solving the problem, called SUDS. Unlike existing methods of heterogenous image matching, which extract common features and eliminate saliency differences for matching, we exploit the advantages of the vector images themselves to reduce the matching difficulty from the reference image selection. Firstly, we extract the common road features between SAR images and road vector images using a weight-sharing U-Net feature extraction network. Then, we propose to weight the sum of segmentation loss and matching loss as the network loss to optimize the feature extraction efficiency from both segmentation and matching perspectives. We prepare a specialized SAR-VEC dataset for experiments. Experiments show that the method is able to obtain high matching correctness, with 80.2% correctness within 5 pixels of matching error and 91.0% correctness within 10 pixels of matching error. Compared to existing methods, this method is able to identify the differences in similar roads and better eliminate the influence of imaging interference in SAR images on the matching results, obtaining more accurate matching results with better robustness. And we explore the effect of different weighting parameters

β

on the matching accuracy, and the best matching results are obtained when

β = 0.8

.

Keywords:

dual-task supervised; Synthetic Aperture Radar (SAR) image; road vector; Siamese network; template matching

1. Introduction

Heterogeneous remote sensing image matching has emerged as a critical research frontier and serves as the fundamental process for geospatial positioning, navigation, and large-scale image stitching. With the rapid diversification of remote sensing platforms, robust cross-modal feature association among heterogeneous data (e.g., optical, SAR, and infrared) has become imperative. However, three inherent challenges persist: (1) nonlinear radiometric distortions caused by disparate imaging mechanisms, (2) divergent textural characteristics across modalities, and (3) nonlinear geometric deformations resulting from coupled sensor–environment interactions. These factors collectively violate the feature-consistency assumption of conventional matchers, especially under complex environmental disturbances and cross-sensor conditions [1,2].

Hand-crafted methods achieve local matching by constructing scale-invariant descriptors; however, they rely on the gradient-coherence assumption. Although the methods based on regional mutual information can mitigate differences in radiometric distributions, they are sensitive to noise and local deformation, and their performance degrades markedly in low-overlap scenarios. In recent years, deep learning has partially mitigated the modal gap through end-to-end feature embedding, but the computational process of an effective feature extraction network is relatively complex, and they struggle to meet real-time constraints when processing large-format images [3].

To solve the above difficulties related to heterogenous image matching, in the selection of the base map, using a road vector image as the base map has the following advantages: (1). Vector image is a binary image generated based on vector data, which is not affected by factors such as sensors and imaging mechanisms. (2). Vector data has the advantage of lossless transformation, so problems such as differences in viewpoints and changes in the scene can be eliminated between vector image and real-time map. (3). Data itself does not exist in the resolution; the generated vector image can be compatible with the real-time map in resolution. (4). Under the same search range, the vector image storage is smaller, the matching calculation is smaller, and the calculation speed is faster.

For real-time map selection, SAR images are very suitable to be used as real-time maps for matching navigation due to their all-day, all-weather characteristics. SAR images have backward scattering features due to their unique imaging method. Road target backscatter is so weak that it appears as dark barred areas in high-resolution SAR images [4], which is an easy feature to identify and match.

Therefore, we propose to match SAR images with road vector images for navigation. There are still some difficulties in this research: (1). The research used for deep learning needs a large amount of data as a support, and the SAR and vector image dataset is relatively lacking at present. (2). The SAR image contains rich texture information and coherent spot noise interference, while the vector image only contains sparse road information, which makes it difficult to match it. (3). The road vector image represents the location of the centerline of the road, and it is difficult to restore the width information of the road. Matching pays more attention to the road pointing and length, and is not sensitive to the road width. (4). There may be some unnamed roads missing in the vector information of the road network, which cannot be completely corresponded to the real-time image.

Deep learning methods require a large amount of training sample data for support. Since most of the SAR images come from satellite and airborne radar, the acquisition cost is significantly higher than that of optical images, and the publicly available SAR datasets are relatively limited. The SENI-2 dataset [5] published the SAR image and optical image pair dataset, after which the use of deep learning to study the alignment algorithms [6], matching algorithms [7], data fusion [8], and detection and recognition [9] have been greatly developed. Deep learning algorithms related to SAR image ship target detection have been further developed after the release of SSDD [10], OpenSARShip [11], SAR Dataset of Ship Detection [12], and AIR-SARShip-1.0 [13]. The datasets containing SAR images with corresponding road vectors are currently only two datasets, SARroad and sar_dataset, but the amount of data in them is still far from meeting the needs of deep learning.

In order to solve the above problems, we constructed a SAR-VEC dataset and proposed a Siamese U-Net dual-task supervised network. The main contributions of this paper are as follows:

From the perspective of image source selection, the use of SAR images and road vector images for matching navigation tasks is proposed.
A dual-task supervised network combined with matching loss and segmentation loss is proposed to match SAR images with road vector images for navigation. This provides an idea that co-incorporating supervision of different tasks that extract similar features can be effective in improving accuracy.
A SAR-VEC dataset containing SAR images, road vector images and related labels is constructed for experiments.

2. Related Work

2.1. SAR Image Matching

The matching of SAR images with optical images has been widely studied in heterogenous image matching, and SAR images generated in real-time by the flight platform are usually matched with the optical satellite reference images. Ma et al. [14] proposed a two-step alignment method based on deep learning features and local features, which firstly approximates the spatial relations using a deep network based on the feature layer of the image (e.g., VGG-16), then applies a matching strategy which takes into account the spatial relationships, and then applies the matching strategy considering spatial relationships to the local SIFT feature-based method to improve the alignment accuracy. Hughes et al. [15] proposed a pseudo-Siamese network for high-resolution images, which can somewhat solve the problem of difficult matching due to the inversion of the tops and bottoms of high-rise buildings in the SAR images. Hoffmann et al. [16] used a full convolutional neural network (FCN) to measure image similarity, which is characterized by the fact that the input of the network is not limited by the size of the image, and it can be applied to images of various sizes without scaling in preprocessing. Hughes et al. [17] also designed a network for the selection of the matching region, the template matching, and the elimination of the mismatched points, which is a unified framework for the realization of the end-to-end matching. The problem of feature matching difficulties due to geometric deformations (e.g., asymmetric distortions) and texture anisotropy in heterogenous image alignment needs to be further addressed [18].

2.2. Vector Image Matching

Traditional matching algorithms for vector images often extract road intersections, road skeletons or edges in the image as features and characterize them based on geometric attribute information such as points and lines to complete the matching. Costea [19] detected road intersections on optical images and realized the matching between optical remote sensing images and vector images by matching them with the road intersections in the vector images. Li [20,21,22] et al. proposed the use of road intersections and their tangents as global invariant features and edge features with projection invariance for matching alignment of road network maps and aerial data.

For vector image matching using deep learning methods, Shi [23] proposed a generative adversarial network-based image transformation method, which employs a loss function based on edge distribution to make real-time image and vector reference image have stylistically consistent edges, and converts the real-time image to the vector reference image to reduce the difference between the two to complete the matching. Wei [24] proposed an end-to-end Siamese U-Net structure for deep learning template matching method (Siamese U-Net, Siam-U), which utilizes the U-Net to establish linked jump connections at the encoder and decoder, preserving the resolution and location information lost in the CNN multipooling layer, and providing suitable convolutional features for the matching to solve the problem of direct matching between geographic vector images and optical images.

2.3. SAR Image Road Extraction

There are many kinds of traditional SAR image road segmentation methods: the most important and commonly used are edge detection [25] and region segmentation [26]. The former focuses on the linear features of the road edges, and the latter mainly focuses on the regional features of the road surface.

Deep learning is utilized for SAR image road extraction. Li [27] et al. used CNN model for feature extraction and matching, and postprocessed the extracted road candidate regions using MRF and improved Radon algorithm. The MSPP [28] model proposes a Global-Attention Fusion (GAF) module, which contains two branches: one is used to pool the high-level feature maps for global averaging as the attention mechanism map, and the other one learns the low-level feature maps through the bottleneck architecture initially to obtain the shallow features, and then fuses the generated attention mechanism map with the shallow feature maps to obtain the weighted feature maps.

3. Dataset

In order to facilitate the development of deep learning methods for SAR-geo-vector data fusion, it is crucial to be able to acquire large datasets of fully aligned images or image patches. However, collecting large volumes of accurately aligned image pairs remains challenging, especially since SAR imagery is less accessible and more costly to obtain than optical data. There is almost no SAR dataset for road targets in the public datasets. In order to promote the research on the heteroscedastic matching method between SAR images and geographic vector images, this paper constructs a public sample dataset of SAR images and geographic vector images of the corresponding roads based on the Gaofen-3 satellite data, which is named as SAR-VEC.

3.1. Data Sources

The SAR strip (GF3_SAY_UFS_000605) was acquired on 20 September 2016 in Ultra-Fine Strip mode, C-band (5.4 GHz, λ ≈ 5.6 cm), HH-polarisation, 3 m ground-range resolution, incidence angle ≈25°, centred at 114.5°E, 30.5°N (Hubei Province).

The road geo-vector data come from OpenStreetMap (OSM) and the road network vector data of WeMap to complement and check each other, supplementing each other to ensure the completeness of the road network with possible deficiencies in the road network.

3.2. Data Processing

SAR images were pre-processed using PIE-SAR7.0 software, including SLC-to-intensity conversion, filtering noise reduction, geometric correction, and geocoding. The geographic vector data were vectorized, geocoded, and plotted into vector images using QGIS software. The geocoded SAR image was coarsely aligned to the vector image and then manually aligned precisely. The tightly aligned SAR image and vector image are sliced into data pairs, and each set of data pairs contains (1). a 512 × 512 geographic vector image of the road network as a base map; (2). a 3 m resolution 256 × 256 SAR image as a real-time map; (3). Coordinate labels of the position of the upper left corner of the SAR image points in the vector image are used to generate the position label image, where the real Euclidean distance of the matching location within three pixels is set to 1 and the rest is set to 0; (4). a 256 × 256 geographic vector image of the road network corresponding to the SAR image is used as the road monitoring map label.

The preliminary data pair set may have the following problems: (i) due to the SAR image itself is prone to serious coherent spot noise interference, some images will also have the phenomenon of defocus; (ii) the road target in the SAR image is prone to be interfered by the number of trees on the side of the road, buildings, and green belts and other objects, resulting in some of the roads appear serious breaks, occlusion, or mixing with the background; (iii) the roads in the vector image are missing and do not fully match the SAR image. SAR images do not match exactly. After removing the above three types of bad data pairs, the final SAR-VEC dataset containing 2503 data pairs is formed. Examples of a data pair are shown in Figure 1. See the Appendix A.1 for detailed dataset preparation.

4. Methods

In this paper, a Siamese U-Net dual-task supervised network (SUDS) is proposed as shown in Figure 2. The network consists of three parts: feature extraction, matching supervision and road extraction supervision.

4.1. Feature Extraction

Road targets exhibit regular geometric properties—edges, orientation, shape, and network topology. SAR images have backward scattering characteristics due to its unique imaging method, and back-scattering magnitude depends on the complex dielectric constant, surface roughness, and other target parameters. The backward scattering of road targets is very weak, which is shown as dark barred areas in high-resolution SAR images [4]. Water bodies such as rivers and roads often show similar dark barred regions and need to be distinguished. In addition, under complex road conditions, ground moving targets appear blurred and shifted in the image, forming highlight interference, which covers part of the road and makes the road appear broken, as shown in Figure 3. Meanwhile, the vector image is a simple binary image that contains only sparse road network information. Therefore, how to extract the common features of the two is the key to solve the problem.

The difficulty of the heterogenous image matching problem lies in the large image differences and the difficulty of homogeneous feature extraction. Some studies [23,29] converted one type of image into another type of image by learning the pixel distribution law between heterogenous images. The homogeneous image matching algorithm is used to complete the matching work after converting the heterogeneous images into images of the same style. Referring to this idea, we can convert SAR images into vector style binary images and then match them with vector images, which seems to be related to the road extraction task. However, in reality, road extraction is only an intermediate means to complete the matching. We do not pursue the accuracy of the extracted roads, but are more concerned with whether it contributes to the matching results. Therefore, we try to add road segmentation as an additional supervision to the network structure.

It is found through the previous research that constructing Siamese network can better extract the common features of heterogeneous source images. Meanwhile, the encoder and decoder structure of U-Net can better extract the road network features of SAR images [30]. Siamese U-Net dual-task supervised network is proposed. Using the Siamese network, the SAR real-time image and the vector reference image are put into two branches of the convolutional network separately, so as to extract similar structural features. Despite being two separate branches of the feature extraction network, their weights are shared. Weight sharing forces both branches to learn common spatial descriptors; this reduces the risk of over-fitting to SAR-specific speckle patterns without requiring additional regularization terms.

The feature extraction consists of nine modules, including a convolutional input module, four downsampling modules, and four upsampling modules. Each upsampling has a corresponding input from the downsampling module and a common input from the upper layer in a hopping fashion. With this hopping cascade approach, the depth features contain both high-level and low-level information. The same encoder feature maps are reused by the segmentation head, adding only one 1 × 1 conv kernel worth of memory at inference.

The specific network layer parameters are shown in Appendix A.2.

4.2. Dual-Task Supervised Loss Function

In the network output section, we divided the final output into two parts representing two different tasks for supervised network feature extraction.

The first part is the matching output. A 1 × 1 convolutional layer is used to extract the SAR feature image and vector feature image, respectively. And then the correlation heat image of the two is obtained by convolution, and the peak of the correlation heat image is used as the matching result. In the matching task, we use the class-balanced cross-entropy loss function

l o s s_{1}

.

\{\begin{matrix} α = |Y_{-}| / |Y| \\ 1 - α = |Y_{+}| / |Y| \end{matrix}

(1)

l o s s_{1} = - α \sum_{j \in Y_{+}} \log \Pr (y_{j} = 1, W) - (1 - α) \sum_{j \in Y_{-}} \log \Pr (y_{j} = 0, W)

(2)

Y

represents the biplot of the matching labels of the image pair, where the Euclidean distance of the true matching position within three pixels is set to 1 for all labels and 0 for the rest.

Y = {y_{j}, j = 1, \dots, N}, y_{j} \in {0, 1}

,

N = (w_{S} - w_{T} + 1) * (h_{S} - h_{T} + 1)

,

w_{S}

,

h_{S}

,

w_{T}

,

h_{T}

denote the width and height of the vector image and the SAR image, respectively.

|Y_{-}|

and

|Y_{+}|

denote the correct match position and wrong match position labels, respectively.

W

represents the predicted heat map of the network output.

\Pr (y_{j} = 1, W)

denotes the value of the predicted heat map at

y_{j} = 1

,

\Pr (y_{j} = 0, W)

is the same.

The second part is the segmentation output. The road segmentation map of the SAR image is obtained using a 3 × 3 convolutional layer and sigmoid activation function, which is used as the road segmentation result. The sum of binary cross entropy loss (

B C E l o s s

) and dice coefficient loss (

D i c e l o s s

) is used as the loss function

l o s s_{2}

, and the model is optimized using the Adam optimizer.

B C E l o s s = - \frac{1}{n} \sum_{i = 1}^{n} [y_{i} \cdot \log p (y_{i} = 1) + (1 - y_{i}) \cdot \log (1 - p (y_{i} = 1))]

(3)

D i c e l o s s = 1 - \frac{2 \cdot |A \cap B|}{|A| + |B|}

(4)

l o s s_{2} = B C E l o s s + D i c e l o s s

(5)

where n denotes the total number of pixel points,

\log p (y_{i} = 1)

denotes the log probability of the prediction category;

|A|

and

|B|

denote the number of road pixel points in binary prediction map A and labelling map B, respectively.

|A \cap B|

denotes the number of intersecting pixel points between A and B.

The final loss function is the weighted sum of the matching

l o s s_{1}

and the segmentation

l o s s_{2}

.

L o s s = β l o s s_{1} + (1 - β) l o s s_{2}, β \in [0, 1]

(6)

where

β

is a weighting factor. Unlike single-task losses used, the joint formulation allows gradients from segmentation to regularize matching features, eliminating the need for extra post-processing filters. The size of

β

determines the proportion of matching loss to segmentation loss in network training, and we will discuss its value in the subsequent sections.

5. Experiments and Results

In this section, we design an experiment to verify that it is feasible to solve the problem of matching SAR images with vector images using Siamese U-Net Dual-task Supervised Network (SUDS). Our proposed SUDS is evaluated by comparing it with other existing matching methods Normalized Cross Correlation (NCC), MatchNet [31], Unet++ [32] and Siam-U [7].

5.1. Experimental Settings and Evaluation Metrics

In terms of hardware and environment, Intel Core I9-13900K CPU and NVIDIA RTX4090 GPU are used throughout the experiment for training and testing, python3.7 is used for programming, and pytorch1.13.0+cu116 is used to build the network framework. The network was optimized using the Adaptive Moment Estimation (Adam) optimizer, with the raw learning rate set to 0.0001, the weight decay coefficient to 0.00005, and the number of network training epochs to 50.

We randomly divide the 2503 sets of image pairs in the SAR-VEC dataset presented in the previous section into a training set containing 2003 sets of image pairs and a test set of 500 sets of image pairs. In the training set, 256 × 256 SAR images are used as real-time images, 512 × 512 vector images are used as reference images, labels with Euclidean distances within three pixels of the true matching location are set to 1 and the rest to 0, constituting a binary map of matching labels, and 256 × 256 vector images are used as road segmentation labels.

In this paper, we use the root mean square error of the sample

R M S E

, the correct rate of template matching within a certain error threshold

C M R_{T}

, and the standard deviation of matching accuracy

σ

to evaluate the matching effect of the model.

R M S E = \frac{1}{N} \sum_{i = 1}^{N} \sqrt{{(X_{i} - {\tilde{X}}_{i})}^{2} + {(Y_{i} - {\tilde{Y}}_{i})}^{2}}

(7)

C M R_{T} = \frac{N_{T}}{N} \times 100 %

(8)

e = \sqrt{{(X - \tilde{X})}^{2} + {(Y - \tilde{Y})}^{2}}

(9)

σ = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(e_{i} - R M S E)}^{2}}

(10)

where

N

is the total number of test samples,

(X_{i}, Y_{i})

denotes the coordinates of network matching in the first group of test data, and

({\tilde{X}}_{i}, {\tilde{Y}}_{i})

denotes the true coordinates of the first group.

N_{T}

is the number of groups whose matching error is less than the threshold T in the group of

N

test samples.

5.2. Matching Comparison

In order to compare different methods and demonstrate the superiority of our method (SUDS), we compared four relate matching methods, NCC, MatchNet, Unet++ and Siam-U. In order to ensure the objectivity and impartiality of the experimental results, we retrained the Siam-U network under the same conditions,

β = 0.8

in SUDS.

An example of a successfully matched heat map is shown as Figure 4. This set of examples has clear road features in the SAR image and strong road specificity in the vector image, which is difficult to make false matches, and is an ideal set of matching inputs. From the heat map, the high heat values in the NCC heat map appear in the form of regions, and the high heat value regions in the MatchNet heat map are in the form of lines, while the peaks are not obvious. Whereas, Unet++, Siam-U and SUDS heat map has obvious dotted peaks which shows that Unet series network performs better on this task. The peaks in the SUDS heat map are more concentrated and the peak contrast is prominent, which indicates that the SUDS method has better performance in matching.

Matching results of three sets of typical interference images are shown as Figure 5. NCC is not able to produce correct matching results in all three sets of images. MatchNet heat map shows obvious line features, the peak ratio is too low, the matching results are easily interfered, and the matching error is significantly higher than that of the Unet network method. Thus we focus on comparing the differences between Unet++, Siam-U and SUDS methods.

In group 1, the vector image contains straight roads with multiple similar intersections, which is easy to produce false matches. both Unet++ and Siam-U matching results have multiple local extreme points, while SUDS matching results have only a single peak value. In group 2, the SAR image contains similar roads and water bodies, resulting in interference. There are multiple local extremes in the Unet++ matching result, and the Siam-U matching result has high local heat values interfered by the water bodies, while the SUDS matching result has prominent peaks and a clearly defined image. In the third set of images, there is highlight interference on the right side of the road in the SAR image, and there are similar highlighted buildings on the top of the image. Unet++ has higher local heat values and the peak is not prominent. Siam-U is disturbed by the interference and has two similar local extremes, and the matching result is inaccurate. While SUDS eliminates the interference well, with a clear single peak and accurate matching result.

The above results show that our proposed dual-task supervisory module can help the network to better focus the road information through the supervision of the segmentation module. When there is interference in the image, the network can extract the key features more accurately. Through the connection of the twin network, the matching module will share the horizon of the segmentation module, which leads to the improvement of the performance of the matching module. Therefore, SUDS can overcome the water body interference and highlight interference caused by the SAR imaging principle, correctly extract the road information and distinguish the differences in similar roads, so as to obtain more accurate matching results with better robustness.

The test data results of all methods are shown in Table 1. It can be seen that the SUDS network is able to complete the task of matching SAR images with vector images and obtain a high matching correct rate. The SUDS method outperforms the remaining two methods in terms of root mean square error and correct matching rate, with 80.2% correct matching rate within 5 pixels and 91.0% correct matching rate within 10 pixels.

For the matching time, although NCC and MatchNet have shorter time consumption, the matching error is larger and the correct rate of matching is lower. While ensuring the accuracy rate, SUDS outperforms Unet++ and Siam-U in terms of time consumption and can better adapt to the demand of real-time matching in the navigation process.

Figure 6 presents the distribution of localization errors, consistent with our previous conclusions. The histogram indicates that most samples have errors concentrated between 0 and 5 pixels, with the highest proportion of samples having errors between 0 and 1 pixels, exceeding 30%. This suggests that the model achieves higher localization accuracy within smaller error ranges.

The CDF curve shows that 80.2% of samples have errors no greater than 5 pixels, and 91.0% have errors no greater than 10 pixels. This further substantiates the model’s capability to achieve high-precision localization in most cases.

The KDE curve smoothly outlines the shape of the error distribution, further emphasizing the concentration of errors in a smaller range and the pronounced decrease in sample counts with increasing error.

Limitations of the current study and representative failure cases are summarised in Appendix A.3.

6. Discussion

In this section, we will discuss how the weighting parameter

β

works for network training and how to choose an appropriate value. The networks were trained and tested with the same settings as in the previous chapter, with weighting parameter

β

of 0, 0.2, 0.4, 0.6, 0.8, and 1, respectively. Two examples of network outputs are shown in Figure 7.

From the road segmentation results, when

β = 1

, the network degrades to a U-Net matching network, which can hardly segment the road. When

β \neq 1

, as b increases, the weight of matching loss in the network increases, the jagged interference in the road segmentation image gradually decreases, and the segmented image becomes smoother. This indicates that the increase in the weight of the matching loss is beneficial to the road segmentation task.

From the matching results, when

β = 0

, the network degrades to a U-Net segmentation network, which can hardly obtain a correlation heat map with clear peaks. When

β = 1

, the network degrades to U-Net matching network, which is able to obtain relatively accurate correlated heat images. When

β \in (0, 1)

, the addition of segmentation loss enhances the boundary alignment of the network, resulting in more prominent correlated heat map peaks and less background noise. This suggests that the presence of segmentation loss is beneficial for the matching task.

In fact, we are more concerned about the matching performance of the model in the navigation task, and the more accurate matching test statistics are shown in Table 2.

As can be seen from Table 2, when

β = 0

, the network contains only the road segmentation loss, and the network is downgraded to a U-Net segmentation network. At this time, the root-mean-square error is the largest, and

C M R_{5}

is close to 0, which almost cannot complete the matching task.

When

β = 1

, the network contains only matching loss, and the network is degraded to U-Net matching network. The matching error is 69.2% correct within 5 pixels, and the matching error is 86.6% correct within 10 pixels.

When

β \in (0, 1)

, the network contains both road segmentation loss and matching loss, and the matching error is more correct within 5 pixels than that at

β = 1

in both cases. This shows that adding segmentation loss supervision to the matching network can effectively improve the correct rate of matching.

When

β = 0.2

or

β = 0.4

, although their matching accuracy are more correct within 5 pixels, the root-mean-square error is larger than that at

β = 1

. The cause of these problems may be related to the performance of matching loss and segmentation loss. Matching loss directly adjusts the gradient ratio of positive and negative samples by fixing the weight parameter, and the gradient contribution of positive samples is larger, which can accelerate the model’s learning of a few classes. When the percentage of matching loss is small, the network may fall into a local optimum, and the ability to match some difficult samples is insufficient. When

β = 0.6

and

β = 0.8

, the matching correctness and root-mean-square error have better performance.

In addition, in terms of gradient propagation and convergence speed, the matching loss gradient is linearly related to the error, with a clear direction of the gradient, and the optimization process is more stable; while the segmentation loss gradient is inversely proportional to the prediction error, which may lead to instability at the early stage of training and slower convergence speed. The convergence curves of the two losses over epoch are shown in Figure 8.

According to the experimental experience, the best matching result is obtained when

β = \frac{l o s s_{2}}{l o s s_{1} + l o s s_{2}}, (E p o c h = 1)

. When facing different tasks and loss functions, the value of the parameter

β

can be adjusted according to the ratio of the initial epoch loss.

7. Conclusions

In this paper, we propose to use SAR images with road vector images for the matching task, create a SAR-VEC dataset, and propose a Siamese U-Net dual-task supervised network for that matching task. We add the segmentation loss to the matching task and use the weight-sharing U-Net to extract the common road information of the SAR and vector images. The sum of the weighted segmentation loss and the matching loss is calculated as the network loss, and the jointly supervised network feature extraction improves the feature extraction capability of the network, thus improving the matching accuracy. By comparing the NCC, MatchNet, Unet++ and Siam-U methods, we find that SUDS can better eliminate the interference in SAR images, identify the differences in similar roads, obtain more accurate matching results and have better robustness. Experiments show that the SUDS method outperforms the remaining two methods in terms of root-mean-square error and matching correctness, with 80.2% correctness up to 5 pixels and 91.0% correctness up to 10 pixels of matching error.

By exploring the value of the weighting factor

β

, we find that dual-task loss supervision is beneficial not only for the matching task, but also for the segmentation task. The best matching results are obtained when

β = 0.8

. When facing different tasks and loss functions, the value of the parameter b can be adjusted according to the ratio of the initial epoch loss. This provides a new idea for future research: adding different tasks that extract similar features together to supervision can effectively improve the accuracy.

The proposed SUDS approach requires further refinement to cope with more complex scenarios, and the matching will become more challenging when facing the presence of road occlusion or deformed roads in SAR images. The matching algorithm can be subsequently optimized to adapt to more complex matching tasks. In addition, in this paper, we use vector data to generate vector images and then match them. Constructing a matching network for heterogeneous data and directly using vector data to match with raster images may achieve improved results.

Author Contributions

Conceptualization, H.C. and Y.X.; methodology, H.C. and S.L.; software, H.C. and D.M.; validation, H.C. and D.M.; resources, S.L.; data curation, D.M.; writing—original draft preparation, H.C.; writing—review and editing, H.C. and Y.X.; visualization, H.C.; supervision, Y.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original high-resolution SAR images were obtained from the Hubei Provincial GF Satellite Data Center and are not publicly available due to licensing restrictions. A de-identified test dataset sufficient to reproduce the results can be provided on reasonable request to the corresponding author, subject to compliance with the data provider’s terms.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1. SAR-VEC Dataset Construction

Raw Data	Road geovector data: road network vector data from OpenStreetMap (OSM) and WeMap. SAR data: Single Look Complex (SLC) data, from Gaofen-3 satellite, processed through imaging and relative radiometric calibration, retaining amplitude, phase, and polarization information. Key parameters: Imaging mode: Ultra-Fine Stripmap (UFS) Polarization mode: Dual-aperture HH polarization Includes slant-to-ground projection coefficients. (Note: vector data complies with the ODbL license; GF-3 data usage is authorized by the Hubei Gaofen Center.)
SAR Image Processing (PIE-SAR 7.0)	SLC-to-intensity conversion: Transform SLC data into intensity images. Filtering noise reduction: Apply Frost adaptive filtering (7 × 7 pixel window) for noise suppression. Geometric correction: Eliminate terrain distortion using satellite orbital parameters and a 30 m-resolution Digital Elevation Model (DEM). Geocoding: Project images into the WGS84 coordinate system and export as GeoTIFF files.
Vector Data Processing (QGIS)	Format conversion: Convert road vector data from Shapefile to GeoJSON. Georeferencing: Align vector data to the WGS84 coordinate system. Rasterization: Render geovector road networks into 3 m/pixel GeoTIFF images.
Image Registration	Coarse alignment: Overlay geocoded SAR and vector images in QGIS, using road intersections and landmarks as control points (residual error ≤ 15 pixels). Precision alignment (MATLAB): 1. Extract SAR edge features and vector road skeletons. 2. Optimize affine transformation parameters via Normalized Mutual Information (NMI). 3. Achieve sub-pixel alignment (residual error ≤ 3 pixels) through manual control point adjustment.
Data Pair Cropping and Label Generation	Reference image (vector image): Extract 512 × 512-pixel road network patches from aligned vector images. Real-time image (SAR): Crop corresponding 256 × 256-pixel SAR sub-images. Position label image: Generate 257 × 257-pixel binary ground-truth masks. Pixels with Euclidean distance ≤ 3 from ground control points (GCPs) are labeled 1; others 0. Road supervision mask: Extract 256 × 256-pixel binary road masks aligned with SAR images.
Data Filtering	Reject samples with: 1. Severe speckle noise or defocusing artifacts in SAR images. 2. 50% road occlusion/breakage due to vegetation, buildings, or mixed backgrounds. 3. Mismatched road networks between SAR and vector data. Final dataset: 2503 validated image pairs.
Dataset Structure	Each sample contains: 1. SAR_XXXX.png: 256 × 256 SAR image 2. VEC_XXXX.png: 512 × 512 vector image 3. MASK_XXXX.png: 256 × 256 road mask 4. label_XXXX.png: 257 × 257 position label image
Spatial Distribution	Urban areas	Rural areas
	68%	32%

Appendix A.2. Network Layer Parameters

Layers	Channel Dimension	Output Size	Parameter
DoubleConv	64	256 × 256	[Conv 3 × 3, BatchNorm, ReLU] × 2
DownSample(1)	128	128 × 128	MaxPool 2 × 2, [Conv 3 × 3, BatchNorm, ReLU] × 2
DownSample(2)	256	64 × 64	MaxPool 2 × 2, [Conv 3 × 3, BatchNorm, ReLU] × 2
DownSample(3)	512	32 × 32	MaxPool 2 × 2, [Conv 3 × 3, BatchNorm, ReLU] × 2
DownSample(4)	512	16 × 16	MaxPool 2 × 2, [Conv 3 × 3, BatchNorm, ReLU] × 2
UpSample(1)	256	32 × 32	Up-conv 3 × 3, [Conv 1 × 1, BatchNorm2d, ReLU] × 2
UpSample(2)	128	64 × 64	Up-conv 3 × 3, [Conv 1 × 1, BatchNorm, ReLU] × 2
UpSample(3)	64	128 × 128	Up-conv 3 × 3, [Conv 1 × 1, BatchNorm, ReLU] × 2
UpSample(4)	64	256 × 256	Up-conv 3 × 3, [Conv 1 × 1, BatchNorm, ReLU] × 2
OutConv	2	256 × 256	Conv 1 × 1
Segmentation	1	256 × 256	Conv 3 × 3, Sigmoid

Appendix A.3. Representative Failure Cases and Implications

Three typical examples of failed matches are shown below.

Limitations in context:

The three failures above exemplify the principal constraints of the present work: (i) single-sensor training (only Gaofen-3 HH) limits robustness to other frequencies or polarisations; (ii) the Hubei-only geographic scope cannot validate seasonal or cultural landscape variability; (iii) absence of Sentinel-1/ALOS-2 samples prevents cross-sensor generalisation; (iv) limited occlusion diversity (vegetation, vehicles, construction) weakens performance when roads are partly hidden; (v) gaps or outdated vector data directly mislead the matching peak; (vi) architectural factors such as Siamese weight-sharing, cross-correlation layer, U-Net depth and skip connections were not individually ablated; (vii) no on-device UAV/vehicle tests are available, so real-time behaviour outside the desktop GPU is unverified. We are extending SAR-VEC with multi-sensor strips, world-wide tiles, simulated occlusions and OSM-missing scenarios, and we intend to conduct onboard navigation trials to address these issues.

References

Xiao, G.; Luo, H.; Zeng, K.; Wei, L.; Ma, J. Robust Feature Matching for Remote Sensing Image Registration via Guided Hyperplane Fitting. IEEE Trans. Geosci. Remote Sens. 2020, 60, 1–11. [Google Scholar] [CrossRef]
Ma, D.; Su, J.; Li, B.; Xian, Y.; Li, S.; Ding, Y. Self cycle strategy for unpaired visible-to-infrared image translation. Pattern Recognit. 2026, 171, 112253. [Google Scholar] [CrossRef]
Lee, S.; Kim, D.; Park, I.; Kim, G.; Kim, S. Perceptible Lightweight Zero-Mean Normalized Cross-Correlation for Infrared Template Matching. IEEE Access 2024, 12, 164777–164791. [Google Scholar] [CrossRef]
Xiao, F.; Tong, L.; Li, Y.; Luo, S.; Benediktsson, J.A. A General Spline-Based Method for Centerline Extraction from Different Segmented Road Maps in Remote Sensing Imagery. Remote Sens. 2022, 14, 2074. [Google Scholar] [CrossRef]
Schmitt, M.; Hughes, L.H.; Zhu, X.X. The SEN1-2 Dataset for Deep Learning in SAR-Optical Data Fusion. ISPRS Ann. Photogramm. Remote. Sens. Spat. Inf. Sci. 2018, IV-1, 141–146. [Google Scholar] [CrossRef]
Liu, X.; Wu, Q.; Pan, X.; Wang, J.; Zhao, F. SAR Image Transform Based on Amplitude and Frequency Shifting Joint Modulation. IEEE Sens. J. 2025, 25, 7043–7052. [Google Scholar] [CrossRef]
Wu, W.; Xian, Y.; Su, J.; Ren, L. A Siamese template matching method for SAR and optical image. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Wu, Q.; Wang, Y.; Liu, X.; Gu, Z.; Xu, Z.; Xiao, S. ISAR Image Transform via Joint Intra pulse and Inter pulse Periodic coded Phase Modulation. IEEE Sens. J. 2025, 25, 28788–28799. [Google Scholar] [CrossRef]
Wang, J.; Sun, H.; Sun, Y.; Tang, T.; Lei, L.; Ji, K. SAR-TinySNN: A Lightweight Spiking Neural Network for SAR Target Recognition. IEEE Geosci. Remote Sens. Lett. 2025, 22, 1–5. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Li, J.; Xu, X.; Wang, B.; Zhan, X.; Xu, Y.; Ke, X.; Zeng, T.; Su, H.; et al. SAR Ship Detection Dataset (SSDD): Official Release and Comprehensive Data Analysis [J/OL]. Remote Sens. 2021, 13, 3690. [Google Scholar] [CrossRef]
Zhao, J.; Zhang, Z.; Yao, W.; Datcu, M.; Xiong, H.; Yu, W. OpenSARUrban: A Sentinel-1 SAR Image Dataset for Urban Interpretation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 187–203. [Google Scholar] [CrossRef]
Wang, Y.; Wang, C.; Zhang, H.; Dong, Y.; Wei, S. A SAR Dataset of Ship Detection for Deep Learning under Complex Backgrounds. Remote Sens. 2019, 11, 765. [Google Scholar] [CrossRef]
Sun, X.; Wang, Z.; Sun, Y.; Wenhui, D.I.; Yue, Z.H.; Kun, F.U. AIR-SARShip-1.0: High-resolution SAR ship detection dataset. Radar J. 2019, 8, 852. [Google Scholar]
Ma, W.; Zhang, J.; Wu, Y.; Jiao, L.; Zhu, H.; Zhao, W. A Novel Two-Step Registration Method for Remote Sensing Images Based on Deep and Local Features. IEEE Trans. Geosci. Remote Sens. 2019, 57, 4834–4843. [Google Scholar] [CrossRef]
Hughes, L.H.; Schmitt, M.; Mou, L.; Wang, Y.; Zhu, X.X. Identifying Corresponding Patches in SAR and Optical Images with a Pseudo-Siamese CNN. IEEE Geosci. Remote Sens. Lett. 2018, 15, 784–788. [Google Scholar] [CrossRef]
Hoffmann, S.; Brust, C.-A.; Shadaydeh, M.; Denzler, J. Registration of High Resolution SAR and Optical Satellite Imagery Using Fully Convolutional Networks. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Yokohama, Japan, 28 July–2 August 2019. [Google Scholar]
Hughes, L.H.; Marcos, D.; Lobry, S.; Tuia, D.; Schmitt, M. A deep learning framework for matching of SAR and optical imagery. ISPRS J. Photogramm. Remote Sens. 2020, 169, 166–179. [Google Scholar] [CrossRef]
Liu, H.; An, J.; Jia, X.; Gan, L.; Karagiannidis, G.K.; Clerckx, B.; Bennis, M.; Debbah, M.; Cui, T.J. Stacked intelligent metasurfaces for wireless sensing and communication: Applications and challenges. IEEE Wirel. Commun. 2024, 32, 46–53. [Google Scholar] [CrossRef]
Costea, D.; Leordeanu, M. Aerial image geolocalization from recognition and matching of roads and intersections. arXiv 2016, arXiv:1605.08323. [Google Scholar] [CrossRef]
Li, Y.; He, H.; Yang, D.; Wang, S.; Zhang, M. Geolocalization with aerial image sequence for UAVs. Auton. Robot. 2020, 44, 1199–1215. [Google Scholar] [CrossRef]
Li, Y.; Wang, S.; He, H.; Meng, D.; Yang, D. Fast Aerial Image Geolocalization Using the Projective-Invariant Contour Feature. Remote Sens. 2021, 13, 490. [Google Scholar] [CrossRef]
Li, Y.; Yang, D.; Wang, S.; He, H.; Hu, J.; Liu, H. Road-Network-Based Fast Geolocalization. IEEE Trans. Geosci. Remote Sens. 2021, 59, 6065–6076. [Google Scholar] [CrossRef]
Shi, Y.; Bao, J.; Xu, L.; Tian, Z.; Ming, D. Study on the Preparation and Application of Vector Reference Map. In Proceedings of the 2020 IEEE 5th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 12–14 June 2020. [Google Scholar]
Wu, W.; Xian, Y.; Li, S.; Su, J.; Zhang, D. A multi-level image alignment method for aerial image and road-based geo-parcel data. Displays 2023, 76, 102361. [Google Scholar] [CrossRef]
Fjortoft, R.; Lopes, A.; Marthon, P.; Cubero-Castan, E. Different approaches to multiedge detection in SAR images. In Proceedings of the IGARSS’97, 1997 IEEE International Geoscience and Remote Sensing Symposium. Remote Sensing—A Scientific Vision for Sustainable Development, Singapore, 3–8 August 1997. [Google Scholar]
Hu, Y.; Fan, J.; Wang, J. SAR image unsupervised segmentation based on a modified fuzzy C-means algorithm. In Proceedings of the International Conference on Information Science and Technology, Dalian, China, 6–8 May 2016. [Google Scholar]
Li, Y.; Zhang, R.; Wu, Y. Road network extraction in high-resolution SAR images based CNN features. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium, Fort Worth, TX, USA, 23–28 July 2017. [Google Scholar]
Xia, Z.; Kim, J. Mixed spatial pyramid pooling for semantic segmentation. Appl. Soft Comput. 2020, 91, 106209. [Google Scholar] [CrossRef]
Wang, D.H. Research on Real-Time Reference Map Preparation and Application Method Based on Vector Map; Huazhong University of Science and Technology: Wuhan, China, 2018. [Google Scholar]
Lin, Y.; Wan, L.; Zhang, H.; Wei, S.; Ma, P.; Li, Y.; Zhao, Z. Leveraging optical and SAR data with a UU-Net for large-scale road extraction. Int. J. Appl. Earth Obs. Geoinf. 2021, 103, 102498. [Google Scholar] [CrossRef]
Ye, Y.; Teng, X.; Yu, Q.; Li, Z. Optical-SAR image matching based on MatchNet and multi-point matching constraint. Acta Aeronaut. Astronaut. Sin. 2024, 45, 329162. [Google Scholar] [CrossRef]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation; proceedings of the 4th Deep Learning in Medical Image Analysis (DLMIA) Workshop. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support; Stoyanov, D., Taylor, Z., Carneiro, G., Syeda-Mahmood, T., Martel, A., Maier-Hein, L., Tavares, J., Bradley, A., Papa, J., Belagiannis, V., et al., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]

Figure 1. Dataset Example. (a) 512 × 512 vector image; (b) 256 × 256 SAR image; (c) 257 × 257 position label image; (d) 256 × 256 vector image.

Figure 2. Siamese U-Net dual-task supervised network structure. The inputs are SAR images and road vector images, and the outputs are correlation heat maps and road segmentation maps of both. Conv(k, s, p) and MaxPool(k, s) denote convolutional and pooling layers. doubleConv(k, s, p) denotes that two “convolutions, ReLu activations, and batch normalization” are performed. The kernel size is k, the step size is s, and the padding is p. The Convolutional Cross Correlation Layer represents the convolution with the SAR feature image as the real-time image and the vector feature image as the reference image.

Figure 3. Examples of road target interference. (a) Similar roads and rivers; (b) highlighting interference in the road.

Figure 4. An example of a successfully matched heat map. From left to right: the SAR real-time image, the vector image, and the matching heat maps for the NCC, MatchNet, Unet++, Siam-U, and SUDS methods.

Figure 5. Matching results of three sets of typical interference images. The red arrows in the heatmap are used to mark peak positions.

Figure 6. Distribution of absolute localization errors and their cumulative distribution function (CDF). The histogram (green bars) and kernel density estimate (KDE) (green line) illustrate the distribution of errors, while the orange line represents the CDF. The red and green dashed vertical lines denote the 5-pixel and 10-pixel thresholds, respectively, aligning with the previously discussed high accuracy rates within these error bounds.

Figure 7. Example of two network outputs. Every two rows are a set of examples: The left two columns are the inputs, which are vector image, SAR image and location label map, where the blue box in the vector image represents the location of SAR image in it. The six columns on the right side are the outputs for different values of β. The top is the road segmentation map output, and the bottom is the match-related heat map output.

Figure 8. Loss over epoch for matching loss and segmentation loss.

Table 1. Matching results of the five methods.

	NCC	MatchNet	Unet++	Siam-U	SUDS
$R M S E$	95.09	63.65	14.02	10.45	8.65
$C M R_{5}$	21.2%	27.1%	67.5%	72.4%	80.2%
$C M R_{10}$	35.2%	43.23%	84%	88.6%	91.0%
$σ$	93.11	80.01	36.0	30.18	29.33
Time cost (s/per pair)	0.007	0.025	0.073	0.045	0.042

Table 2. Matching results for different values of β.

	$β = 0$	$β = 0.2$	$β = 0.4$	$β = 0.6$	$β = 0.8$	$β = 1$
$R M S E$	95.64	13.92	18.49	10.05	8.65	11.17
$C M R_{5}$	0.6%	76%	72.0%	78.0%	80.2%	69.2%
$C M R_{10}$	7.4%	87.8%	83.2%	90.2%	91.0%	86.6%
$σ$	75.82	41.48	52.01	30.86	29.33	30.09

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cai, H.; Xian, Y.; Li, S.; Ma, D. Dual-Task Supervised Network for SAR and Road Vector Image Matching. Remote Sens. 2025, 17, 3504. https://doi.org/10.3390/rs17203504

AMA Style

Cai H, Xian Y, Li S, Ma D. Dual-Task Supervised Network for SAR and Road Vector Image Matching. Remote Sensing. 2025; 17(20):3504. https://doi.org/10.3390/rs17203504

Chicago/Turabian Style

Cai, Hanyu, Yong Xian, Shaopeng Li, and Decao Ma. 2025. "Dual-Task Supervised Network for SAR and Road Vector Image Matching" Remote Sensing 17, no. 20: 3504. https://doi.org/10.3390/rs17203504

APA Style

Cai, H., Xian, Y., Li, S., & Ma, D. (2025). Dual-Task Supervised Network for SAR and Road Vector Image Matching. Remote Sensing, 17(20), 3504. https://doi.org/10.3390/rs17203504

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dual-Task Supervised Network for SAR and Road Vector Image Matching

Highlights

Abstract

1. Introduction

2. Related Work

2.1. SAR Image Matching

2.2. Vector Image Matching

2.3. SAR Image Road Extraction

3. Dataset

3.1. Data Sources

3.2. Data Processing

4. Methods

4.1. Feature Extraction

4.2. Dual-Task Supervised Loss Function

5. Experiments and Results

5.1. Experimental Settings and Evaluation Metrics

5.2. Matching Comparison

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. SAR-VEC Dataset Construction

Appendix A.2. Network Layer Parameters

Appendix A.3. Representative Failure Cases and Implications

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI