Robust Optical and SAR Image Matching via Attention-Guided Structural Encoding and Confidence-Aware Filtering

Kang, Qi; Zhang, Jixian; Huang, Guoman; Liu, Fei

doi:10.3390/rs17142501

Open AccessArticle

Robust Optical and SAR Image Matching via Attention-Guided Structural Encoding and Confidence-Aware Filtering

¹

School of Environment and Spatial Informatics, China University of Mining and Technology, Xuzhou 221116, China

²

Advanced Remote Sensing Research Institute, Moganshan Geospatial Information Laboratory, Huzhou 313299, China

³

National Geomatics Center of China, Beijing 100830, China

⁴

Chinese Academy of Surveying & Mapping, Beijing 100830, China

⁵

School of Geomatics and Urban Spatial Informatics, Beijing University of Civil Engineering and Architecture, Beijing 100044, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(14), 2501; https://doi.org/10.3390/rs17142501

Submission received: 23 June 2025 / Revised: 15 July 2025 / Accepted: 17 July 2025 / Published: 18 July 2025

(This article belongs to the Special Issue Advancements of Vision-Language Models (VLMs) in Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

Accurate feature matching between optical and synthetic aperture radar (SAR) images remains a significant challenge in remote sensing due to substantial modality discrepancies in texture, intensity, and geometric structure. In this study, we proposed an attention-context-aware deep learning framework (ACAMatch) for robust and efficient optical–SAR image registration. The proposed method integrates a structure-enhanced feature extractor, RS2FNet, which combines dual-stage Res2Net modules with a bi-level routing attention mechanism to capture multi-scale local textures and global structural semantics. A context-aware matching module refines correspondences through self- and cross-attention, coupled with a confidence-driven early-exit pruning strategy to reduce computational cost while maintaining accuracy. Additionally, a match-aware multi-task loss function jointly enforces spatial consistency, affine invariance, and structural coherence for end-to-end optimization. Experiments on public datasets (SEN1-2 and WHU-OPT-SAR) and a self-collected Gaofen (GF) dataset demonstrated that ACAMatch significantly outperformed existing state-of-the-art methods in terms of the number of correct matches, matching accuracy, and inference speed, especially under challenging conditions such as resolution differences and severe structural distortions. These results indicate the effectiveness and generalizability of the proposed approach for multimodal image registration, making ACAMatch a promising solution for remote sensing applications such as change detection and multi-sensor data fusion.

Keywords:

optical–SAR image matching; deep feature extraction; attention mechanism; multimodal registration; feature consistency; context-aware matching

1. Introduction

Remote sensing image registration serves as a fundamental prerequisite for advanced applications such as multi-source data fusion, change detection, and object recognition. It is particularly critical in scenarios such as natural resource surveys, environmental monitoring, meteorological forecasting, homeland security, and disaster response [1]. With the rapid development of various remote sensing platforms—especially optical and SAR sensors—multimodal image processing has become a growing trend in remote sensing analysis. Optical imagery, due to its high visual fidelity and alignment with human perception, has been extensively used for surface interpretation and object identification [2]. However, it is often hindered by adverse weather conditions, such as cloud cover, which limits its usability during rainy or cloudy seasons. In contrast, SAR imagery, which relies on active microwave emission, is capable of penetrating clouds and vegetation to capture stable surface observations under all-weather and all-day conditions [3]. Given their complementary characteristics, the joint use of optical and SAR images can provide a more comprehensive and robust understanding of the Earth’s surface, offering significant advantages in tasks such as image fusion [4], change detection [5], and land cover classification [6]. High-precision registration between optical and SAR images is thus crucial for effective multi-source data integration.

Nevertheless, achieving accurate optical–SAR registration remains a challenging task due to substantial differences in imaging mechanisms, geometric structures, and radiometric properties. These disparities result in non-linear intensity differences, geometric distortions, and the presence of speckle noise in SAR images, which collectively hinder reliable matching [7]. Figure 1 illustrates three representative optical–SAR image pairs from diverse landscapes and resolutions, clearly showing their significant discrepancies in texture, edge definition, and radiometric patterns. Figure 1a shows the considerable radiometric discrepancy in water regions between optical and SAR modalities; Figure 1b highlights the inconsistent texture patterns in urban areas; and Figure 1c illustrates the structural blurring caused by speckle noise, which is prevalent in SAR images. These discrepancies substantially increase the difficulty of accurate registration across modalities.

Traditional image registration methods can be broadly categorized into intensity-based approaches (e.g., mutual information (MI) [8], normalized cross correlation (NCC) [9], histogram of orientated phase congruency (HOPC) [10], chanel features of orientated gradient (CFOG) [11]) and feature-based approaches (e.g., scale-invariant feature transform (SIFT) [12], speeded-up robust features (SURF) [13], phase congruency-based rotation-invariant feature transform (RIFT) [14], and optical-to-SAR SIFT (OS-SIFT) [15]). The former methods depend on statistical measures derived from image intensity or phase information to estimate optimal transformation parameters. The latter detect local features such as corners or edges and construct descriptors for geometric model estimation. However, due to the limited expressiveness of handcrafted features and their susceptibility to non-linear modality differences, these methods struggle to deliver satisfactory performance in complex cross-modal settings.

To overcome these challenges, deep learning has been widely adopted for remote sensing image matching, offering significantly enhanced feature representation capabilities. Current deep learning-based methods fall into two main categories: (1) descriptor learning methods that directly extract modality-invariant features from optical and SAR images (e.g., MatchNet [16], HardNet [17], CMM-Net [18], ADRNet [19]); and (2) modality conversion methods that leverage generative adversarial networks (GANs) [20,21] to translate images from one modality to another, effectively reducing domain gaps before matching. Although both categories have improved matching accuracy and robustness to varying degrees, they each have limitations. Descriptor-based methods often struggle to model both local details and global context simultaneously and typically involve high computational redundancy. Modality-conversion methods, on the other hand, may introduce artifacts or lead to loss of critical information, and they usually rely on large paired datasets, limiting their generalizability.

This study aims to address the performance and efficiency bottlenecks of cross-modal matching by proposing a novel deep learning framework, called ACAMatch. It consists of two main modules: a structure-aware feature extraction network (RS2FNet) and a context-guided matching network (ACAMatcher). RS2FNet integrates a two-stage Res2Net [22] backbone with a bi-level routing attention mechanism to capture both fine-grained local features and region-level structural information, enabling stable and repeatable keypoint detection across modalities. ACAMatcher incorporates prior attention cues to guide bidirectional attention-based matching and introduces a confidence-driven early-exit pruning strategy to significantly reduce computational overhead without compromising matching precision. Furthermore, we design a match-aware multi-task loss function that jointly enforces spatial consistency, affine invariance, and structural loop closure constraints, enabling unified modeling from feature learning to matching inference.

The main contributions of this paper are summarized as follows:

(1): A novel feature extraction network, RS2FNet, is proposed, which for the first time integrates the multi-scale residual structure (Res2Net) [22] with the region-aware sparse attention mechanism (BRA) [23] for optical–SAR image matching. This integration enhances cross-modal feature representation and structural robustness.
(2): The ACAMatcher module is developed, incorporating a context-guided bidirectional attention mechanism and a confidence-driven dynamic pruning strategy. This design improves matching accuracy while significantly reducing computational overhead.
(3): A match-aware multi-task loss function is formulated to jointly optimize keypoint detection, descriptor learning, and structural consistency, contributing to a unified and systematic cross-modal matching framework.
(4): Extensive experiments on multiple public optical–SAR datasets validate the superiority of ACAMatch in terms of the number of correct matches, matching accuracy, and registration precision compared to existing state-of-the-art methods.

2. Related Works

With the increasing application of remote sensing imagery in fine-grained land cover classification, disaster monitoring, and change detection, accurate geometric registration between images has become a fundamental prerequisite. Image matching, as the core of registration, aims to identify corresponding points with consistent spatial positions across multi-temporal, multi-sensor, or multi-view images. Brown’s seminal work in 1992 outlined four critical components of image matching—feature space selection, similarity metrics, search space definition, and search strategies—which laid the theoretical foundation for subsequent developments [24]. As matching tasks grow more complex, especially in multimodal settings, research has shifted from handcrafted features toward deep model-driven learning and representation approaches. Current matching methods can be broadly divided into two categories: traditional methods and deep learning-based methods, which we summarize and analyze below.

2.1. Traditional Remote Sensing Image Matching Methods

Traditional matching methods typically fall into two types: intensity-based similarity matching and feature-based structural matching. Both approaches rely on handcrafted features and predefined similarity metrics to identify correspondences and spatial relationships between images. The former emphasizes direct pixel-wise comparison using statistical measures, while the latter aims to enhance robustness through geometry-aware descriptors.

Intensity-based methods often utilize sliding-window template matching with metrics such as MI [8] and NCC [9]. While effective in similar modalities, they are sensitive to radiometric differences and computationally intensive. Variants such as mutual information with contrast measure (MIC) [25] and maximization of mutual information (MMI) [26] offer some improvements, yet they struggle under significant modality discrepancies or geometric distortions.

To reduce reliance on raw gray-level distributions, researchers have explored structure-aware descriptors. Examples include HOPC [10], which combines phase congruency and histogram of oriented gradients (HOG) [27], and CFOG [11], a pixel-level extension of HOG using 3D Gaussian convolution to enhance continuity. However, these methods remain sensitive to multiplicative noise and often underperform in weakly structured regions. Alternative approaches such as local self-similarities with MI (LSS+MI) [28] aim to balance local-global robustness in complex scenes. Overall, intensity-based methods offer simple implementation and dense matching but are limited by poor adaptability to modality variance.

Feature-based matching methods extract stable geometric primitives (e.g., points, lines, regions) to build descriptors and estimate transformations [3]. Keypoint-based methods like SIFT [12] introduced scale-space and orientation normalization to achieve rotation and scale invariance. Extensions such as SURF [13], RIFT [14], and oriented FAST and rotated BRIEF (ORB) [29] improve efficiency and descriptor robustness. For multimodal settings, SAR-SIFT [30] replaces differential operators with gradient ratios and integrates multi-scale corner detection to enhance SAR point stability. OS-SIFT [15] incorporates Sobel [31] and ratio of exponentially weighted averages (ROEWA) [32] operators to improve consistency across modalities, demonstrating the potential of adapting traditional methods to multimodal tasks.

Line-feature methods focus on structural elements like roads, rivers, or building edges. Algorithms such as Hough Transform and LSD [33] extract line segments globally or locally. Enhancements like ROEWA [32], histogram of angle and maximal edge orientation distribution (HAED) [34], and Voronoi integrated spectral point matching (VSPM) [35] improve SAR robustness through region merging, orientation statistics, or topology constraints, though challenges remain in endpoint localization and segment continuity.

Region-based methods match closed, geometrically stable high-contrast regions such as water bodies or buildings. These often rely on image segmentation and shape descriptors like Hu moments [36] or maximally stable extremal regions (MSER) [37]. While effective in well-structured images, performance degrades in scenes with poor segmentation or strong modality differences.

In summary, traditional matching methods retain value for homogeneous or weakly heterogeneous datasets, particularly where resources are limited or end-to-end learning is not feasible. Despite their limitations in modeling complex cross-modal discrepancies, they provide the conceptual groundwork for modern deep learning techniques.

2.2. Deep Learning-Based Matching Methods

Recent years have seen a growing adoption of deep learning techniques to address the limitations of handcrafted features in multimodal image matching. Deep models, trained in an end-to-end fashion, can extract robust, modality-invariant features that better accommodate radiometric differences, geometric distortions, and scale or rotation variations inherent in optical–SAR image pairs. These models leverage multi-layer convolutional architectures to capture geometric structures, texture patterns, and semantic cues, offering a more flexible framework for non-linear mapping across modalities.

Based on their core strategies, deep learning-based methods can be broadly categorized into: (1) descriptor learning methods, which directly extract discriminative deep features from optical and SAR images; and (2) style transfer methods, which convert images from one modality to another to unify domains before registration.

(1) Descriptor Learning Methods

These methods focus on learning deep feature representations directly from heterogeneous image pairs. A common direction is to enhance structural awareness and contextual representation. For example, SuperPoint [38] integrates keypoint detection and descriptor learning into a unified framework. MatchosNet [39] extracts hierarchical structural features and fuses them with geometric constraints for improved consistency. AECF [40] introduces multi-scale gradient convolution to boost alignment stability in urban and rural scenes. Li et al. [41] propose a phase-aware attention-based descriptor that enhances structural consistency under heavy SAR noise. CMM-Net [18] and ADRNet [19] incorporate structure-aware encoders and graph-based reasoning modules to reinforce matching robustness. Other strategies such as Bi-level Routing Attention (BRA) [23] and Res2Net [22] are also adopted for multi-scale perception and salient region modeling. Our proposed ACAMatch follows a similar path by integrating contextual attention with structural guidance to improve cross-modal feature alignment.

Another line of work treats matching as a similarity learning problem using Siamese or pseudo-Siamese architectures. MatchNet [16] and HardNet [17] train deep encoders under contrastive loss to distinguish matched and unmatched pairs. FDANet [42] improves robustness through frequency domain fusion and contrastive learning. CRS-Net [43] introduces cross-domain decoupling and rotation-invariant design to further enhance generalization across modalities. Lightweight alternatives like LM-Net [44] and F3Net [45] aim for deployment efficiency, combining frequency-aware modules or knowledge distillation to reduce model size without sacrificing accuracy.

To further improve inference speed and matching quality, efficient attention-based matchers such as SuperGlue [46] and LightGlue [47] introduce graph neural networks and sparse attention strategies, respectively. These enable scalable, high-performance matching even in resource-constrained environments.

In addition, many methods integrate contrastive or triplet loss [48,49] and attention-aware supervision to enhance the discriminative power of features, especially under limited training data conditions.

(2) Style Transfer-Based Methods

These methods aim to unify domain appearance by translating one modality (e.g., optical) into another (e.g., SAR) using generative models, often based on GANs [50]. Although early attempts like KCG-GAN [21] improved matching performance, they frequently suffered from artifact generation or geometric distortion. Later approaches such as CDA-GAN [51] incorporate geometry-preserving losses, while Diffusion-GAN [52] enhances transformation quality and stability through hybrid strategies. ADD-UNet [53] proposes an adjacent dual-decoder UNet architecture based on cGAN, which introduces a multi-scale feature aggregation mechanism while maintaining a lightweight parameter size, effectively enhancing the structural integrity and edge sharpness of the generated images.

While style transfer simplifies domain alignment, it risks losing structural details and heavily depends on paired training samples. In contrast, descriptor learning methods offer better structural consistency and generalization.

Inspired by these findings, our work proposes ACAMatch, a hybrid framework that integrates multi-scale structural encoding, context-aware matching, and dynamic inference strategies to address the challenges of optical–SAR registration.

3. Methodology

In cross-modal remote sensing image matching, optical and SAR images exhibit fundamental differences in their imaging mechanisms, causing significant inconsistencies in textures, intensities, and noise characteristics. These discrepancies present major challenges for reliable keypoint detection and accurate feature matching. To effectively address the uncertainty arising from such modality differences, we propose a unified framework for feature extraction and attention-guided matching, called ACAMatch. The overall pipeline is illustrated in Figure 2.

As shown in the figure, the ACAMatch framework consists of two core modules: RS2FNet for feature extraction and ACAMatcher for feature matching. RS2FNet introduces a two-stage Res2Net backbone combined with a BRA mechanism to enhance multi-scale feature representation and improve cross-modal invariance, enabling the early encoding stage to generate highly repeatable and robust keypoints and descriptors. The ACAMatcher module leverages attention priors generated during feature extraction to guide channel-wise modulation. To strengthen contextual information aggregation, it integrates a self- and cross-attention fusion mechanism that models both intra- and inter-image relationships. Furthermore, a dynamic gating mechanism is introduced to adaptively predict matching confidence and prune unreliable correspondences, thereby improving matching accuracy and enhancing robustness against cross-modal interference.

To enhance the robustness and cross-modal generalization of matching in complex remote sensing registration tasks, we propose a matching-aware multi-task optimization strategy. Building upon conventional supervision for keypoint detection and descriptor learning, this strategy introduces a structural consistency loss to improve the geometric stability of extracted features and incorporates a matchability prediction branch to support early-exit inference. Together, these components improve both matching efficiency and reliability.

This chapter provides a detailed introduction to the proposed RS2FNet feature extraction network, the ACAMatcher feature matching module, and the design of the matching-aware loss function.

3.1. RS2FNett Feature Extraction Network

Remote sensing image matching demands high-quality feature representations that are capable of modeling cross-scale geometric structures and semantic information while maintaining sensitivity to local details. Most existing methods, such as SuperPoint [38], R2D2 [54], and DISK [55], rely on shallow CNNs or fixed backbone networks to extract local descriptors, struggling to achieve a balance between computational efficiency and rich semantic-structural representations. Recent approaches, including LoFTR [56] and SuperGlue [46], introduce Transformer-based architectures to model global context. However, these methods suffer from high computational overhead and limited spatial selectivity, making them less effective when applied to high-resolution remote sensing images.

To address these challenges, we propose RS2FNet, a feature extraction network that combines lightweight multi-scale convolutions with dynamic sparse attention mechanisms. It employs a dual-stage Res2Net backbone to enhance semantic fusion and structural perception across scales. Each Res2Net stage is followed by a dual-layer BRA module, which introduces regional guidance and efficient sparse attention modeling. Finally, a lightweight decoder head produces descriptors and keypoint probability maps. RS2FNet supports end-to-end training and can be flexibly integrated with various matching modules, thus making it suitable for a wide range of cross-modal remote sensing matching tasks. The overall network architecture is illustrated in Figure 3.

3.1.1. Dual-Stage Res2Net Backbone

To enhance multi-scale structural modeling and cross-modal semantic perception in remote sensing imagery, we design a dual-stage backbone based on Res2Net modules. Unlike conventional shallow convolutional stacks such as VGG, Res2Net introduces grouped convolutions and cross-scale residual fusion within each block, enabling fine-grained feature decomposition and integration, and thereby providing strong adaptability to complex structures.

Specifically, each Res2Net stage is followed by a BRA module, which compresses intra-region information and models inter-region relationships. The integration of Res2Net and BRA is illustrated in Figure 4.

In the first stage, the input feature map of size

H \times W \times 64

is first processed by a 1 × 1 convolution to reduce the channel dimension. It is then split into four equal-width groups (

X_{1}

to

X_{4}

), each with 16 channels. Among them,

X_{1}

serves as a skip connection without convolution, while

X_{2}

to

X_{4}

are sequentially processed by 3 × 3 convolutions, each taking the previous output as a residual input. The outputs (

Y_{1}

to

Y_{4}

) are then concatenated and fused via another 1 × 1 convolution, followed by residual addition with the input feature map:

{Y_{1} = X}_{1}, Y_{2} = {C o n v}_{3 \times 3} (X_{2}) Y_{3} = {C o n v}_{3 \times 3} (X_{3} + Y_{2}) Y_{4} = {C o n v}_{3 \times 3} (X_{4} + Y_{3}) F_{out} = ReLU (BN ({Conv}_{1 \times 1} ([Y_{1}, Y_{2}, Y_{3}, Y_{4}]) + F_{in}))

(1)

In the second stage, a similar structure is adopted, with the first 3 × 3 convolution applying a stride of 2 for spatial downsampling. The feature map size is reduced to

H / 2 \times W / 2

and the channel dimension is increased to 128, thus enhancing semantic abstraction while reducing computational complexity. Outputs from both stages are subsequently passed to BRA modules for region-guided sparse attention modeling.

3.1.2. Bi-Level Routing Attention (BRA)

To mitigate the overhead of fully connected attention in high-resolution remote sensing images and to enhance focus on structurally salient regions, we introduce a BRA module after each Res2Net stage. Specifically, BRA follows a coarse-to-fine strategy that first performs region-level attention routing and then applies fine-grained token-level attention. This mechanism constructs a sparse, structure-aware attention graph, well-suited for cross-modal and structure-dense image matching. The BRA computation includes the following steps:

Region Partition and Token Construction: The input feature map

X \in R^{H \times W \times C}

is divided into

S \times S

non-overlapping spatial regions, each containing

N = H W / S^{2}

spatial tokens (one per location). These features are then reshaped into a tensor

X^{r} \in R^{S^{2} \times N \times C}

. Three linear projections are then applied to obtain query (

Q)

, key (

K)

, and value (

V)

representations:

Q = X^{r} W^{q}, K = X^{r} W^{k}, V = X^{r} W^{v}

(2)

where

W^{q}, W^{k}, W^{k} \in R^{C \times C}

are learnable projection matrices, and

Q

,

K

,

V \in R^{S^{2} \times N \times C}

are the projected token-level tensors.

Routing Attention Construction: To avoid the high cost of full token-to-token attention, BRA first computes region-level embeddings by average pooling the token-level query and key features within each region, resulting in

Q^{r}, K^{r} \in R^{S^{2} \times C}

. A semantic adjacency matrix

A^{r} \in R^{S^{2} \times S^{2}}

is then constructed by computing the dot product between these region-level representations:

A^{r} = Q^{r} \cdot (K^{r})^{T}, A^{r} \in R^{S^{2} \times S^{2}}

(3)

where

A^{r} [i, j]

reflects the semantic correlation between region

i

and region

j

.

Top-k Routing and Sparse Token Gathering: For each region, the top-k most semantically relevant regions are selected from the adjacency matrix

A^{r}

to construct a routing index matrix:

I^{r} = topkIndex (A^{r}), I^{r} \in N^{S^{2} \times k}

(4)

Based on

I^{r}

, we gather all token-level keys and values from the selected regions:

K^{g} = gather (K, I^{r}), K^{g} \in R^{S^{2} \times (k N) \times C} V^{g} = gather (V, I^{r}), V^{g} \in R^{S^{2} \times (k N) \times C}

(5)

The gather (·) operation compacts tokens from non-contiguous regions into dense tensors, enabling efficient sparse attention computation on GPUs.

Token-wise Sparse Attention and Fusion: For each query token, standard dot-product attention is applied over its corresponding sparse key-value set, and the final output

Y

is obtained by fusing the attention output

O

with the original feature

X

:

O = Softmax (Q \times ({K^{g})}^{T}) \times V^{g} Y = O + X

(6)

where

O \in R^{S^{2} \times N \times C}

is the attention-enhanced feature, and

Y

is the fused representation. All computations are confined within the top-k routed regions, significantly reducing global attention costs while maintaining sensitivity to important structural patterns.

3.1.3. Lightweight Decoding Module

Following the structure-aware encoding process, the output features of RS2FNet are passed to two parallel decoding branches to generate the final keypoint probability map and descriptor map. A lightweight decoding module is employed, consisting of a keypoint detection head and a descriptor reconstruction head to, respectively, predict spatial distributions and reconstruct semantic feature representations.

The keypoint detection branch takes as input the output from the Shared Encoding Layer, which has a resolution of

H / 8 \times W / 8 \times 128

. A 1 × 1 convolution reduces the channel dimension to 65, where 64 channels correspond to specific grid locations within an 8 × 8 window, and the 65th channel represents background or non-keypoint regions. A softmax activation function is then applied for spatial normalization, and the resulting tensor is reshaped back to the original resolution

H \times W

, producing a probability map that reflects the likelihood of each pixel being a keypoint.

The descriptor branch first upsamples the encoded feature map to the original resolution

H \times W \times 256

using bicubic interpolation, which helps minimize feature drift and ensures effective resolution recovery. To enhance the scale invariance of the descriptors, L2 normalization is applied to each per-pixel channel vector:

\hat{F} (x, y) = \frac{F (x, y)}{\sqrt{\sum_{c = 1}^{C} F (x, y, c)^{2} + ε}}

(7)

where

F \in R^{H \times W \times C}

is the input feature map, and

\hat{F} (x, y)

is the normalized descriptor. The normalization is applied over the channel dimension

C

at each spatial location. The constant

ε

is added for numerical stability and is typically set to

10^{- 6}

. This normalization improves robustness to scale variations and facilitates reliable similarity comparison in the subsequent matching stage.

In summary, RS2FNet outputs a keypoint probability map of size

H \times W \times 1

and a descriptor map of size

H \times W \times D

. These outputs enable full-scene keypoint coverage with strong spatial localization and discriminative feature representation, supporting robust cross-modal matching.

3.1.4. Multi-Level Loss Function for Feature Extraction

To enhance the structural awareness and geometric robustness of the RS2FNet feature extraction network in cross-modal remote sensing image registration, we design a multi-branch weakly supervised loss strategy. This strategy leverages spatial correspondence from co-registered image pairs and incorporates affine-transformed image pairs constructed in a self-supervised manner to improve the network’s invariance to scale, rotation, and translation. The training process uses two types of image pairs:

Registered image pairs

(I_{A}, I_{B})

: where each pixel

(x, y)

in image

I_{A}

has a spatially corresponding point

(x^{'}, y^{'})

in image

I_{B}

, forming a set of semantically aligned locations.

Affine-transformed image pairs

(I, \tilde{I})

: where

\tilde{I} = T (I)

is generated by applying a random affine transformation (including rotation, scaling, and translation) to the original image

I

.

For each input, RS2FNet outputs a descriptor map

F \in R^{H \times W \times D}

and a keypoint probability map

H \in R^{H \times W \times 1}

.

(1) Pixel-wise Contrastive Loss: To enforce geometric consistency and discriminative representation at corresponding locations, we define a pixel-wise contrastive loss. This loss encourages high similarity between matched descriptors while penalizing similarity with non-matching ones:

L_{p i x e l} = - \frac{1}{N} \sum_{i = 1}^{N} l o g (\frac{e x p (f_{i}^{A} \cdot f_{i}^{B} / t)}{\sum_{j = 1}^{N} e x p (f_{i}^{A} \cdot f_{j}^{B} / t)})

(8)

where

N

is the number of matched point pairs;

f_{i}^{A}, f_{j}^{B} \in R^{C}

are normalized descriptors from feature maps of images

I_{A}

and

I_{B}

,

C

is channel dimension; the numerator term

f_{i}^{A} \cdot f_{i}^{B}

denotes the feature similarity between the

i

pixel in image

A

and its correctly matched counterpart in image

B

, the denominator

f_{i}^{A} \cdot f_{j}^{B}

includes the similarities between the

i

pixel in image

A

and all candidate pixels in image

B

, including the true match (

j = i

) and all non-matching negatives (

j \neq i

);

t

is a temperature parameter, empirically set to 0.07 [57]. To reduce computational complexity, the negative set in the denominator can be restricted to a local

w \times w

neighborhood.

(2) Affine Consistency Loss: To ensure feature stability under geometric perturbations, we apply an affine consistency loss that constrains the descriptors of an image and its affine-transformed version to be consistent:

L_{a f f i n e} = \frac{1}{|P|} \sum_{(x, y) \in P} (1 - f (x, y)^{⊤} \tilde{f} (T (x, y)))

(9)

where

f

and

\tilde{f}

denote descriptors from the original image and the transformed image

\tilde{I}

, respectively;

T (x, y)

represents the affine-transformed coordinate of pixel

(x, y)

;

P

is set of positions selected for supervision (e.g., full image or salient regions). This loss penalizes deviations in descriptor similarity caused by geometric transformations.

(3) Heatmap Consistency Loss: To improve the robustness of the keypoint detection branch under geometric variations, we apply a heatmap-level consistency loss:

L_{h e a t} = \frac{1}{H W} \sum_{x = 1}^{H} \sum_{y = 1}^{W} B C E (H (x, y), \tilde{H} (T (x, y)))

(10)

where

H

and

\tilde{H}

denote the keypoint probability maps of the original image and its affine-transformed version, respectively, and BCE denotes the binary cross-entropy loss measuring similarity at each pixel.

(4) Overall Loss Function: The overall training loss for RS2FNet is the weighted sum of the three components:

L_{RS 2 FNet} = λ_{1} \cdot L_{pixel} + λ_{2} \cdot L_{affine} + λ_{3} \cdot L_{heat}

(11)

where

λ_{1}

,

λ_{2}

,

λ_{3}

are weighting coefficients that balance the contributions of each loss term. Based on empirical tuning, we set

λ_{1} = 1.0

,

λ_{2} = 0.5

, and

λ_{3} = 0.1

.

This multi-level loss jointly leverages both spatial alignment from registered pairs and robustness to affine transformations, guiding RS2FNet to learn structure-aware and transformation-invariant feature representations, which form a strong foundation for the downstream matching module.

3.2. ACAMatcher Feature Matching Network

In remote sensing image registration tasks, feature matching is a critical component that significantly affects both accuracy and robustness. Traditional methods typically rely on handcrafted descriptors and Euclidean distance criteria, which struggle to maintain stable performance under scale variations and non-linear radiometric distortions. Although recent end-to-end methods such as SuperGlue [46] have achieved promising progress, their high computational cost, redundant structure, and single pruning mechanism still limit their applicability to high-resolution remote sensing scenarios. To address these limitations, we propose a lightweight, context-aware, and dynamically inferable feature matching network, called ACAMatcher. Taking as input the keypoint probability maps and descriptor maps generated by RS2FNet, ACAMatcher constructs a hierarchical attention interaction module and a confidence-aware sparse inference mechanism to enable precise and efficient matching within a modular pipeline. As shown in Figure 5, the ACAMatcher framework consists of two main components: a context-guided bidirectional attention module and a confidence-driven dynamic pruning mechanism.

3.2.1. Context-Guided Bidirectional Attention Matching

To effectively model structural correspondences between image pairs, ACAMatcher adopts a context-aware bidirectional attention module that jointly captures local dependencies within each image and inter-image matching patterns. Initially, each keypoint feature extracted by RS2FNet is encoded by fusing its spatial coordinates

(x, y)

, detection confidence

p

, and descriptor

d

through a multilayer perceptron (MLP):

z = M L P ([x, y, p, f])

(12)

where

z

denotes the initial context-augmented representation that serves as the input to the attention module.

In each attention layer, ACAMatcher performs intra-image self-attention followed by inter-image cross-attention to progressively model structural relations and semantic alignment. The process is defined as:

h^{A} = SelfAttn (z^{A}), h^{B} = SelfAttn (z^{B}) z^{' A} = CrossAttn (h^{A}, h^{B}), z^{' B} = CrossAttn (h^{B}, h^{A})

(13)

where

z^{A}, z^{B}

are the context-enhanced features from image

A

and

B

,

h^{A}, h^{B}

are their local structure-enhanced features, and

z^{' A}, z^{' B}

are the outputs after bidirectional cross-attention fusion.

Unlike conventional attention modules, ACAMatcher integrates prior attention maps from the BRA module in RS2FNet to guide attention focus toward semantically salient regions. The resulting attention weight matrix

α

modulates the softmax-normalized similarity scores during attention computation. Specifically, channel-wise attention priors are fused into the attention score computation:

α = Softmax (Q K^{T} ⊙ w_{BRA}), z^{*} = α V

(14)

where

⊙

denotes element-wise channel multiplication,

w_{BRA}

is the attention prior from BRA module, and

z^{*}

is the context-weighted output. This integration reinforces semantically important channels and suppresses irrelevant ones, improving structural alignment and match quality.

3.2.2. Confidence-Aware Layer-Wise Pruning Inference

To balance accuracy and computational efficiency, ACAMatcher introduces a confidence-guided layer-wise pruning mechanism. This design adaptively allocates computational resources by determining, at each layer, whether a given keypoint feature should continue through the inference pipeline. A lightweight gating module is employed after each attention layer to assess the feature’s importance. The gating operation consists of two MLP layers and outputs a channel-wise weight vector:

z_{g} = σ (MLP (z)) ⊙ z

(15)

where

σ

is the sigmoid function, and

z_{g}

is the gated feature representation. Then, global average pooling and a binary classifier are used to predict the matchability confidence

c

for each point:

c = S i g m o i d (M L P (G l o b a l A v g P o o l (z_{g})))

(16)

If

c > δ

, where

δ

is a threshold, the point is considered stable and exits early from further attention updates; otherwise, it proceeds to the next layer for continued refinement.

This forms an early-exit pruning mechanism that eliminates low-confidence features from deeper layers, reducing computation and improving inference efficiency. The final retained features are passed to a matching solver that computes a similarity matrix:

S = d^{A} (d^{B})^{T}

(17)

The matchability predictions are used to normalize

S

, and a constrained optimal assignment algorithm is applied to produce the final matching matrix and the effective correspondence set. This mechanism not only significantly improves inference efficiency but also enhances matching robustness, particularly in complex remote sensing scenarios with high outlier ratios and unstructured regions.

3.2.3. Matching-Aware Multi-Task Loss Function

To further enhance the discriminative ability and structural adaptability of the matching network under complex conditions, ACAMatcher adopts a matching-aware loss strategy that integrates multi-task modeling and structural consistency constraints. This design jointly optimizes point-level matching accuracy, matchability prediction, and spatial alignment to improve the overall matching performance.

Matching Similarity Loss: Given matched feature pairs from the similarity matrix

S

, we define a cross-entropy loss that encourages one-to-one assignment of correspondences:

L_{s i m} = - \frac{1}{| M |} \sum_{(i, j) \in M} l o g S_{i j}

(18)

where

M

is the set of ground-truth matching pairs and

S_{i j}

is the predicted similarity score between point

i

in image

A

and point

j

in image

B

. These pseudo ground-truth correspondences are automatically generated by applying affine perturbations to pre-registered optical–SAR image pairs, without requiring manual annotation.

Matchability Confidence Loss: We introduce a binary cross-entropy loss to supervise the predicted matchability scores

c

, using ground-truth match labels

y

:

L_{c o n f} = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} l o g c_{i} + (1 - y_{i}) l o g (1 - c_{i})]

(19)

where

N

is the number of candidate points.

Cycle Consistency Loss: To improve the mutual coherence of the matching results, we define a cycle consistency loss by enforcing that a point matched from image

A

to

B

and back to

A

should return to its original position. This loss computes the Euclidean distance between the starting and returning points across all valid round-trip matching paths, and takes the average as a measure of structural consistency error:

L_{cycle} = \frac{1}{| P |} \sum_{(i \to j \to i^{'}) \in P} ‖ i - i^{'} ‖_{2}

(20)

Here,

P

denotes the set of keypoints in image A that can form a round-trip matching path

i \to j \to i^{'}

, where

i

is the original point in image

A

,

j

is its corresponding match in image

B

, and

i^{'}

is the returned match back in image

A

. This constraint ensures geometric consistency and encourages local structural alignment.

The final loss function is defined as:

L_{A C A M a t c h e r} = λ_{4} L_{s i m} + λ_{5} L_{c o n f} + λ_{6} L_{c y c l e}

(21)

where

λ_{4}, λ_{5}, λ_{6}

are empirically determined weighting factors. In our experiments, we set

λ_{4} = 1.0, λ_{5} = 0.5, λ_{6} = 0.3

, following a commonly adopted ratio between primary objectives and auxiliary regularization terms in multi-task matching frameworks. This training strategy relies entirely on pseudo supervision derived from controlled geometric transformations, enabling a fully automated and annotation-free optimization process. The proposed configuration demonstrates stable convergence and effective pruning performance, whose impact will be further analyzed in Section 4.

4. Experiments

4.1. Datasets

To comprehensively evaluate the performance and robustness of the proposed method on multimodal remote sensing image matching, we constructed a dataset comprising two public datasets and one self-built dataset. This combined dataset covers a wide range of typical scenarios and contains significant modality discrepancies, diverse land cover types, and challenging imaging conditions. It includes inter-resolution differences, SAR-specific speckle noise, lighting variations, low-texture regions, local occlusions, and view-angle deviations, enabling a rigorous assessment of model generalization and stability in real-world remote sensing contexts.

To evaluate the model’s capability in large-scale cross-region scenarios, we selected two seasonal subsets (summer and winter) from the SEN1-2 dataset, containing 48,158 and 60,104 images, respectively. Released by the German Aerospace Center (DLR) in 2017 [58], SEN1-2 consists of 282,384 pairs of Sentinel-1 (SAR) and Sentinel-2 (optical) image patches, each with a size of 256 × 256 pixels and a spatial resolution of 10–20 m. It covers diverse geographic regions under various meteorological and seasonal conditions.

The WHU-OPT-SAR dataset, provided by Wuhan University, contains high-resolution paired optical and SAR images with detailed land cover annotations [59]. The optical data are from the GF-1 satellite (RGB and NIR bands), while the SAR data are from the GF-3 satellite. Both are resampled to a uniform resolution of 5 m. This dataset includes 100 optical–SAR image pairs, each approximately 5556 × 3704 pixels, and is ideal for evaluating registration and recognition under complex cross-modal conditions. For compatibility with our training and testing pipeline, we cropped the original scenes into 256 × 256 patches.

Our self-constructed dataset consists of 48,000 paired image patches (256 × 256 pixels) from China’s high-resolution GF-series optical images and GF-3 SAR images. All image pairs were manually aligned using professional remote sensing processing software, with the registration error strictly controlled within 2 pixels. The alignment results were further verified by experienced remote sensing specialists to ensure accuracy. The dataset covers diverse scenes such as urban, industrial, farmland, water bodies, and mountainous areas, with spatial resolution ranging from 5 to 10 m.

To enhance robustness to scale and rotation variations, we applied random affine transformations (including rotation, scaling, and translation) during training. Using the original keypoints and their transformed coordinates, we automatically generated pseudo-labels to form positive and negative sample pairs. This supervision strategy allows the network to learn geometry-consistent keypoint heatmaps and descriptors, improving model stability under geometric variations.

Considering the annotation challenges in cross-modal settings, we adopted the above self-supervised strategy to ensure controllable and geometry-consistent training without manual intervention. The dataset was split into training, validation, and testing subsets in a 7:1:2 ratio.

4.2. Evaluation Metrics and Implementation Details

4.2.1. Evaluation Metrics

(1) Feature Similarity: We visualize descriptor value curves along dimension indices for matched point pairs, comparing the shape and overlap across methods to qualitatively assess cross-modal consistency.

(2) Number of Correct Matches (NCM): The number of correctly matched pairs verified by geometric consistency, reflecting the algorithm’s recall. A higher NCM indicates more reliable correspondences for subsequent registration.

(3) Correct Match Rate (CMR): The proportion of correct matches among all predicted matches, calculated as:

C M R = \frac{N C M}{N C M + N F M}

(22)

where

N F M

is the number of incorrect matches. Higher

C M R

denotes better matching precision.

(4) Root Mean Square Error (RMSE): Measures the spatial deviation of matched pairs after transformation. Given estimated homography H, RMSE is defined as:

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {({\hat{p}}_{i} - p_{i})}^{2}}

(23)

where

{\hat{p}}_{i}

is the transformed coordinate of the predicted point,

p_{i}

is the reference point, and

N

is the number of matches. Lower RMSE indicates higher spatial accuracy.

(5) Inference Time (T): The average runtime (in seconds) from feature extraction to matching for a single image pair, measuring overall algorithmic efficiency.

4.2.2. Implementation Details

In the experiment, our method is implemented using PyTorch 2.1.0 and trained/tested on an Ubuntu 20.04 system with an NVIDIA GeForce RTX 4090 GPU (24 GB VRAM). RS2FNet is optimized using Adam with an initial learning rate of 1 × 10⁻³, batch size of 64, and 50 training epochs. ACAMatcher is trained using AdamW with a learning rate of 1 × 10⁻⁴, batch size of 16, and 30 epochs.

We employ a two-stage training strategy. First, RS2FNet is trained using pseudo-labels from random affine augmentation and a composite loss including spatial consistency, affine invariance, and heatmap supervision (weights

λ_{1} = 1.0, λ_{2} = 1.0, λ_{3} = 0.3

). Then, RS2FNet parameters are frozen, and ACAMatcher is trained using a combined loss of similarity, matchability confidence, and loop consistency (weights

λ_{4} = 1.0, λ_{5} = 0.5, λ_{6} = 0.3

).

ACAMatcher employs a 4-layer Transformer with dynamic pruning. The pruning threshold τ is set to 0.5. To balance accuracy and efficiency, we retain the top 500 keypoints per image based on response score and reduce descriptor dimension to 256.

4.3. Experimental Results and Analysis

To comprehensively evaluate the performance and robustness of the proposed ACAMatch method for multimodal remote sensing image matching, three categories of experiments were designed: feature similarity analysis, rotation and scale invariance tests, and overall registration performance evaluation. These experiments systematically compare different methods from the perspectives of feature representation, geometric adaptability, and registration effectiveness.

(1) Feature Similarity Analysis

The goal of the feature similarity experiment is to assess the consistency of cross-modal feature representations extracted by different algorithms. We selected a pair of matching keypoints from an optical–SAR image pair and compared their descriptors generated by different methods. The descriptor values were plotted against feature dimension indices to observe their trends and alignment. Three representative algorithms were included for comparison: the traditional handcrafted RIFT [14] method and two deep learning-based approaches—CMM-Net [18] and MatchosNet [39].

Line plots were used to visualize the normalized descriptor values of the same keypoints across dimensions, intuitively reflecting the continuity, principal component distribution, and structural alignment of features in high-dimensional space. A higher degree of overlap between the two curves indicates greater feature consistency, suggesting stronger cross-modal alignment and higher matching confidence. To enhance interpretability, we visualized selected feature dimensions, as shown in Figure 6.

As shown in Figure 6, ACAMatch (Figure 6d) produces highly synchronized trends in the descriptor values across modalities, with consistent peaks and troughs, demonstrating strong modality invariance and feature stability. In contrast, RIFT (Figure 6a) shows significant dispersion across multiple dimensions, indicating limited adaptability to complex radiometric variations. MatchosNet (Figure 6b) exhibits slightly better trend alignment but still suffers from local mismatches, while CMM-Net (Figure 6c) displays highly divergent curves with intense fluctuations, suggesting weak structural consistency. Overall, ACAMatch demonstrates superior ability in extracting structurally stable and modality-invariant descriptors, performing robustly even under significant modality discrepancies and noise interference in remote sensing scenarios.

(2) Scale and Rotation Invariance Test

To validate the robustness of ACAMatch against image scale variation and rotational perturbations, we designed typical simulation experiments and real-world experiments. Two representative cross-modal optical–SAR image pairs were selected—one from our custom-built test set and another from an unseen Gaofen (GF) Series Satellites dataset—to assess the generalization ability of ACAMatch on previously unseen data. The first image pair covers an urban area, while the second involves a complex terrain region with significant resolution differences (optical: 2 m; SAR: 5 m). The image sizes are 3190 × 3465 (optical) and 1306 × 1419 (SAR), respectively. As shown in Figure 7, five experimental setups were used to simulate various geometric perturbations: rotation by 5° and 10°, scaling by 0.8×, resolution difference (2 m vs. 5 m), and resolution difference combined with a 20° rotation.

To evaluate robustness under scale and rotation transformations, three representative baseline methods were selected for comparison with the proposed ACAMatch: the traditional handcrafted method RIFT [14] (combined with RANSAC-based geometric correction), the end-to-end deep matching framework MatchosNet [39], and the sparse keypoint-based approach SuperPoint+LightGlue [38,47]. LightGlue was specifically chosen in this setting due to its superior efficiency and robustness in handling geometric variations. All methods were implemented based on their official open-source code, using recommended pretrained weights and default parameter settings to ensure fairness and reproducibility. Detailed results are presented in Table 1.

The five experimental setups correspond to: (1) rotation of 5°, (2) rotation of 10°, (3) scaling by 0.8×, (4) resolution difference (2 m vs. 5 m), and (5) resolution difference combined with a 20° rotation. As shown in Table 1, ACAMatch consistently achieves the best matching performance across all scenarios. It detects the highest number of correct matches (NCM), significantly outperforming both traditional and deep learning-based methods. Its correct match rate (CMR) exceeds 0.80 in all cases and reaches up to 0.85, demonstrating high accuracy. RMSE values remain the lowest under all conditions, with 1.54 in the resolution difference group and 2.01 in the resolution difference + rotation group, indicating excellent geometric consistency. In terms of efficiency, ACAMatch leverages lightweight feature extraction and dynamic inference to maintain high accuracy with competitive speed. Although SuperPoint+LightGlue achieves slightly faster inference in some cases, it sacrifices accuracy, failing to balance precision and efficiency. To better illustrate the trade-off between inference efficiency and matching accuracy, Figure 8 presents a scatter plot of inference time (T) versus the number of correct matches (NCM) for all methods across the five test conditions.

For the RIFT method, no valid matches were produced in Groups 4 and 5. This is mainly due to the larger image sizes and the significant modality differences between optical and SAR images caused by speckle noise and structural distortions. These factors impaired the gradient-based keypoint detection mechanism of RIFT, rendering it ineffective under such challenging conditions. As a result, its performance is not reported in the corresponding table entries or in the figure. Visual comparisons of feature matching results under the five transformations are shown in Figure 9.

From the visual results, under mild perturbations (5°, 10° rotation and 0.8× scaling), RIFT detects some correct correspondences but struggles in low-texture or complex regions due to its reliance on gradient-based features. MatchosNet and SuperPoint+LightGlue provide richer matches, but the former lacks strong descriptor representation, and the latter degrades under combined large-scale and rotation distortions. Particularly in the 4th and 5th scenarios, both RIFT and MatchosNet perform poorly, with reduced correct matches and increased false matches. In contrast, ACAMatch consistently detects a large number of accurate and spatially distributed matches, confirming its robustness and stability under multimodal matching with scale and rotation perturbations.

(3) Registration Performance Evaluation

In this section, a traditional image registration method CFOG [11] and three representative deep learning-based approaches—MatchosNet [39], SuperPoint+SuperGlue [38,46] (SuperGlue is included in this evaluation due to its superior accuracy in dense feature matching tasks), and ADRNet [19] are selected as baselines. All baseline methods are implemented using their publicly available source codes and executed with the officially recommended parameters and pretrained weights. Six test image pairs were used: Group 1 from the SEN1-2 dataset, Group 2 from the WHU-OPT-SAR dataset, Groups 3 and 4 from our custom dataset, and Groups 5 and 6 from unseen scenes with 5 m and 10 m resolutions, respectively. Table 2 and Table 3 present the quantitative evaluation results in terms of the number of correct matches (NCM), correct match rate (CMR), root mean square error (RMSE), and average runtime (T). ACAMatch outperforms all baseline methods across all evaluation metrics.

At the same time, Figure 10 shows four bar comparison charts, (a–d) correspond to the NCM, CMR, RMSE and Time indicators, respectively, which more intuitively verifies the comprehensive advantages of ACAMatch in various indicators.

ACAMatch consistently demonstrates superior matching and registration capability across all six scenes. It extracts a large number of reliable and spatially distributed correspondences, achieving the highest CMR values—mostly above 0.80—while maintaining the lowest RMSE (mostly ≤ 2 pixels). CFOG struggles with limited correct matches and longer runtimes, particularly under strong speckle noise. Although other deep learning methods occasionally extract more matches, their performance in accuracy and robustness is inferior to ACAMatch. Visual matching results are shown in Figure 11.

ACAMatch yields dense and well-distributed correspondences across the entire image extent, even under weak texture and strong noise conditions (e.g., Groups 3 and 5). To further highlight the precision of registration, we provide checkerboard visualizations in Figure 12, where edges from registered images align seamlessly, confirming the subpixel accuracy of ACAMatch.

5. Discussion

To evaluate the effectiveness of the proposed architectural components, loss functions, and hyperparameter settings in robust optical–SAR image matching, a series of ablation studies and parameter analyses were conducted based on multiple benchmark datasets.

5.1. Effectiveness of Core Modules

Ablation studies were conducted using test sets from three datasets (SEN1-2, WHU-OPT-SAR, and our custom dataset) to assess the impact of key components in RS2FNet and ACAMatcher. These include the following:

(a): The dual-stage Res2Net hierarchical feature extractor.
(b): The BRA bi-branch attention fusion module.
(c): Context-guided bidirectional attention in ACAMatcher.
(d): Confidence-based dynamic pruning mechanism.

Each module was removed independently while keeping all other components and training settings unchanged. Results are shown in Table 4.

All four components significantly contribute to performance. Removing context attention leads to the largest RMSE increase and the lowest NCM, showing its importance for structural consistency in cross-modal matching. The Res2Net structure notably boosts the number of reliable matches, and dynamic pruning enhances efficiency with only minor impact on accuracy.

5.2. Contribution of Loss Function Components

The role of loss components in RS2FNet and ACAMatcher was evaluated by progressively removing terms from the full loss setup. Table 5 presents the results.

In RS2FNet, both affine and spatial consistency terms greatly enhance feature robustness and alignment. Heatmap supervision alone is insufficient. In ACAMatcher, matching similarity is the primary supervision signal, while matchability and loop consistency regularizers improve correspondence reliability and bidirectional geometric consistency.

The following default loss weights are adopted to balance gradients: λ₁ = 1.0 (spatial),

λ_{2} = 1.0

(affine),

λ_{3} = 0.3

(heatmap),

λ_{4} = 1.0

(similarity),

λ_{5} = 0.5

(matchability),

λ_{6} = 0.3

(loop consistency). The smaller weights on weak constraints prevent them from dominating training. We further examine

λ_{6}

’s sensitivity.

As shown in Figure 13, RMSE decreases and NCM increases as λ₆ grows from 0 to 0.3, peaking at λ₆ = 0.3, and then slightly degrading. Thus, a moderate λ₆ best balances precision and generalization.

5.3. Sensitivity of Key Hyperparameters

Two critical hyperparameters were analyzed: Pruning threshold θ and Top-k maximum number of keypoints for matching. Table 6 summarizes the results.

The dynamic pruning threshold

θ

(confidence threshold) governs early termination and pruning of keypoints, influencing the quality of preserved matches and overall computational efficiency. As

θ

increases from 0.3 to 0.7, inference time is significantly reduced while RMSE remains relatively stable. However, NCM shows a noticeable decline at

θ = 0.7

due to premature pruning of valid matches. The default value

θ = 0.5

achieves the best balance between accuracy and efficiency.

Top-k defines the maximum number of keypoints allowed into the matching phase. Setting Top-k too high (e.g., 800) increases matching density and NCM but results in longer inference time. Conversely, a low Top-k (e.g., 300) reduces runtime but leads to significantly fewer matches. The default Top-k = 500 yields the optimal trade-off in accuracy and efficiency.

In summary, the above analysis verifies the necessity of each architectural and training component. The combination of multi-scale feature extraction, attention-guided matching, confidence-aware filtering, and task-specific loss functions ensures high matching precision and robustness under challenging multimodal conditions.

6. Conclusions

In this article, we proposed a deep learning-based matching framework for optical and SAR remote sensing image registration called ACAMatch, which is compsed of two major components: a structurally enhanced feature extraction network RS2FNet and a context-guided matching module ACAMatcher. RS2FNet leverages multi-scale structural encoding via Res2Net and region-aware attention modeling BRA to generate robust and distinctive descriptors. ACAMatcher employs a fusion of self- and cross-attention mechanisms, together with a confidence-driven pruning strategy, to achieve efficient and accurate correspondence estimation. Additionally, a match-aware multi-task loss is introduced to reinforce the stability and geometric consistency of feature representations, thereby improving both the interpretability and generalization capacity of the matching process. Extensive experiments on several public and self-collected Gaofen (GF) datasets demonstrate that ACAMatch consistently outperforms existing state-of-the-art methods in terms of matching quantity, accuracy, and inference speed. The method demonstrates notable robustness to structural distortions, resolution inconsistencies, and speckle interference commonly found in SAR imagery. This adaptability enables ACAMatch to serve as a reliable solution for large-scale remote sensing tasks, including emergency response, change detection, and scene interpretation in challenging all-weather conditions.

In future work, we plan to extend the applicability of ACAMatch to other multimodal settings, such as infrared, thermal infrared, or hyperspectral-to-SAR matching. We will also explore the incorporation of auxiliary data sources, such as DEMs or vector maps, to enhance geometric reasoning in complex terrain. Moreover, we aim to further optimize the model for lightweight deployment and efficient inference, facilitating its integration into real-world remote sensing systems operating under resource constraints.

Author Contributions

Q.K., J.Z., G.H. and F.L. jointly conceived this paper. Q.K. conducted the experiments and analysis and wrote the manuscript. J.Z., G.H. and F.L. revised the manuscript, supervised the work, and provided critical comments. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Department of Science and Technology of Zhejiang Province, China (No. 2025C01073), under the Project “Key Technologies and Application Demonstration of Precise Air-Space Information Extraction in Three-Dimensional Spatiotemporal Scenarios”.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lu, S.; Guo, J.; Zimmer-Dauphinee, J.R.; Nieusma, J.M.; Wang, X.; van Valkenburgh, P.; Wernke, S.A.; Huo, Y. Vision Foundation Models in Remote Sensing: A Survey. IEEE Geosci. Remote Sens. Mag. 2025, 2–27. [Google Scholar] [CrossRef]
Peng, D.; Liu, X.; Zhang, Y.; Guan, H.; Li, Y.; Bruzzone, L. Deep Learning Change Detection Techniques for Optical Remote Sensing Imagery: Status, Perspectives and Challenges. Int. J. Appl. Earth Obs. Geoinf. 2025, 136, 104282. [Google Scholar] [CrossRef]
Jiang, X. A Review of Multimodal Image Matching: Methods and Applications. Inf. Fusion 2021, 73, 22–71. [Google Scholar] [CrossRef]
Zhang, W.; Mei, J.; Wang, Y. DMDiff: A Dual-Branch Multimodal Conditional Guided Diffusion Model for Cloud Removal Through SAR-Optical Data Fusion. Remote Sens. 2025, 17, 965. [Google Scholar] [CrossRef]
Zhu, Q.; Guo, X.; Li, Z.; Li, D. A Review of Multi-Class Change Detection for Satellite Remote Sensing Imagery. Geo-Spa. Inf. Sci. 2022, 27, 1–15. [Google Scholar] [CrossRef]
Gao, F.; Jin, X.; Zhou, X.; Dong, J.; Du, Q. MSFMamba: Multiscale Feature Fusion State Space Model for Multisource Remote Sensing Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5504116. [Google Scholar] [CrossRef]
Zhu, B.; Zhou, L.; Pu, S.; Fan, J.; Ye, Y. Advances and Challenges in Multimodal Remote Sensing Image Registration. IEEE J. Miniaturiz. Air Space Syst. 2023, 4, 165–174. [Google Scholar] [CrossRef]
Cole-Rhodes, A.A.; Johnson, K.L.; LeMoigne, J.; Zavorin, I. Multiresolution Registration of Remote Sensing Imagery by Optimization of Mutual Information Using a Stochastic Gradient. IEEE Trans. Image Process. 2003, 12, 1495–1511. [Google Scholar] [CrossRef]
Martinez, A.; Garcia-Consuegra, J.; Abad, F. A Correlation-Symbolic Approach to Automatic Remotely Sensed Image Rectification. In Proceedings of the IEEE 1999 International Geoscience and Remote Sensing Symposium. IGARSS’99 (Cat. No.99CH36293), Hamburg, Germany, 28 June–2 July 1999; Volume 1, pp. 336–338. [Google Scholar]
Ye, Y.; Shen, L. HOPC: A Novel Similarity Metric Based on Geometric Structural Properties for Multi-Modal Remote Sensing Image Matching. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2016, III–1, 9–16. [Google Scholar] [CrossRef]
Ye, Y.; Bruzzone, L.; Shan, J.; Bovolo, F.; Zhu, Q. Fast and Robust Matching for Multimodal Remote Sensing Image Registration. IEEE Trans. Geosci. Remote Sens. 2019, 57, 9059–9070. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vision 2004, 60, 91–110. [Google Scholar] [CrossRef]
Bay, H.; Ess, A.; Tuytelaars, T.; Van Gool, L. Speeded-Up Robust Features (SURF). Comput. Vis. Image Underst. 2008, 110, 346–359. [Google Scholar] [CrossRef]
Li, J.; Hu, Q.; Ai, M. RIFT: Multi-Modal Image Matching Based on Radiation-Variation Insensitive Feature Transform. IEEE Trans. Image Process. 2020, 29, 3296–3310. [Google Scholar] [CrossRef] [PubMed]
Xiang, Y.; Wang, F.; You, H. OS-SIFT: A Robust SIFT-Like Algorithm for High-Resolution Optical-to-SAR Image Registration in Suburban Areas. IEEE Trans. Geosci. Remote Sens. 2018, 56, 3078–3090. [Google Scholar] [CrossRef]
Han, X.; Leung, T.; Jia, Y.; Sukthankar, R.; Berg, A.C. MatchNet: Unifying Feature and Metric Learning for Patch-Based Matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3279–3286. [Google Scholar]
Mishchuk, A.; Mishkin, D.; Radenovic, F.; Matas, J. Working Hard to Know Your Neighbor’ s Margins: Local Descriptor Learning Loss. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
Lan, C.; Lu, W.; Yu, J.; Xu, Q. Deep Learning Algorithm for Feature Matching of Cross Modality Remote Sensing Images. Cehui Xuebao/Acta Geod. Cartogr. Sin. 2021, 50, 189–202. [Google Scholar]
Xiao, Y.; Zhang, C.; Chen, Y.; Jiang, B.; Tang, J. ADRNet: Affine and Deformable Registration Networks for Multimodal Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
Quan, D.; Wang, S.; Liang, X.; Wang, R.; Fang, S.; Hou, B.; Jiao, L. Deep Generative Matching Network for Optical and SAR Image Registration. In Proceedings of the IGARSS 2018–2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 6215–6218. [Google Scholar]
Du, W.-L.; Zhou, Y.; Zhao, J.; Tian, X. K-Means Clustering Guided Generative Adversarial Networks for SAR-Optical Image Matching. IEEE Access 2020, 8, 217554–217572. [Google Scholar] [CrossRef]
Gao, S.-H.; Cheng, M.-M.; Zhao, K.; Zhang, X.-Y.; Yang, M.-H.; Torr, P. Res2Net: A New Multi-Scale Backbone Architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 652–662. [Google Scholar] [CrossRef]
Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R. BiFormer: Vision Transformer with Bi-Level Routing Attention. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 10323–10333. [Google Scholar]
Brown, L.G. A Survey of Image Registration Techniques. ACM Comput. Surv. 1992, 24, 325–376. [Google Scholar] [CrossRef]
Shu, L.; Tan, T. SAR and SPOT Image Registration Based on Mutual Information with Contrast Measure. In Proceedings of the 2007 IEEE International Conference on Image Processing, San Antonio, TX, USA, 16–19 September 2007; Volume 5, pp. V-429–V-432. [Google Scholar]
Fan, X.; Rhody, H.; Saber, E. Automatic Registration of Multisensor Airborne Imagery. In Proceedings of the 34th Applied Imagery and Pattern Recognition Workshop (AIPR’05), Washington, DC, USA, 19 October–21 December 2005; pp. 6–86. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; Volume 1, pp. 886–893. [Google Scholar]
Liu, X.; Chen, S.; Zhuo, L.; Li, J.; Huang, K. Multi-Sensor Image Registration by Combining Local Self-Similarity Matching and Mutual Information. Front. Earth Sci. 2018, 12, 779–790. [Google Scholar] [CrossRef]
Li, S.; Wang, Q.; Li, J. Improved ORB Matching Algorithm Based on Adaptive Threshold. J. Phys. Conf. Ser. 2021, 1871, 012151. [Google Scholar] [CrossRef]
Dellinger, F.; Delon, J.; Gousseau, Y.; Michel, J.; Tupin, F. SAR-SIFT: A SIFT-Like Algorithm for SAR Images. IEEE Trans. Geosci. Remote Sens. 2015, 53, 453–466. [Google Scholar] [CrossRef]
Sobel, I. Neighborhood Coding of Binary Images for Fast Contour Following and General Binary Array Processing. Comput. Graph. Image Process. 1978, 8, 127–135. [Google Scholar] [CrossRef]
Fjortoft, R.; Lopes, A.; Marthon, P.; Cubero-Castan, E. An Optimal Multiedge Detector for SAR Image Segmentation. IEEE Trans. Geosci. Remote Sens. 1998, 36, 793–802. [Google Scholar] [CrossRef]
Burns, J.B.; Hanson, A.R.; Riseman, E.M. Extracting Straight Lines. IEEE Trans. Pattern Anal. Mach. Intell. 1986, PAMI-8, 425–455. [Google Scholar] [CrossRef]
Xu, G.; Wu, Q.; Cheng, Y.; Yan, F.; Li, Z.; Yu, Q. A Robust Deformed Image Matching Method for Multi-Source Image Matching. Infrared Phys. Technol. 2021, 115, 103691. [Google Scholar] [CrossRef]
Sui, H.; Xu, C.; Liu, J.; Hua, F. Automatic Optical-to-SAR Image Registration by Iterative Line Extraction and Voronoi Integrated Spectral Point Matching. IEEE Trans. Geosci. Remote Sens. 2015, 53, 6058–6072. [Google Scholar] [CrossRef]
Teague, M.R. Image Analysis via the General Theory of Moments. JOSA 1980, 70, 920–930. [Google Scholar] [CrossRef]
Matas, J.; Chum, O.; Urban, M.; Pajdla, T. Robust Wide-Baseline Stereo from Maximally Stable Extremal Regions. Image Vision Comput. 2004, 22, 761–767. [Google Scholar] [CrossRef]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperPoint: Self-Supervised Interest Point Detection and Description. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; IEEE: Piscataway, NJ, USA, 2018. [Google Scholar]
Liao, Y.; Di, Y.; Zhou, H.; Li, A.; Liu, J.; Lu, M.; Duan, Q. Feature Matching and Position Matching Between Optical and SAR with Local Deep Feature Descriptor. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 448–462. [Google Scholar] [CrossRef]
Ye, Y.; Yang, C.; Gong, G.; Yang, P.; Quan, D.; Li, J. Robust Optical and SAR Image Matching Using Attention-Enhanced Structural Features. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–12. [Google Scholar] [CrossRef]
Li, Y.; Wu, D.; Cui, Y.; He, P.; Zhang, Y.; Wang, R. A Robust Multisource Remote Sensing Image Matching Method Utilizing Attention and Feature Enhancement Against Noise Interference. arXiv 2024, arXiv:2410.11848. [Google Scholar] [CrossRef]
Lv, C.; Wang, W.; Quan, D.; Wang, S.; Dong, L.; Jiang, X.; Gu, Y.; Jiao, L. Fourier Domain Adaptive Multi-Modal Remote Sensing Image Template Matching Based on Siamese Network. In Proceedings of the IGARSS 2024–2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; pp. 7325–7329. [Google Scholar]
Han, Q.; Zhi, X.; Jiang, S.; Chen, W.; Huang, Y.; Yu, L.; Zhang, W. A Siamese Network via Cross-Domain Robust Feature Decoupling for Multi-Source Remote Sensing Image Registration. Remote Sens. 2025, 17, 646. [Google Scholar] [CrossRef]
Quan, D.; Wang, Z.; Lv, C.; Wang, S.; Li, Y.; Ren, B.; Chanussot, J.; Jiao, L. LM-Net: A Lightweight Matching Network for Remote Sensing Image Matching and Registration. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
Quan, D.; Wang, Z.; Wang, S.; Li, Y.; Ren, B.; Kang, M.; Chanussot, J.; Jiao, L. F3Net: Adaptive Frequency Feature Filtering Network for Multimodal Remote Sensing Image Registration. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
Sarlin, P.-E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperGlue: Learning Feature Matching with Graph Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 4937–4946. [Google Scholar]
Lindenberger, P.; Sarlin, P.-E.; Pollefeys, M. LightGlue: Local Feature Matching at Light Speed. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; IEEE: Piscataway, NJ, USA, 2023. [Google Scholar]
van den Oord, A.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv 2019, arXiv:1807.03748. [Google Scholar] [CrossRef]
Zhang, H.; Lei, L.; Ni, W.; Tang, T.; Wu, J.; Xiang, D.; Kuang, G. Optical and SAR Image Matching Using Pixelwise Deep Dense Features. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Curran Associates, Inc.: New York, NY, USA, 2014; Volume 27. [Google Scholar]
Wu, B.; Wang, H.; Zhang, C.; Chen, J. Optical-to-SAR Translation Based on CDA-GAN for High-Quality Training Sample Generation for Ship Detection in SAR Amplitude Images. Remote Sens. 2024, 16, 3001. [Google Scholar] [CrossRef]
Bai, X.; Xu, F. Accelerating Diffusion for SAR-to-Optical Image Translation via Adversarial Consistency Distillation. arXiv 2024, arXiv:2407.06095. [Google Scholar]
Luo, Q.; Li, H.; Chen, Z.; Li, J. ADD-UNet: An Adjacent Dual-Decoder UNet for SAR-to-Optical Translation. Remote Sens. 2023, 15, 3125. [Google Scholar] [CrossRef]
Revaud, J.; De Souza, C.; Humenberger, M.; Weinzaepfel, P. R2D2: Reliable and Repeatable Detector and Descriptor. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates, Inc.: New York, NY, USA, 2019; Volume 32. [Google Scholar]
Tyszkiewicz, M.; Fua, P.; Trulls, E. DISK: Learning Local Features with Policy Gradient. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates Inc.: New York, NY, USA, 2020; Volume 33, pp. 14254–14265. [Google Scholar]
Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-Free Local Feature Matching with Transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 8918–8927. [Google Scholar]
Chen, X.; Fan, H.; Girshick, R.; He, K. Improved Baselines with Momentum Contrastive Learning. arXiv 2020, arXiv:2003.04297. [Google Scholar] [CrossRef]
Schmitt, M.; Hughes, L.H.; Zhu, X.X. The Sen1-2 Dataset for Deep Learning in Sar-Optical Data Fusion. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2018, IV–1, 141–146. [Google Scholar] [CrossRef]
Li, X.; Zhang, G.; Cui, H.; Hou, S.; Wang, S.; Li, X.; Chen, Y.; Li, Z.; Zhang, L. MCANet: A Joint Semantic Segmentation Framework of Optical and SAR Images for Land Use Classification. Int. J. Appl. Earth Obs. Geoinf. 2022, 106, 102638. [Google Scholar] [CrossRef]

Figure 1. Example of optical and SAR image pairs. (a) Radiometric discrepancy in water regions; (b) Inconsistent texture patterns in urban areas; (c) Structural blurring due to speckle noise in SAR images.

Figure 2. Overall Flowchart of the Matching Framework ACAMatch.

Figure 3. Overall architecture of RS2FNet.

Figure 4. Structure diagram of the two-stage Res2Net-BRA encoding module.

s 2

represents stride = 2.

Figure 4. Structure diagram of the two-stage Res2Net-BRA encoding module.

s 2

represents stride = 2.

Figure 5. Overall architecture of ACAMatcher. It comprises two main components: a bidirectional attention module that integrates self- and cross-attention guided by BRA features, and a dynamic pruning module that predicts confidence via gated fusion for sparse and reliable matching.

Figure 6. Comparison of feature vectors from the same matched points under different methods.

Figure 7. Optical (left) and SAR (right) image pairs used for scale and rotation invariance testing.

Figure 8. Trade-off between inference time and matching performance (NCM) under five test conditions.

Figure 9. Feature matching results of different methods.

Figure 10. Comparison of metrics with bar charts. across six test groups. (a) Number of correct matches (NCM). (b) Correct matching ratio (CMR). (c) Root mean square error (RMSE) of the estimated transformation. (d) Running time (in seconds) of each method.

Figure 11. Visual registration results across six test pairs.

Figure 12. Registration results and zoom-in checkerboard overlays for the six test pairs using ACAMatch. (a) Group 1: Farmland area with strip noise and low texture. (b) Group 2: Urban-rural transition zone with structural distortion. (c) Group 3: Suburban region with weak textures and SAR speckle. (d) Group 4: Riverine area with vegetation boundaries. (e) Group 5: Dense built-up area with severe SAR noise. (f) Group 6: Agricultural scene with complex patterns and slight misalignment.

Figure 13. The impact of hyperparameter

λ_{6}

on registration performance.

Figure 13. The impact of hyperparameter

λ_{6}

on registration performance.

Table 1. Matching performance under rotational and scale variations.

Group	RIFT				MatchosNet				Superpoint+Lightglue				ACAMatch
Group	NCM	CMR	RMSE	T	NCM	CMR	RMSE	T	NCM	CMR	RMSE	T	NCM	CMR	RMSE	T
1	29	0.56	2.45	3.65	32	0.58	2.43	2.56	60	0.64	2.19	2.34	87	0.76	1.91	1.79
2	17	0.49	2.49	3.66	65	0.67	2.37	2.57	67	0.63	2.29	2.34	90	0.85	1.98	1.79
3	60	0.66	2.33	2.99	71	0.68	2.30	2.52	163	0.71	2.11	2.01	202	0.81	1.89	1.76
4	-	-	-	-	169	0.69	1.77	5.11	367	0.72	1.70	4.02	878	0.83	1.54	4.34
5	-	-	-	-	82	0.70	2.15	5.01	289	0.71	2.34	4.77	677	0.81	2.01	4.81

Table 2. NCM and CMR evaluation results of different methods on six test image pairs.

Group	CFOG		MatchosNet		Sp+Sl		ADRNet		ACAMatch
Group	NCM	CMR	NCM	CMR	NCM	CMR	NCM	CMR	NCM	CMR
1	27	0.49	52	0.55	109	0.75	214	0.80	288	0.83
2	72	0.56	98	0.69	157	0.78	277	0.80	355	0.88
3	11	0.40	42	0.49	38	0.47	149	0.78	273	0.86
4	33	0.50	43	0.52	63	0.53	141	0.71	234	0.80
5	19	0.49	33	0.50	58	0.67	109	0.71	208	0.81
6	58	0.56	69	0.56	73	0.62	125	0.70	184	0.78

Table 3. RMSE and time evaluation results of different methods on six test image pairs.

Group	CFOG		MatchosNet		Sp+Sl		ADRNet		ACAMatch
Group	RMSE	T	RMSE	T	RMSE	T	RMSE	T	RMSE	T
1	11.93	3.94	8.03	3.66	5.99	2.34	2.36	2.15	1.96	1.91
2	7.93	4.02	5.10	3.87	5.60	2.89	1.98	2.25	1.69	1.93
3	13.45	3.91	8.88	3.14	8.27	2.93	2.22	2.21	1.81	1.92
4	9.80	4.56	7.45	4.43	6.69	3.21	2.69	2.17	2.20	1.89
5	10.38	4.14	9.15	4.13	7.65	3.68	3.38	2.14	2.25	1.98
6	8.95	5.67	8.25	5.16	7.13	3.59	3.47	1.93	2.13	2.06

Table 4. Impact of core module removal on registration performance.

Model Variant	RMSE	NCM	T(ms)
Full model	1.89	285	90
—Res2Net	2.34	198	85
—BAR Attention	2.11	170	88
—Context Attention	2.59	164	86
—Dynamic Pruning	2.07	258	118

Table 5. RS2FNet and ACAMatcher loss component ablation.

RS2FNet			ACAMather
Loss Combination	RMSE	NCM	Loss Combination	RMSE	NCM
Full Loss	1.89	285	Full Loss	1.89	285
–Affine Consistency	2.17	266	–Matchability Loss	2.06	255
–Spatial Consistency	2.25	252	–Loop Consistency Regularizer	2.13	242
Heatmap Supervision Only	2.42	220	Matching Similarity Only	2.31	211

Table 6. Effect of Pruning Threshold θ and Top-k Selection.

	Empirical Value	RMSE	NCM	T(ms)
$θ$	0.3	1.93	289	98
	0.5	1.89	285	90
	0.7	1.85	269	92
Top-k	300	2.04	257	78
	500	1.89	285	90
	800	1.87	309	147

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kang, Q.; Zhang, J.; Huang, G.; Liu, F. Robust Optical and SAR Image Matching via Attention-Guided Structural Encoding and Confidence-Aware Filtering. Remote Sens. 2025, 17, 2501. https://doi.org/10.3390/rs17142501

AMA Style

Kang Q, Zhang J, Huang G, Liu F. Robust Optical and SAR Image Matching via Attention-Guided Structural Encoding and Confidence-Aware Filtering. Remote Sensing. 2025; 17(14):2501. https://doi.org/10.3390/rs17142501

Chicago/Turabian Style

Kang, Qi, Jixian Zhang, Guoman Huang, and Fei Liu. 2025. "Robust Optical and SAR Image Matching via Attention-Guided Structural Encoding and Confidence-Aware Filtering" Remote Sensing 17, no. 14: 2501. https://doi.org/10.3390/rs17142501

APA Style

Kang, Q., Zhang, J., Huang, G., & Liu, F. (2025). Robust Optical and SAR Image Matching via Attention-Guided Structural Encoding and Confidence-Aware Filtering. Remote Sensing, 17(14), 2501. https://doi.org/10.3390/rs17142501

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robust Optical and SAR Image Matching via Attention-Guided Structural Encoding and Confidence-Aware Filtering

Abstract

1. Introduction

2. Related Works

2.1. Traditional Remote Sensing Image Matching Methods

2.2. Deep Learning-Based Matching Methods

3. Methodology

3.1. RS2FNett Feature Extraction Network

3.1.1. Dual-Stage Res2Net Backbone

3.1.2. Bi-Level Routing Attention (BRA)

3.1.3. Lightweight Decoding Module

3.1.4. Multi-Level Loss Function for Feature Extraction

3.2. ACAMatcher Feature Matching Network

3.2.1. Context-Guided Bidirectional Attention Matching

3.2.2. Confidence-Aware Layer-Wise Pruning Inference

3.2.3. Matching-Aware Multi-Task Loss Function

4. Experiments

4.1. Datasets

4.2. Evaluation Metrics and Implementation Details

4.2.1. Evaluation Metrics

4.2.2. Implementation Details

4.3. Experimental Results and Analysis

5. Discussion

5.1. Effectiveness of Core Modules

5.2. Contribution of Loss Function Components

5.3. Sensitivity of Key Hyperparameters

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI