MMARNet: Two-Stage Remote Sensing Image Registration with Multimodal Attention Mechanism

Liu, Xiangzeng; Shi, Guanglu; Huang, Zhipeng; Ji, Jian; Miao, Qiguang

doi:10.3390/rs18121983

Open AccessArticle

MMARNet: Two-Stage Remote Sensing Image Registration with Multimodal Attention Mechanism

by

Xiangzeng Liu

^*

,

Guanglu Shi

,

Zhipeng Huang

,

Jian Ji

and

Qiguang Miao

School of Computer Science and Technology, Xidian University, Xi’an 710071, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(12), 1983; https://doi.org/10.3390/rs18121983 (registering DOI)

Submission received: 20 May 2026 / Revised: 10 June 2026 / Accepted: 11 June 2026 / Published: 15 June 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A geometric transformation prediction (GTP) module is developed by utilizing a dynamic adaptive sparse attention mechanism to capture prominent feature regions, thereby enabling accurate estimation and compensation for large-scale geometric transformations between the input images.
A local feature refinement (LFR) module is constructed by leveraging a feature extraction network with a super token transformer attention mechanism. Therefore, high-precision keypoint-level features extracted by the module can be used to establish accurate correspondences across highly variable modalities.

What are the implications of the main findings?

The proposed model can be applied in the field of multimodal remote sensing image matching and registration for pixel-level spatial coordinate alignment in multimodal image fusion and change detection.
The proposed method can also be applied in the field of unmanned aerial vehicle navigation in restricted environments, providing a technical foundation for its cross-modal geographic positioning.

Abstract

Multimodal image registration is a fundamental yet challenging task, particularly in remote sensing scenarios involving cross-platform, multi-temporal, and cross-modal data. The primary difficulty arises from the coexistence of large-scale geometric distortions and complex local appearance variations across modalities, which makes it difficult for a single-stage model to achieve both global alignment and fine-grained correspondence simultaneously. To address this issue, we propose MMARNet, a task-driven coarse-to-fine registration framework that explicitly decomposes multimodal registration into global geometric alignment and local correspondence refinement. Instead of treating registration as a unified problem, the proposed framework sequentially resolves distinct sources of error, leading to improved robustness and accuracy under challenging conditions. In the first stage, MMARNet learns geometry-aware global alignment by identifying structurally reliable regions across modalities and estimating large-scale transformations, effectively reducing the initial misalignment and normalizing the geometric space. In the second stage, the model focuses on residual local discrepancies by learning context-enhanced feature representations, enabling robust keypoint-level matching even under severe modality differences and nonlinear distortions. The two stages are designed to work in a complementary manner, where global alignment significantly simplifies the subsequent local matching process. Extensive experiments on three challenging multimodal datasets demonstrate that MMARNet achieves superior performance in both accuracy and robustness compared to existing methods. The results validate the effectiveness of the proposed problem decomposition and highlight the advantage of the coarse-to-fine optimization strategy for multimodal remote sensing image registration.

Keywords:

multimodal feature learning; multimodal image registration; detection and description; multimodal remote sensing image

1. Introduction

Multimodal image registration is the process of spatially aligning and superimposing images of the same scene acquired from different sensors, at different times, from various viewpoints, or across diverse platforms, aiming to achieve high-precision spatial alignment of multi-source information. This technology holds significant research value in the fields of image processing and computer vision and is widely applied in real-world scenarios, such as visual navigation [1], autonomous driving [2], image fusion [3], geographic information systems [4], and medical imaging analysis [5]. Due to the substantial nonlinear differences between multimodal images, including variations in spectral characteristics, geometric structure, scale, and resolution, multimodal image registration presents considerable challenges, making it a key area of focus in both academic research and engineering applications.

Currently, MRSI registration faces several significant challenges [6,7,8]. Remote sensing images acquired from different viewpoints or platforms (such as airborne or spaceborne sensors) often exhibit large-scale geometric distortions, making it extremely difficult to extract and maintain geometrically invariant features. Additionally, images captured at different times or by different sensors may have significant contrast differences, resulting in inconsistencies in brightness and tonal information, thereby affecting feature-based representations and further complicating precise registration. To address these challenges, researchers have developed various methods for multimodal remote sensing image registration, aiming to improve the robustness and accuracy of registration by adapting to the characteristics and variability inherent to multimodal images.

Traditional feature-based image registration methods [9,10,11,12] achieve geometric alignment of images by extracting edges, corner points, and other geometric features and then solving the matching relationships to complete the registration process. These methods typically consist of four main stages: feature detection, feature description [12,13], feature matching [9], and outlier removal [14]. In these stages, handcrafted feature descriptors are used to identify and align key points in the images. Traditional feature-based methods are effective in handling complex scenarios, such as rotation, scale changes, and illumination variations, by utilizing geometric features like corners and edges for image matching. However, in multimodal image registration, significant differences in contrast, texture, and other factors can cause local features at the same physical location to distort or disappear due to variations in lighting or imaging mechanisms, which increases the difficulty of image registration.

In contrast, deep learning-based methods [15,16,17,18,19,20,21,22] have significantly advanced image registration by learning discriminative representations and more robust transformation priors. The early approaches primarily improve feature extraction or matching efficiency through hybrid designs, such as combining handcrafted cues with CNN-based representations or introducing detector-free matching pipelines. Building on this line, LoFTR [15] formulates dense correspondence estimation in a coarse-to-fine manner using self- and cross-attention, while SE2-LoFTR [21] further improves robustness to rotation variations. More recent methods increasingly emphasize stronger contextual modeling and cross-modal adaptability, including TopicFM [16], XFeat [17], and XoFTR [18]. In multimodal remote sensing scenarios, SwinMatcher [23], DGIM [24], and MINIMA [25] further improve matching robustness through transformer-based interaction, data generation, and modality-invariant learning, respectively, while DS-MAR [26] and GPDRNet [27] explore semantic-level alignment and geometry-preserving refinement for more challenging registrations. Despite these advances, the existing methods still struggle to jointly handle large-scale geometric transformations and residual local misalignments in highly heterogeneous multimodal images.

In this paper, we propose MMARNet, a task-driven coarse-to-fine framework for multimodal remote sensing image registration. As illustrated in Figure 1, the proposed framework explicitly decomposes registration into two complementary stages: global geometric alignment and local correspondence refinement. In the first stage, MMARNet performs geometry-aware coarse alignment by leveraging multi-scale structural consistency and sparse attention to estimate large-scale transformations and reduce severe initial misalignment. In the second stage, the framework refines residual local discrepancies through context-enhanced feature learning and robust keypoint correspondence estimation, enabling accurate matching under complex modality variations. By progressively resolving distinct sources of registration error, MMARNet achieves improved robustness and accuracy for multimodal remote sensing image registration. To better summarize the distinctions between representative registration methods and our approach, Table 1 compares their main strategies, strengths, and limitations. Our contributions are summarized as follows:

We propose a task-driven coarse-to-fine framework for multimodal remote sensing image registration, which explicitly decomposes the registration process into global geometric alignment and local correspondence refinement to address the distinct error characteristics of heterogeneous image pairs.
We introduce a geometry-aware global alignment mechanism that leverages multi-scale structural consistency and dynamic sparse attention to estimate large-scale transformations under severe modality discrepancies.
We develop a context-enhanced local refinement mechanism based on super token aggregation to learn robust feature representations and recover accurate keypoint correspondences for fine-grained multimodal registration.

The remainder of this paper is structured as follows: Section 2 introduces related work on registration. Section 3 provides a detailed description of our method. Section 4 presents the experimental results and offers relevant analysis, while Section 5 discusses the limitations and future work. Finally, Section 6 concludes the paper.

2. Related Work

2.1. Image Rectification

Image rectification aims to reduce geometric discrepancies before feature matching and plays an important role in improving registration robustness under large viewpoint, scale, and rotation variations. In multimodal remote sensing scenarios, significant geometric distortions often coexist with severe radiometric differences, making accurate rectification particularly challenging. Traditional rectification methods mainly rely on geometric transformations and handcrafted priors. For example, ASIFT [38] simulates multiple affine views to achieve full affine invariance, significantly improving matching performance under viewpoint changes. MODS [39] further improves efficiency through an adaptive iterative framework that progressively generates intermediate views and transformation hypotheses. Although effective, these approaches often require extensive geometric sampling and may still suffer from reduced robustness in complex multimodal scenarios.

To overcome these limitations, learning-based rectification methods have been increasingly explored. Early studies focused on learning geometric normalization directly from image patches. Yi et al. [40] proposed learning canonical orientations for local patches to facilitate descriptor extraction, while AffNet [41] introduced a learned affine shape estimator that improves geometric normalization and matching repeatability. Rau et al. [42] further demonstrated that overlap-aware scale correction can effectively improve matching accuracy. More recently, Dai et al. [43] proposed an affine estimation unit (AEU) that predicts relative affine transformations for feature pairs, enhancing correspondence robustness. Park et al. [44] designed a two-stream symmetric network with bidirectional ensemble learning for aerial image matching, while ScaleNet [45] introduced scene-independent scale estimation through keypoint distance ratio supervision.

Meanwhile, affine correspondence-based geometric estimation has recently attracted considerable attention. Compared with point correspondences, affine correspondences encode additional local geometric information, providing stronger constraints for transformation recovery. Guan et al. [46] demonstrated that affine correspondences enable efficient relative pose estimation for multi-camera systems using significantly fewer correspondences. Hruby et al. [47] further combined affine correspondences with monocular depth cues for semi-calibrated pose estimation. Yu et al. [48] incorporated affine corrections into monocular depth priors to improve geometric estimation accuracy, while Sun et al. [49] integrated dense matching and geometric constraints to learn high-quality affine correspondences. More recently, Guan et al. [50] proposed a complete solution to generalized relative pose estimation from affine correspondences. These studies highlight the effectiveness of affine correspondences as compact geometric representations that bridge local feature matching and global transformation estimation, providing valuable insights for robust image rectification and registration.

Recent deep learning-based rectification frameworks further integrate geometric correction and registration into unified architectures. ADRNet [51] combines affine and deformable registration modules with spatial transformation learning to jointly address global and local misalignment. Sun et al. [49] further introduce a dense matching framework that incorporates geometric constraints together with learned scale and orientation estimation, achieving accurate affine correspondence generation. Despite these advances, the existing methods still face challenges when simultaneously handling severe modality gaps, large-scale geometric transformations, and residual local misalignments in multimodal remote sensing imagery.

2.2. Registration Methods

Multimodal image registration aims to establish spatial correspondence between images acquired from different sensors, viewpoints, or time instances. The existing methods can be broadly categorized into traditional methods and deep learning-based methods. Traditional methods include region-based and feature-based approaches, while deep learning-based methods can be divided into hybrid approaches and end-to-end methods.

Traditional registration methods rely on handcrafted priors to estimate geometric relationships. Region-based methods directly measure similarity between local patches or image regions using metrics such as cross-correlation (CC) [28], mutual information (MI) [29], and phase correlation [30]. These methods avoid explicit feature extraction but are sensitive to noise, local distortions, and illumination variations, and they are computationally expensive for high-resolution remote sensing images. Feature-based methods instead detect salient structures and establish correspondences via descriptors. Recent advances focus on improving robustness under nonlinear radiometric differences and geometric distortions. For example, RIFT [31] leverages phase congruency and an index map for radiation-invariant matching, while WSSF [32] constructs structural saliency maps using scale-space analysis and phase-aware filtering. Fan et al. [33] further improve robustness by introducing a modality-invariant feature transform with a global-to-local search strategy and adaptive rotation-invariant descriptors. Despite these improvements, handcrafted methods still rely on manually designed features and often struggle under large geometric deformations and severe appearance variations.

Deep learning has significantly advanced image registration by enabling data-driven feature representation and robust transformation estimation. The existing approaches can be divided into hybrid methods and end-to-end methods. Hybrid methods introduce deep models into specific components of conventional pipelines, such as feature extraction or similarity modeling, while retaining geometric optimization. Early works improve robustness by learning feature representations or similarity metrics [34]. More recent approaches further integrate geometric reasoning with deep features, such as learning task-specific keypoints or incorporating semantic-aware and multi-scale representations for multimodal registration [35]. However, their loosely coupled design may limit global consistency. End-to-end methods perform registration by directly predicting correspondences or transformation parameters. Early approaches such as MatchNet [36] and LIFT [37] demonstrate the effectiveness of joint feature learning and matching. More recent works incorporate attention mechanisms and transformer architectures to model long-range dependencies. LoFTR [15] proposes a detector-free coarse-to-fine framework for dense matching via self- and cross-attention. TopicFM [16] introduces topic-aware representations, while XFeat [17] focuses on efficient feature extraction. XoFTR [18] further enhances cross-modal matching through modality-aware pretraining and augmentation strategies. SwinMatcher [23] models cross-modal interactions using a Swin transformer, DGIM [24] improves robustness via dynamic data generation, and MINIMA [25] learns modality-invariant representations through synthetic cross-modal training pairs. More recently, DS-MAR [26] leverages dual-stream multiscale attention with adaptive deformation refinement for semantic-level alignment, while GPDRNet [27] introduces geometry-preserving dense registration to reduce structural distortions. Although these methods achieve improved performance under modality gaps, they still face challenges in jointly handling large-scale geometric transformations and residual local misalignments.

Motivated by these limitations, we propose a coarse-to-fine registration framework that explicitly separates global geometric correction from local correspondence refinement, enabling more accurate and stable multimodal remote sensing image registration.

3. Method

3.1. Coarse-to-Fine Registration Framework

To address the challenges in multimodal image registration, we propose a coarse-to-fine multimodal feature registration strategy named MMARNet, as illustrated in Figure 2. It consists of two main phases: the global transformation prediction (GTP) phase and the local feature refinement (lfr) phase. MMARNet employs a two-stage registration framework for multimodal images. The coarse registration phase predicts global geometric transformations through a GTP module, utilizing multi-scale local consistency features and dynamic sparse attention to achieve initial alignment, reducing initial alignment errors and laying a foundation for geometric consistency. The fine registration phase then applies a super token transformer-enhanced network for keypoint-level feature extraction and matching, resolving residual local discrepancies. By refining alignment from global to local, it effectively handles complex geometric variations in multimodal remote sensing while maintaining computational efficiency.

3.2. Geometric Transformation Prediction (GTP) Module

The top part of Figure 2 illustrates the two-stage process of the proposed geometric transformation prediction (GTP) module for multimodal remote sensing image registration. In the feature extraction stage, we employ a multi-scale feature extractor that captures both local and global features, enhancing the representation of spatial dependencies across different scales. This stage integrates sparsity attention mechanisms to dynamically adjust channel weights, amplifying meaningful features while suppressing irrelevant ones, allowing for more effective feature recalibration and spatial coherence. In the feature regression stage, the extracted features are progressively processed to estimate the geometric transformation parameters required for accurate registration, producing a geometrically consistent output image.

3.2.1. Feature Extraction

Let

F_{in}

denote the input feature map of the GTP module. It is first processed by the SE-ResNeXt module [52], which introduces a squeeze-and-excitation (SE) block to dynamically adjust channel weights. The SE block enhances inter-channel dependencies through two steps: squeeze (global average pooling) and excitation (nonlinear mapping with sigmoid). The output feature map

F_{se}

is computed as

F_{se} = F_{in} \cdot σ (W_{2} \cdot ReLU (W_{1} \cdot GAP (F_{in}))),

(1)

where GAP denotes global average pooling, and

W_{1}

,

W_{2}

are learnable weights for dimensionality reduction and expansion.

Next, to enhance multi-scale feature extraction, we apply a multi-scale attention mechanism. The feature map

F_{se}

is divided into G groups along the channel dimension, each group

F_{se}^{(g)} \in R^{C / G \times H \times W}

. For each group, adaptive pooling along height and width yields spatial attention maps

A_{h}^{(g)}

and

A_{w}^{(g)}

:

A_{h}^{(g)} = Softmax ({Pool}_{h} (F_{se}^{(g)})), A_{w}^{(g)} = Softmax ({Pool}_{w} (F_{se}^{(g)})),

(2)

where

{Pool}_{h}

and

{Pool}_{w}

are adaptive pooling operations along height and width. The enhanced group features are obtained by element-wise multiplication:

{\tilde{F}}_{se}^{(g)} = F_{se}^{(g)} ⊙ (A_{h}^{(g)} \cdot A_{w}^{(g)}) .

(3)

All groups are concatenated to form the multi-scale feature map

F_{multi}

. To handle high-resolution images efficiently, we integrate a dynamic adaptive sparse attention mechanism (Figure 3). The input feature map

F_{multi}

is first divided into M coarse regions. For each region

R_{m}

, we compute an importance score based on the aggregated attention response:

s_{m} = \frac{1}{| R_{m} |} \sum_{x \in R_{m}} max {(Softmax (Q K^{⊤} / \sqrt{d}))}_{x},

(4)

where

Q

,

K

, and

V

are the query, key, and value matrices derived from

F_{multi}

. We then select the top-k regions with the highest scores:

I^{r} = TopK (s_{1}, s_{2}, \dots, s_{M}), k = ⌈ ρ M ⌉,

(5)

where

ρ

is the retention ratio. In all experiments, we set

ρ = 0.25

; i.e., the top 25% most informative regions are retained for fine-grained attention. The corresponding keys and values are gathered as

K^{g} = gather (K, I^{r}), V^{g} = gather (V, I^{r}) .

(6)

The final attention output is

F_{dynamic} = Attention (Q, K^{g}, V^{g}) + LCE (V),

(7)

where

LCE (V)

denotes a local convolutional embedding that enhances local dependencies.

3.2.2. Feature Regression

To establish dense correspondences between source and target images, we compute a dense correspondence map. Let

F_{s}

and

F_{t}

denote the source and target feature maps obtained from the extraction stage (both of size

h^{'} \times w^{'} \times C^{'}

). The correspondence map

C_{geo} \in R^{h^{'} \times w^{'} \times (h^{'} \times w^{'})}

is defined as

C_{geo} (i, j, k) = 〈 F_{s} (i, k), F_{t} (i, j) 〉,

(8)

where

〈 \cdot, \cdot 〉

is the inner product, and indices

(i, j)

and

(i, k)

denote spatial positions in the target and source feature maps, respectively. Each element of

C_{geo}

represents the similarity between a source point and a target point.

The dense correspondence map is then fed into a regression network

R

that directly estimates the geometric transformation parameters. The regression network maps the correspondence map to the degrees of freedom (DoFs) of the transformation model:

R : R^{h^{'} \times w^{'} \times (h^{'} \times w^{'})} \to R^{DoF} .

(9)

This design enables the network to capture the geometric structure within the feature maps and infer the spatial transformation between images, providing a robust foundation for multimodal registration.

3.2.3. Loss Function for the GTP Module

During training, we adopt the transformed grid loss [53] as the basic supervision for the GTP module. Given the predicted transformation parameters

\hat{θ}

and the ground-truth transformation parameters

θ^{g t}

, the baseline loss is defined as

L_{grid} (\hat{θ}, θ^{g t}) = \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{N} d {(T_{\hat{θ}} (x_{i}, y_{j}), T_{θ^{g t}} (x_{i}, y_{j}))}^{2},

(10)

where N denotes the number of grid points,

T_{\hat{θ}} (\cdot)

and

T_{θ^{g t}} (\cdot)

are the transformation operations parameterized by

\hat{θ}

and

θ^{g t}

, respectively, and

d (\cdot, \cdot)

denotes the Euclidean distance. To enable bidirectional learning, we further introduce supervision in both source-to-target and target-to-source directions. Let

{\hat{θ}}_{S \to T}

and

{\hat{θ}}_{T \to S}

denote the predicted transformations from source to target and from target to source, respectively. The bidirectional loss for the original image pair is defined as

L_{org} = L_{grid} ({\hat{θ}}_{S \to T}, θ_{S \to T}^{g t}) + L_{grid} ({\hat{θ}}_{T \to S}, {(θ_{S \to T}^{g t})}^{- 1}),

(11)

where the inverse transformation is obtained from the isomorphism of the affine transformation, so no additional ground-truth annotation is required for the reverse direction.

For regularization, we also exploit the augmented image pair generated during training. The corresponding loss is formulated as

L_{aug} = L_{grid} ({\hat{θ}}_{S \to T}, θ_{S \to T}^{g t}) + L_{grid} ({\hat{θ}}_{T \to S}, {(θ_{S \to T}^{g t})}^{- 1}),

(12)

where the augmented pair shares the same geometric relation as the original pair. In addition, we introduce an identity loss to encourage the predicted transformations for the original pair and the augmented pair to remain consistent:

L_{id} = L_{grid} ({\hat{θ}}_{S \to T}, θ_{S \to T}) + L_{grid} ({\hat{θ}}_{T \to S}, θ_{T \to S}),

(13)

where

θ_{S \to T}

and

θ_{T \to S}

denote the identity transformations for the original and augmented pairs, respectively. The final loss function for the GTP module is defined as

L_{GTP} = α L_{org} + β L_{aug} + γ L_{id},

(14)

where

α

,

β

, and

γ

are the balance coefficients of the three loss terms. Following the original setting in [44], we set

α = 0.5

,

β = 0.3

, and

γ = 0.2

. These values are retained to maintain consistency with the baseline design and enable a fair comparison.

3.3. Local Feature Refinement (LFR) Module

The local feature refinement module enhances registration accuracy through three stages: feature point detection, descriptor extraction, and feature matching. The feature point detection layer leverages spatial selection mechanisms and multi-scale large kernels to capture key points while suppressing redundancy. The descriptor extraction layer employs clustering to strengthen global and local representations, making descriptors resilient to nonlinear radiation differences (NRDs). Finally, the feature matching layer predicts a similarity matrix between descriptors to achieve precise alignment.

3.3.1. Feature Detection Layer

The feature point detection layer combines SCconv [54] and LSKNet [55] to extract robust keypoints. Let

G_{in}

be the input feature map to this layer.

SCconv Module: It consists of a spatial reconstruction unit (SRU) and a channel reconstruction unit (CRU). The SRU applies group normalization (GN) [56] to normalize spatially variant features:

G_{sc} = GN (G_{in}) = γ \frac{G_{in} - μ}{\sqrt{σ^{2} + ϵ}} + β,

(15)

where

μ

and

σ

are the mean and standard deviation, and

γ, β

are trainable parameters.

LSKNet Module: It captures multi-scale contextual information using large-kernel convolutions and a spatial selection mechanism. Starting from the input

G_{sc}

, a series of convolutions with varying kernel sizes k and dilations d produce multi-scale features,

G_{i + 1} = F_{i}^{k, d} (G_{i}), {\tilde{G}}_{i} = F_{i}^{1 \times 1} (G_{i}),

(16)

where

F_{i}^{k, d}

denotes a convolution with kernel k and dilation d, and

{\tilde{G}}_{i}

is the result of a

1 \times 1

convolution. In our implementation, the large-kernel branch consists of a

5 \times 5

depth-wise convolution followed by a

7 \times 7

depth-wise convolution with dilation rate 3. A spatial attention map

{SA}_{i}

is computed for each scale, and the final output

G_{lsk}

is obtained by weighted summation,

G_{lsk} = F (\sum_{i = 1}^{N} ({SA}_{i} \cdot {\tilde{G}}_{i})),

(17)

where

F

is a convolution layer. The output

G_{lsk}

serves as the detected feature points for subsequent descriptor extraction.

3.3.2. Feature Descriptor Layer

The architecture of the descriptor generator is shown in Figure 4. Inspired by the super token transformer (STT) [57], it consists of three sequential components: super token sampling (STS), multi-head self-attention (MHSA), and token upsampling (TU). STS compresses dense visual tokens into a compact set of super tokens to reduce redundancy and computational cost. MHSA captures global contextual dependencies among super tokens, while TU projects the enhanced representations back to the original token space. Through this process, both local details and global semantic information are incorporated into the generated descriptors.

Let

X_{tok} \in R^{N \times C}

be the input visual tokens (with

N = H \times W

). STS aggregates them into super tokens

S_{\sup} \in R^{m \times C}

, where

m = \frac{H}{h} \times \frac{W}{w}

(grid size

h \times w

). Initial super tokens

S_{\sup}^{0}

are obtained by averaging tokens within each grid cell. The association map from tokens to super tokens at iteration t is

A_{map}^{t} = Softmax (\frac{X_{tok} {(S_{\sup}^{t - 1})}^{⊤}}{\sqrt{d}}),

(18)

where d is the channel dimension. The super tokens are then updated as

S_{\sup}^{t} = A_{map}^{t} X_{tok}

.

In the MHSA stage, self-attention is applied on the super tokens

S_{\sup}

to capture global dependencies,

{\tilde{S}}_{\sup} = Softmax (\frac{q (S_{\sup}) k {(S_{\sup})}^{⊤}}{\sqrt{d}}) v (S_{\sup}),

(19)

where

q (\cdot), k (\cdot), v (\cdot)

are linear projections.

Finally, TU remaps the enhanced super tokens back to the original token space using the association map:

X_{enhanced} = A_{map}^{t} {\tilde{S}}_{\sup} .

(20)

The output

X_{enhanced}

is the refined descriptor map, which we denote as

D_{enhanced}

. This process integrates both local and global context, yielding robust descriptors for matching.

3.3.3. Feature Matching Layer

Given source descriptors

D_{s}

and target descriptors

D_{t}

(both derived from

D_{enhanced}

), we compute a similarity matrix

M_{sim} \in R^{N_{s} \times N_{t}}

:

M_{sim}^{i, j} = 〈 D_{s}^{i}, D_{t}^{j} 〉,

(21)

where

N_{s} = N_{t} = (1 / δ^{2}) H W

, with

δ

being the downsampling factor. The similarity matrix is then normalized using dual-softmax [58] to obtain the correspondence matrix

C_{match}

:

C_{match}^{i, j} = softmax {(M_{sim}^{i, :})}_{j} \cdot softmax {(M_{sim}^{:, j})}_{i} .

(22)

This matrix serves as a supervisory signal to train the detection and descriptor modules, enabling high-precision multimodal image registration.

3.3.4. Loss Function for the LFR Module

The LFR module is supervised by multiple complementary objectives to jointly optimize keypoint localization, descriptor discriminability, and correspondence estimation. During training, given a source keypoint

p_{s}

, we warp it to the target image using the known geometric transformation

T_{S \to T}

and search for its nearest neighbor

p_{t}

among the detected target keypoints. If the distance between them is smaller than a predefined threshold

ϵ

, the pair is regarded as a valid correspondence. Based on these pairs, the location loss is defined as

L_{loc} = \frac{1}{N} \sum_{i = 1}^{N} {∥T_{S \to T} (p_{s}^{i}) - p_{t}^{i}∥}_{2},

(23)

where N denotes the number of valid keypoint pairs. To learn robust and discriminative descriptors, we employ a pointwise triplet loss. For each anchor descriptor

D_{i}

, its matched descriptor

D_{i}^{+}

is treated as the positive sample, while the hardest unmatched descriptor

D_{i}^{-}

is selected as the negative sample. The loss is formulated as

L_{tri} = \frac{1}{N} \sum_{i = 1}^{N} {[{∥D_{i} - D_{i}^{+}∥}_{2} - {∥D_{i} - D_{i}^{-}∥}_{2} + m]}_{+},

(24)

where m is the margin parameter and

{[\cdot]}_{+}

denotes the hinge function. Furthermore, the predicted correspondence matrix

C_{match}

is optimized using a log-likelihood objective. Let

M_{pos}

and

M_{neg}

denote the sets of positive and negative matches, respectively. The correspondence loss is defined as

L_{corr} = - (\frac{1}{| M_{pos} |} \sum_{(i, j) \in M_{pos}} log C_{match}^{i, j} + γ \frac{1}{| M_{neg} |} \sum_{(i, j) \in M_{neg}} log (1 - C_{match}^{i, j})),

(25)

where

γ

is a balancing coefficient. The final loss for the LFR module is given by

L_{LFR} = λ_{loc} L_{loc} + λ_{tri} L_{tri} + λ_{corr} L_{corr},

(26)

The weighting coefficients are set to

λ_{loc} = 0.1

,

λ_{tri} = 4.0

, and

λ_{corr} = 0.5

, following the default configuration of the LFR backbone proposed in [59]. We retain these values to ensure consistency with the original design and facilitate fair performance evaluation.

4. Experiments

In this section, we will first provide a detailed description of the dataset used and the specific method for generating image pairs. Next, we conduct a systematic comparison and analysis between the proposed method and existing state-of-the-art approaches. Finally, through a series of ablation experiments, we thoroughly examine the importance of each strategy and module introduced in the model to verify their critical roles in enhancing overall performance.

4.1. Datasets and Implementation Details

Train Datasets

Google Earth Dataset [60]: This dataset contains 9042 pairs of images with a size of 1080 × 1080 pixels. Each pair consists of a visible light remote sensing image and a high-resolution multispectral and panchromatic remote sensing image of the same location taken at different times. It is used to train our geometric transformation prediction network.
MegaDepth Dataset [61]: The dataset is a large-scale image matching and 3D reconstruction dataset released in 2018. MegaDepth contains nearly one million internet photos covering multiple cities and landmark buildings around the world. These photos exhibit a wide range of viewpoints, lighting, scale differences, and occlusions, all of which pose significant challenges for image matching. Our feature matching network is trained on a subset of the MegaDepth dataset.

Test Dataset

Google Earth Dataset: Used for both training and testing.
High-Resolution Remote Sensing Dataset (PatternNet) [62]: PatternNet is a large-scale high-resolution remote sensing dataset specifically collected for remote sensing image retrieval. The dataset contains 38 categories, with 800 images per category, each of size 256 × 256 pixels. The images in PatternNet are sourced from Google Earth or collected via the Google Map API from certain cities in the United States.
Visible Light–Infrared Image Dataset (VIS–NIR) [63]: This dataset contains 319 pairs of visible light and infrared images, mainly focusing on urban and coastal scenes.

Training Details

All the experiments were conducted on an Nvidia GeForce RTX 4090. The two modules are trained independently. For the geometric transformation prediction network, we implemented the model using PyTorch 2.0.1 and trained it with the ADAM optimizer [64] (learning rate

5 \times 10^{- 4}

, batch size 8) for 20 epochs (about 2 days), applying random affine transformations as data augmentation and resizing input images to

540 \times 540

. For the feature matching network, we used 118k images from COCO 2017 [65] without human annotations, applying spatial augmentation (scaling, rotation, and perspective) to generate self-supervised signals for Siamese training. Input images were resized to

240 \times 320

, and the network was trained with ADAM (learning rate

3 \times 10^{- 4}

, batch size 4) for 20 epochs (about 3 days); the learning rate was halved after 4 and 8 epochs. In the testing stage, image pairs are generated following the procedure shown in Figure 5. Specifically, an original image is randomly selected from the test dataset, and a random affine transformation (rotation

[- 90^{\circ}, 90^{\circ}]

, scaling

[0.85, 1.3]

) is applied to produce the corresponding target image. The generated image pair is then used as input for registration prediction, while the applied transformation serves as the ground-truth geometric relationship for quantitative evaluation. For evaluation, PCK is computed under a pixel threshold of 5 pixels, and MAE/RMSE are measured in pixel units. SSIM and PSNR are calculated between the warped image and the ground-truth target image.

4.2. Quantitative Comparisons

4.2.1. Image Similarity Evaluation

Table 2 reports the SSIM and PSNR results on the Google Earth, PatternNet, and VIS–NIR datasets. These two metrics evaluate the structural fidelity and visual quality of the registered images, and they provide a complementary assessment of registration performance.

The structural similarity index measure (SSIM) reflects the degree of structural preservation after registration. As shown in Table 2, traditional feature-based methods generally obtain lower SSIM values. For example, SIFT and BRISK achieve SSIM scores of 0.41 and 0.40 on Google Earth, respectively, while ORB only reaches 0.29. Although these methods perform better on PatternNet, their results on VIS–NIR remain limited, with ORB achieving only 0.23 and ReDfeat and XoFTR producing even lower values of 0.08. Recent learning-based methods show improvements over earlier approaches. TopicFM, LoFTR, ASpanFormer, and XFeat achieve moderate SSIM values, ranging from 0.45 to 0.62 on the three datasets. More recent methods further improve structural consistency: MINIMA achieves 0.71, 0.90, and 0.83 on Google Earth, PatternNet, and VIS–NIR, respectively; SwinMatcher obtains 0.73, 0.88, and 0.86; and DGIM reaches 0.75, 0.91, and 0.81. Among all the compared methods, MMARNet consistently achieves the best SSIM performance, with 0.79 on Google Earth, 0.93 on PatternNet, and 0.92 on VIS–NIR. These results indicate that MMARNet preserves structural information more effectively, especially in challenging multimodal scenarios.

The peak signal-to-noise ratio (PSNR) measures the visual quality and noise suppression capability of the registered results. Traditional feature-based methods produce relatively low PSNR values on the three datasets. For instance, ORB obtains only 13.55 on Google Earth and 10.45 on VIS–NIR, while ReDfeat and XoFTR yield even lower PSNR values of 14.36/14.40 on Google Earth and 9.36/9.37 on VIS–NIR, respectively. Compared with these methods, recent learning-based approaches exhibit clearer improvements. TopicFM, LoFTR, ASpanFormer, and XFeat reach PSNR values of 21.16–22.88 on Google Earth, 20.44–22.60 on PatternNet, and 16.32–17.24 on VIS–NIR. The newly added recent methods further strengthen the performance: MINIMA achieves 23.15, 27.45, and 24.35, SwinMatcher obtains 23.40, 27.20, and 25.10, and DGIM reaches 23.70, 27.80, and 24.00 on the three datasets, respectively. MMARNet still achieves the highest PSNR values across all the datasets, with 24.75 on Google Earth, 28.24 on PatternNet, and 26.19 on VIS–NIR. These results demonstrate that MMARNet provides stronger noise suppression and better visual consistency after registration.

In summary, MMARNet achieves the best overall performance in both SSIM and PSNR. The results show that the proposed method not only preserves structural information more effectively but also maintains higher image quality after registration, which confirms its robustness in multimodal remote sensing scenarios.

4.2.2. Registration Accuracy Evaluation

PCK (percentage of correct keypoints) [68], MAE (mean absolute error) [28] and RMSE (root mean squared error) are three widely used metrics for evaluating image matching performance [69]. To provide a more comprehensive assessment of geometric registration accuracy, the results are reported in Table 3. The results on the Google Earth, PatternNet, and VIS–NIR datasets demonstrate the effectiveness of MMARNet under both intra-modal and cross-modal scenarios.

Traditional methods such as SIFT [9] and BRISK [12] achieve moderate performance on the benchmark datasets. For example, SIFT obtains relatively high PCK scores on PatternNet (94.96) and VIS–NIR (84.09), but its MAE and RMSE on VIS–NIR remain 2.78 and 5.83, respectively, indicating limited robustness under severe modality variation. BRISK shows a similar tendency, with a PCK of 96.43 on PatternNet but a much larger MAE of 6.29 and RMSE of 8.96 on VIS–NIR. ORB [11] performs poorly on all three datasets, particularly on Google Earth, where it only achieves a PCK of 22.70 and the highest MAE/RMSE values among the traditional methods, confirming its limited effectiveness in multimodal registration.

Deep learning-based methods generally improve matching performance, but their results vary significantly across datasets. TopicFM [16], LoFTR [15], ASpanFormer [67], and XFeat [17] achieve strong performance on Google Earth, with PCK values above 98.00 and relatively low MAE/RMSE values. However, their performance drops substantially on PatternNet and VIS–NIR, especially for TopicFM and LoFTR, whose PCK values fall below 45.00 on PatternNet and below 42.00 on VIS–NIR. This indicates that, although these methods are effective for relatively easier alignment cases, their robustness remains limited when facing larger cross-modal discrepancies and more challenging geometric transformations.

Recent methods, including MINIMA [25], SwinMatcher [23], and DGIM [24], show improved robustness over earlier deep models. MINIMA achieves PCK scores of 99.92 on Google Earth and 97.50 on VIS–NIR, while SwinMatcher and DGIM further improve the overall registration quality across the three datasets. In particular, DGIM obtains the highest PCK on PatternNet (99.11), indicating strong performance on this more challenging dataset. Nevertheless, MMARNet still delivers the most consistent overall results. It achieves the best PCK on Google Earth (99.99) and VIS–NIR (98.00) and remains highly competitive on PatternNet (99.11), where it is only slightly lower than DGIM. More importantly, MMARNet achieves the lowest MAE on all three datasets, with values of 0.32, 0.75, and 0.96, respectively, and also the lowest RMSE, with 1.28 on Google Earth, 2.35 on PatternNet, and 3.78 on VIS–NIR. These results indicate that MMARNet not only improves the average alignment accuracy but also reduces large registration deviations, leading to more reliable geometric consistency.

Overall, the experimental results verify that MMARNet provides superior registration performance across all three datasets. Compared with both traditional methods and recent learning-based baselines, the proposed method achieves the most favorable trade-off between keypoint accuracy, average error, and overall geometric stability, demonstrating its robustness in multimodal remote sensing image registration.

4.2.3. Qualitative Comparison

Figure 6, Figure 7 and Figure 8 provide qualitative comparisons on the Google Earth, PatternNet, and VIS–NIR datasets. For clearer visualization, checkerboard overlays and zoomed-in regions are used to highlight local alignment details. Overall, MMARNet achieves the most consistent registration results across all three datasets, particularly in challenging cases involving large geometric transformations and strong modality differences.

On the Google Earth dataset, the geometric discrepancy between image pairs is relatively moderate, and the modality gap is also less pronounced. As a result, most methods are able to produce visually acceptable alignment, and the performance differences are comparatively limited. Nevertheless, several baseline methods still exhibit slight boundary shifts and local discontinuities in the enlarged regions, whereas MMARNet produces cleaner structural continuity and more accurate boundary alignment.

In contrast, the PatternNet dataset involves more significant rotation variations, which makes the registration task considerably more difficult. Under such conditions, traditional methods and earlier deep learning approaches begin to suffer from noticeable ghosting, edge distortion, and incomplete alignment. Although MINIMA, SwinMatcher, and DGIM improve the overall results to some extent, residual misalignment remains visible in the red-box regions. MMARNet, by comparison, better suppresses these artifacts and maintains more precise structural correspondence.

The VIS–NIR dataset is the most challenging one as it simultaneously presents substantial modality differences, scale variation, and rotation changes. On this dataset, most competing methods fail to achieve reliable alignment and produce obvious misregistration or severe structural inconsistency. Even the recent baselines still show clear ghosting and incomplete correspondence in the zoomed-in regions. In contrast, MMARNet consistently preserves geometric structure and achieves much better alignment quality. These results clearly demonstrate the effectiveness of the proposed coarse-to-fine strategy in progressively reducing global misalignment and refining local correspondence under highly complex multimodal conditions.

4.3. Ablation Study

4.3.1. Geometric Transformation Prediction Module

Table 4 presents the ablation study results on the Google Earth dataset, using PCK as the evaluation metric to systematically measure the contributions of each improvement module at various thresholds (

τ = 0.05, 0.03, 0.01

). In this experiment, the multi-scale local consistency feature extraction module is denoted as “A”, the dynamic adaptive sparsity attention mechanism as “B” and the unmodified baseline network as “baseline”. Table 4 comprehensively illustrates the specific contributions of each module to the image matching task. Specifically, the baseline model shows a gradual decrease in PCK as the threshold decreases, reaching only 31.5% at

τ = 0.01

, indicating the limitations of the baseline model in high-precision registration tasks. When the multi-scale local consistency feature extraction module is introduced (baseline + A), the PCK at the lower threshold (

τ = 0.01

) slightly increases to 32.9%, demonstrating a certain effect on improving detail feature matching. When the dynamic adaptive sparsity attention mechanism (baseline + B) is introduced, the PCK significantly increases across all thresholds, especially reaching 86.2% and 37.1% at

τ = 0.03

and 0.01, respectively. This suggests that the dynamic adaptive sparsity attention mechanism has a significant advantage in effectively focusing on key regions and suppressing redundant information.

Finally, by integrating both the multi-scale local consistency feature extraction module and the dynamic adaptive sparsity attention mechanism into the baseline model to form the complete MMARNet model, the best performance is achieved across all thresholds, especially reaching 39.6% at

τ = 0.01

, which significantly outperforms other configurations. This validates the synergy of the two modules, enabling MMARNet to achieve more accurate image matching.

4.3.2. Local Feature Refinement Module

Table 5 presents the ablation study results on the HPatches dataset, evaluating the performance impact of incorporating the enhanced feature extraction module (denoted as “A”) and the super token transformer (denoted as “B”) on the baseline model. As shown in the table, the baseline model demonstrates moderate performance across all the metrics, with an Re value of 0.690, LE of 1.019, MS of 0.538, H-3 of 0.869, and H-5 of 0.926, indicating certain limitations when dealing with complex transformations. After introducing the feature extraction module (baseline + A) the LE metric improves by approximately 0.4% and MS improves by 4.3%, indicating that the feature extraction module helps to refine local features and enhance local matching accuracy. Further, adding the super token transformer (baseline + B) results in more substantial improvements across all the metrics. Compared to the baseline model, the Re value decreases by about 3.5%, the LE value decreases by 3.4%, the MS metric improves by 9.6%, and the H-3 and H-5 metrics increase by 0.6% and 0.2%, respectively. These results demonstrate that the super token transformer plays a significant role in focusing on key feature regions and enhancing matching consistency, thereby greatly improving robustness to complex transformations.

When both the feature extraction module and the super token transformer are combined, the model achieves optimal performance with notable overall improvement ratios. Compared to the baseline model, the LE metric decreases by 6.7% and MS improves by 6.9%. These results indicate that the synergistic effect of the two modules significantly enhances the accuracy and stability of image matching, achieving optimal registration results.

5. Discussion

The experimental results demonstrate that MMARNet effectively addresses multimodal remote sensing image registration through a coarse-to-fine strategy. By decomposing the task into global geometric alignment and local feature refinement, the framework achieves robust performance under large geometric distortions and significant cross-modal appearance variations. The global alignment stage compensates for large-scale transformations and reduces the search space for matching, while the local refinement stage further establishes accurate correspondences at the feature level. Compared with existing single-stage approaches, this hierarchical design improves both registration accuracy and robustness across different multimodal datasets.

Despite these advantages, several limitations remain. As a sequential framework, MMARNet is still sensitive to errors in the initial global transformation estimation, which may propagate to the local refinement stage and affect the final registration result. This issue becomes more pronounced in cases involving extremely large rotations, severe viewpoint changes, or limited overlap between image pairs. Future work will focus on jointly optimizing the two stages to reduce error propagation, improve robustness under extreme conditions, and extend the framework to more challenging multimodal and multi-temporal remote sensing applications.

6. Conclusions

In this paper, we introduce a new multimodal remote sensing image registration method called MMARNet. MMARNet is designed with a two-stage framework to achieve high-precision remote sensing image registration. In the first stage, MMARNet incorporates a geometric transformation prediction (GTP) module, which estimates and compensates for large-scale transformations, such as scaling and rotation, providing a geometrically consistent foundation for subsequent registration. In the second stage, MMARNet integrates a deep learning-based feature matching network (LFR) to perform pixel-level feature extraction and robust matching, enabling precise local alignment across modalities. Extensive experimental results demonstrate that MMARNet achieves state-of-the-art accuracy and robustness in multimodal image registration, validating its significant potential and superiority in the field of multimodal remote sensing image registration.

Author Contributions

Conceptualization, X.L. and G.S.; Methodology, G.S.; Validation, G.S. and J.J.; Visualization, Z.H.; Writing—original draft preparation, X.L., G.S. and Z.H.; Writing—review and editing, J.J.; Supervision, Q.M.; Funding acquisition, X.L. and Q.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Basic Research Program of Shaanxi (2024JCYBMS-467), the Aeronautical Science Foundation of China (D023030002).

Data Availability Statement

The datasets used in our work are derived from the following resources available in the public domain: the Google Earth dataset from ref. [60] and the MegaDepth dataset from ref. [61].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shah, D.; Sridhar, A.; Dashora, N.; Stachowicz, K.; Black, K.; Hirose, N.; Levine, S. ViNT: A foundation model for visual navigation. arXiv 2023, arXiv:2306.14846. [Google Scholar] [CrossRef]
Kozłowski, M.; Racewicz, S.; Wierzbicki, S. Image Analysis in Autonomous Vehicles: A Review of the Latest AI Solutions and Their Comparison. Appl. Sci. 2024, 14, 8150. [Google Scholar] [CrossRef]
Kaur, H.; Koundal, D.; Kadyan, V. Image fusion techniques: A survey. Arch. Comput. Methods Eng. 2021, 28, 4425–4447. [Google Scholar] [CrossRef]
Kamel Boulos, M.; Peng, G.; VoPham, T. An overview of GeoAI applications in health and healthcare. Int. J. Health Geogr. 2019, 18, 7. [Google Scholar] [CrossRef] [PubMed]
Zeng, Q.; Sun, W.; Xu, J.; Wan, W.; Pan, L. Machine Learning-Based Medical Imaging Detection and Diagnostic Assistance. Int. J. Comput. Sci. Inf. Technol. 2024, 2, 36–44. [Google Scholar] [CrossRef]
Paul, S.; Pati, U. A comprehensive review on remote sensing image registration. Int. J. Remote Sens. 2021, 42, 5396–5432. [Google Scholar] [CrossRef]
Zhang, X.; Leng, C.; Hong, Y.; Pei, Z.; Cheng, I.; Basu, A. Multimodal remote sensing image registration methods and advancements: A survey. Remote Sens. 2021, 13, 5128. [Google Scholar] [CrossRef]
Zhao, Y.; Liang, J.; Ma, H.; Huang, P.; Dong, Y.; Li, J. Semantic-Guided Hierarchical Consistency Domain Adaptation for Open-Set Remote Sensing Scene Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2026, 19, 2088–2102. [Google Scholar] [CrossRef]
Lowe, D. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Bay, H.; Tuytelaars, T.; Van Gool, L. SURF: Speeded up robust features. In Proceedings of the European Conference on Computer Vision (ECCV), Graz, Austria, 7–13 May 2006; pp. 404–417. [Google Scholar]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
Leutenegger, S.; Chli, M.; Siegwart, R. BRISK: Binary robust invariant scalable keypoints. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 2548–2555. [Google Scholar]
Levi, G.; Hassner, T. LATCH: Learned arrangements of three patch codes. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–9 March 2016; pp. 1–9. [Google Scholar]
Chum, O.; Matas, J. Matching with PROSAC—progressive sample consensus. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA, 20–26 June 2005; pp. 220–226. [Google Scholar]
Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8922–8931. [Google Scholar]
Giang, K.; Song, S.; Jo, S. TopicFM: Robust and interpretable topic-assisted feature matching. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Washington, DC, USA, 7–14 February 2023; pp. 2447–2455. [Google Scholar]
Potje, G.; Cadar, F.; Araujo, A.; Martins, R.; Nascimento, E. XFeat: Accelerated Features for Lightweight Image Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 2682–2691. [Google Scholar]
Tuzcuoğlu, Ö.; Köksal, A.; Sofu, B.; Kalkan, S.; Alatan, A. XoFTR: Cross-modal Feature Matching Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 4275–4286. [Google Scholar]
Barroso-Laguna, A.; Mikolajczyk, K. Key.Net: Keypoint detection by handcrafted and learned CNN filters revisited. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 698–711. [Google Scholar] [CrossRef] [PubMed]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Bökman, G.; Kahl, F. A case for using rotation invariant features in state of the art feature matchers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5110–5119. [Google Scholar]
Giang, K.; Song, S.; Jo, S. TopicFM+: Boosting accuracy and efficiency of topic-assisted feature matching. IEEE Trans. Image Process. 2024, 33, 6016–6028. [Google Scholar] [CrossRef]
Li, W.; Weng, D.; Gao, C.; Du, Q. SwinMatcher: Universal Cross-Modal Remote Sensing Image Matching With Interactive Swin Transformer. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4705916. [Google Scholar] [CrossRef]
Weng, D.; Li, W.; Gao, C.; Xia, X.G.; Shi, Z.; Cui, B. DGIM: Cascaded Dynamic Data Generation for Robust Cross-Modal Image Matching. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4708616. [Google Scholar] [CrossRef]
Ren, J.; Jiang, X.; Li, Z.; Liang, D.; Zhou, X.; Bai, X. MINIMA: Modality Invariant Image Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 23059–23068. [Google Scholar]
He, Y.; Yang, C.; Sun, C.; Song, P. A Multimodal Remote Sensing Image Registration Framework with Dual-Stream Multiscale Attention and Adaptive Deformation Refinement. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2026, 19, 13909–13923. [Google Scholar] [CrossRef]
Zhou, L.; Peng, T.; Han, Z.; Li, L.; Zhu, Q.; Ye, Y. Robust Pixel-by-Pixel Multimodal Remote Sensing Image Registration Using Geometry Preserving Dense Registration Network. IEEE Trans. Geosci. Remote Sens. 2026, 64, 4701814. [Google Scholar] [CrossRef]
Viola, P.; Wells, W., III. Alignment by maximization of mutual information. Int. J. Comput. Vis. 1997, 24, 137–154. [Google Scholar] [CrossRef]
Maes, F.; Collignon, A.; Vandermeulen, D.; Marchal, G.; Suetens, P. Multimodality image registration by maximization of mutual information. IEEE Trans. Med. Imaging 1997, 16, 187–198. [Google Scholar] [CrossRef] [PubMed]
Alba, A.; Aguilar-Ponce, R.; Vigueras-Gómez, J.; Arce-Santana, E. Phase correlation based image alignment with subpixel accuracy. In Proceedings of the Mexican International Conference on Artificial Intelligence (MICAI), San Luis Potosí, Mexico, 27 October–4 November 2012; pp. 171–182. [Google Scholar]
Li, J.; Hu, Q.; Ai, M. RIFT: Multi-Modal Image Matching Based on Radiation-Variation Insensitive Feature Transform. IEEE Trans. Image Process. 2020, 29, 3296–3310. [Google Scholar] [CrossRef] [PubMed]
Wan, G.; Ye, Z.; Xu, Y.; Huang, R.; Zhou, Y.; Xie, H.; Tong, X. Multimodal Remote Sensing Image Matching Based on Weighted Structure Saliency Feature. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4700816. [Google Scholar] [CrossRef]
Fan, Z.; Pi, Y.; Han, J.; Kan, Y.; Tan, K. GS–MIFT: A modality invariant feature transform with global-to-local searching. Inf. Fusion 2024, 105, 102252. [Google Scholar] [CrossRef]
Islam, K.T.; Wijewickrema, S.; O’Leary, S. A deep learning based framework for the registration of three dimensional multi-modal medical images of the head. Sci. Rep. 2021, 11, 1860. [Google Scholar] [CrossRef]
Liu, C.; Sui, H.; Zhou, M.; Xu, C. Large-scale multimodal remote sensing image registration with semantic guidance and multi-scale contextual matching. Expert Syst. Appl. 2026, 323, 132455. [Google Scholar] [CrossRef]
Han, X.; Leung, T.; Jia, Y.; Sukthankar, R.; Berg, A. MatchNet: Unifying feature and metric learning for patch-based matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2015; pp. 3279–3286. [Google Scholar]
Yi, K.; Trulls, E.; Lepetit, V.; Fua, P. LIFT: Learned invariant feature transform. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 467–483. [Google Scholar]
Yu, G.; Morel, J. ASIFT: An algorithm for fully affine invariant comparison. Image Process. Line 2011, 1, 11–38. [Google Scholar] [CrossRef]
Mishkin, D.; Matas, J.; Perdoch, M. MODS: Fast and robust method for two-view matching. Comput. Vis. Image Underst. 2015, 141, 81–93. [Google Scholar] [CrossRef]
Yi, K.; Verdie, Y.; Fua, P.; Lepetit, V. Learning to assign orientations to feature points. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 107–116. [Google Scholar]
Mishkin, D.; Radenovic, F.; Matas, J. Repeatability is not enough: Learning affine regions via discriminability. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 284–300. [Google Scholar]
Rau, A.; Garcia-Hernando, G.; Stoyanov, D.; Brostow, G.; Turmukhambetov, D. Predicting visual overlap of images through interpretable non-metric box embeddings. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 629–646. [Google Scholar]
Dai, J.; Jin, S.; Zhang, J.; Nguyen, T.Q. Boosting Feature Matching Accuracy With Pairwise Affine Estimation. IEEE Trans. Image Process. 2020, 29, 8278–8291. [Google Scholar] [CrossRef] [PubMed]
Park, J.H.; Nam, W.J.; Lee, S.W. A Two-Stream Symmetric Network with Bidirectional Ensemble for Aerial Image Matching. Remote Sens. 2020, 12, 465. [Google Scholar] [CrossRef]
Barroso-Laguna, A.; Tian, Y.; Mikolajczyk, K. ScaleNet: A shallow architecture for scale estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12808–12818. [Google Scholar]
Guan, B.; Zhao, J.; Barath, D.; Fraundorfer, F. Minimal Solvers for Relative Pose Estimation of Multi-Camera Systems Using Affine Correspondences. Int. J. Comput. Vis. 2023, 131, 324–345. [Google Scholar] [CrossRef]
Hruby, P.; Pollefeys, M.; Barath, D. Semicalibrated Relative Pose from an Affine Correspondence and Monodepth. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2024; pp. 39–57. [Google Scholar]
Yu, Y.; Liu, S.; Pautrat, R.; Pollefeys, M.; Larsson, V. Relative Pose Estimation through Affine Corrections of Monocular Depth Priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025. [Google Scholar]
Sun, P.; Guan, B.; Yu, Z.; Shang, Y.; Yu, Q.; Barath, D. Learning Affine Correspondences by Integrating Geometric Constraints. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 27038–27048. [Google Scholar]
Guan, B.; Zhao, J.; Kneip, L. A Complete Solution to Generalized Relative Pose Estimation from Affine Correspondences. IEEE Trans. Pattern Anal. Mach. Intell. 2026, 48, 8598–8612. [Google Scholar] [CrossRef]
Xiao, Y.; Zhang, C.; Chen, Y.; Jiang, B.; Tang, J. ADRNet: Affine and Deformable Registration Networks for Multimodal Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5207613. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Rocco, I.; Arandjelovic, R.; Sivic, J. Convolutional Neural Network Architecture for Geometric Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6148–6157. [Google Scholar]
Li, J.; Wen, Y.; He, L. SCConv: Spatial and channel reconstruction convolution for feature redundancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 6153–6162. [Google Scholar]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 16794–16805. [Google Scholar]
Wu, Y.; He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Huang, H.; Zhou, X.; Cao, J.; He, R.; Tan, T. Vision transformer with super token sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 22690–22699. [Google Scholar]
Rocco, I.; Cimpoi, M.; Arandjelović, R.; Torii, A.; Pajdla, T.; Sivic, J. Neighbourhood consensus networks. Adv. Neural Inf. Process. Syst. (NeurIPS) 2018, 31, 1658–1669. [Google Scholar]
Wang, C.; Zhang, G.; Cheng, Z.; Zhou, W. Rethinking low-level features for interest point detection and description. In Proceedings of the Asian Conference on Computer Vision (ACCV), Macao, China, 4–8 December 2022; pp. 2059–2074. [Google Scholar]
Kim, D.; Nam, W.; Lee, S. A robust matching network for gradually estimating geometric transformation on remote sensing imagery. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics (SMC), Bari, Italy, 6–9 October 2019; pp. 3889–3894. [Google Scholar]
Li, Z.; Snavely, N. MegaDepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 2041–2050. [Google Scholar]
Zhou, W.; Newsam, S.; Li, C.; Shao, Z. PatternNet: A benchmark dataset for performance evaluation of remote sensing image retrieval. ISPRS J. Photogramm. Remote Sens. 2018, 145, 197–209. [Google Scholar] [CrossRef]
Pielawski, N.; Wetzer, E.; Öfverstedt, J.; Lu, J.; Wählby, C.; Lindblad, J.; Sladoje, N. CoMIR: Contrastive multimodal image representation for registration. Adv. Neural Inf. Process. Syst. (NeurIPS) 2020, 33, 18433–18444. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Deng, Y.; Ma, J. ReDFeat: Recoupling detection and description for multimodal feature learning. IEEE Trans. Image Process. 2022, 32, 591–602. [Google Scholar] [CrossRef]
Chen, H.; Luo, Z.; Zhou, L.; Tian, Y.; Zhen, M.; Fang, T.; Mckinnon, D.; Tsin, Y.; Quan, L. Aspanformer: Detector-free image matching with adaptive span transformer. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 20–36. [Google Scholar]
Yang, Y.; Ramanan, D. Articulated human detection with flexible mixtures of parts. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 2878–2890. [Google Scholar] [CrossRef]
Kim, S.; Min, D.; Ham, B.; Jeon, S.; Lin, S.; Sohn, K. FCSS: Fully convolutional self-similarity for dense semantic correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6560–6569. [Google Scholar]

Figure 1. Two-stage image registration framework: coarse and fine registration phases for enhanced alignment. Red boxes denote the input image pair.

Figure 2. Workflow of the proposed two-stage framework for multimodal image registration, consisting of geometric consistency learning and feature matching.

Figure 3. Dynamic adaptive sparsity attention mechanism.

Figure 4. Descriptor generator.

Figure 5. Overview of generating image pairs.

Figure 6. Comparative visualization of different methods on the Google Earth dataset.

Figure 7. Comparative visualization of different methods on the PatternNet dataset.

Figure 8. Comparative visualization of different methods on the VIS–NIR dataset.

Table 1. Comparison of representative multimodal registration methods and the proposed MMARNet.

Category	Representative Methods	Main Idea	Strengths	Key Difference
Traditional region-based	CC [28], MI [29], and phase correlation [30]	Directly optimize image similarity at the region level	Simple and label-free; effective for roughly aligned pairs	Lack explicit feature modeling and are sensitive to modality gaps, noise, and large deformations
Traditional feature-based	RIFT [31], WSSF [32], and GSMIFT [33]	Use handcrafted detectors/descriptors and feature matching	Improved geometric robustness and radiation invariance	Rely on manually designed features and still struggle under severe appearance variation and large misalignment
Hybrid deep learning methods	Deep feature/similarity learning [34,35]	Replace part of the classical pipeline with learned representations	Improved robustness while preserving conventional optimization steps	Loosely coupled design may limit global consistency and end-to-end adaptability
End-to-end dense matching	MatchNet [36], LIFT [37], LoFTR [15], TopicFM [16], XFeat [17], XoFTR [18], and SwinMatcher [23]	Directly learn correspondences or dense matches from data	Strong automation and matching capability	Typically rely on unified matching pipelines and do not explicitly separate global correction from local refinement
Multimodal remote sensing methods	DGIM [24], MINIMA [25], DS-MAR [26], and GPDRNet [27]	Improve modality invariance, semantic alignment, or geometry- preserving refinement	Better suited for cross-modal remote sensing data	Usually emphasize one aspect of the problem, while MMARNet explicitly decomposes registration into global alignment and local correspondence refinement
Proposed MMARNet	Ours	Task-driven coarse-to- fine registration with global transformation prediction and local feature refinement	Handles both large geometric distortions and residual local misalignment	Explicitly separates global geometric correction from local correspondence refinement

Table 2. Comparison of SSIM and PSNR metrics on different datasets. Bold values indicate the best result.

Method	Google Earth		PatternNet		VIS–NIR
Method	PSNR↑	SSIM↑	PSNR↑	SSIM↑	PSNR↑	SSIM↑
SIFT [9]	18.34	0.41	27.44	0.86	21.78	0.75
BRISK [12]	17.17	0.40	28.16	0.90	19.20	0.59
ORB [11]	13.55	0.29	24.48	0.80	10.45	0.23
ReDfeat [66]	14.36	0.27	13.01	0.26	9.36	0.08
XoFTR [18]	14.40	0.27	12.99	0.24	9.37	0.08
TopicFM [16]	22.82	0.60	21.16	0.59	16.90	0.45
LoFTR [15]	22.38	0.61	22.32	0.59	16.32	0.45
ASpanFormer [67]	22.88	0.60	22.60	0.62	17.24	0.47
XFeat [17]	21.18	0.52	20.44	0.58	16.57	0.50
MINIMA [25]	23.15	0.71	27.45	0.90	24.35	0.83
SwinMatcher [23]	23.40	0.73	27.20	0.88	25.10	0.86
DGIM [24]	23.70	0.75	27.80	0.91	24.00	0.81
MMARNet	24.75	0.79	28.24	0.93	26.19	0.92

Table 3. Comparison of PCK, MAE and RMSE metrics on different datasets. Bold values indicate the best result, and underlined indicate the second-best result.

Method	Google Earth			PatternNet			VIS–NIR
Method	PCK↑	MAE↓	RMSE↓	PCK↑	MAE↓	RMSE↓	PCK↑	MAE↓	RMSE↓
SIFT	78.72	3.64	5.21	94.96	0.83	3.45	84.09	2.78	5.83
BRISK	60.21	10.63	14.70	96.43	0.85	3.52	67.83	6.29	8.96
ORB	22.70	16.82	18.31	93.59	1.21	4.98	42.15	12.35	14.22
TopicFM	99.70	0.80	2.28	42.20	13.01	15.02	36.05	11.76	14.31
LoFTR	99.49	0.85	2.36	44.12	11.42	13.89	41.35	12.02	13.88
ASpanFormer	99.69	0.81	2.30	59.84	8.23	9.56	46.81	10.86	12.26
XFeat	98.87	1.48	3.11	50.26	10.73	12.93	51.25	10.58	11.92
MINIMA	99.92	0.52	1.88	97.80	0.82	3.01	97.50	2.10	5.54
SwinMatcher	99.85	0.36	1.56	98.50	0.85	3.28	96.80	1.95	5.12
DGIM	99.90	0.38	1.49	99.11	0.78	2.92	97.20	2.20	5.85
MMARNet	99.99	0.32	1.28	99.11	0.75	2.35	98.00	0.96	3.78

Table 4. Ablation results on Google Earth dataset.

Method	PCK (%)
Method	$τ = 0.05$	$τ = 0.03$	$τ = 0.01$
Baseline	94.5	82.4	31.5
Baseline + A	94.5	81.7	32.9
Baseline + B	96.2	86.2	37.1
Baseline + A + B	95.2	86.8	39.6

Table 5. Ablation results on HPatches dataset.

Method	Re↑	LE↓	MS↑	H-1↑	H-3↑	H-5↑
Baseline	0.690	1.019	0.538	0.595	0.869	0.926
Baseline + A	0.689	1.015	0.561	0.536	0.866	0.919
Baseline + B	0.666	0.984	0.522	0.588	0.874	0.928
Baseline + A + B	0.689	0.951	0.575	0.617	0.881	0.929

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, X.; Shi, G.; Huang, Z.; Ji, J.; Miao, Q. MMARNet: Two-Stage Remote Sensing Image Registration with Multimodal Attention Mechanism. Remote Sens. 2026, 18, 1983. https://doi.org/10.3390/rs18121983

AMA Style

Liu X, Shi G, Huang Z, Ji J, Miao Q. MMARNet: Two-Stage Remote Sensing Image Registration with Multimodal Attention Mechanism. Remote Sensing. 2026; 18(12):1983. https://doi.org/10.3390/rs18121983

Chicago/Turabian Style

Liu, Xiangzeng, Guanglu Shi, Zhipeng Huang, Jian Ji, and Qiguang Miao. 2026. "MMARNet: Two-Stage Remote Sensing Image Registration with Multimodal Attention Mechanism" Remote Sensing 18, no. 12: 1983. https://doi.org/10.3390/rs18121983

APA Style

Liu, X., Shi, G., Huang, Z., Ji, J., & Miao, Q. (2026). MMARNet: Two-Stage Remote Sensing Image Registration with Multimodal Attention Mechanism. Remote Sensing, 18(12), 1983. https://doi.org/10.3390/rs18121983

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

MMARNet: Two-Stage Remote Sensing Image Registration with Multimodal Attention Mechanism

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Image Rectification

2.2. Registration Methods

3. Method

3.1. Coarse-to-Fine Registration Framework

3.2. Geometric Transformation Prediction (GTP) Module

3.2.1. Feature Extraction

3.2.2. Feature Regression

3.2.3. Loss Function for the GTP Module

3.3. Local Feature Refinement (LFR) Module

3.3.1. Feature Detection Layer

3.3.2. Feature Descriptor Layer

3.3.3. Feature Matching Layer

3.3.4. Loss Function for the LFR Module

4. Experiments

4.1. Datasets and Implementation Details

4.2. Quantitative Comparisons

4.2.1. Image Similarity Evaluation

4.2.2. Registration Accuracy Evaluation

4.2.3. Qualitative Comparison

4.3. Ablation Study

4.3.1. Geometric Transformation Prediction Module

4.3.2. Local Feature Refinement Module

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI