MMARNet: Two-Stage Remote Sensing Image Registration with Multimodal Attention Mechanism
Highlights
- A geometric transformation prediction (GTP) module is developed by utilizing a dynamic adaptive sparse attention mechanism to capture prominent feature regions, thereby enabling accurate estimation and compensation for large-scale geometric transformations between the input images.
- A local feature refinement (LFR) module is constructed by leveraging a feature extraction network with a super token transformer attention mechanism. Therefore, high-precision keypoint-level features extracted by the module can be used to establish accurate correspondences across highly variable modalities.
- The proposed model can be applied in the field of multimodal remote sensing image matching and registration for pixel-level spatial coordinate alignment in multimodal image fusion and change detection.
- The proposed method can also be applied in the field of unmanned aerial vehicle navigation in restricted environments, providing a technical foundation for its cross-modal geographic positioning.
Abstract
1. Introduction
- We propose a task-driven coarse-to-fine framework for multimodal remote sensing image registration, which explicitly decomposes the registration process into global geometric alignment and local correspondence refinement to address the distinct error characteristics of heterogeneous image pairs.
- We introduce a geometry-aware global alignment mechanism that leverages multi-scale structural consistency and dynamic sparse attention to estimate large-scale transformations under severe modality discrepancies.
- We develop a context-enhanced local refinement mechanism based on super token aggregation to learn robust feature representations and recover accurate keypoint correspondences for fine-grained multimodal registration.
2. Related Work
2.1. Image Rectification
2.2. Registration Methods
3. Method
3.1. Coarse-to-Fine Registration Framework
3.2. Geometric Transformation Prediction (GTP) Module
3.2.1. Feature Extraction
3.2.2. Feature Regression
3.2.3. Loss Function for the GTP Module
3.3. Local Feature Refinement (LFR) Module
3.3.1. Feature Detection Layer
3.3.2. Feature Descriptor Layer
3.3.3. Feature Matching Layer
3.3.4. Loss Function for the LFR Module
4. Experiments
4.1. Datasets and Implementation Details
- Google Earth Dataset [60]: This dataset contains 9042 pairs of images with a size of 1080 × 1080 pixels. Each pair consists of a visible light remote sensing image and a high-resolution multispectral and panchromatic remote sensing image of the same location taken at different times. It is used to train our geometric transformation prediction network.
- MegaDepth Dataset [61]: The dataset is a large-scale image matching and 3D reconstruction dataset released in 2018. MegaDepth contains nearly one million internet photos covering multiple cities and landmark buildings around the world. These photos exhibit a wide range of viewpoints, lighting, scale differences, and occlusions, all of which pose significant challenges for image matching. Our feature matching network is trained on a subset of the MegaDepth dataset.
- Google Earth Dataset: Used for both training and testing.
- High-Resolution Remote Sensing Dataset (PatternNet) [62]: PatternNet is a large-scale high-resolution remote sensing dataset specifically collected for remote sensing image retrieval. The dataset contains 38 categories, with 800 images per category, each of size 256 × 256 pixels. The images in PatternNet are sourced from Google Earth or collected via the Google Map API from certain cities in the United States.
- Visible Light–Infrared Image Dataset (VIS–NIR) [63]: This dataset contains 319 pairs of visible light and infrared images, mainly focusing on urban and coastal scenes.
4.2. Quantitative Comparisons
4.2.1. Image Similarity Evaluation
4.2.2. Registration Accuracy Evaluation
4.2.3. Qualitative Comparison
4.3. Ablation Study
4.3.1. Geometric Transformation Prediction Module
4.3.2. Local Feature Refinement Module
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Shah, D.; Sridhar, A.; Dashora, N.; Stachowicz, K.; Black, K.; Hirose, N.; Levine, S. ViNT: A foundation model for visual navigation. arXiv 2023, arXiv:2306.14846. [Google Scholar] [CrossRef]
- Kozłowski, M.; Racewicz, S.; Wierzbicki, S. Image Analysis in Autonomous Vehicles: A Review of the Latest AI Solutions and Their Comparison. Appl. Sci. 2024, 14, 8150. [Google Scholar] [CrossRef]
- Kaur, H.; Koundal, D.; Kadyan, V. Image fusion techniques: A survey. Arch. Comput. Methods Eng. 2021, 28, 4425–4447. [Google Scholar] [CrossRef]
- Kamel Boulos, M.; Peng, G.; VoPham, T. An overview of GeoAI applications in health and healthcare. Int. J. Health Geogr. 2019, 18, 7. [Google Scholar] [CrossRef] [PubMed]
- Zeng, Q.; Sun, W.; Xu, J.; Wan, W.; Pan, L. Machine Learning-Based Medical Imaging Detection and Diagnostic Assistance. Int. J. Comput. Sci. Inf. Technol. 2024, 2, 36–44. [Google Scholar] [CrossRef]
- Paul, S.; Pati, U. A comprehensive review on remote sensing image registration. Int. J. Remote Sens. 2021, 42, 5396–5432. [Google Scholar] [CrossRef]
- Zhang, X.; Leng, C.; Hong, Y.; Pei, Z.; Cheng, I.; Basu, A. Multimodal remote sensing image registration methods and advancements: A survey. Remote Sens. 2021, 13, 5128. [Google Scholar] [CrossRef]
- Zhao, Y.; Liang, J.; Ma, H.; Huang, P.; Dong, Y.; Li, J. Semantic-Guided Hierarchical Consistency Domain Adaptation for Open-Set Remote Sensing Scene Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2026, 19, 2088–2102. [Google Scholar] [CrossRef]
- Lowe, D. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Bay, H.; Tuytelaars, T.; Van Gool, L. SURF: Speeded up robust features. In Proceedings of the European Conference on Computer Vision (ECCV), Graz, Austria, 7–13 May 2006; pp. 404–417. [Google Scholar]
- Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
- Leutenegger, S.; Chli, M.; Siegwart, R. BRISK: Binary robust invariant scalable keypoints. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 2548–2555. [Google Scholar]
- Levi, G.; Hassner, T. LATCH: Learned arrangements of three patch codes. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–9 March 2016; pp. 1–9. [Google Scholar]
- Chum, O.; Matas, J. Matching with PROSAC—progressive sample consensus. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA, 20–26 June 2005; pp. 220–226. [Google Scholar]
- Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8922–8931. [Google Scholar]
- Giang, K.; Song, S.; Jo, S. TopicFM: Robust and interpretable topic-assisted feature matching. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Washington, DC, USA, 7–14 February 2023; pp. 2447–2455. [Google Scholar]
- Potje, G.; Cadar, F.; Araujo, A.; Martins, R.; Nascimento, E. XFeat: Accelerated Features for Lightweight Image Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 2682–2691. [Google Scholar]
- Tuzcuoğlu, Ö.; Köksal, A.; Sofu, B.; Kalkan, S.; Alatan, A. XoFTR: Cross-modal Feature Matching Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 4275–4286. [Google Scholar]
- Barroso-Laguna, A.; Mikolajczyk, K. Key.Net: Keypoint detection by handcrafted and learned CNN filters revisited. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 698–711. [Google Scholar] [CrossRef] [PubMed]
- Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
- Bökman, G.; Kahl, F. A case for using rotation invariant features in state of the art feature matchers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5110–5119. [Google Scholar]
- Giang, K.; Song, S.; Jo, S. TopicFM+: Boosting accuracy and efficiency of topic-assisted feature matching. IEEE Trans. Image Process. 2024, 33, 6016–6028. [Google Scholar] [CrossRef]
- Li, W.; Weng, D.; Gao, C.; Du, Q. SwinMatcher: Universal Cross-Modal Remote Sensing Image Matching With Interactive Swin Transformer. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4705916. [Google Scholar] [CrossRef]
- Weng, D.; Li, W.; Gao, C.; Xia, X.G.; Shi, Z.; Cui, B. DGIM: Cascaded Dynamic Data Generation for Robust Cross-Modal Image Matching. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4708616. [Google Scholar] [CrossRef]
- Ren, J.; Jiang, X.; Li, Z.; Liang, D.; Zhou, X.; Bai, X. MINIMA: Modality Invariant Image Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 23059–23068. [Google Scholar]
- He, Y.; Yang, C.; Sun, C.; Song, P. A Multimodal Remote Sensing Image Registration Framework with Dual-Stream Multiscale Attention and Adaptive Deformation Refinement. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2026, 19, 13909–13923. [Google Scholar] [CrossRef]
- Zhou, L.; Peng, T.; Han, Z.; Li, L.; Zhu, Q.; Ye, Y. Robust Pixel-by-Pixel Multimodal Remote Sensing Image Registration Using Geometry Preserving Dense Registration Network. IEEE Trans. Geosci. Remote Sens. 2026, 64, 4701814. [Google Scholar] [CrossRef]
- Viola, P.; Wells, W., III. Alignment by maximization of mutual information. Int. J. Comput. Vis. 1997, 24, 137–154. [Google Scholar] [CrossRef]
- Maes, F.; Collignon, A.; Vandermeulen, D.; Marchal, G.; Suetens, P. Multimodality image registration by maximization of mutual information. IEEE Trans. Med. Imaging 1997, 16, 187–198. [Google Scholar] [CrossRef] [PubMed]
- Alba, A.; Aguilar-Ponce, R.; Vigueras-Gómez, J.; Arce-Santana, E. Phase correlation based image alignment with subpixel accuracy. In Proceedings of the Mexican International Conference on Artificial Intelligence (MICAI), San Luis Potosí, Mexico, 27 October–4 November 2012; pp. 171–182. [Google Scholar]
- Li, J.; Hu, Q.; Ai, M. RIFT: Multi-Modal Image Matching Based on Radiation-Variation Insensitive Feature Transform. IEEE Trans. Image Process. 2020, 29, 3296–3310. [Google Scholar] [CrossRef] [PubMed]
- Wan, G.; Ye, Z.; Xu, Y.; Huang, R.; Zhou, Y.; Xie, H.; Tong, X. Multimodal Remote Sensing Image Matching Based on Weighted Structure Saliency Feature. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4700816. [Google Scholar] [CrossRef]
- Fan, Z.; Pi, Y.; Han, J.; Kan, Y.; Tan, K. GS–MIFT: A modality invariant feature transform with global-to-local searching. Inf. Fusion 2024, 105, 102252. [Google Scholar] [CrossRef]
- Islam, K.T.; Wijewickrema, S.; O’Leary, S. A deep learning based framework for the registration of three dimensional multi-modal medical images of the head. Sci. Rep. 2021, 11, 1860. [Google Scholar] [CrossRef]
- Liu, C.; Sui, H.; Zhou, M.; Xu, C. Large-scale multimodal remote sensing image registration with semantic guidance and multi-scale contextual matching. Expert Syst. Appl. 2026, 323, 132455. [Google Scholar] [CrossRef]
- Han, X.; Leung, T.; Jia, Y.; Sukthankar, R.; Berg, A. MatchNet: Unifying feature and metric learning for patch-based matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2015; pp. 3279–3286. [Google Scholar]
- Yi, K.; Trulls, E.; Lepetit, V.; Fua, P. LIFT: Learned invariant feature transform. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 467–483. [Google Scholar]
- Yu, G.; Morel, J. ASIFT: An algorithm for fully affine invariant comparison. Image Process. Line 2011, 1, 11–38. [Google Scholar] [CrossRef]
- Mishkin, D.; Matas, J.; Perdoch, M. MODS: Fast and robust method for two-view matching. Comput. Vis. Image Underst. 2015, 141, 81–93. [Google Scholar] [CrossRef]
- Yi, K.; Verdie, Y.; Fua, P.; Lepetit, V. Learning to assign orientations to feature points. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 107–116. [Google Scholar]
- Mishkin, D.; Radenovic, F.; Matas, J. Repeatability is not enough: Learning affine regions via discriminability. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 284–300. [Google Scholar]
- Rau, A.; Garcia-Hernando, G.; Stoyanov, D.; Brostow, G.; Turmukhambetov, D. Predicting visual overlap of images through interpretable non-metric box embeddings. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 629–646. [Google Scholar]
- Dai, J.; Jin, S.; Zhang, J.; Nguyen, T.Q. Boosting Feature Matching Accuracy With Pairwise Affine Estimation. IEEE Trans. Image Process. 2020, 29, 8278–8291. [Google Scholar] [CrossRef] [PubMed]
- Park, J.H.; Nam, W.J.; Lee, S.W. A Two-Stream Symmetric Network with Bidirectional Ensemble for Aerial Image Matching. Remote Sens. 2020, 12, 465. [Google Scholar] [CrossRef]
- Barroso-Laguna, A.; Tian, Y.; Mikolajczyk, K. ScaleNet: A shallow architecture for scale estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12808–12818. [Google Scholar]
- Guan, B.; Zhao, J.; Barath, D.; Fraundorfer, F. Minimal Solvers for Relative Pose Estimation of Multi-Camera Systems Using Affine Correspondences. Int. J. Comput. Vis. 2023, 131, 324–345. [Google Scholar] [CrossRef]
- Hruby, P.; Pollefeys, M.; Barath, D. Semicalibrated Relative Pose from an Affine Correspondence and Monodepth. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2024; pp. 39–57. [Google Scholar]
- Yu, Y.; Liu, S.; Pautrat, R.; Pollefeys, M.; Larsson, V. Relative Pose Estimation through Affine Corrections of Monocular Depth Priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025. [Google Scholar]
- Sun, P.; Guan, B.; Yu, Z.; Shang, Y.; Yu, Q.; Barath, D. Learning Affine Correspondences by Integrating Geometric Constraints. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 27038–27048. [Google Scholar]
- Guan, B.; Zhao, J.; Kneip, L. A Complete Solution to Generalized Relative Pose Estimation from Affine Correspondences. IEEE Trans. Pattern Anal. Mach. Intell. 2026, 48, 8598–8612. [Google Scholar] [CrossRef]
- Xiao, Y.; Zhang, C.; Chen, Y.; Jiang, B.; Tang, J. ADRNet: Affine and Deformable Registration Networks for Multimodal Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5207613. [Google Scholar] [CrossRef]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Rocco, I.; Arandjelovic, R.; Sivic, J. Convolutional Neural Network Architecture for Geometric Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6148–6157. [Google Scholar]
- Li, J.; Wen, Y.; He, L. SCConv: Spatial and channel reconstruction convolution for feature redundancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 6153–6162. [Google Scholar]
- Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 16794–16805. [Google Scholar]
- Wu, Y.; He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Huang, H.; Zhou, X.; Cao, J.; He, R.; Tan, T. Vision transformer with super token sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 22690–22699. [Google Scholar]
- Rocco, I.; Cimpoi, M.; Arandjelović, R.; Torii, A.; Pajdla, T.; Sivic, J. Neighbourhood consensus networks. Adv. Neural Inf. Process. Syst. (NeurIPS) 2018, 31, 1658–1669. [Google Scholar]
- Wang, C.; Zhang, G.; Cheng, Z.; Zhou, W. Rethinking low-level features for interest point detection and description. In Proceedings of the Asian Conference on Computer Vision (ACCV), Macao, China, 4–8 December 2022; pp. 2059–2074. [Google Scholar]
- Kim, D.; Nam, W.; Lee, S. A robust matching network for gradually estimating geometric transformation on remote sensing imagery. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics (SMC), Bari, Italy, 6–9 October 2019; pp. 3889–3894. [Google Scholar]
- Li, Z.; Snavely, N. MegaDepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 2041–2050. [Google Scholar]
- Zhou, W.; Newsam, S.; Li, C.; Shao, Z. PatternNet: A benchmark dataset for performance evaluation of remote sensing image retrieval. ISPRS J. Photogramm. Remote Sens. 2018, 145, 197–209. [Google Scholar] [CrossRef]
- Pielawski, N.; Wetzer, E.; Öfverstedt, J.; Lu, J.; Wählby, C.; Lindblad, J.; Sladoje, N. CoMIR: Contrastive multimodal image representation for registration. Adv. Neural Inf. Process. Syst. (NeurIPS) 2020, 33, 18433–18444. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
- Deng, Y.; Ma, J. ReDFeat: Recoupling detection and description for multimodal feature learning. IEEE Trans. Image Process. 2022, 32, 591–602. [Google Scholar] [CrossRef]
- Chen, H.; Luo, Z.; Zhou, L.; Tian, Y.; Zhen, M.; Fang, T.; Mckinnon, D.; Tsin, Y.; Quan, L. Aspanformer: Detector-free image matching with adaptive span transformer. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 20–36. [Google Scholar]
- Yang, Y.; Ramanan, D. Articulated human detection with flexible mixtures of parts. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 2878–2890. [Google Scholar] [CrossRef]
- Kim, S.; Min, D.; Ham, B.; Jeon, S.; Lin, S.; Sohn, K. FCSS: Fully convolutional self-similarity for dense semantic correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6560–6569. [Google Scholar]








| Category | Representative Methods | Main Idea | Strengths | Key Difference |
|---|---|---|---|---|
| Traditional region-based | CC [28], MI [29], and phase correlation [30] | Directly optimize image similarity at the region level | Simple and label-free; effective for roughly aligned pairs | Lack explicit feature modeling and are sensitive to modality gaps, noise, and large deformations |
| Traditional feature-based | RIFT [31], WSSF [32], and GSMIFT [33] | Use handcrafted detectors/descriptors and feature matching | Improved geometric robustness and radiation invariance | Rely on manually designed features and still struggle under severe appearance variation and large misalignment |
| Hybrid deep learning methods | Deep feature/similarity learning [34,35] | Replace part of the classical pipeline with learned representations | Improved robustness while preserving conventional optimization steps | Loosely coupled design may limit global consistency and end-to-end adaptability |
| End-to-end dense matching | MatchNet [36], LIFT [37], LoFTR [15], TopicFM [16], XFeat [17], XoFTR [18], and SwinMatcher [23] | Directly learn correspondences or dense matches from data | Strong automation and matching capability | Typically rely on unified matching pipelines and do not explicitly separate global correction from local refinement |
| Multimodal remote sensing methods | DGIM [24], MINIMA [25], DS-MAR [26], and GPDRNet [27] | Improve modality invariance, semantic alignment, or geometry- preserving refinement | Better suited for cross-modal remote sensing data | Usually emphasize one aspect of the problem, while MMARNet explicitly decomposes registration into global alignment and local correspondence refinement |
| Proposed MMARNet | Ours | Task-driven coarse-to- fine registration with global transformation prediction and local feature refinement | Handles both large geometric distortions and residual local misalignment | Explicitly separates global geometric correction from local correspondence refinement |
| Method | Google Earth | PatternNet | VIS–NIR | |||
|---|---|---|---|---|---|---|
| PSNR↑ | SSIM↑ | PSNR↑ | SSIM↑ | PSNR↑ | SSIM↑ | |
| SIFT [9] | 18.34 | 0.41 | 27.44 | 0.86 | 21.78 | 0.75 |
| BRISK [12] | 17.17 | 0.40 | 28.16 | 0.90 | 19.20 | 0.59 |
| ORB [11] | 13.55 | 0.29 | 24.48 | 0.80 | 10.45 | 0.23 |
| ReDfeat [66] | 14.36 | 0.27 | 13.01 | 0.26 | 9.36 | 0.08 |
| XoFTR [18] | 14.40 | 0.27 | 12.99 | 0.24 | 9.37 | 0.08 |
| TopicFM [16] | 22.82 | 0.60 | 21.16 | 0.59 | 16.90 | 0.45 |
| LoFTR [15] | 22.38 | 0.61 | 22.32 | 0.59 | 16.32 | 0.45 |
| ASpanFormer [67] | 22.88 | 0.60 | 22.60 | 0.62 | 17.24 | 0.47 |
| XFeat [17] | 21.18 | 0.52 | 20.44 | 0.58 | 16.57 | 0.50 |
| MINIMA [25] | 23.15 | 0.71 | 27.45 | 0.90 | 24.35 | 0.83 |
| SwinMatcher [23] | 23.40 | 0.73 | 27.20 | 0.88 | 25.10 | 0.86 |
| DGIM [24] | 23.70 | 0.75 | 27.80 | 0.91 | 24.00 | 0.81 |
| MMARNet | 24.75 | 0.79 | 28.24 | 0.93 | 26.19 | 0.92 |
| Method | Google Earth | PatternNet | VIS–NIR | ||||||
|---|---|---|---|---|---|---|---|---|---|
| PCK↑ | MAE↓ | RMSE↓ | PCK↑ | MAE↓ | RMSE↓ | PCK↑ | MAE↓ | RMSE↓ | |
| SIFT | 78.72 | 3.64 | 5.21 | 94.96 | 0.83 | 3.45 | 84.09 | 2.78 | 5.83 |
| BRISK | 60.21 | 10.63 | 14.70 | 96.43 | 0.85 | 3.52 | 67.83 | 6.29 | 8.96 |
| ORB | 22.70 | 16.82 | 18.31 | 93.59 | 1.21 | 4.98 | 42.15 | 12.35 | 14.22 |
| TopicFM | 99.70 | 0.80 | 2.28 | 42.20 | 13.01 | 15.02 | 36.05 | 11.76 | 14.31 |
| LoFTR | 99.49 | 0.85 | 2.36 | 44.12 | 11.42 | 13.89 | 41.35 | 12.02 | 13.88 |
| ASpanFormer | 99.69 | 0.81 | 2.30 | 59.84 | 8.23 | 9.56 | 46.81 | 10.86 | 12.26 |
| XFeat | 98.87 | 1.48 | 3.11 | 50.26 | 10.73 | 12.93 | 51.25 | 10.58 | 11.92 |
| MINIMA | 99.92 | 0.52 | 1.88 | 97.80 | 0.82 | 3.01 | 97.50 | 2.10 | 5.54 |
| SwinMatcher | 99.85 | 0.36 | 1.56 | 98.50 | 0.85 | 3.28 | 96.80 | 1.95 | 5.12 |
| DGIM | 99.90 | 0.38 | 1.49 | 99.11 | 0.78 | 2.92 | 97.20 | 2.20 | 5.85 |
| MMARNet | 99.99 | 0.32 | 1.28 | 99.11 | 0.75 | 2.35 | 98.00 | 0.96 | 3.78 |
| Method | PCK (%) | ||
|---|---|---|---|
| Baseline | 94.5 | 82.4 | 31.5 |
| Baseline + A | 94.5 | 81.7 | 32.9 |
| Baseline + B | 96.2 | 86.2 | 37.1 |
| Baseline + A + B | 95.2 | 86.8 | 39.6 |
| Method | Re↑ | LE↓ | MS↑ | H-1↑ | H-3↑ | H-5↑ |
|---|---|---|---|---|---|---|
| Baseline | 0.690 | 1.019 | 0.538 | 0.595 | 0.869 | 0.926 |
| Baseline + A | 0.689 | 1.015 | 0.561 | 0.536 | 0.866 | 0.919 |
| Baseline + B | 0.666 | 0.984 | 0.522 | 0.588 | 0.874 | 0.928 |
| Baseline + A + B | 0.689 | 0.951 | 0.575 | 0.617 | 0.881 | 0.929 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Liu, X.; Shi, G.; Huang, Z.; Ji, J.; Miao, Q. MMARNet: Two-Stage Remote Sensing Image Registration with Multimodal Attention Mechanism. Remote Sens. 2026, 18, 1983. https://doi.org/10.3390/rs18121983
Liu X, Shi G, Huang Z, Ji J, Miao Q. MMARNet: Two-Stage Remote Sensing Image Registration with Multimodal Attention Mechanism. Remote Sensing. 2026; 18(12):1983. https://doi.org/10.3390/rs18121983
Chicago/Turabian StyleLiu, Xiangzeng, Guanglu Shi, Zhipeng Huang, Jian Ji, and Qiguang Miao. 2026. "MMARNet: Two-Stage Remote Sensing Image Registration with Multimodal Attention Mechanism" Remote Sensing 18, no. 12: 1983. https://doi.org/10.3390/rs18121983
APA StyleLiu, X., Shi, G., Huang, Z., Ji, J., & Miao, Q. (2026). MMARNet: Two-Stage Remote Sensing Image Registration with Multimodal Attention Mechanism. Remote Sensing, 18(12), 1983. https://doi.org/10.3390/rs18121983

