Image Matching: Foundations, State of the Art, and Future Directions
Abstract
1. Introduction
1.1. What Is Image Matching?
1.2. Applications of Image Matching
1.3. Evolution of Image-Matching Algorithms
2. Traditional Handcrafted Image Matching
2.1. Template Matching-Based Image Matching
2.2. Feature-Based Image-Matching Pipeline
2.3. SIFT-Based Image Matching
2.4. SURF-Based Image Matching
3. Deep Learning Neural Network-Based Image Matching
3.1. SuperPoint-Based Image Matching
3.2. SuperGlue-Based Image Matching
3.3. LoFTR-Based Image Matching
- The backbone network is constructed using a ResNet [24] variant with an FPN (Feature Pyramid Network) [25] to extract multiscale features from input image pairs. The backbone network can output feature maps of different resolutions at the same time. The coarse-level feature maps, with low resolution and a large number of channels, reduce the length of the sequence to be processed by the subsequent transformer module. The fine-level feature maps, with high resolution and fewer channels, are used for final matching refinement.
- The Local Feature Transformer (LoFTR) module integrates the context of coarse-level features. The coarse-level feature maps obtained from the backbone network are flattened into one-dimensional vector sequences. Positional encoding is added to these vector sequences so that LoFTR can perceive the spatial location information of features. This module consists of four cascaded LoFTR encoder layers. Each LoFTR encoder layer contains a multi-head self-attention layer and a multi-head cross-attention layer. By alternately applying self-attention and cross-attention four times, LoFTR learns globally consistent relationships and outputs transformed coarse-level features.
- The matching module establishes the initial matches. Matrix multiplication is applied to the transformed coarse-level features to compute the score matrix, where each element represents the similarity between features in one input image and another. The confidence matrix is then calculated by applying dual softmax or optimal transport to the score matrix. Reliable coarse-level matches are filtered according to the confidence threshold and the mutual nearest-neighbor rule.
- The fine-level refinement module obtains accurate matches at the sub-pixel level. Coarse-level matches are obtained from low-resolution feature maps with low localization accuracy. For each coarse-level match, its coordinates are first mapped back to the fine-level feature map. Then, projection coordinates are taken as the center, and local windows are cropped to a size of . These fine-level local features are further transformed by a smaller LoFTR module containing one encoder layer. Then, a heatmap of matching probabilities is generated by calculating the correlation between the center feature vector of one window and all the feature vectors of the other window. Finally, the sub-pixel accurate matching coordinates are obtained by calculating the expectation of this probability distribution map.
- 1.
- Aggregation attention mechanism: ELoFTR observes that attention computation across all tokens on the feature graph in LoFTR’s transformer module is redundant because neighboring tokens usually contain similar local information. Therefore, ELoFTR introduces an aggregated attention mechanism by adding before each self-attention and cross-attention computation. This is done by applying convolution operations independently to each channel of the coarse-level feature map to generate a learnable query, while the key and query are extracted through max-pooling to achieve spatial downsampling. This significantly reduces the number of tokens involved in the attention computation, thereby dramatically lowering the computational complexity and memory usage. In this way, ELoFTR focuses computational resources on more informative regions while maintaining the advantage of the global receptive field.
- 2.
- Efficient backbone network: ELoFTR uses a lightweight RepVGG network as its feature extraction backbone to replace LoFTR’s combination of a ResNet and an FPN. RepVGG uses a multi-branch structure during training to enhance the model’s representational capability, and it can be reparameterized into a simple single-branch structure during inference, thus improving inference speed without loss of accuracy.
3.4. LoFTR Speedup
4. Latest Developments in Image Matching
4.1. Heavyweight Models
4.2. Lightweight Models
5. H-Matrix Optimization
5.1. GA Algorithm-Based H-Matrix Calculation
- Step 1: Randomly initialize N individuals for the first generation. Each individual represents a possible homography matrix.
- Step 2: New individual generation. In each generation, new individuals are generated based on those from the previous generation. This ensures that useful information learned so far is preserved while allowing for new variations. To achieve this, two commonly used functions, crossover and mutation, can be applied. The crossover operation combines information from parent individuals to generate a new individual. The frequency of the mutation operation is usually low (i.e., a small mutation rate), and it updates the information of an individual randomly.
- Step 3: Selection. To ensure that more suitable individuals are generated in subsequent generations, individuals with higher fitness values (i.e., those that reduce the distances between estimated mapping points and ground-truth mapping points more effectively) are selected, while those with lower fitness values are removed from the current generation. Typically, the number of individuals in each generation is kept constant.
- Step 4: Termination. Steps 2 and 3 are repeated multiple times until the fitness value of the best individual converges (i.e., does not change compared to the best individual from the previous generation).
5.2. H-Matrix Optimization with the Levenberg–Marquardt Algorithm
6. Current Challenges
6.1. Challenges in 2D Image Matching
6.2. Challenges in Stereo Image Matching
7. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Durrant-Whyte, H.; Bailey, T. Simultaneous localization and mapping: Part I. IEEE Robot. Autom. Mag. 2006, 13, 99–110. [Google Scholar] [PubMed]
- Stewart, D. A platform with six degrees of freedom. Proc. Inst. Mech. Eng. 1965, 180, 371–386. [Google Scholar] [CrossRef]
- Yang, M.; Li, J.; Li, Z.; Li, W.; Zhang, K. Self-adaptive SURF for image-to-video matching. Signal Image Video Process. 2024, 18, 751–759. [Google Scholar] [CrossRef]
- Schmidhuber, J. Multi-column deep neural networks for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 3642–3649. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar]
- O’Mahony, N.; Campbell, S.; Carvalho, A.; Harapanahalli, S.; Hernandez, G.V.; Krpalkova, L.; Riordan, D.; Walsh, J. Deep learning vs. traditional computer vision. In Advances in Computer Vision. CVC 2019; Advances in Intelligent Systems and Computing; Springer: Cham, Switzerland, 2020; Volume 943, pp. 128–144. [Google Scholar]
- Voulodimos, A.; Doulamis, N.; Doulamis, A.; Protopapadakis, E. Deep learning for computer vision: A brief review. Comput. Intell. Neurosci. 2018, 2018, 7068349. [Google Scholar] [CrossRef] [PubMed]
- Hassaballah, M.; Awad, A.I. Deep Learning in Computer Vision: Principles and Applications; CRC Press: Boca Raton, FL, USA, 2020. [Google Scholar]
- Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–25 September 1999; Volume 2, pp. 1150–1157. [Google Scholar]
- Bay, H.; Ess, A.; Tuytelaars, T.; Van Gool, L. Speeded-up robust features (SURF). Comput. Vis. Image Underst. 2008, 110, 346–359. [Google Scholar] [CrossRef]
- Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
- Xu, S.; Chen, S.; Xu, R.; Wang, C.; Lu, P.; Guo, L. Local feature matching using deep learning: A survey. Inf. Fusion 2024, 107, 102344. [Google Scholar] [CrossRef]
- Ma, J.; Jiang, X.; Fan, A.; Jiang, J.; Yan, J. Image matching from handcrafted to deep features: A survey. Int. J. Comput. Vis. 2021, 129, 23–79. [Google Scholar] [CrossRef]
- Huang, Q.; Guo, X.; Wang, Y.; Sun, H.; Yang, L. A survey of feature matching methods. IET Image Process. 2024, 18, 1385–1410. [Google Scholar] [CrossRef]
- Brunelli, R. Template Matching Techniques in Computer Vision: Theory and Practice; John Wiley & Sons: Hoboken, NJ, USA, 2009. [Google Scholar]
- Cantzler, H. Random Sample Consensus (Ransac); Institute for Perception, Action and Behaviour, Division of Informatics, University of Edinburgh: Edinburgh, UK, 1981; Volume 3, pp. 1–4. [Google Scholar]
- Zhang, Z. Flexible camera calibration by viewing a plane from unknown orientations. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–25 September 1999; Volume 1, pp. 666–673. [Google Scholar]
- DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 224–236. [Google Scholar]
- Sarlin, P.E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4938–4947. [Google Scholar]
- Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-Free Local Feature Matching with Transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8918–8927. [Google Scholar] [CrossRef]
- Sinkhorn, R. A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann. Math. Stat. 1964, 35, 876–879. [Google Scholar] [CrossRef]
- Li, X.; Han, K.; Li, S.; Prisacariu, V. Dual-resolution correspondence networks. Adv. Neural Inf. Process. Syst. 2020, 33, 17346–17357. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Wang, Y.; He, X.; Peng, S.; Tan, D.; Zhou, X. Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 21666–21675. [Google Scholar] [CrossRef]
- Zhang, Y.; Wu, R.; Dascalu, S.M.; Harris, F.C., Jr. Sparse transformer with local and seasonal adaptation for multivariate time series forecasting. Sci. Rep. 2024, 14, 15909. [Google Scholar] [CrossRef] [PubMed]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
- Nam, J.; Lee, G.; Kim, S.; Kim, H.; Cho, H.; Kim, S.; Kim, S. Diffusion Model for Dense Matching. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Youssef, A.; Vasconcelos, F. NeRF-Supervised Feature Point Detection and Description. In Computer Vision—ECCV 2024 Workshops: Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXIII; Springer: Cham, Switzerland, 2025; pp. 103–119. [Google Scholar]
- Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. NeRF: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
- Xue, F.; Budvytis, I.; Cipolla, R. SFD2: Semantic-Guided Feature Detection and Description. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 5206–5216. [Google Scholar]
- Strecha, C.; Lindner, A.; Ali, K.; Fua, P. Training for task specific keypoint detection. In Proceedings of the Pattern Recognition: 31st DAGM Symposium, Jena, Germany, 9–11 September 2009; Proceedings 31; Springer: Berlin/Heidelberg, Germany, 2009; pp. 151–160. [Google Scholar]
- Revaud, J.; De Souza, C.; Humenberger, M.; Weinzaepfel, P. R2d2: Reliable and repeatable detector and descriptor. Adv. Neural Inf. Process. Syst. 2019, 32, 12414–12424. [Google Scholar]
- Edstedt, J.; Sun, Q.; Bökman, G.; Wadenbäck, M.; Felsberg, M. RoMa: Robust Dense Feature Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–24 June 2024; pp. 19790–19800. [Google Scholar]
- Barroso-Laguna, A.; Munukutla, S.; Prisacariu, V.A.; Brachmann, E. Matching 2D Images in 3D: Metric Relative Pose from Metric Correspondences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–24 June 2024; pp. 4852–4863. [Google Scholar]
- Arsalan Soltani, A.; Huang, H.; Wu, J.; Kulkarni, T.D.; Tenenbaum, J.B. Synthesizing 3d shapes via modeling multi-view depth maps and silhouettes with deep generative networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1511–1519. [Google Scholar]
- Lin, C.H.; Ma, W.C.; Torralba, A.; Lucey, S. Barf: Bundle-adjusting neural radiance fields. In Proceedings of the IEEE/CVF international Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 5741–5751. [Google Scholar]
- Lindenberger, P.; Sarlin, P.E.; Pollefeys, M. Lightglue: Local feature matching at light speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 17627–17638. [Google Scholar]
- Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
- Wu, K.; Zhang, J.; Peng, H.; Liu, M.; Xiao, B.; Fu, J.; Yuan, L. Tinyvit: Fast pretraining distillation for small vision transformers. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 68–85. [Google Scholar]
- Potje, G.; Cadar, F.; Araujo, A.; Martins, R.; Nascimento, E.R. XFeat: Accelerated Features for Lightweight Image Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 2682–2691. [Google Scholar]
- Teed, Z.; Deng, J. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II; Springer: Cham, Switzerland, 2020; pp. 402–419. [Google Scholar]
- Jiang, W.; Trulls, E.; Hosang, J.; Tagliasacchi, A.; Yi, K.M. Cotr: Correspondence transformer for matching across images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 6207–6217. [Google Scholar]
Method | Transformer [12] | Dozer [27] | Swin [28] | Swin + Dozer |
---|---|---|---|---|
Self-Attention |
Model | Full | Dozer | Swin | Dozer and Swin |
---|---|---|---|---|
Params (M) | 5.251 | 5.251 | 5.251 | 5.251 |
FLOPs (G) | 239.32 | 152.83 | 97.71 | 76.12 |
Memory (M) | 2.98 | 2.37 | 0.38 | 0.34 |
Condition | Mean X Error | Mean Y Error | Mean Reprojection Error |
---|---|---|---|
With LM Refinement | 0.5706 | 1.1727 | 1.3838 |
Without LM Refinement | 0.9587 | 1.1873 | 1.6236 |
Type | Model Size | Descriptor Type | Accuracy | Computational Complexity | Robustness | Application Scenarios | |
---|---|---|---|---|---|---|---|
Template Matching | Handcrafted | N/A | N/A | Low | Medium | Very low | Very limited |
SIFT | Handcrafted | N/A | Local | Medium | Fast | Low | Limited |
SURF | Handcrafted | N/A | Local | Medium | Fast | Low | Limited |
SuperPoint | Learning-based | Large | Local | High | High | High | Broad |
SuperGlue | Learning-based | Large | Global | High | High | High | Broad |
LoFTR | Learning-based | Large | Global | High | High | High | Broad |
ELoFTR | Learning-based | Large | Regional | High | Medium–high | High | Broad |
LightGlue | Learning-based | Large | Regional | High | Medium–high | High | Broad |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yang, M.; Wu, R.; Yang, Y.; Tao, L.; Zhang, Y.; Xie, Y.; Reddy, G.P.R.D. Image Matching: Foundations, State of the Art, and Future Directions. J. Imaging 2025, 11, 329. https://doi.org/10.3390/jimaging11100329
Yang M, Wu R, Yang Y, Tao L, Zhang Y, Xie Y, Reddy GPRD. Image Matching: Foundations, State of the Art, and Future Directions. Journal of Imaging. 2025; 11(10):329. https://doi.org/10.3390/jimaging11100329
Chicago/Turabian StyleYang, Ming, Rui Wu, Yunxuan Yang, Liang Tao, Yifan Zhang, Yixin Xie, and Gnana Prakash Reddy Donthi Reddy. 2025. "Image Matching: Foundations, State of the Art, and Future Directions" Journal of Imaging 11, no. 10: 329. https://doi.org/10.3390/jimaging11100329
APA StyleYang, M., Wu, R., Yang, Y., Tao, L., Zhang, Y., Xie, Y., & Reddy, G. P. R. D. (2025). Image Matching: Foundations, State of the Art, and Future Directions. Journal of Imaging, 11(10), 329. https://doi.org/10.3390/jimaging11100329