Beyond Handcrafted Features: A Deep Learning Framework for Optical Flow and SLAM
Abstract
:1. Introduction
- This research uses CNN features to minimize reprojection error in estimating trajectory maps.
- This gives hope of obtaining minimal reprojection errors without even optimizing the map; hence, the needfor loop closure detection and map optimization can be eliminated in the future.
- This paper explores the ConvNeXtXLarge, EfficientNetV2L, and NASNetLarge models, performing a detailed layer-wise and filter-wise analysis to identify features that yield minimal odometry error in SLAM tasks.
- The CNN-based features show improved robustness to common environmental challenges in SLAM, such as viewpoint changes, occlusions, and illumination variations, which often degrade the performance of traditional handcrafted features.
- The use of CNN features in place of handcrafted features results in a significant improvement by reducing odometry error, which suggests that CNN features can be used to estimate efficient trajectory maps without the need for map optimization in the future.
2. Related Work
3. Methodology
3.1. Selection of Pre-Trained CNN Models for Feature Extraction
3.2. Analysing Filters and Estimating Odometry Error
3.3. Datasets Used for Testing the Proposed Method
Kitti Sequence Datasets
3.4. Estimating Transformation Matrix from CNN Features
3.5. Plotting the Map of the Trajectory
3.5.1. Estimating Transformation Matrix from Homography
3.5.2. Estimating Transformation Matrix from Fundamental Matrix
3.6. Generating the Trajectory Map
3.7. When Initial Pose Is Already Detected
4. Experimental Results
Offset Error Estimation on Datasets
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Athira, K.A.; Divya Udayan, J.U.S. A Systematic Literature Review on Multi-Robot Task Allocation. ACM Comput. Surv. 2024, 57, 1–28. [Google Scholar] [CrossRef]
- Haider, Z.; Sardar, M.Z.; Azar, A.T.; Ahmed, S.; Kamal, N.A. Exploring reinforcement learning techniques in the realm of mobile robotics. Int. J. Autom. Control 2024, 18, 655–697. [Google Scholar] [CrossRef]
- Azar, A.T.; Sardar, M.Z.; Ahmed, S.; Hassanien, A.E.; Kamal, N.A. Autonomous Robot Navigation and Exploration Using Deep Reinforcement Learning with Gazebo and ROS. Lect. Notes Data Eng. Commun. Technol. 2023, 2023, 287–299. [Google Scholar] [CrossRef]
- Liu, J.; Gao, W.; Xie, C.; Hu, Z. Implementation and observability analysis of visual-inertial-wheel odometry with robust initialization and online extrinsic calibration. Robot. Auton. Syst. 2024, 176, 104686. [Google Scholar] [CrossRef]
- Junaedy, A.; Masuta, H.; Sawai, K.; Motoyoshi, T.; Takagi, N. Real-Time 3D Map Building in a Mobile Robot System with Low-Bandwidth Communication. Robotics 2023, 12, 157. [Google Scholar] [CrossRef]
- Petrakis, G.; Partsinevelos, P. Keypoint Detection and Description Through Deep Learning in Unstructured Environments. Robotics 2023, 12, 137. [Google Scholar] [CrossRef]
- Memon, A.R.; Iqbal, M.; Almakhles, D. DisView: A Semantic Visual IoT Mixed Data Feature Extractor for Enhanced Loop Closure Detection for UGVs During Rescue Operations. IEEE Internet Things J. 2024, 11, 36214–36224. [Google Scholar] [CrossRef]
- Iqbal, M.; Memon, A.R.; Almakhles, D.J. Accelerating Resource-Constrained Swarm Robotics with Cone-Based Loop Closure and 6G Communication. IEEE Trans. Intell. Transp. Syst. 2025; early access. [Google Scholar] [CrossRef]
- Li, D.; Shi, X.; Long, Q.; Liu, S.; Yang, W.; Wang, F.; Wei, Q.; Qiao, F. DXSLAM: A Robust and Efficient Visual SLAM System with Deep Features. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October–24 January 2021; pp. 4958–4965. [Google Scholar]
- Lianos, K.N.; Schönberger, J.L.; Pollefeys, M.; Sattler, T. VSO: Visual Semantic Odometry; Springer: Berlin/Heidelberg, Germany, 2018; pp. 246–263. [Google Scholar]
- Lin, S.; Wang, J.; Xu, M.; Zhao, H.; Chen, Z. Topology Aware Object-Level Semantic Mapping Towards More Robust Loop Closure. IEEE Robot. Autom. Lett. 2021, 6, 7041–7048. [Google Scholar] [CrossRef]
- Waheed, S.R.; Suaib, N.M.; Rahim, M.S.M.; Khan, A.R.; Bahaj, S.A.; Saba, T. Synergistic Integration of Transfer Learning and Deep Learning for Enhanced Object Detection in Digital Images. IEEE Access 2024, 12, 13525–13536. [Google Scholar] [CrossRef]
- Tahir, N.U.A.; Zhang, Z.; Asim, M.; Chen, J.; Elaffendi, M. Object Detection in Autonomous Vehicles under Adverse Weather: A Review of Traditional and Deep Learning Approaches. Algorithms 2024, 17, 103. [Google Scholar] [CrossRef]
- Memon, A.R.; Wang, H.; Hussain, A. Loop closure detection using supervised and unsupervised deep neural networks for monocular SLAM systems. Robot. Auton. Syst. 2020, 126, 103470. [Google Scholar] [CrossRef]
- Ul Islam, Q.; Khozaei, F.; Salah Al Barhoumi, E.M.; Baig, I.; Ignatyev, D. Advancing Autonomous SLAM Systems: Integrating YOLO Object Detection and Enhanced Loop Closure Techniques for Robust Environment Mapping. Robot. Auton. Syst. 2024, 185, 104871. [Google Scholar] [CrossRef]
- Chen, X.; Milioto, A.; Palazzolo, E.; Giguere, P.; Behley, J.; Stachniss, C. SuMa++: Efficient LiDAR-based Semantic SLAM. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 4530–4537. [Google Scholar] [CrossRef]
- Sualeh, M.; Kim, G.W. Simultaneous Localization and Mapping in the Epoch of Semantics: A Survey. Int. J. Control. Autom. Syst. 2019, 17, 729–742. [Google Scholar] [CrossRef]
- Mahmoud, A.; Atia, M. Improved Visual SLAM Using Semantic Segmentation and Layout Estimation. Robotics 2022, 11, 91. [Google Scholar] [CrossRef]
- Giubilato, R.; Vayugundla, M.; Schuster, M.J.; Sturzl, W.; Wedler, A.; Triebel, R.; Debei, S. Relocalization with Submaps: Multi-Session Mapping for Planetary Rovers Equipped with Stereo Cameras. IEEE Robot. Autom. Lett. 2020, 5, 580–587. [Google Scholar] [CrossRef]
- Müller, C.J. Map Point Selection for Hardware Constrained Visual Simultaneous Localisation and Mapping; Technical Report; Stellenbosch University: Stellenbosch, South Africa, 2024. [Google Scholar]
- Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
- Mur-Artal, R.; Tardos, J.D. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
- Memon, A.R.; Liu, Z.; Wang, H. Viewpoint-Invariant Loop Closure Detection Using Step-Wise Learning with Controlling Embeddings of Landmarks. IEEE Trans. Intell. Transp. Syst. 2022, 23, 20148–20159. [Google Scholar] [CrossRef]
- Mahattansin, N.; Sukvichai, K.; Bunnun, P.; Isshiki, T. Improving Relocalization in Visual SLAM by using Object Detection. In Proceedings of the 19th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, ECTI-CON 2022, Prachuap Khiri Khan, Thailand, 24–27 May 2022; pp. 17–20. [Google Scholar] [CrossRef]
- Zins, M.; Simon, G.; Berger, M.O. OA-SLAM: Leveraging Objects for Camera Relocalization in Visual SLAM. In Proceedings of the 2022 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Singapore, 17–21 October 2022; pp. 720–728. [Google Scholar]
- Ming, D.; Wu, X. Research on Monocular Vision SLAM Algorithm for Multi-map Fusion and Loop Detection. In Proceedings of the 2022 6th International Conference on Automation, Control and Robots (ICACR), Shanghai, China, 23–25 September 2022. [Google Scholar]
- Lim, H. Outlier-robust long-term robotic mapping leveraging ground segmentation. arXiv 2024, arXiv:2405.11176. [Google Scholar]
- Milford, M.J.; Wyeth, G.F. SeqSLAM: Visual route-based navigation for sunny summer days and stormy winter nights. In Proceedings of the IEEE International Conference on Robotics and Automation, Saint Paul, MN, USA, 14–18 May 2012; pp. 1643–1649. [Google Scholar]
- Xu, Z.; Rong, Z.; Wu, Y. A survey: Which features are required for dynamic visual simultaneous localization and mapping? Vis. Comput. Ind. Biomed. Art 2021, 4, 20. [Google Scholar] [CrossRef]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
- Yu, Z.; Jiang, X.; Liu, Y. Pose estimation of an aerial construction robot based on motion and dynamic constraints. Robot. Auton. Syst. 2024, 172, 104591. [Google Scholar] [CrossRef]
- Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision, 2nd ed.; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
- Faugeras, O.; Lustman, F. Motion and Structure From Motion in a Piecewise Planar Environment. Int. J. Pattern Recognit. Artif. Intell. 1988, 2, 485–508. [Google Scholar] [CrossRef]
- Wang, S.; Clark, R.; Wen, H.; Trigoni, N. Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 2043–2050. [Google Scholar]
- Wang, W.; Zhu, D.; Wang, X.; Hu, Y.; Qiu, Y.; Wang, C.; Hu, Y.; Kapoor, A.; Scherer, S. TartanAir: A Dataset to Push the Limits of Visual SLAM. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October–24 January 2021. [Google Scholar]
Model | Accuracy | Parameters | Layers | Filters | Used | Not Used | |
---|---|---|---|---|---|---|---|
Top-1 | Top-5 | ||||||
ConvNeXtXLarge | 86.7% | - | 350.1 Million | 297 | 332,806 | 36,096 | 296,710 |
EfficientNetV2L | 85.7% | 97.5% | 119.0 Million | 1031 | 1,253,254 | 261,216 | 992,038 |
NASNetLarge | 82.5% | 96.0% | 88.9 Million | 1041 | 502,407 | 16,434 | 485,973 |
Average Offset Error | |||
---|---|---|---|
S. # | Layer Name–Layer #–Filter # | Translation (m) | Rotation (deg) |
ConvNeXtXLarge Dataset | |||
1 | Input Layer–00–00 | 24.54 | 127.21 |
2 | Input Layer–00–01 | 32.22 | 125.14 |
3 | Input Layer–00–02 | 15.01 | 108.37 |
4 | convnext_xlarge_stem–2–0 | Pose not detected | Pose not detected |
5 | convnext_xlarge_stem–2–1 | 442.27 | 126.01 |
6 | convnext_xlarge_stem–2–2 | Pose not detected | Pose not detected |
7 | convnext_xlarge_stem–2–3 | Pose not detected | Pose not detected |
8 | convnext_xlarge_stem–2–4 | 31.60 | 97.35 |
9 | convnext_xlarge_stem–2–5 | 51.54 | 130.41 |
10 | convnext_xlarge_stem–2–6 | 3,580,159.43 | 104.76 |
11 | convnext_xlarge_stem–2–7 | Pose not detected | Pose not detected |
12 | convnext_xlarge_stem–2 – 8 | 2809.92 | 127.59 |
13 | convnext_xlarge_stem–2–9 | Pose not detected | Pose not detected |
14 | convnext_xlarge_stem–2–10 | 7.28 | 78.02 |
15 | convnext_xlarge_stem–2–11 | 545,608.38 | 104.77 |
16 | convnext_xlarge_stem–2–12 | Pose not detected | Pose not detected |
17 | convnext_xlarge_stem–2–13 | 660,566.05 | 105.23 |
18 | convnext_xlarge_stem–2–14 | Pose not detected | Pose not detected |
19 | convnext_xlarge_stem–2–15 | 955,025.13 | 104.47 |
20 | convnext_xlarge_stem–2–16 | 4,441,462.76 | 104.42 |
21 | convnext_xlarge_stem–2–17 | Pose not detected | Pose not detected |
22 | convnext_xlarge_stem–2–18 | 270,065,492.5 | 93.92 |
23 | convnext_xlarge_stem–2–19 | 305,075.56 | 106.48 |
24 | convnext_xlarge_stem–2–20 | Pose not detected | Pose not detected |
EfficientNetV2L Model | |||
1 | Input Layer–00–00 | 10.38 | 124.41 |
2 | Input Layer–00–01 | 13.50 | 113.19 |
3 | Input Layer–00–02 | 21.73 | 147.26 |
4 | stem_conv–02–00 | 293,395.01 | 105.52 |
5 | stem_conv–02–01 | Pose not detected | Pose not detected |
6 | stem_conv–02–02 | Pose not detected | Pose not detected |
NASNetLarge Model | |||
1 | input_1 Layer–00–00 | 4.85 | 105.52 |
2 | input_1 Layer–00–01 | 17.36 | 96.62 |
3 | input_1 Layer–00–02 | 2,454,372.65 | 113.61 |
4 | stem_bn1– 02–00 | 1,149,842.15 | 110.19 |
5 | stem_bn1–02–01 | 8.58 | 96.36 |
6 | stem_bn1–02–02 | 9.25 | 105.60 |
Dataset | No. of Images | Image Size | Ground Truth Available? | Contains Loops? | Widely Used? |
---|---|---|---|---|---|
Kitti Sequence 00 | 4541 | 1241 × 376 | Yes | Yes | Yes |
Kitti Sequence 02 | 4661 | 1241 × 376 | Yes | Yes | Yes |
Kitti Sequence 05 | 2761 | 1226 × 370 | Yes | Yes | Yes |
Kitti Sequence 06 | 1101 | 1226 × 370 | Yes | Yes | Yes |
Kitti Sequence 08 | 4071 | 1226 × 370 | Yes | Yes | Yes |
Kitti Sequence 09 | 1591 | 1226 × 370 | Yes | Yes | Yes |
Dataset | Number of Frames | Average Offset Error | ||
---|---|---|---|---|
Deep-VO [34] | TartanVO [35] | Ours | ||
Kitti Sequence 00 | 4541 | 13,227.92 | 124,361.71 | 476.40 |
Kitti Sequence 02 | 4661 | 13,005.68 | 317,924.60 | 1451.84 |
Kitti Sequence 05 | 2761 | 7250.48 | 133,977.30 | 286.65 |
Kitti Sequence 06 | 1101 | 7774.23 | 45,526.16 | 310.44 |
Kitti Sequence 08 | 4071 | 23,942.91 | 203,107.88 | 374.23 |
Kitti Sequence 09 | 1591 | 4379.55 | 110,614.15 | 536.42 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kazi, K.; Kalhoro, A.N.; Memon, F.; Memon, A.R.; Iqbal, M. Beyond Handcrafted Features: A Deep Learning Framework for Optical Flow and SLAM. J. Imaging 2025, 11, 155. https://doi.org/10.3390/jimaging11050155
Kazi K, Kalhoro AN, Memon F, Memon AR, Iqbal M. Beyond Handcrafted Features: A Deep Learning Framework for Optical Flow and SLAM. Journal of Imaging. 2025; 11(5):155. https://doi.org/10.3390/jimaging11050155
Chicago/Turabian StyleKazi, Kamran, Arbab Nighat Kalhoro, Farida Memon, Azam Rafique Memon, and Muddesar Iqbal. 2025. "Beyond Handcrafted Features: A Deep Learning Framework for Optical Flow and SLAM" Journal of Imaging 11, no. 5: 155. https://doi.org/10.3390/jimaging11050155
APA StyleKazi, K., Kalhoro, A. N., Memon, F., Memon, A. R., & Iqbal, M. (2025). Beyond Handcrafted Features: A Deep Learning Framework for Optical Flow and SLAM. Journal of Imaging, 11(5), 155. https://doi.org/10.3390/jimaging11050155