TOSD: A Hierarchical Object-Centric Descriptor Integrating Shape, Color, and Topology
Abstract
1. Introduction
- We propose TOSD, a unified object-centric descriptor framework that jointly encodes shape, color, and topology for robust visual matching.
- We introduce a hierarchical representation strategy that extends from low-level pixel information to mid-level object semantics and high-level scene understanding.
- Encodes shape, color, and scene information robustly across varying views using relation-based encoding, while attention-based filtering and fusion enhance efficiency by removing irrelevant relations.
- Supports various vision tasks and provides a foundation for the Semantic Modeling Framework, which defines the environment by hierarchical representation.
2. Related Works
2.1. Keypoint Descriptors and Dense Matching
2.2. Object-Centric Representation and Pooling
2.3. Multimodality Feature Extraction and Fusion
2.4. Graph-Based Feature Representation
2.5. Global Embedding for Retrieval and Matching
3. Method
3.1. Overview of TOSD Architecture
- Shape Descriptor that represents geometric structure.
- Color Descriptor that represents visual appearance.
- Topology Descriptor that represents inter-object relationships.
3.2. Descriptor Modules
3.2.1. Center Pooling and Graph Construction
- Directed edges from the object center to salient points, forming a hierarchical structure.
- Relevance-based connections between salient points to capture contextual or semantic relationships.
3.2.2. Shape Descriptor
3.2.3. Color Descriptor
3.2.4. Descriptor Fusion and Topology Embedding
3.3. Loss Functions and Training Objectives
3.3.1. Low-Level Loss
3.3.2. Object-Level Loss
3.3.3. High-Level Loss
4. Experiments
4.1. Datasets and Implementation Details
4.2. Performance Evaluation
- MMA@r: The percentage of keypoint matches with a reprojection error below threshold r pixels. We report MMA@3px in our results.
- AUC@r: The area under the cumulative curve of correct matches with respect to the reprojection error threshold, up to r pixels (e.g., AUC@2 and AUC@5).
- Success Rate (SR): Proportion of frames in which the Intersection over Union (IoU) between predicted and ground truth bounding boxes exceeds a given threshold.
- Precision Rate (PR): Percentage of frames in which the center distance between predicted and ground truth bounding boxes is below a defined threshold (typically 20 pixels).
- Expected Average Overlap (EAO): Average IoU score over successful tracking segments.
- Accuracy (A): Average overlap between predicted and ground truth.
- Robustness (R): The number of tracking failures or reinitializations required.
- ATE (Absolute Trajectory Error) [m]: The global consistency of the estimated trajectory by computing the root mean square error. (RMSE)
- RPE (Relative Pose Error) [m]: The local accuracy of the motion estimation by measuring the difference in relative pose over a fixed time interval.
- Translation Drift [%]: The accumulated translational error relative to the traveled distance.
- Rotation Drift [deg/m]: The average rotational error per meter traveled indicates the estimated motion’s angular stability.
- Precision: The proportion of predicted matches that are correct among all predicted matches.
- Recall: The proportion of correct matches that are successfully retrieved by the model among all ground-truth matches.
- F1-score: The harmonic mean of precision and recall, providing a balanced measure between the two.
- mAP (mean Average Precision): The mean of the average precision (AP) over all queries. AP is computed as the area under the precision-recall curve for each query. Higher mAP values indicate better retrieval performance.
- Easy/Medium/Hard Protocols: The benchmarks are evaluated under three difficulty levels based on viewpoint and appearance variations in the query-target pairs. Performance is reported separately for each level (E, M, H).
4.3. Descriptor Contribution Analysis Under Appearance and Viewpoint Changes
4.3.1. Similarity Gap Analysis
4.3.2. Object Matching
4.4. Computation Cost Analysis by Hierarchical Module
5. Discussion
6. Conclusions
Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Gledhill, D.; Tian, G.Y.; Taylor, D.; Clarke, D. Panoramic imaging—A review. Comput. Graph. 2003, 27, 435–445. [Google Scholar] [CrossRef]
- Bayoudh, K.; Knani, R.; Hamdaoui, F.; Mtibaa, A. A survey on deep multimodal learning for computer vision: Advances, trends, applications, and datasets. Vis. Comput. 2022, 38, 2939–2970. [Google Scholar] [CrossRef] [PubMed]
- Joo, S.; Bae, S.; Choi, J.; Park, H.; Lee, S.; You, S.; Uhm, T.; Moon, J.; Kuc, T. A flexible semantic ontological model framework and its application to robotic navigation in large dynamic environments. Electronics 2022, 11, 2420. [Google Scholar] [CrossRef]
- Bae, S.; Joo, S.; Choi, J.; Pyo, J.; Park, H.; Kuc, T. Semantic knowledge-based hierarchical planning approach for multi-robot systems. Electronics 2023, 12, 2131. [Google Scholar] [CrossRef]
- Choi, J.H.; Bae, S.H.; Gilberto, G.G.; Seo, D.S.; Kwon, S.W.; Kwon, G.H.; Ahn, Y.C.; Joo, K.J.; Kuc, T.Y. A Multi-robot Navigation Framework using Semantic Knowledge for Logistics Environment. In Proceedings of the 2024 24th International Conference on Control, Automation and Systems (ICCAS), Jeju, Republic of Korea, 29 October–1 November 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 927–932. [Google Scholar]
- Pyo, J.W.; Choi, J.H.; Kuc, T.Y. An Object-Centric Hierarchical Pose Estimation Method Using Semantic High-Definition Maps for General Autonomous Driving. Sensors 2024, 24, 5191. [Google Scholar] [CrossRef] [PubMed]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
- Zhao, X.; Ding, W.; An, Y.; Du, Y.; Yu, T.; Li, M.; Tang, M.; Wang, J. Fast segment anything. arXiv 2023, arXiv:2306.12156. [Google Scholar]
- Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Bay, H.; Tuytelaars, T.; Van Gool, L. Surf: Speeded up robust features. In Proceedings of the Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; Proceedings, Part I 9. Springer: Berlin/Heidelberg, Germany, 2006; pp. 404–417. [Google Scholar]
- Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 2564–2571. [Google Scholar]
- DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 224–236. [Google Scholar]
- Dusmanu, M.; Rocco, I.; Pajdla, T.; Pollefeys, M.; Sivic, J.; Torii, A.; Sattler, T. D2-net: A trainable cnn for joint description and detection of local features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8092–8101. [Google Scholar]
- Revaud, J.; De Souza, C.; Humenberger, M.; Weinzaepfel, P. R2d2: Reliable and repeatable detector and descriptor. Adv. Neural Inf. Process. Syst. 2019, 32, 13665–13675. [Google Scholar]
- Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8922–8931. [Google Scholar]
- Zhou, Q.; Sattler, T.; Leal-Taixe, L. Patch2pix: Epipolar-guided pixel-level correspondences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4669–4678. [Google Scholar]
- Russakovsky, O.; Lin, Y.; Yu, K.; Fei-Fei, L. Object-centric spatial pooling for image classification. In Proceedings of the Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Proceedings, Part II 12. Springer: Berlin/Heidelberg, Germany, 2012; pp. 1–15. [Google Scholar]
- Garg, K.; Puligilla, S.S.; Kolathaya, S.; Krishna, M.; Garg, S. Revisit Anything: Visual Place Recognition via Image Segment Retrieval. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 326–343. [Google Scholar]
- Locatello, F.; Weissenborn, D.; Unterthiner, T.; Mahendran, A.; Heigold, G.; Uszkoreit, J.; Dosovitskiy, A.; Kipf, T. Object-centric learning with slot attention. Adv. Neural Inf. Process. Syst. 2020, 33, 11525–11538. [Google Scholar]
- Vikström, O.; Ilin, A. Learning explicit object-centric representations with vision transformers. arXiv 2022, arXiv:2210.14139. [Google Scholar]
- Sajjadi, M.S.; Duckworth, D.; Mahendran, A.; Van Steenkiste, S.; Pavetic, F.; Lucic, M.; Guibas, L.J.; Greff, K.; Kipf, T. Object scene representation transformer. Adv. Neural Inf. Process. Syst. 2022, 35, 9512–9524. [Google Scholar]
- Engelcke, M.; Kosiorek, A.R.; Jones, O.P.; Posner, I. Genesis: Generative scene inference and sampling with object-centric latent representations. arXiv 2019, arXiv:1907.13052. [Google Scholar]
- Loghmani, M.R.; Planamente, M.; Caputo, B.; Vincze, M. Recurrent convolutional fusion for RGB-D object recognition. IEEE Robot. Autom. Lett. 2019, 4, 2878–2885. [Google Scholar] [CrossRef]
- Brenner, M.; Reyes, N.H.; Susnjak, T.; Barczak, A.L. RGB-D and thermal sensor fusion: A systematic literature review. IEEE Access 2023, 11, 82410–82442. [Google Scholar] [CrossRef]
- Sun, C.; Zhang, C.; Xiong, N. Infrared and visible image fusion techniques based on deep learning: A review. Electronics 2020, 9, 2162. [Google Scholar] [CrossRef]
- Xiao, Y.; Gao, G.; Wang, L.; Lai, H. Optical flow-aware-based multi-modal fusion network for violence detection. Entropy 2022, 24, 939. [Google Scholar] [CrossRef] [PubMed]
- Sun, W.; Cao, L.; Guo, Y.; Du, K. Multimodal and multiscale feature fusion for weakly supervised video anomaly detection. Sci. Rep. 2024, 14, 22835. [Google Scholar] [CrossRef] [PubMed]
- Yang, A.; Li, M.; Wu, Z.; He, Y.; Qiu, X.; Song, Y.; Du, W.; Gou, Y. CDF-net: A convolutional neural network fusing frequency domain and spatial domain features. IET Comput. Vis. 2023, 17, 319–329. [Google Scholar] [CrossRef]
- Chen, L.; Fu, Y.; Gu, L.; Yan, C.; Harada, T.; Huang, G. Frequency-aware feature fusion for dense image prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10763–10780. [Google Scholar] [CrossRef] [PubMed]
- Peng, S.; Cai, Y.; Yao, Z.; Tan, M. Weakly-supervised video anomaly detection via temporal resolution feature learning. Appl. Intell. 2023, 53, 30607–30625. [Google Scholar] [CrossRef]
- Chen, H.; Li, Y.; Fang, H.; Xin, W.; Lu, Z.; Miao, Q. Multi-scale attention 3D convolutional network for multimodal gesture recognition. Sensors 2022, 22, 2405. [Google Scholar] [CrossRef] [PubMed]
- Liu, F.; Zhang, Y.; Lu, T.; Wang, J.; Wang, L. Hierarchical in-out fusion for incomplete multimodal brain tumor segmentation. Sci. Rep. 2025, 15, 23017. [Google Scholar] [CrossRef] [PubMed]
- Kim, J.H.; Lee, S.W.; Kwak, D.; Heo, M.O.; Kim, J.; Ha, J.W.; Zhang, B.T. Multimodal residual learning for visual qa. Adv. Neural Inf. Process. Syst. 2016, 29, 361–369. [Google Scholar]
- Prakash, A.; Chitta, K.; Geiger, A. Multi-modal fusion transformer for end-to-end autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7077–7087. [Google Scholar]
- Gudhe, N.R.; Behravan, H.; Sudah, M.; Okuma, H.; Vanninen, R.; Kosma, V.M.; Mannermaa, A. Multi-level dilated residual network for biomedical image segmentation. Sci. Rep. 2021, 11, 14105. [Google Scholar] [CrossRef] [PubMed]
- Shi, L.; Zhao, S.; Niu, W. A welding defect detection method based on multiscale feature enhancement and aggregation. Nondestruct. Test. Eval. 2024, 39, 1295–1314. [Google Scholar] [CrossRef]
- Li, Y.; Daho, M.E.H.; Conze, P.H.; Zeghlache, R.; Le Boité, H.; Tadayoni, R.; Cochener, B.; Lamard, M.; Quellec, G. A review of deep learning-based information fusion techniques for multimodal medical image classification. Comput. Biol. Med. 2024, 177, 108635. [Google Scholar] [CrossRef] [PubMed]
- Han, X.; Chen, S.; Fu, Z.; Feng, Z.; Fan, L.; An, D.; Wang, C.; Guo, L.; Meng, W.; Zhang, X.; et al. Multimodal fusion and vision-language models: A survey for robot vision. arXiv 2025, arXiv:2504.02477. [Google Scholar]
- Yang, B.; Li, J.; Zeng, T. A review of environmental perception technology based on multi-sensor information fusion in autonomous driving. World Electr. Veh. J. 2025, 16, 20. [Google Scholar] [CrossRef]
- Xu, D.; Zhu, Y.; Choy, C.B.; Li, F.-F. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5410–5419. [Google Scholar]
- Yang, J.; Lu, J.; Lee, S.; Batra, D.; Parikh, D. Graph r-cnn for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 670–685. [Google Scholar]
- Ren, Y.; Zhao, Z.; Jiang, J.; Jiao, Y.; Yang, Y.; Liu, D.; Chen, K.; Yu, G. A Scene Graph Similarity-Based Remote Sensing Image Retrieval Algorithm. Appl. Sci. 2024, 14, 8535. [Google Scholar] [CrossRef]
- Lin, Z.; Zhu, F.; Wang, Q.; Kong, Y.; Wang, J.; Huang, L.; Hao, Y. RSSGG_CS: Remote sensing image scene graph generation by fusing contextual information and statistical knowledge. Remote Sens. 2022, 14, 3118. [Google Scholar] [CrossRef]
- Gao, G.; Xiong, Z.; Zhao, Y.; Zhang, L. Landmark Topology Descriptor-Based Place Recognition and Localization under Large View-Point Changes. Sensors 2023, 23, 9775. [Google Scholar] [CrossRef] [PubMed]
- Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
- Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
- Hu, W.; Liu, B.; Gomes, J.; Zitnik, M.; Liang, P.; Pande, V.; Leskovec, J. Strategies for pre-training graph neural networks. arXiv 2019, arXiv:1905.12265. [Google Scholar]
- Min, E.; Chen, R.; Bian, Y.; Xu, T.; Zhao, K.; Huang, W.; Zhao, P.; Huang, J.; Ananiadou, S.; Rong, Y. Transformer for graphs: An overview from architecture perspective. arXiv 2022, arXiv:2202.08455. [Google Scholar]
- Rong, Y.; Bian, Y.; Xu, T.; Xie, W.; Wei, Y.; Huang, W.; Huang, J. Self-supervised graph transformer on large-scale molecular data. Adv. Neural Inf. Process. Syst. 2020, 33, 12559–12571. [Google Scholar]
- Pan, C.H.; Qu, Y.; Yao, Y.; Wang, M.J.S. HybridGNN: A Self-Supervised Graph Neural Network for Efficient Maximum Matching in Bipartite Graphs. Symmetry 2024, 16, 1631. [Google Scholar] [CrossRef]
- Wang, D.; Lin, M.; Zhang, X.; Huang, Y.; Zhu, Y. Automatic Modulation Classification Based on CNN-Transformer Graph Neural Network. Sensors 2023, 23, 7281. [Google Scholar] [CrossRef] [PubMed]
- Arandjelovic, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Noh, H.; Araujo, A.; Sim, J.; Weyand, T.; Han, B. Large-Scale Image Retrieval With Attentive Deep Local Features. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Gordo, A.; Almazán, J.; Revaud, J.; Larlus, D. Deep Image Retrieval: Learning Global Representations for Image Search. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 241–257. [Google Scholar]
- Wang, R.; Shen, Y.; Zuo, W.; Zhou, S.; Zheng, N. TransVPR: Transformer-Based Place Recognition With Multi-Level Attention Aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13648–13657. [Google Scholar]
- Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. arXiv 2024, arXiv:2304.07193. [Google Scholar]
- Xu, K.; Wang, C.; Chen, C.; Wu, W.; Scherer, S. Aircode: A robust object encoding method. IEEE Robot. Autom. Lett. 2022, 7, 1816–1823. [Google Scholar] [CrossRef]
- Keetha, N.V.; Wang, C.; Qiu, Y.; Xu, K.; Scherer, S. AirObject: A Temporally Evolving Graph Embedding for Object Identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8407–8416. [Google Scholar]
- Opanin Gyamfi, E.; Qin, Z.; Mantebea Danso, J.; Adu-Gyamfi, D. Hierarchical Graph Neural Network: A Lightweight Image Matching Model with Enhanced Message Passing of Local and Global Information in Hierarchical Graph Neural Networks. Information 2024, 15, 602. [Google Scholar] [CrossRef]
- Zhong, Z.; Li, C.T.; Pang, J. Hierarchical message-passing graph neural networks. Data Min. Knowl. Discov. 2023, 37, 381–408. [Google Scholar] [CrossRef]
- Chen, J.; Luo, Z.; Zhang, Z.; Huang, F.; Ye, Z.; Takiguchi, T.; Hancock, E.R. Polar transformation on image features for orientation-invariant representations. IEEE Trans. Multimed. 2018, 21, 300–313. [Google Scholar] [CrossRef]
- Matungka, R.; Zheng, Y.F.; Ewing, R.L. Image registration using adaptive polar transform. IEEE Trans. Image Process. 2009, 18, 2340–2354. [Google Scholar] [CrossRef] [PubMed]
- Rolínek, M.; Swoboda, P.; Zietlow, D.; Paulus, A.; Musil, V.; Martius, G. Deep graph matching via blackbox differentiation of combinatorial solvers. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXVIII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 407–424. [Google Scholar]
- Lin, Y.C.; Wang, C.H.; Lin, Y.C. GAT TransPruning: Progressive channel pruning strategy combining graph attention network and transformer. PeerJ Comput. Sci. 2024, 10, e2012. [Google Scholar] [CrossRef] [PubMed]
- Mapelli, D.; Behrmann, M. The role of color in object recognition: Evidence from visual agnosia. Neurocase 1997, 3, 237–247. [Google Scholar] [CrossRef]
- Cucchiara, R.; Grana, C.; Piccardi, M.; Prati, A.; Sirotti, S. Improving shadow suppression in moving object detection with HSV color information. In Proceedings of the 2001 IEEE Intelligent Transportation Systems (ITSC 2001), Singapore, 28–30 May 2001; Proceedings (Cat. No. 01TH8585). IEEE: Piscataway, NJ, USA, 2001; pp. 334–339. [Google Scholar]
- Hdioud, B.; Tirari, M.E.H.; Thami, R.O.H.; Faizi, R. Detecting and shadows in the HSV color space using dynamic thresholds. Bull. Electr. Eng. Inform. 2018, 7, 70–79. [Google Scholar] [CrossRef]
- Xhonneux, L.P.; Qu, M.; Tang, J. Continuous graph neural networks. In Proceedings of the International Conference on Machine Learning, Online, 13–18 July 2020; pp. 10432–10441. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
- Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning, Virtual Event, 13–18 July 2020; Daumé III, H., Singh, A., Eds.; PMLR—Proceedings of Machine Learning Research. Volume 119, pp. 1597–1607. [Google Scholar]
- Balntas, V.; Lenc, K.; Vedaldi, A.; Mikolajczyk, K. HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5173–5182. [Google Scholar]
- Wu, Y.; Lim, J.; Yang, M.H. Online object tracking: A benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2411–2418. [Google Scholar]
- Kristan, M.; Leonardis, A.; Matas, J.; Felsberg, M.; Pflugfelder, R.; Čehovin Zajc, L.; Vojir, T.; Bhat, G.; Lukezic, A.; Eldesokey, A.; et al. The sixth visual object tracking vot2018 challenge results. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018; pp. 3–53. [Google Scholar]
- Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
- Radenović, F.; Iscen, A.; Tolias, G.; Avrithis, Y.; Chum, O. Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Luo, Z.; Zhou, L.; Bai, X.; Chen, H.; Zhang, J.; Yao, Y.; Li, S.; Fang, T.; Quan, L. Aslfeat: Learning local features of accurate shape and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6589–6598. [Google Scholar]
- Mishchuk, A.; Mishkin, D.; Radenovic, F.; Matas, J. Working hard to know your neighbor’s margins: Local descriptor learning loss. Adv. Neural Inf. Process. Syst. 2017, 30, 4826–4837. [Google Scholar]
- Barroso-Laguna, A.; Riba, E.; Ponsa, D.; Mikolajczyk, K. Key. net: Keypoint detection by handcrafted and learned cnn filters. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Long Beach, CA, USA, 15–20 June 2019; pp. 5836–5844. [Google Scholar]
- Suwanwimolkul, S.; Komorita, S.; Tasaka, K. Learning of Low-Level Feature Keypoints for Accurate and Robust Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Virtual, 5–9 January 2021; pp. 2262–2271. [Google Scholar]
- Zhao, X.; Wu, X.; Miao, J.; Chen, W.; Chen, P.C.; Li, Z. Alike: Accurate and lightweight keypoint detection and descriptor extraction. IEEE Trans. Multimed. 2022, 25, 3101–3112. [Google Scholar] [CrossRef]
- Balntas, V.; Riba, E.; Ponsa, D.; Mikolajczyk, K. Learning local feature descriptors with triplets and shallow convolutional neural networks. In Proceedings of the BMVC, York, UK, 19–22 September 2016; Volume 1, p. 3. [Google Scholar]
- Wang, C.; Xu, R.; Zhang, Y.; Xu, S.; Meng, W.; Fan, B.; Zhang, X. MTLDesc: Looking wider to describe better. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 2388–2396. [Google Scholar]
- Luo, Z.; Shen, T.; Zhou, L.; Zhang, J.; Yao, Y.; Li, S.; Fang, T.; Quan, L. Contextdesc: Local descriptor augmentation with cross-modality context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2527–2536. [Google Scholar]
- Li, K.; Wang, L.; Liu, L.; Ran, Q.; Xu, K.; Guo, Y. Decoupling makes weakly supervised local feature better. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 15838–15848. [Google Scholar]
- Tyszkiewicz, M.; Fua, P.; Trulls, E. DISK: Learning local features with policy gradient. Adv. Neural Inf. Process. Syst. 2020, 33, 14254–14265. [Google Scholar]
- Xue, F.; Budvytis, I.; Cipolla, R. SFD2: Semantic-Guided Feature Detection and Description. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 5206–5216. [Google Scholar]
- Wang, Z.; Wu, C.; Yang, Y.; Li, Z. Learning transformation-predictive representations for detection and description of local features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 11464–11473. [Google Scholar]
- Danelljan, M.; Hager, G.; Shahbaz Khan, F.; Felsberg, M. Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4310–4318. [Google Scholar]
- Xu, T.; Feng, Z.H.; Wu, X.J.; Kittler, J. Joint group feature selection and discriminative filter learning for robust visual object tracking. In Proceedings of the IEEE/CVF international Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7950–7960. [Google Scholar]
- Ma, C.; Yang, X.; Zhang, C.; Yang, M.H. Long-term correlation tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5388–5396. [Google Scholar]
- Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. ATOM: Accurate Tracking by Overlap Maximization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4660–4669. [Google Scholar]
- Bertinetto, L.; Valmadre, J.; Golodetz, S.; Miksik, O.; Torr, P.H. Staple: Complementary learners for real-time tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1401–1409. [Google Scholar]
- Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R. Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6668–6677. [Google Scholar]
- Yu, B.; Tang, M.; Zheng, L.; Zhu, G.; Wang, J.; Lu, H. High-Performance Discriminative Tracking with Target-Aware Feature Embeddings. In Proceedings of the Pattern Recognition and Computer Vision: 4th Chinese Conference, PRCV 2021, Beijing, China, 29 October–1 November 2021; Proceedings, Part I 4. Springer: Berlin/Heidelberg, Germany, 2021; pp. 3–15. [Google Scholar]
- Bhat, G.; Danelljan, M.; Van Gool, L.; Timofte, R. Know your surroundings: Exploiting scene information for object tracking. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 205–221. [Google Scholar]
- Yun, S.; Choi, J.; Yoo, Y.; Yun, K.; Young Choi, J. Action-decision networks for visual tracking with deep reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2711–2720. [Google Scholar]
- Mayer, C.; Danelljan, M.; Paudel, D.P.; Van Gool, L. Learning target candidate association to keep track of what not to track. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 13444–13454. [Google Scholar]
- Lukezic, A.; Vojir, T.; Čehovin Zajc, L.; Matas, J.; Kristan, M. Discriminative correlation filter with channel and spatial reliability. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6309–6318. [Google Scholar]
- Yu, B.; Tang, M.; Zheng, L.; Zhu, G.; Wang, J.; Feng, H.; Feng, X.; Lu, H. High-performance discriminative tracking with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9856–9865. [Google Scholar]
- Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8971–8980. [Google Scholar]
- Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8126–8135. [Google Scholar]
- Wang, Q.; Gao, J.; Xing, J.; Zhang, M.; Hu, W. Dcfnet: Discriminant correlation filters network for visual tracking. arXiv 2017, arXiv:1704.04057. [Google Scholar]
- Xu, Y.; Wang, Z.; Li, Z.; Yuan, Y.; Yu, G. SiamFC++: Towards robust and accurate visual tracking with target estimation guidelines. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12549–12556. [Google Scholar]
- Sarlin, P.E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4938–4947. [Google Scholar]
- Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3651–3660. [Google Scholar]
Methods | Hpatches | Methods | Hpatches | ||||
---|---|---|---|---|---|---|---|
MMA@3 ↑ | AUC@2 ↑ | AUC@5 ↑ | MMA@3 ↑ | AUC@2 ↑ | AUC@5 ↑ | ||
SIFT [9] | 50.1 | 39.6 | 49.6 | ASLFeat [77] | 72.1 | 52.4 | 66.5 |
HardNet [78] | 62.1 | 42.6 | 59.9 | LoFTR [15] | 81.2 | 58.7 | 74.5 |
DELF [53] | 56.7 | 41.0 | 54.2 | Key.Net [79] | 73.2 | 54.0 | 68.5 |
LLF [80] | 56.2 | 41.2 | 50.4 | ALIKE [81] | 70.5 | 51.7 | 66.9 |
Lf-net [82] | 53.2 | 38.7 | 48.7 | MTLDesc [83] | 77.6 | 56.5 | 71.4 |
ContextDesc [84] | 63.3 | 44.6 | 59.0 | PoSFeat [85] | 76.9 | 56.0 | 69.9 |
DISK [86] | 72.2 | 52.3 | 66.4 | SFD2 [87] | 77.8 | 56.1 | 70.6 |
R2D2 [14] | 64.4 | 45.8 | 61.6 | TPR [88] | 79.8 | 57.1 | 73.0 |
D2Net [13] | 40.3 | 31.6 | 39.5 | SuperPoint [12] | 63.0 | 44.1 | 59.6 |
TOSD (shape) | 66.3 | 49.4 | 64.1 | TODS (fusion) | 71.1 | 52.7 | 65.3 |
Tracker | OTB50 | OTB100 | Tracker | VOT2018 | ||||
---|---|---|---|---|---|---|---|---|
SR ↑ | PR ↑ | SR ↑ | PR ↑ | EAO ↑ | A ↑ | R ↓ | ||
SRDCF [89] | 0.726 | 0.81 | 0.605 | 0.729 | GFS-DCF [90] | 0.397 | 0.511 | 0.143 |
LCT [91] | 0.711 | 0.780 | 0.61 | 0.655 | ATOM [92] | 0.401 | 0.590 | 0.204 |
Staple [93] | 0.745 | 0.766 | 0.593 | 0.848 | SiamBAN [94] | 0.452 | 0.597 | 0.178 |
HDT [95] | 0.603 | 0.889 | 0.539 | 0.848 | KYS [96] | 0.446 | 0.598 | 0.191 |
ADNet [97] | 0.659 | 0.903 | 0.590 | 0.803 | KeepTrack [98] | 0.476 | 0.615 | 0.172 |
CSR-DCF [99] | 0.678 | 0.773 | 0.587 | 0.733 | DTT [100] | 0.449 | 0.615 | 0.176 |
SiamRPN [101] | 0.663 | 0.800 | 0.631 | 0.853 | TransT [102] | 0.447 | 0.616 | 0.201 |
DCFNet [103] | 0.618 | 0.716 | 0.618 | 0.804 | SiamFC++ [104] | 0.426 | 0.587 | 0.183 |
TOSD (shape) | 0.617 | 0.767 | 0.562 | 0.767 | TOSD (shape) | 0.351 | 0.515 | 0.166 |
TOSD (color) | 0.501 | 0.708 | 0.443 | 0.664 | TOSD (color) | 0.307 | 0.410 | 0.147 |
TOSD (fusion) | 0.678 | 0.836 | 0.600 | 0.781 | TOSD (fusion) | 0.391 | 0.546 | 0.191 |
Method | ATE (m) ↓ | RPE (m) ↓ | Trans. Drift (%) ↓ | Rot. Drift (deg/m) ↓ |
---|---|---|---|---|
Sift | 158.9 | 0.385 | 9.63 | 0.549 |
ORB | 392.2 | 0.31 | 10.5 | 1.08 |
Superpoint | 338.8 | 0.262 | 9.1 | 0.82 |
TOSD | 156.5 | 0.263 | 4.2 | 0.778 |
Sequence | Method | Precision ↑ | Recall ↑ | F1 ↑ | Method | Precision ↑ | Recall ↑ | F1 ↑ |
---|---|---|---|---|---|---|---|---|
00 | NetVLAD [52] | 0.211 | 0.413 | 0.279 | TOSD | 0.799 | 0.578 | 0.638 |
05 | NetVLAD [52] | 0.190 | 0.253 | 0.217 | TOSD | 0.720 | 0.399 | 0.490 |
06 | NetVLAD [52] | 0.390 | 0.274 | 0.322 | TOSD | 0.705 | 0.441 | 0.452 |
Method | Dimensions | Roxford | RParis | ||||
---|---|---|---|---|---|---|---|
Easy | Medium | Hard | Easy | Medium | Hard | ||
V-[O]-MAC [76] | 512 | 0.587 | 0.446 | 0.198 | 0.592 | 0.359 | 0.176 |
V-[O]-SPoC [76] | 512 | 0.601 | 0.459 | 0.212 | 0.598 | 0.324 | 0.158 |
V-[O]-CroW [76] | 512 | 0.612 | 0.472 | 0.225 | 0.629 | 0.369 | 0.184 |
V-[O]-GeM [76] | 512 | 0.623 | 0.483 | 0.237 | 0.632 | 0.388 | 0.196 |
V-[O]-R-MAC [76] | 512 | 0.635 | 0.496 | 0.248 | 0.662 | 0.409 | 0.208 |
TOSD (topology) | 512+@ | 0.620 | 0.449 | 0.185 | 0.624 | 0.324 | 0.147 |
TOSD (topology) | 256+@ | 0.591 | 0.429 | 0.183 | 0.594 | 0.319 | 0.142 |
Descriptor | Mean (Pos) | Std (Pos) | Mean (Neg) | Std (Neg) | Gap (Pos − Neg) |
---|---|---|---|---|---|
Shape | 0.815 | 0.061 | 0.335 | 0.063 | 0.752 |
Color | 0.783 | 0.079 | 0.369 | 0.071 | 0.712 |
Fusion | 0.852 | 0.043 | 0.270 | 0.051 | 0.801 |
Descriptor | Recall@1 | Recall@5 | Recall@10 | AUC |
---|---|---|---|---|
Shape | 0.571 | 0.598 | 0.633 | 0.699 |
Color | 0.470 | 0.487 | 0.522 | 0.646 |
Fusion | 0.631 | 0.662 | 0.693 | 0.735 |
Model | Parameters (M) | FPS | Description |
---|---|---|---|
SuperPoint + SuperGlue [12,105] | 13.30 | 24.28 | Sparse keypoint detection and matching |
LoFTR [15] | 11.56 | 8.32 | Dense feature-level matching |
SAM [7] | 641.09 | 2.10 | Object-level zero-shot segmentation |
DINOv2-base [56] | 86.58 | 122.06 | Patch-level ViT feature embedding |
DINOv2-giant [56] | 1136.48 | 22.79 | Large-scale scene transformer |
DETR [106] | 43.04 | 18.48 | End-to-end object detection |
Ours | 25.37 | 12.70 | Hierarchical Object-Centric Descriptor |
Module | Parameters (M) | Time (ms) | Description |
---|---|---|---|
Preprocess | 13.10 | 34.10 | Segmentation and keypoint detection |
Low-level | 6.73 | 2.85 | Salient point abstraction and local encoding |
Object-level | 3.56 | 35.20 | Object-wise pooling and relational encoding |
Scene-level | 1.98 | 6.59 | Scene-level aggregation via object graph |
Total | 25.37 | 78.74 | Full hierarchical descriptor |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Choi, J.-H.; Pyo, J.-W.; An, Y.-C.; Kuc, T.-Y. TOSD: A Hierarchical Object-Centric Descriptor Integrating Shape, Color, and Topology. Sensors 2025, 25, 4614. https://doi.org/10.3390/s25154614
Choi J-H, Pyo J-W, An Y-C, Kuc T-Y. TOSD: A Hierarchical Object-Centric Descriptor Integrating Shape, Color, and Topology. Sensors. 2025; 25(15):4614. https://doi.org/10.3390/s25154614
Chicago/Turabian StyleChoi, Jun-Hyeon, Jeong-Won Pyo, Ye-Chan An, and Tae-Yong Kuc. 2025. "TOSD: A Hierarchical Object-Centric Descriptor Integrating Shape, Color, and Topology" Sensors 25, no. 15: 4614. https://doi.org/10.3390/s25154614
APA StyleChoi, J.-H., Pyo, J.-W., An, Y.-C., & Kuc, T.-Y. (2025). TOSD: A Hierarchical Object-Centric Descriptor Integrating Shape, Color, and Topology. Sensors, 25(15), 4614. https://doi.org/10.3390/s25154614