Monocular 3D Tooltip Tracking in Robotic Surgery—Building a Multi-Stage Pipeline
Abstract
:1. Introduction
1.1. Robotic Surgery
1.2. Tool Tracking and Skill Assessment
1.3. Prior Approaches
1.4. Zero-Shot Tool Tracking
2. Materials and Methods
2.1. JIGSAWS Suturing Dataset
2.2. Computer Vision Pipeline
- Zero-Shot Tool Localization: A foundation model, Florence2 [29], is employed to perform zero-shot localization of surgical tools. This model uses a text prompt to identify the position of the tools within each video frame and generate bounding box annotations, ensuring robust performance even in the absence of task-specific training data.
- Zero-Shot Tool Segmentation: Another foundation model, Segment Anything 2 (SAM 2) [30], is applied to further refine the localization. SAM 2 takes the detected bounding boxes as input and outputs precise segmentation masks for the tools that are present in the scene. The tooltip coordinates are extracted based on the mask geometry: for the right tool, the top-left pixel is selected, while for the left tool, the bottom-right pixel is chosen. This ensures accurate localization of tooltip positions in the 2D image plane.
- Depth Estimation: To extend the 2D tooltip coordinates into 3D space, we employ a monocular depth estimation model, Metric3D [31]. This model computes a relative depth value for each pixel in the range of [0.0, 1.0]. Using this depth information, the 2D tooltip coordinates (xc, yc) are combined with the estimated depth (zc) to produce the 3D tooltip positions in the camera coordinate frame as (xc, yc, zc).
- Transformation to World Coordinates: To reconstruct the 3D trajectories of the tooltips in the world coordinate frame, we apply a linear transformation. This involves estimating the rotation matrix (R) and translation vector (T), which represent the camera extrinsic parameters relative to the surgical robot. By applying this transformation, the 3D camera coordinates (xc, yc, zc) are mapped to world coordinates (xw, yw, zw), enabling accurate trajectory reconstruction. We can then evaluate this against the ground truth kinematic annotations of the JIGSAWS dataset to evaluate the approach.
- Comparison with a Supervised Model: A subset of the dataset is annotated with precise tool masks by healthcare professionals. These annotations are used to train an independent, task-specific tool segmentation model (YOLOv11 [32]). The supervised model demonstrates improved tool segmentation accuracy and reduced inference time compared to the foundation models, making it better suited for real-time surgical applications. The trained model, combined with the existing depth estimation process, is then applied to the remaining dataset for 3D tooltip tracking. The results are compared with the zero-shot approach to evaluate the benefits of domain-specific supervised training, highlighting the trade-offs between general-purpose foundation models and specialized supervised models for robotic-assisted surgery tasks.
2.3. Data Preprocessing
3. Results
3.1. Evaluation Metrics
- APD: The 3D-APD metric measures the percentage of predicted points that are within a threshold distance from the corresponding ground truth 3D points. The threshold is dynamically adapted based on the depth of each point, assigning points that are closer to the camera (less depth) with higher importance. The metric uses the Euclidean norm and an indicator function to assess the tracking accuracy, with a focus on the visibility of ground truth points across the frames.
- 3D-Average Jaccard (AJ): The 3D-AJ metric is calculated as the ratio of true positives (points that are correctly predicted to be visible within a distance threshold) to the sum of true positives, false positives (incorrectly predicted as visible), and false negatives (points that are predicted as occluded or out of range). The metric emphasizes both tracking accuracy and depth correctness, and in cases of perfect depth estimation, it simplifies to a 2D metric. It is particularly useful for evaluating 3D trajectory predictions across video frames, with a focus on correct tracking and depth estimation.
3.2. Performance Comparison
4. Discussion
4.1. Performance Insights
4.2. Considerations and Future Work
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Picozzi, P.; Nocco, U.; Labate, C.; Gambini, I.; Puleo, G.; Silvi, F.; Pezzillo, A.; Mantione, R.; Cimolin, V. Advances in Robotic Surgery: A Review of New Surgical Platforms. Electronics 2024, 13, 4675. [Google Scholar] [CrossRef]
- Sheetz, K.H.; Claflin, J.; Dimick, J.B. Trends in the Adoption of Robotic Surgery for Common Surgical Procedures. JAMA Netw. Open. 2020, 3, e1918911. [Google Scholar] [CrossRef] [PubMed]
- Anderberg, M.; Larsson, J.; Kockum, C.C.; Arnbjörnsson, E. Robotics versus laparoscopy—An experimental study of the transfer effect in maiden users. Ann. Surg. Innov. Res. 2010, 4, 3. [Google Scholar] [CrossRef] [PubMed]
- Lim, C.; Barragan, J.A.; Farrow, J.M.; Wachs, J.P.; Sundaram, C.P.; Yu, D. Physiological Metrics of Surgical Difficulty and Multi-Task Requirement During Robotic Surgery Skills. Sensors 2023, 23, 4354. [Google Scholar] [CrossRef]
- Mason, J.D.; Ansell, J.; Warren, N.; Torkington, J. Is motion analysis a valid tool for assessing laparoscopic skill? Surg. Endosc. 2013, 27, 1468–1477. [Google Scholar] [CrossRef]
- Ghasemloonia, A.; Maddahi, Y.; Zareinia, K.; Lama, S.; Dort, J.C.; Sutherland, G.R. Surgical Skill Assessment Using Motion Quality and Smoothness. J. Surg. Educ. 2017, 74, 295–305. [Google Scholar] [CrossRef]
- Ebina, K.; Abe, T.; Yan, L.; Hotta, K.; Shichinohe, T.; Higuchi, M.; Iwahara, N.; Hosaka, Y.; Harada, S.; Kikuchi, H.; et al. A surgical instrument motion measurement system for skill evaluation in practical laparoscopic surgery training. PLoS ONE 2024, 19, e0305693. [Google Scholar] [CrossRef]
- Gerull, W.D.; Kulason, S.; Shields, M.C.; Yee, A.; Awad, M.M. Impact of robotic surgery objective performance indicators: A systematic review. J. Am. Coll. Surg. 2025, 240, 201–210. [Google Scholar] [CrossRef]
- Mattingly, A.S.; Chen, M.M.; Divi, V.; Holsinger, F.C.; Saraswathula, A. Minimally Invasive Surgery in the United States, 2022: Understanding Its Value Using New Datasets. J. Surg. Res. 2023, 281, 33–36. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Garrow, C.R.; Kowalewski, K.; Li, L.; Wagner, M.; Schmidt, M.W.; Engelhardt, S.; Hashimoto, D.A.; Kenngott, H.G.; Bodenstedt, S.; Speidel, S.; et al. Machine learning for surgical phase recognition: A systematic review. Ann. Surg. 2021, 273, 684–693. [Google Scholar] [CrossRef]
- Chadebecq, F.; Vasconcelos, F.; Mazomenos, E.; Stoyanov, D. Computer Vision in the Surgical Operating Room. Visc. Med. 2020, 36, 456–462. [Google Scholar] [CrossRef] [PubMed]
- Dick, L.; Boyle, C.P.; Skipworth, R.J.E.; Smink, D.S.; Tallentire, V.R.; Yule, S. Automated analysis of operative video in surgical training: Scoping review. BJS Open 2024, 8, zrae124. [Google Scholar] [CrossRef]
- Nema, S.; Vachhani, L. Surgical instrument detection and tracking technologies: Automating dataset labeling for surgical skill assessment. Front. Robot. AI 2022, 9, 1030846. [Google Scholar] [CrossRef]
- Chen, Z.; Zhao, Z.; Cheng, X. Surgical instruments tracking based on deep learning with lines detection and spatio-temporal context. In Proceedings of the Chinese Automation Congress (CAC), Jinan, China, 20–22 October 2017; pp. 2711–2714. [Google Scholar] [CrossRef]
- Deol, E.S.; Henning, G.; Basourakos, S.; Vasdev, R.M.S.; Sharma, V.; Kavoussi, N.L.; Karnes, R.J.; Leibovich, B.C.; Boorjian, S.A.; Khanna, A. Artificial intelligence model for automated surgical instrument detection and counting: An experimental proof-of-concept study. Patient Saf. Surg. 2024, 18, 24. [Google Scholar] [CrossRef]
- Nwoye, C.I.; Padoy, N. SurgiTrack: Fine-Grained Multi-Class Multi-Tool Tracking in Surgical Videos. arXiv 2024, arXiv:2405.20333. [Google Scholar]
- Ye, M.; Zhang, L.; Giannarou, S.; Yang, G. Real-Time 3D Tracking of Articulated Tools for Robotic Surgery. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2016: Proceedings of the 19th International Conference, Athens, Greece, 17–21 October 2016; Springer International Publishing: New York, NY, USA, 2016; pp. 386–394. [Google Scholar] [CrossRef]
- Hao, R.; Özgüner, O.; Çavuşoğlu, M.C. Vision-Based Surgical Tool Pose Estimation for the da Vinci® Robotic Surgical System. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Madrid, Spain, 1–5 October 2018; pp. 1298–1305. [Google Scholar] [CrossRef]
- Franco-González, I.T.; Lappalainen, N.; Bednarik, R. Tracking 3D motion of instruments in microsurgery: A comparative study of stereoscopic marker-based vs. Deep learning method for objective analysis of surgical skills. Inform. Med. Unlocked 2024, 51, 101593. [Google Scholar] [CrossRef]
- Gerats, B.G.; Wolterink, J.M.; Mol, S.P.; Broeders, I.A. Neural Fields for 3D Tracking of Anatomy and Surgical Instruments in Monocular Laparoscopic Video Clips. arXiv 2024, arXiv:2403.19265. [Google Scholar]
- Gao, Y.; Vedula, S.S.; Reiley, C.E.; Ahmidi, N.; Varadarajan, B.; Lin, H.C.; Tao, L.; Zappella, L.; Béjar, B.; Yuh, D.D.; et al. The JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS): A Surgical Activity Dataset for Human Motion Modeling. In Modeling and Monitoring of Computer Assisted Interventions (M2CAI)—MICCAI Workshop; Johns Hopkins University: Baltimore, MD, USA, 2014. [Google Scholar]
- Li, C.; Gan, Z.; Yang, Z.; Yang, J.; Li, L.; Wang, L.; Gao, J. Multimodal Foundation Models: From Specialists to General-Purpose Assistants. arXiv 2023, arXiv:2309.10020. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar]
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning. arXiv 2023, arXiv:2304.08485. [Google Scholar]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment Anything. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 3992–4003. [Google Scholar] [CrossRef]
- Papp, D.; Elek, R.N.; Haidegger, T. Surgical Tool Segmentation on the JIGSAWS Dataset for Autonomous Image-Based Skill Assessment. In Proceedings of the 2022 IEEE 10th Jubilee International Conference on Computational Cybernetics and Cyber-Medical Systems (ICCC), Reykjavík, Iceland, 6–9 July 2022; pp. 000049–000056. [Google Scholar] [CrossRef]
- Lefor, A.K.; Harada, K.; Dosis, A.; Mitsuishi, M. Motion analysis of the JHU-ISI Gesture and Skill Assessment Working Set using Robotics Video and Motion Assessment Software. Int. J. Comput. Assist. Radiol. Surg. 2020, 15, 2017–2025. [Google Scholar] [CrossRef] [PubMed]
- Carciumaru, T.Z.; Tang, C.M.; Farsi, M.; Bramer, W.M.; Dankelman, J.; Raman, C.; Dirven, C.M.; Gholinejad, M.; Vasilic, D. Systematic review of machine learning applications using nonoptical motion tracking in surgery. NPJ Digit. Med. 2025, 8, 28. [Google Scholar] [CrossRef] [PubMed]
- Xiao, B.; Wu, H.; Xu, W.; Dai, X.; Hu, H.; Lu, Y.; Zeng, M.; Liu, C.; Yuan, L. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks. arXiv 2023, arXiv:2311.06242. [Google Scholar]
- Ravi, N.; Gabeur, V.; Hu, Y.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Rädle, R.; Rolland, C.; Gustafson, L.; et al. SAM 2: Segment Anything in Images and Videos. arXiv 2024, arXiv:2408.00714. [Google Scholar]
- Yin, W.; Zhang, C.; Chen, H.; Cai, Z.; Yu, G.; Wang, K.; Chen, X.; Shen, C. Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image. arXiv 2023, arXiv:2307.10984. [Google Scholar]
- Jocher, G.; Qiu, J. Ultralytics Yolo11. 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 2 November 2024).
- Lou, A.; Li, Y.; Zhang, Y.; Labadie, R.F.; Noble, J. Zero-Shot Surgical Tool Segmentation in Monocular Video Using Segment Anything Model 2. arXiv 2024, arXiv:2408.01648. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Liu, Z.; Mao, H.; Wu, C.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. arXiv 2022, arXiv:2201.03545. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
- Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
- Koppula, S.; Rocco, I.; Yang, Y.; Heyward, J.; Carreira, J.; Zisserman, A.; Brostow, G.; Doersch, C. TAPVid-3D: A Benchmark for Tracking Any Point in 3D. arXiv 2024, arXiv:2407.05921. [Google Scholar]
- Ahmidi, N.; Tao, L.; Sefati, S.; Gao, Y.; Lea, C.; Haro, B.B.; Zappella, L.; Khudanpur, S.; Vidal, R.; Hager, G.D. A Dataset and Benchmarks for Segmentation and Recognition of Gestures in Robotic Surgery. IEEE Trans. Bio-Med. Eng. 2017, 64, 2025. [Google Scholar] [CrossRef] [PubMed]
- Pelanis, E.; Teatini, A.; Eigl, B.; Regensburger, A.; Alzaga, A.; Kumar, R.P.; Rudolph, T.; Aghayan, D.L.; Riediger, C.; Kvarnström, N.; et al. Evaluation of a novel navigation platform for laparoscopic liver surgery with organ deformation compensation using injected fiducials. Med. Image Anal. 2021, 69, 101946. [Google Scholar] [CrossRef] [PubMed]
- Ozbulak, U.; Mousavi, S.A.; Tozzi, F.; Rashidian, N.; Willaert, W.; De Neve, W.; Vankerschaver, J. Less is More? Revisiting the Importance of Frame Rate in Real-Time Zero-Shot Surgical Video Segmentation. arXiv 2025, arXiv:2502.20934. [Google Scholar]
Tooltip Detection | Tool Type | MAE (m) | MSE (m) | Disp. Error (m) | AJ | APD |
---|---|---|---|---|---|---|
Zero-Shot | Left | 0.0086 | 0.0122 | 0.0176 | - | - |
Right | 0.0063 | 0.0081 | 0.0128 | - | - | |
Both | 0.0074 | 0.0101 | 0.0152 | 84.5 | 88.2 | |
Supervised (YOLO11x-seg) | Left | 0.0076 | 0.0104 | 0.0149 | - | - |
Right | 0.0054 | 0.0069 | 0.011 | - | - | |
Both | 0.0065 | 0.0086 | 0.013 | 91.2 | 93.4 |
Tooltip Detection | Component | Use | # Parameters (M) | Inference Time (ms) |
---|---|---|---|---|
Zero-Shot | Florence-2 | Bounding Box Proposals | 230 | 34.4 |
SAM 2 | Tooltip Detection | 224.4 | 24.1 | |
Metric3D | Depth Estimation/3D Coordinate Calculation | 142 | 28.2 | |
Total | - | - | 86.7 | |
Supervised | YOLO11x-seg | Tooltip Detection | 62 | 14.2 |
Metric3D | Depth Estimation/3D Coordinate Calculation | 142 | 28.2 | |
Total | - | - | 42.4 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Narasimhan, S.; Turkcan, M.K.; Ballo, M.; Choksi, S.; Filicori, F.; Kostic, Z. Monocular 3D Tooltip Tracking in Robotic Surgery—Building a Multi-Stage Pipeline. Electronics 2025, 14, 2075. https://doi.org/10.3390/electronics14102075
Narasimhan S, Turkcan MK, Ballo M, Choksi S, Filicori F, Kostic Z. Monocular 3D Tooltip Tracking in Robotic Surgery—Building a Multi-Stage Pipeline. Electronics. 2025; 14(10):2075. https://doi.org/10.3390/electronics14102075
Chicago/Turabian StyleNarasimhan, Sanjeev, Mehmet Kerem Turkcan, Mattia Ballo, Sarah Choksi, Filippo Filicori, and Zoran Kostic. 2025. "Monocular 3D Tooltip Tracking in Robotic Surgery—Building a Multi-Stage Pipeline" Electronics 14, no. 10: 2075. https://doi.org/10.3390/electronics14102075
APA StyleNarasimhan, S., Turkcan, M. K., Ballo, M., Choksi, S., Filicori, F., & Kostic, Z. (2025). Monocular 3D Tooltip Tracking in Robotic Surgery—Building a Multi-Stage Pipeline. Electronics, 14(10), 2075. https://doi.org/10.3390/electronics14102075