TriQuery: A Query-Based Model for Surgical Triplet Recognition
Abstract
1. Introduction
- 1.
- We propose TriQuery, a query-centric multi-task framework for surgical video recognition. It jointly models instruments, verbs, targets, and their triplet combinations through task-specific queries and a dedicated MQ-DH module, enabling semantic disentanglement and mitigating long-tailed class imbalance.
- 2.
- We introduce the TKQ module, which reuses high-confidence queries from the previous frame to guide current decoding, enhancing temporal consistency with minimal overhead.
- 3.
- Our framework achieves superior performance on the CholecT45 dataset, outperforming baseline methods while providing attention visualizations that facilitate clinical applicability.
2. Related Work
3. Methods
3.1. TriQuery Architecture Overview
3.2. Swin-T Backbone for Hierarchical Visual Feature Extraction
3.3. Multi-Query Decoding Head for Structured Multi-Task Learning
3.4. Top-K Guided Query Update for Temporal Coherence
3.5. Loss Function
4. Experimental Setup
4.1. Dataset
4.2. Evaluation Metrics
4.3. Implementation Details
5. Experimental Results
5.1. Comparison with State-of-the-Art Methods
5.2. Attention Map Visualization
5.3. Prediction Result Visualization
6. Discussion
6.1. Ablation Study: Task Queries and Temporal Guidance
6.2. Ablation Study: Module Design Analysis
6.3. Opportunities for Enhancement
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Bain, A.P.; Holcomb, C.N.; Zeh III, H.J.; Sankaranarayanan, G. Artificial intelligence for improving intraoperative surgical care. Glob. Surg. Educ.-J. Assoc. Surg. Educ. 2024, 3, 73. [Google Scholar] [CrossRef]
- Lukács, E.; Levendovics, R.; Haidegger, T. Enhancing autonomous skill assessment of robot-assisted minimally invasive surgery: A comprehensive analysis of global and gesture-level techniques applied on the JIGSAWS dataset. Acta Polytech. Hung 2023, 20, 133–153. [Google Scholar] [CrossRef]
- Aspart, F.; Bolmgren, J.L.; Lavanchy, J.L.; Beldi, G.; Woods, M.S.; Padoy, N.; Hosgor, E. ClipAssistNet: Bringing real-time safety feedback to operating rooms. Int. J. Comput. Assist. Radiol. Surg. 2022, 17, 5–13. [Google Scholar] [CrossRef]
- Islam, M.M.; Islam, M.Z.; Asraf, A.; Al-Rakhami, M.S.; Ding, W.; Sodhro, A.H. Diagnosis of COVID-19 from X-rays using combined CNN-RNN architecture with transfer learning. Benchcouncil Trans. Benchmarks Stand. Eval. 2022, 2, 100088. [Google Scholar] [CrossRef]
- Liu, H.; Tu, J.; Liu, M. Two-stream 3d convolutional neural network for skeleton-based action recognition. arXiv 2017, arXiv:1705.08106. [Google Scholar]
- Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 3202–3211. [Google Scholar]
- Bao, F.; Nie, S.; Xue, K.; Cao, Y.; Li, C.; Su, H.; Zhu, J. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
- Jiao, E.; Leng, Q.; Guo, J.; Meng, X.; Wang, C. Vision Transformer with window sequence merging mechanism for image classification. Appl. Soft Comput. 2025, 171, 112811. [Google Scholar] [CrossRef]
- Luvizon, D.C.; Picard, D.; Tabia, H. Multi-task deep learning for real-time 3D human pose estimation and action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2752–2764. [Google Scholar] [CrossRef] [PubMed]
- Gan, Z.; Jin, L.; Nie, L.; Wang, Z.; Zhou, L.; Li, L.; Wang, Z.; Li, J.; Xing, J.; Zhao, J. ASQuery: A query-based model for action segmentation. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar]
- Zhang, Y.; Yang, Q. An overview of multi-task learning. Natl. Sci. Rev. 2018, 5, 30–43. [Google Scholar] [CrossRef]
- Jia, Y.; Dang, R.; Wang, D.; Wu, Z.; Yang, T.; Tian, Z.; Yin, J. Visual SLAM for Dynamic Environment Using Pre-Frame Semantic. In Proceedings of the 2024 IEEE International Conference on Unmanned Systems (ICUS), Nanjing, China, 18–20 October 2024; pp. 37–42. [Google Scholar]
- Wang, L.; Zang, J.; Zhang, Q.; Niu, Z.; Hua, G.; Zheng, N. Action recognition by an attention-aware temporal weighted convolutional neural network. Sensors 2018, 18, 1979. [Google Scholar] [CrossRef] [PubMed]
- Kim, H.W.; Choi, Y.S. Fusion attention for action recognition: Integrating sparse-dense and global attention for video action recognition. Sensors 2024, 24, 6842. [Google Scholar] [CrossRef] [PubMed]
- Wei, W.; Zhu, C.; Hu, L.; Liu, P. Application of a Transfer Learning Model Combining CNN and Self-Attention Mechanism in Wireless Signal Recognition. Sensors 2025, 25, 4202. [Google Scholar] [CrossRef] [PubMed]
- Xu, Y.; Wang, Y. Single-Image Super-Resolution via Cascaded Non-Local Mean Network and Dual-Path Multi-Branch Fusion. Sensors 2025, 25, 4044. [Google Scholar] [CrossRef] [PubMed]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Computer Vision—ECCV 2020, Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
- Nwoye, C.I.; Yu, T.; Gonzalez, C.; Seeliger, B.; Mascagni, P.; Mutter, D.; Marescaux, J.; Padoy, N. Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos. Med. Image Anal. 2022, 78, 102433. [Google Scholar] [CrossRef] [PubMed]
- Tufail, H.; Naseer, A.; Tamoor, M.; Ali, A.R. Advancements in Query-Based Tabular Data Retrieval: Detecting Image Data Tables and Extracting Text using Convolutional Neural Networks. Preprint 2024, 2024080108. [Google Scholar] [CrossRef]
- Liu, C.; Zhang, B.; Bo, C.; Wang, D. Query-Based Object Visual Tracking with Parallel Sequence Generation. Sensors 2024, 24, 4802. [Google Scholar] [CrossRef] [PubMed]
- Zhou, S.; Yang, P.; Wang, J.; Luo, Y.; Loy, C.C. Upscale-a-video: Temporal-consistent diffusion model for real-world video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 2535–2545. [Google Scholar]
- Hori, S.; Omi, K.; Tamaki, T. Query matching for spatio-temporal action detection with query-based object detector. arXiv 2024, arXiv:2409.18408. [Google Scholar]
- Nwoye, C.I.; Padoy, N. Data splits and metrics for benchmarking methods on surgical action triplet datasets. arXiv 2022, arXiv:2204.05235. [Google Scholar]
- Wang, Z.; Song, Y.; Pang, L.; Li, S.; Sun, G. Attention-Enhanced CNN-LSTM Model for Exercise Oxygen Consumption Prediction with Multi-Source Temporal Features. Sensors 2025, 25, 4062. [Google Scholar] [CrossRef] [PubMed]
- Gong, S.; Yan, X.; Fang, Y.; Paul, A.; Wu, Z.; Chen, J. A dual-stream CNN-BiLSTM for human motion recognition with raw radar data. IEEE Sens. J. 2024, 24, 25094–25105. [Google Scholar] [CrossRef]
Method | Fold 1 | Fold 2 | Fold 3 | Mean | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Acc | F1 | AP | Acc | F1 | AP | Acc | F1 | AP | Acc | F1 | AP | |
CNN+RNN [4] | 71.86 | 53.46 | 41.44 | 70.52 | 56.57 | 42.91 | 69.73 | 58.59 | 44.78 | 69.59 | 54.52 | 41.37 |
CNN-LSTM [24] | 72.52 | 54.87 | 42.29 | 69.02 | 57.91 | 43.86 | 68.96 | 58.01 | 43.83 | 70.54 | 54.18 | 41.17 |
Dual-stream CNN [25] | 73.06 | 54.65 | 43.28 | 70.39 | 56.41 | 43.06 | 70.14 | 56.69 | 43.16 | 70.78 | 53.56 | 41.35 |
3D CNN [5] | 74.05 | 50.65 | 39.22 | 69.73 | 50.62 | 38.32 | 72.96 | 53.52 | 41.64 | 70.97 | 49.49 | 38.06 |
Rendezvous [18] | 72.18 | 51.23 | 34.93 | 69.88 | 52.14 | 39.13 | 49.41 | 50.11 | 36.87 | 69.11 | 50.97 | 37.98 |
Swin-T (Baseline) [6] | 72.24 | 54.47 | 43.05 | 67.58 | 55.57 | 41.21 | 68.06 | 56.75 | 42.08 | 68.10 | 53.69 | 40.15 |
TriQuery (ours) | 75.55 | 59.19 | 48.63 | 71.79 | 60.77 | 47.33 | 71.57 | 59.85 | 46.70 | 71.53 | 57.79 | 45.19 |
Method | Instrument | Verb | Target | Triplet | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Acc | F1 | AP | Acc | F1 | AP | Acc | F1 | AP | Acc | F1 | AP | |
CNN+RNN [4] | 95.99 | 81.00 | 85.80 | 94.81 | 71.47 | 69.55 | 84.67 | 63.00 | 54.89 | 69.59 | 54.52 | 41.37 |
CNN-LSTM [24] | 96.04 | 81.04 | 85.95 | 94.79 | 71.48 | 69.38 | 84.09 | 62.81 | 54.60 | 70.54 | 54.18 | 41.17 |
Dual-stream CNN [25] | 95.99 | 80.98 | 85.76 | 94.82 | 71.39 | 69.73 | 84.78 | 62.73 | 54.67 | 70.78 | 53.56 | 41.35 |
3D CNN [5] | 95.96 | 80.97 | 85.64 | 94.69 | 70.29 | 68.77 | 84.93 | 60.75 | 53.09 | 70.97 | 49.49 | 38.06 |
Rendezvous [18] | 95.48 | 79.23 | 84.08 | 94.51 | 69.28 | 67.67 | 82.65 | 53.22 | 46.33 | 69.11 | 50.97 | 37.98 |
Swin-T (Baseline) [6] | 95.91 | 81.14 | 85.41 | 94.70 | 71.53 | 69.05 | 83.59 | 62.39 | 53.54 | 68.10 | 53.69 | 40.15 |
TriQuery (ours) | 96.88 | 83.51 | 88.67 | 98.29 | 74.30 | 71.82 | 84.00 | 66.47 | 57.69 | 71.53 | 57.79 | 45.19 |
MQ-DH | TKQ | Instrument | Verb | Target | Triplet | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Acc | F1 | AP | Acc | F1 | AP | Acc | F1 | AP | Acc | F1 | AP | ||
95.91 | 81.14 | 85.41 | 94.70 | 71.53 | 69.05 | 83.59 | 62.39 | 53.54 | 68.10 | 53.69 | 40.15 | ||
✓ | 96.90 | 83.50 | 88.82 | 95.28 | 74.29 | 72.29 | 83.91 | 66.48 | 57.60 | 71.32 | 57.78 | 45.18 | |
✓ | ✓ | 96.88 | 83.51 | 88.67 | 98.29 | 74.30 | 71.82 | 84.00 | 66.47 | 57.69 | 71.53 | 57.79 | 45.19 |
Instrument | Verb | Target | Triplet | Triplet (Best) | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Acc | F1 | AP | Acc | F1 | AP | Acc | F1 | AP | Acc | F1 | AP | Acc | F1 | AP | ||||
95.91 | 81.14 | 85.41 | 94.70 | 71.53 | 69.05 | 83.59 | 62.39 | 53.54 | 68.10 | 53.69 | 40.15 | 72.24 | 54.47 | 43.05 | ||||
✓ | 96.89 | 83.41 | 88.83 | – | – | – | – | – | – | 70.88 | 57.73 | 45.07 | 73.52 | 58.84 | 47.21 | |||
✓ | – | – | – | 95.23 | 74.08 | 72.03 | – | – | – | 71.04 | 57.15 | 45.08 | 74.21 | 58.13 | 48.33 | |||
✓ | – | – | – | – | – | – | 83.81 | 66.26 | 57.39 | 71.07 | 57.52 | 44.97 | 74.47 | 58.77 | 47.77 | |||
✓ | ✓ | 96.91 | 83.47 | 88.78 | 95.28 | 74.21 | 72.28 | – | – | – | 71.49 | 57.79 | 45.12 | 75.43 | 58.49 | 48.04 | ||
✓ | ✓ | ✓ | 96.88 | 83.51 | 88.67 | 98.29 | 74.30 | 71.82 | 84.00 | 66.47 | 57.69 | 71.53 | 57.79 | 45.19 | 75.55 | 59.19 | 48.63 | |
✓ | ✓ | ✓ | ✓ | 96.79 | 83.32 | 88.35 | 95.18 | 73.92 | 71.74 | 83.75 | 66.06 | 56.83 | 70.55 | 57.77 | 44.84 | 73.92 | 59.18 | 47.40 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yao, M.; Zhang, W.; Wang, L.; Zhao, Z.; Jia, X. TriQuery: A Query-Based Model for Surgical Triplet Recognition. Sensors 2025, 25, 5306. https://doi.org/10.3390/s25175306
Yao M, Zhang W, Wang L, Zhao Z, Jia X. TriQuery: A Query-Based Model for Surgical Triplet Recognition. Sensors. 2025; 25(17):5306. https://doi.org/10.3390/s25175306
Chicago/Turabian StyleYao, Mengrui, Wenjie Zhang, Lin Wang, Zhongwei Zhao, and Xiao Jia. 2025. "TriQuery: A Query-Based Model for Surgical Triplet Recognition" Sensors 25, no. 17: 5306. https://doi.org/10.3390/s25175306
APA StyleYao, M., Zhang, W., Wang, L., Zhao, Z., & Jia, X. (2025). TriQuery: A Query-Based Model for Surgical Triplet Recognition. Sensors, 25(17), 5306. https://doi.org/10.3390/s25175306