Hybrid Spatiotemporal Contrastive Representation Learning for Content-Based Surgical Video Retrieval
Abstract
:1. Introduction
- To support surgical quality assessment, the teaching and training of surgical procedures, and other aforementioned applications, we propose a Content-Based Surgical Video Retrieval (CBSVR) system based on a contrastive learning framework.
- A hybrid temporal embedding approach with adaptive fusion layer is proposed to enhance spatiotemporal features from different modalities.
- We propose a supervised contrastive learning approach to learn surgical video representations which extends the general contrastive loss to consider positive samples from the same label in addition to augmented versions.
- In addition, we design video frame-level self-supervised learning to enhance the visual feature learning when combined with spatiotemporal supervised contrastive learning.
- With extensive experiments, we validate our proposed methodology on two publicly available surgery video datasets along with an ablation study on a surgical phase recognition task.
2. Related Work
3. Background
4. Proposed Methodology
4.1. Video Representation Learning
4.1.1. Hybrid Spatiotemporal Contrastive Embedding Learning
Adaptive Fusion
Contrastive Embedding Learning
4.1.2. Training Details
4.2. Feature Extraction and Query Matching
5. Dataset
6. Experimental Results and Analysis
6.1. Analysis of Training with Different Temporal Length Video Sequences
6.2. Analysis of Different Temporal Pooling Strategies
6.3. Effectiveness of Combined Visual and Sequential learning
6.4. Comparison with State of the Art
6.5. Ablation Study in Context of Surgical Phase Recognition
7. Discussion
8. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Münzer, B.; Schoeffmann, K.; Böszörmenyi, L. Content-based processing and analysis of endoscopic images and videos: A survey. Multimed. Tools Appl. 2018, 77, 1323–1362. [Google Scholar] [CrossRef] [Green Version]
- Green, J.L.; Suresh, V.; Bittar, P.; Ledbetter, L.; Mithani, S.K.; Allori, A. The Utilization of Video Technology in Surgical Education: A Systematic Review. J. Surg. Res. 2019, 235, 171–180. [Google Scholar] [CrossRef] [PubMed]
- Anh, N.X.; Nataraja, R.M.; Chauhan, S. Towards near real-time assessment of surgical skills: A comparison of feature extraction techniques. Comput. Methods Programs Biomed. 2020, 187, 105234. [Google Scholar] [CrossRef]
- Husslein, H.; Shirreff, L.; Shore, E.M.; Lefebvre, G.G.; Grantcharov, T.P. The Generic Error Rating Tool: A Novel Approach to Assessment of Performance and Surgical Education in Gynecologic Laparoscopy. J. Surg. Educ. 2015, 72, 1259–1265. [Google Scholar] [CrossRef] [PubMed]
- Ritter, E.M.; Gardner, A.K.; Dunkin, B.J.; Schultz, L.; Pryor, A.D.; Feldman, L. Video-based assessment for laparoscopic fundoplication: Initial development of a robust tool for operative performance assessment. Surg. Endosc. 2020, 34, 3176–3183. [Google Scholar] [CrossRef] [PubMed]
- van Dalen, A.S.H.M.; Legemaate, J.; Schlack, W.S.; Legemate, D.A.; Schijven, M.P. Legal perspectives on black box recording devices in the operating environment. Br. J. Surg. 2019, 106, 1433–1441. [Google Scholar] [CrossRef] [Green Version]
- Bezemer, J.; Cope, A.; Korkiakangas, T.; Kress, G.; Murtagh, G.; Weldon, S.M.; Kneebone, R. Microanalysis of video from the operating room: An underused approach to patient safety research. BMJ Qual. Saf. 2017, 7, 583–587. [Google Scholar] [CrossRef] [Green Version]
- Grenda, T.R.; Pradarelli, J.C.; Dimick, J.B. Using surgical video to improve technique and skill. Ann. Surg. 2016, 264, 32–33. [Google Scholar] [CrossRef]
- Lavanchy, J.L.; Zindel, J.; Kirtac, K.; Twick, I.; Hosgor, E.; Candinas, D.; Beldi, G. Automation of surgical skill assessment using a three-stage machine learning algorithm. Sci. Rep. 2021, 11, 5197. [Google Scholar] [CrossRef]
- Loukas, C. Video content analysis of surgical procedures. Surg. Endosc. 2018, 32, 553–568. [Google Scholar] [CrossRef]
- Blum, T.; Feußner, H.; Navab, N. Modeling and segmentation of surgical workflow from laparoscopic video. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2010; Lecture Notes in Computer Science; Jiang, T., Navab, N., Pluim, J.P.W., Viergever, M.A., Eds.; Springer: Berlin/Heidelberg, Germany, 2010; Volume 6363, pp. 400–407. [Google Scholar]
- Lalys, F.; Riffaud, L.; Bouget, D.; Jannin, P. A framework for the recognition of high-level surgical tasks from video images for cataract surgeries. IEEE Trans. Biomed. Eng. 2012, 59, 966–976. [Google Scholar] [CrossRef] [Green Version]
- Lalys, F.; Riffaud, L.; Morandi, X.; Jannin, P. Automatic phases recognition in pituitary surgeries by microscope images classification. In Information Processing in Computer-Assisted Interventions—IPCAI 2010; Lecture Notes in Computer Science; Navab, N., Jannin, P., Eds.; Springer: Berlin/Heidelberg, Germany, 2010; Volume 6135, pp. 34–44. [Google Scholar]
- Zia, A.; Sharma, Y.; Bettadapura, V.; Sarin, E.L.; Ploetz, T.; Clements, M.A.; Essa, I. Automated video-based assessment of surgical skills for training and evaluation in medical schools. Int. J. Comput. Assist. Radiol. Surg. 2016, 11, 1623–1636. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Weede, O.; Dittrich, F.; Worn, H.; Jensen, B.; Knoll, A.; Wilhelm, D.; Kranzfelder, M.; Schneider, A.; Feussner, H. Workflow analysis and surgical phase recognition in minimally invasive surgery. In Proceedings of the 2012 IEEE International Conference on Robotics and Biomimetics, ROBIO 2012—Conference Digest, Guangzhou, China, 11–14 December 2012; pp. 1068–1074. [Google Scholar]
- Sharma, Y.; Bettadapura, V.; Ploetz, T.; Hammerla, N.; Mellor, S.; McNaney, R.; Olivier, P.; Deshmukh, S.; McCaskie, A.; Essa, I. Video Based Assessment of OSATS Using Sequential Motion Textures. In Proceedings of the Fifth Workshop on Modeling and Monitoring of Computer Assisted Interventions (M2CAI), Boston, MA, USA, 14 September 2014; Forestier, G., Giannarou, S., Lin, H., Masamune, K., Speidel, S., Stauder, R., Penet, C., Eds.; Springer: Cham, Switzerland, 2014. [Google Scholar]
- Allan, M.; Ourselin, S.; Thompson, S.; Hawkes, D.J.; Kelly, J.; Stoyanov, D. Toward detection and localization of instruments in minimally invasive surgery. IEEE Trans. Biomed. Eng. 2013, 60, 1050–1058. [Google Scholar] [CrossRef]
- Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; van der Laak, J.A.W.M.; van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef] [Green Version]
- Shen, D.; Wu, G.; Suk, H.-I. Deep Learning in Medical Image Analysis. Annu. Rev. Biomed. Eng. 2017, 19, 221–248. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Pandey, B.; Pandey, D.K.; Mishra, B.P.; Rhmann, W. A Comprehensive Survey of Deep Learning in the field of Medical Imaging and Medical Natural Language Processing: Challenges and research directions. J. King Saud Univ.-Comput. Inf. Sci. 2021; in press. [Google Scholar] [CrossRef]
- Blum, T.; Padoy, N.; Feußner, H.; Navab, N. Modeling and online recognition of surgical phases using hidden Markov models. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2008; Lecture Notes in Computer Science; Metaxas, D., Axel, L., Fichtinger, G., Székely, G., Eds.; Springer: Berlin/Heidelberg, Germany, 2008; Volume 5242, pp. 627–635. [Google Scholar]
- Lalys, F.; Riffaud, L.; Morandi, X.; Jannin, P. Surgical phases detection from microscope videos by combining SVM and HMM. In Medical Computer Vision. Recognition Techniques and Applications in Medical Imaging—MCV 2010; Lecture Notes in Computer Science; Menze, B., Langs, G., Tu, Z., Criminisi, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6533, pp. 54–62. [Google Scholar]
- Tao, L.; Elhamifar, E.; Khudanpur, S.; Hager, G.D.; Vidal, R. Sparse hidden Markov models for surgical gesture classification and skill evaluation. In Information Processing in Computer-Assisted Interventions—IPCAI 2012; Lecture Notes in Computer Science; Abolmaesumi, P., Joskowicz, L., Navab, N., Jannin, P., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; Volume 7330, pp. 167–177. [Google Scholar]
- Charrière, K.; Quellec, G.; Lamard, M.; Martiano, D.; Cazuguel, G.; Coatrieux, G.; Cochener, B. Real-time analysis of cataract surgery videos using statistical models. Multimed. Tools Appl. 2017, 76, 22473–22491. [Google Scholar] [CrossRef] [Green Version]
- Lea, C.; Hager, G.D.; Vidal, R. An improved model for segmentation and recognition of fine-grained activities with application to surgical training tasks. In Proceedings of the 2015 IEEE Winter Conference on Applications of Computer Vision WACV, Waikoloa, HI, USA, 5–9 January 2015; pp. 1123–1129. [Google Scholar]
- Zappella, L.; Béjar, B.; Hager, G.; Vidal, R. Surgical gesture classification from video and kinematic data. Med. Image Anal. 2013, 17, 732–745. [Google Scholar] [CrossRef]
- Padoy, N.; Blum, T.; Ahmadi, S.A.; Feussner, H.; Berger, M.O.; Navab, N. Statistical modeling and recognition of surgical workflow. Med. Image Anal. 2012, 16, 632–641. [Google Scholar] [CrossRef]
- Cadène, R.; Robert, T.; Thome, N.; Cord, M. M2CAI Workflow Challenge: Convolutional Neural Networks with Time Smoothing and Hidden Markov Model for Video Frames Classification. arXiv 2016, arXiv:1610.05541. [Google Scholar]
- Jalal, N.A.; Alshirbaji, T.A.; Möller, K. Evaluating convolutional neural network and hidden Markov model for recognising surgical phases in sigmoid resection. Curr. Dir. Biomed. Eng. 2018, 4, 415–418. [Google Scholar] [CrossRef]
- Twinanda, A.P.; Shehata, S.; Mutter, D.; Marescaux, J.; De Mathelin, M.; Padoy, N. EndoNet: A Deep Architecture for Recognition Tasks on Laparoscopic Videos. IEEE Trans. Med. Imaging 2017, 36, 86–97. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Al Hajj, H.; Lamard, M.; Conze, P.H.; Cochener, B.; Quellec, G. Monitoring tool usage in surgery videos using boosted convolutional and recurrent neural networks. Med. Image Anal. 2018, 47, 203–218. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Jin, Y.; Dou, Q.; Chen, H.; Yu, L.; Qin, J.; Fu, C.W.; Heng, P.A. SV-RCNet: Workflow recognition from surgical videos using recurrent convolutional network. IEEE Trans. Med. Imaging 2018, 37, 1114–1126. [Google Scholar] [CrossRef] [PubMed]
- Jin, Y.; Li, H.; Dou, Q.; Chen, H.; Qin, J.; Fu, C.W.; Heng, P.A. Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Med. Image Anal. 2020, 59, 101572. [Google Scholar] [CrossRef] [PubMed]
- Shi, X.; Jin, Y.; Dou, Q.; Heng, P.A. LRTD: Long-range temporal dependency based active learning for surgical workflow recognition. Int. J. Comput. Assist. Radiol. Surg. 2020, 15, 1573–1584. [Google Scholar] [CrossRef] [PubMed]
- Kreuzer, D.; Munz, M. Deep Convolutional and LSTM Networks on Multi-Channel Time Series Data for Gait Phase Recognition. Sensors 2021, 21, 789. [Google Scholar] [CrossRef]
- Donahue, J.; Hendricks, L.A.; Rohrbach, M.; Venugopalan, S.; Guadarrama, S.; Saenko, K.; Darrell, T. LongTerm Recurrent Convolutional Networks for Visual Recognition and Description. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 677–691. [Google Scholar] [CrossRef]
- Kumar, V.; Tripathi, V.; Pant, B. Learning Compact Spatio-Temporal Features for Fast Content based Video Retrieval. Int. J. Innov. Technol. Explor. Eng. 2019, 9, 2404–2409. [Google Scholar]
- Majd, M.; Safabakhsh, R. Correlational Convolutional LSTM for human action recognition. Neurocomputing 2019, 396, 224–229. [Google Scholar] [CrossRef]
- Li, Z.; Zhang, X.; Müller, H.; Zhang, S. Large-scale retrieval for medical image analytics: A comprehensive review. Med. Image Anal. 2018, 43, 66–84. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Carlos, J.R.; Lux, M.; Giro-I-Nieto, X.; Munoz, P.; Anagnostopoulos, N. Visual information retrieval in endoscopic video archives. In Proceedings of the 2015 13th International Workshop on Content-Based Multimedia Indexing (CBMI), Prague, Czech Republic, 10–12 June 2015; pp. 1–6. [Google Scholar]
- Beecks, C.; Schoeffmann, K.; Lux, M.; Uysal, M.S.; Seidl, T. Endoscopic Video Retrieval: A Signature-Based Approach for Linking Endoscopic Images with Video Segments. In Proceedings of the 2015 IEEE International Symposium on Multimedia (ISM), Miami, FL, USA, 14–16 December 2015; pp. 33–38. [Google Scholar]
- Schoeffmann, K.; Beecks, C.; Lux, M.; Uysal, M.S.; Seidl, T. Content-based retrieval in videos from laparoscopic surgery. In Medical Imaging 2016: Image-Guided Procedures, Robotic Interventions and Modeling; Webster, R.J., III, Yaniv, Z.R., Eds.; SPIE: Bellingham, WA, USA, 2016; Volume 9786, p. 97861V. [Google Scholar]
- André, B.; Vercauteren, T.; Buchner, A.M.; Wallace, M.B.; Ayache, N. A smart atlas for endomicroscopy using automated video retrieval. Med. Image Anal. 2011, 15, 460–476. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Beecks, C.; Kletz, S.; Schoeffmann, K. Large-Scale Endoscopic Image and Video Linking with Gradient-Based Signatures. In Proceedings of the 2017 IEEE 3rd International Conference on Multimedia Big Data (BigMM), Laguna Hills, CA, USA, 19–21 April 2017; pp. 17–21. [Google Scholar]
- Droueche, Z.; Lamard, M.; Cazuguel, G.; Quellec, G.; Roux, C.; Cochener, B. Motion-based video retrieval with application to computer-assisted retinal surgery. In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBS), San Diego, CA, USA, 28 August–1 September 2012; pp. 4962–4965. [Google Scholar]
- Quellec, G.; Lamard, M.; Droueche, Z.; Cochener, B.; Roux, C.; Cazuguel, G. A polynomial model of surgical gestures for real-time retrieval of surgery videos. In Medical Content-Based Retrieval for Clinical Decision Support—MCBR-CDS 2012; Lecture Notes in Computer Science; Greenspan, H., Müller, H., Syeda-Mahmood, T., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; Volume 7723, pp. 10–20. [Google Scholar]
- Syeda-Mahmood, T.; Ponceleon, D.; Yang, J. Validating cardiac echo diagnosis through video similarity. In Proceedings of the 13th ACM International Conference on Multimedia (MM), Singapore, 6–11 November 2005; pp. 527–530. [Google Scholar]
- Quellec, G.; Charrière, K.; Lamard, M.; Droueche, Z.; Roux, C.; Cochener, B.; Cazuguel, G. Real-time recognition of surgical tasks in eye surgery videos. Med. Image Anal. 2014, 18, 579–590. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Quellec, G.; Lamard, M.; Cazuguel, G.; Droueche, Z.; Roux, C.; Cochener, B. Real-time retrieval of similar videos with application to computer-aided retinal surgery. In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBS), Boston, MA, USA, 30 August 3–September 2011; pp. 4465–4468. [Google Scholar]
- Droueche, Z.; Lamard, M.; Cazuguel, G.; Quellec, G.; Roux, C.; Cochener, B. Content-based medical video retrieval based on region motion trajectories. In Proceedings of the 5th European Conference of the International Federation for Medical and Biological Engineering, Budapest, Hungary, 14–18 September 2011; Jobbágy, Á., Ed.; Springer: Berlin/Heidelberg, Germany, 2011; Volume 37, pp. 622–625. [Google Scholar]
- Muenzer, B.; Primus, M.J.; Kletz, S.; Petscharnig, S.; Schoeffmann, K. Static vs. Dynamic Content Descriptors for Video Retrieval in Laparoscopy. In Proceedings of the 2017 IEEE International Symposium on Multimedia (ISM), Taichung, Taiwan, 11–13 December 2017; pp. 216–223. [Google Scholar]
- Kletz, S.; Schoeffmann, K.; Munzer, B.; Primus, M.J.; Husslein, H. Surgical action retrieval for assisting video review of laparoscopic skills. In Proceedings of the MultiEdTech 2017—Proceedings of the 2017 ACM Workshop on Multimedia-Based Educational and Knowledge Technologies for Personalized and Social Online Training, Co-Located with MM 2017, Mountain View, CA, USA, 27 October 2017; pp. 11–19. [Google Scholar]
- Amanat, S.; Idrees, M.; Khan, M.U.G.; Rehman, Z.; Chang, H.; Mehmood, I.; Baik, S.W. Video retrieval system for meniscal surgery to improve health care services. J. Sens. 2018, 2018, 4390703. [Google Scholar] [CrossRef] [Green Version]
- Schoeffmann, K.; Husslein, H.; Kletz, S.; Petscharnig, S.; Muenzer, B.; Beecks, C. Video retrieval in laparoscopic video recordings with dynamic content descriptors. Multimed. Tools Appl. 2018, 77, 16813–16832. [Google Scholar] [CrossRef] [Green Version]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27; Curran Associates, Inc.: Red Hook, NY, USA; pp. 3104–3112.
- Lea, C.; Flynn, M.D.; Vidal, R.; Reiter, A.; Hager, G.D. Temporal convolutional networks for action segmentation and detection. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1003–1012. [Google Scholar] [CrossRef] [Green Version]
- Czempiel, T.; Paschali, M.; Keicher, M.; Simson, W.; Feussner, H.; Kim, S.T.; Navab, N. TeCNO: Surgical Phase Recognition with Multi-stage Temporal Convolutional Networks. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2020; Lecture Notes in Computer Science; Martel, A.L., Abolmaesumi, P., Stoyanov, D., Mateus, D., Zuluaga, M.A., Zhou, K.S., Racoceanu, D., Joskowicz, L., Eds.; Springer: Cham, Switzerland, 2020; Volume 12263, pp. 343–352. [Google Scholar] [CrossRef]
- Ramesh, S.; Dall’Alba, D.; Gonzalez, C.; Yu, T.; Mascagni, P.; Mutter, D.; Padoy, N. Multi-task temporal convolutional networks for joint recognition of surgical phases and steps in gastric bypass procedures. Int. J. Comput. Assist. Radiol. Surg. 2021, 16, 1111–1119. [Google Scholar] [CrossRef]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning (ICML), Vienna, Austria, 12–18 July 2020; pp. 1597–1607. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
- Saxe, A.M.; McClelland, J.L.; Ganguli, S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In Proceedings of the 2nd International Conference on Learning Representations (ICLR), Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
- Kingma, D.P.; Ba, J.L. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Schoeffmann, K.; Taschwer, M.; Sarny, S.; Münzer, B.; Primus, M.J.; Putzgruber, D. Cataract-101—Video dataset of 101 cataract surgeries. In Proceedings of the 9th ACM Multimedia Systems Conference (MMSys), Amsterdam, The Netherlands, 12–15 June 2018; pp. 421–425. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 27; Curran Associates, Inc.: Red Hook, NY, USA, 2012; pp. 1097–1105. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Twinanda, P. Vision-Based Approaches for Surgical Activity Recognition Using Laparoscopic and RBGD Videos. Ph.D. Thesis, Université de Strasbourg, Strasbourg, France, 2017. [Google Scholar]
Surgery Phase/Label | #Videos | ||
---|---|---|---|
Training | Testing | ||
Incision | 104 | 52 | 52 |
Viscous agent injection | 232 | 116 | 116 |
Rhexis | 104 | 52 | 52 |
Hydrodissection | 101 | 51 | 50 |
Phacoemulsificiation | 104 | 52 | 52 |
Irrigation and aspiration | 120 | 60 | 60 |
Capsule polishing | 110 | 55 | 55 |
Lens implant setting-up | 105 | 53 | 52 |
Viscous agent removal | 106 | 53 | 53 |
Tonifying and antibiotics | 179 | 90 | 89 |
TOTAL | 1265 | 634 | 631 |
Dataset | Temporal Factor | |||
---|---|---|---|---|
8 | 16 | 24 | 32 | |
Surgical Actions 160 | 27.11 ± 1.21 | 27.89 ± 1.12 | 30.012 ± 1.778 | 30.09 ± 1.14 |
Cataract-101 | 78.43 ± 0.673 | 79.25 ± 0.556 | 81.13 ± 1.28 | 81.11 ± 1.08 |
Dataset | Max Pool | Mean Pool | Adaptive Fusion |
---|---|---|---|
Surgical Actions 160 | 27.64 ± 1.17 | 28.63 ± 1.76 | 30.012 ± 1.778 |
Cataract-101 | 76.71 ± 1.21 | 79.72 ± 0.28 | 81.13 ± 1.28 |
Split | RDCN−TCN−CLS | RDCN−TCN−CL | ||||
---|---|---|---|---|---|---|
1 | 14.71 | 25.59 | 18.52 | 18.81 | 14.78 | 28.62 |
2 | 19.35 | 30.47 | 20.87 | 21.42 | 18.14 | 32.34 |
3 | 14.97 | 25.14 | 19.14 | 19.25 | 14.21 | 28.67 |
4 | 14.36 | 26.37 | 18.12 | 18.42 | 13.98 | 27.78 |
5 | 17.42 | 28.16 | 22.42 | 22.51 | 17.88 | 30.71 |
6 | 15.65 | 26.17 | 21.13 | 22.24 | 15.65 | 29.42 |
7 | 16.75 | 29.18 | 22.21 | 23.11 | 17.23 | 31.58 |
8 | 14.41 | 26.09 | 20.17 | 21.12 | 14.34 | 27.48 |
9 | 18.27 | 26.74 | 21.36 | 22.46 | 19.14 | 32.67 |
10 | 17.24 | 26.11 | 21.54 | 21.82 | 17.21 | 30.85 |
Mean ± std. | 16.31 ± 1.660 | 27.002 ± 1.623 | 20.548 ± 1.431 | 21.12 ± 1.601 | 16.256 ± 1.787 | 30.012 ± 1.778 |
Split | RDCN−TCN−CLS | RDCN−TCN−CL | ||||
---|---|---|---|---|---|---|
1 | 46.48 | 73.75 | 53.89 | 54.32 | 55.26 | 81.35 |
2 | 46.25 | 74.66 | 54.25 | 55.29 | 57.45 | 79.25 |
3 | 42.56 | 73.02 | 53.83 | 54.96 | 50.96 | 82.73 |
4 | 48.68 | 74.67 | 54.13 | 56.01 | 52.90 | 80.14 |
5 | 51.35 | 75.81 | 54.37 | 56.15 | 51.96 | 82.2 |
Mean ± std. | 47.06 ± 2.91 | 74.38 ± 0.9438 | 54.27 ± 0.327 | 55.35 ± 0.677 | 53.70 ± 2.352 | 81.134 ±1.28 |
Feature Descriptor | Dim. | mAP (Mean ± std.) | |
---|---|---|---|
Surgical Actions 160 | Cataract-101 | ||
CNNA [66] | 4096 | 20.011 ± 1.531 | 20.44 ± 1.253 |
CNNG [67] | 1024 | 21.82 ± 1.064 | 22.16 ± 1.145 |
CNNR [55] | 512 | 22.67 ± 1.192 | 23.18 ± 1.212 |
FS [41] | 630 | 19.46 ± 1.631 | 20.11 ± 1.264 |
DFS [54] | 810 | 20.75 ± 1.253 | 25.18 ± 1.152 |
MIDD [54] | 25 | 22.54 ± 1.557 | 33.18 ± 1.311 |
RDCN-TCN-CE | 512 | 28.533 ± 1.142 | 75.27 ± 1.35 |
RDCN−TCN−Triplet | 512 | 28.741 ± 1.334 | 77.36 ± 1.18 |
RDCN−TCN−CL | 512 | 30.012 ± 1.778 | 81.134 ± 1.28 |
Methods | Accuracy | Precision | Recall |
---|---|---|---|
PhaseNet [30] | 78.8 ± 4.7 | 71.3 ± 15.6 | 76.6 ± 16.6 |
EndoNet * [30] | 81.7 ±4.2 | 73.70 ± 16.1 | 79.60 ± 7.9 |
(EndoNet + LSTM) * [68] | 88.6 ± 9.6 | 84.4 ± 7.9 | 84.7 ± 7.9 |
SV-RCNet [32] | 85.3 ± 7.3 | 80.7 ± 7.0 | 83.5 ± 7.5 |
MTRCNet * [33] | 89.2 ± 7.6 | 86.9 ± 4.3 | 88.0 ± 6.9 |
TeCNO [59] | 88.6 ± 7.8 | 86.5 ± 7.0 | 87.6 ± 6.7 |
NL-RCNet [34] | 85.73 ± 6.96 | 82.94 ± 6.20 | 85.04 ± 5.15 |
RDCN-TCN-CE (R18) | 82.13 ± 6.7 | 75.14 ± 9.1 | 79.21 ± 7.1 |
RDCN-TCN-Triplet (R18) | 83.04 ± 7.1 | 77.65 ± 10.5 | 80.43 ± 7.4 |
RDCN−TCN−CL (R18) | 83.86 ± 6.67 | 78.89 ± 7.3 | 81.15 ± 7.2 |
RDCN−TCN−CL (R50) | 90.2 ± 6.93 | 87.52 ± 6.88 | 85.65 ± 6.91 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kumar, V.; Tripathi, V.; Pant, B.; Alshamrani, S.S.; Dumka, A.; Gehlot, A.; Singh, R.; Rashid, M.; Alshehri, A.; AlGhamdi, A.S. Hybrid Spatiotemporal Contrastive Representation Learning for Content-Based Surgical Video Retrieval. Electronics 2022, 11, 1353. https://doi.org/10.3390/electronics11091353
Kumar V, Tripathi V, Pant B, Alshamrani SS, Dumka A, Gehlot A, Singh R, Rashid M, Alshehri A, AlGhamdi AS. Hybrid Spatiotemporal Contrastive Representation Learning for Content-Based Surgical Video Retrieval. Electronics. 2022; 11(9):1353. https://doi.org/10.3390/electronics11091353
Chicago/Turabian StyleKumar, Vidit, Vikas Tripathi, Bhaskar Pant, Sultan S. Alshamrani, Ankur Dumka, Anita Gehlot, Rajesh Singh, Mamoon Rashid, Abdullah Alshehri, and Ahmed Saeed AlGhamdi. 2022. "Hybrid Spatiotemporal Contrastive Representation Learning for Content-Based Surgical Video Retrieval" Electronics 11, no. 9: 1353. https://doi.org/10.3390/electronics11091353