Surgical Video Understanding with Alignment-Preserving Temporal Adaptation and Action Triplet Text Alignment

Ikeido, Taiyo; Togo, Ren; Ogawa, Takahiro; Sugiyama, Taku; Poudel, Saseem; Sugimori, Hiroyuki; Tang, Minghui; Han, Feng; Koyano, Hidenori; Hirata, Kenji; Kudo, Kohsuke; Haseyama, Miki

doi:10.3390/bioengineering13060640

This is an early access version, the complete PDF, HTML, and XML versions will be available soon.

Open AccessArticle

Surgical Video Understanding with Alignment-Preserving Temporal Adaptation and Action Triplet Text Alignment

by

Taiyo Ikeido

¹

,

Ren Togo

²

,

Takahiro Ogawa

²

,

Taku Sugiyama

³

,

Saseem Poudel

⁴

,

Hiroyuki Sugimori

⁵

,

Minghui Tang

⁶

,

Feng Han

⁷

,

Hidenori Koyano

⁸,

Kenji Hirata

⁶

,

Kohsuke Kudo

⁶

and

Miki Haseyama

^2,*

¹

Graduate School of Information Science and Technology, Hokkaido University, Kita 14, Nishi 9, Kita-ku, Sapporo 060-0814, Hokkaido, Japan

²

Faculty of Information Science and Technology, Hokkaido University, Kita 14, Nishi 9, Kita-ku, Sapporo 060-0814, Hokkaido, Japan

³

Department of Neurosurgery, Graduate School of Medicine, Hokkaido University, Kita 15, Nishi 7, Kita-ku, Sapporo 060-8638, Hokkaido, Japan

⁴

Department of Gastroenterological Surgery II, Graduate School of Medicine, Hokkaido University, Kita 15, Nishi 7, Kita-ku, Sapporo 060-8638, Hokkaido, Japan

⁵

Faculty of Health Sciences, Hokkaido University, Kita 12, Nishi 5, Kita-ku, Sapporo 060-0812, Hokkaido, Japan

⁶

Department of Diagnostic Imaging, Graduate School of Medicine, Hokkaido University, Kita 15, Nishi 7, Kita-ku, Sapporo 060-8638, Hokkaido, Japan

⁷

Division of AI Support for Medical Research, Faculty of Medicine, Hokkaido University, Kita 15, Nishi 7, Kita-ku, Sapporo 060-8638, Hokkaido, Japan

⁸

Technical Support Center, Graduate School of Medicine, Hokkaido University, Kita 15, Nishi 7, Kita-ku, Sapporo 060-8638, Hokkaido, Japan

^*

Author to whom correspondence should be addressed.

Bioengineering 2026, 13(6), 640; https://doi.org/10.3390/bioengineering13060640 (registering DOI)

Submission received: 4 April 2026 / Revised: 23 May 2026 / Accepted: 24 May 2026 / Published: 29 May 2026

(This article belongs to the Section Biosignal Processing)

Download Review Reports Versions Notes

Abstract

Surgical workflow understanding requires recognizing procedural phases and fine-grained activities from long-horizon videos, yet acquiring dense annotations for surgical video analysis is costly and requires medical expertise. To address this challenge, we present a text-guided and annotation-efficient framework for surgical video understanding based on a frozen surgical vision–language-pretrained (VLP) encoder and a lightweight temporal adapter. The frozen SurgVLP image encoder provides frame-level visual embeddings, and the temporal adapter aggregates them into clip-level representations while preserving compatibility with the pretrained visual–text embedding space. We evaluate the proposed framework on CholecT50 using text-guided prototype matching for phase recognition and few-shot triplet recognition. Experiments show that temporal adaptation improves phase recognition while preserving the pretrained SurgVLP embedding space. In particular, among the evaluated methods, the proposed Text Contrastive method with rich phase prompts achieves the highest phase recognition performance, outperforming the phase-only baseline. Furthermore, the proposed framework enables classifier-free few-shot triplet recognition in the frozen text space without training a dedicated triplet classifier. These results suggest that effective surgical video understanding under limited annotation depends not only on temporal adaptation but also on preserving alignment with the pretrained text space and using semantically informative text prompts.

Keywords: surgical video understanding; vision–language pretraining; temporal adaptation; text prototype matching; few-shot recognition; CholecT50

Share and Cite

MDPI and ACS Style

Ikeido, T.; Togo, R.; Ogawa, T.; Sugiyama, T.; Poudel, S.; Sugimori, H.; Tang, M.; Han, F.; Koyano, H.; Hirata, K.; et al. Surgical Video Understanding with Alignment-Preserving Temporal Adaptation and Action Triplet Text Alignment. Bioengineering 2026, 13, 640. https://doi.org/10.3390/bioengineering13060640

AMA Style

Ikeido T, Togo R, Ogawa T, Sugiyama T, Poudel S, Sugimori H, Tang M, Han F, Koyano H, Hirata K, et al. Surgical Video Understanding with Alignment-Preserving Temporal Adaptation and Action Triplet Text Alignment. Bioengineering. 2026; 13(6):640. https://doi.org/10.3390/bioengineering13060640

Chicago/Turabian Style

Ikeido, Taiyo, Ren Togo, Takahiro Ogawa, Taku Sugiyama, Saseem Poudel, Hiroyuki Sugimori, Minghui Tang, Feng Han, Hidenori Koyano, Kenji Hirata, and et al. 2026. "Surgical Video Understanding with Alignment-Preserving Temporal Adaptation and Action Triplet Text Alignment" Bioengineering 13, no. 6: 640. https://doi.org/10.3390/bioengineering13060640

APA Style

Ikeido, T., Togo, R., Ogawa, T., Sugiyama, T., Poudel, S., Sugimori, H., Tang, M., Han, F., Koyano, H., Hirata, K., Kudo, K., & Haseyama, M. (2026). Surgical Video Understanding with Alignment-Preserving Temporal Adaptation and Action Triplet Text Alignment. Bioengineering, 13(6), 640. https://doi.org/10.3390/bioengineering13060640

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Surgical Video Understanding with Alignment-Preserving Temporal Adaptation and Action Triplet Text Alignment

Abstract

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI