Surgical triplet video understanding is essential for accu-rately recognizing and classifying surgical actions in real-time video data, enabling improved surgical planning and support. However, the recog-nition of surgical video triplets is particularly influenced by the varying frequency and duration of specific actions. Furthermore, the similarity in motion trajectories across various surgical actions presents another issue in video-based fine-grained surgical action recognition, complicat-ing the differentiation of similar actions. To address these challenges, we propose a new comprehensive paired image-text surgical activity event dataset (SAE), consisting of 90,500 pairs of images and text depict-ing surgical actions. Additionally, we introduce TriClip, a novel dual-branch contrastive multimo dalframework,whicheffectivelybridgesthegapbetweenvisualandtextualmodalitiesinsurgicalactionrecogni-tion.Byleveragingtransferablevisualmodelsfromnaturallanguagesupervision,TriClipwasevaluatedusingtheCholecT45dataset,whereitachievedanSOTAaverageprecisionof42.1%,settinganewstate-of-the-artinthefieldofsurgicalactionrecognition.

Understanding Surgical Triplet Videos Through Transferable Visual Models from Natural Language Supervision / Li, Yunhao; Wang, Aoying; Xie, Yu-Xi; Wang, Qiong; Ye, Xiucai; Savi, Patrizia; Ghu, Yin; Yanpang,. - ELETTRONICO. - (2026), pp. 560-573. ( Pattern Recognition and Computer Vision, 8th Chinese Conference Shanghai (Cin) October 15-18, 2025) [10.1007/978-981-95-5764-6_38].

Understanding Surgical Triplet Videos Through Transferable Visual Models from Natural Language Supervision

Patrizia Savi;
2026

Abstract

Surgical triplet video understanding is essential for accu-rately recognizing and classifying surgical actions in real-time video data, enabling improved surgical planning and support. However, the recog-nition of surgical video triplets is particularly influenced by the varying frequency and duration of specific actions. Furthermore, the similarity in motion trajectories across various surgical actions presents another issue in video-based fine-grained surgical action recognition, complicat-ing the differentiation of similar actions. To address these challenges, we propose a new comprehensive paired image-text surgical activity event dataset (SAE), consisting of 90,500 pairs of images and text depict-ing surgical actions. Additionally, we introduce TriClip, a novel dual-branch contrastive multimo dalframework,whicheffectivelybridgesthegapbetweenvisualandtextualmodalitiesinsurgicalactionrecogni-tion.Byleveragingtransferablevisualmodelsfromnaturallanguagesupervision,TriClipwasevaluatedusingtheCholecT45dataset,whereitachievedanSOTAaverageprecisionof42.1%,settinganewstate-of-the-artinthefieldofsurgicalactionrecognition.
2026
978-981-95-5764-6
File in questo prodotto:
File Dimensione Formato  
2026-Understanding Surgical Triplet p560-573.pdf

accesso riservato

Descrizione: Understanding Surgical Triplet ... pag. 560-573
Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Non Pubblico - Accesso privato/ristretto
Dimensione 7.11 MB
Formato Adobe PDF
7.11 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/3007767