Surgical triplet video understanding is essential for accu-rately recognizing and classifying surgical actions in real-time video data, enabling improved surgical planning and support. However, the recog-nition of surgical video triplets is particularly influenced by the varying frequency and duration of specific actions. Furthermore, the similarity in motion trajectories across various surgical actions presents another issue in video-based fine-grained surgical action recognition, complicat-ing the differentiation of similar actions. To address these challenges, we propose a new comprehensive paired image-text surgical activity event dataset (SAE), consisting of 90,500 pairs of images and text depict-ing surgical actions. Additionally, we introduce TriClip, a novel dual-branch contrastive multimo dalframework,whicheffectivelybridgesthegapbetweenvisualandtextualmodalitiesinsurgicalactionrecogni-tion.Byleveragingtransferablevisualmodelsfromnaturallanguagesupervision,TriClipwasevaluatedusingtheCholecT45dataset,whereitachievedanSOTAaverageprecisionof42.1%,settinganewstate-of-the-artinthefieldofsurgicalactionrecognition.
Understanding Surgical Triplet Videos Through Transferable Visual Models from Natural Language Supervision / Li, Yunhao; Wang, Aoying; Xie, Yu-Xi; Wang, Qiong; Ye, Xiucai; Savi, Patrizia; Ghu, Yin; Yanpang,. - ELETTRONICO. - (2026), pp. 560-573. ( Pattern Recognition and Computer Vision, 8th Chinese Conference Shanghai (Cin) October 15-18, 2025) [10.1007/978-981-95-5764-6_38].
Understanding Surgical Triplet Videos Through Transferable Visual Models from Natural Language Supervision
Patrizia Savi;
2026
Abstract
Surgical triplet video understanding is essential for accu-rately recognizing and classifying surgical actions in real-time video data, enabling improved surgical planning and support. However, the recog-nition of surgical video triplets is particularly influenced by the varying frequency and duration of specific actions. Furthermore, the similarity in motion trajectories across various surgical actions presents another issue in video-based fine-grained surgical action recognition, complicat-ing the differentiation of similar actions. To address these challenges, we propose a new comprehensive paired image-text surgical activity event dataset (SAE), consisting of 90,500 pairs of images and text depict-ing surgical actions. Additionally, we introduce TriClip, a novel dual-branch contrastive multimo dalframework,whicheffectivelybridgesthegapbetweenvisualandtextualmodalitiesinsurgicalactionrecogni-tion.Byleveragingtransferablevisualmodelsfromnaturallanguagesupervision,TriClipwasevaluatedusingtheCholecT45dataset,whereitachievedanSOTAaverageprecisionof42.1%,settinganewstate-of-the-artinthefieldofsurgicalactionrecognition.| File | Dimensione | Formato | |
|---|---|---|---|
|
2026-Understanding Surgical Triplet p560-573.pdf
accesso riservato
Descrizione: Understanding Surgical Triplet ... pag. 560-573
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
Non Pubblico - Accesso privato/ristretto
Dimensione
7.11 MB
Formato
Adobe PDF
|
7.11 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/3007767
