SurgXBench: Explainable Vision-Language Model Benchmark for Surgery

Jiajun Cheng; Xianwu Zhao; Sainan Liu; Xiaofan Yu; Ravi Prakash; Patrick J. Codd; Jonathan Elliott Katz; Shan Lin

doi:10.1109/WACV61042.2026.00790

Back

Conference proceeding

SurgXBench: Explainable Vision-Language Model Benchmark for Surgery

Jiajun Cheng, Xianwu Zhao, Sainan Liu, Xiaofan Yu, Ravi Prakash, Patrick J. Codd, Jonathan Elliott Katz and Shan Lin

Proceedings / IEEE Workshop on Applications of Computer Vision, pp.8188-8198

2026-03-06

DOI: https://doi.org/10.1109/WACV61042.2026.00790

Abstract

benchmarks

Circuits

explainable ai

Feedback

Integrated circuits

Location awareness

Low earth orbit satellites

Mobile communication

Pixel

Product development

surgical instrument and action classification

Video equipment

Videos

vlm

Innovations in digital intelligence are transforming robotic surgery through more informed decision-making. Real-time awareness of surgical instrument presence and actions (e.g., cutting tissue) is essential, yet despite decades of research, most machine learning models rely on small datasets and still struggle to generalize. Recently, Vision-Language Models (VLMs) have achieved transformative advances in multimodal reasoning, suggesting strong potential for intelligent robotic surgery. However, surgical VLMs remain underexplored, and existing models show limited performance, underscoring the need for systematic benchmarks to assess their capabilities, limitations, and future development. To this end, we benchmark the zero-shot performance of several advanced VLMs on two public robotic-assisted laparoscopic datasets for instrument and action classification. Beyond standard evaluation, we integrate explainable AI to visualize VLM attention and uncover causal explanations behind predictions, providing a previously underexplored perspective for assessing model reliability. We also propose explainability-based metrics to complement standard evaluations. Our analysis reveals that surgical VLMs, despite domain-specific training, often rely on weak contextual cues rather than clinically meaningful visual evidence, highlighting the need for stronger visual and reasoning supervision in surgical applications. The code is provided in our public repository at: https://github.com/jiajun344/SurgXBench-Explainable-Vision-Language-Model-Benchmark-for-Surgery.

Metrics

1 Record Views

Details

Title: SurgXBench: Explainable Vision-Language Model Benchmark for Surgery
Creators: Jiajun Cheng - Arizona State University
Xianwu Zhao - Arizona State University
Sainan Liu - Intel (United States)
Xiaofan Yu - Arizona State University
Ravi Prakash - Duke University
Patrick J. Codd - Duke University
Jonathan Elliott Katz - University of Miami
Shan Lin - Arizona State University
Publication Details: Proceedings / IEEE Workshop on Applications of Computer Vision, pp.8188-8198
Publisher: IEEE
Academic Unit: Miller School of Medicine; UMMG Department of Urology
Language: English
Resource Type: Conference proceeding
Record Identifier: 991033077253302976