Potential and pitfalls: accuracy versus adequacy of ChatGPT’s performance on surgery shelf examination

Baylee Brochu; Michael D. Cobler-Lichter; Talia R. Arcieri; Nikita M. Shah; Jessica M. Delamater; Ana M. Reyes; Matthew S. Sussman; Edward B. Lineen; Laurence R. Sands; Vanessa W. Hui; Steven E. Rodgers; Chad M. Thorson

doi:10.1007/s44186-025-00413-2

Journal article

Potential and pitfalls: accuracy versus adequacy of ChatGPT’s performance on surgery shelf examination

Baylee Brochu, Michael D. Cobler-Lichter, Talia R. Arcieri, Nikita M. Shah, Jessica M. Delamater, Ana M. Reyes, Matthew S. Sussman, Edward B. Lineen, Laurence R. Sands, Vanessa W. Hui, …

Global surgical education : journal of the Association for Surgical Education, Vol.5(1), p.10

2026-12-01

DOI: https://doi.org/10.1007/s44186-025-00413-2

Appears in Miller School of Medicine - Latest Publications

Abstract

Medical Education

Medicine

Medicine & Public Health

Original Article

Surgery

Purpose The recent surge in artificial intelligence (AI)-related technologies presents an opportunity for revolutionary advances in traditional methods of medical education. ChatGPT is one such application of AI that can take free-form text as input and generate a human-like response. This study sought to evaluate ChatGPT’s performance on a simulated surgery shelf exam and assess its potential as a learning tool for medical students. Methods Two 50-question tests were randomly selected from the National Board of Medical Examiners (NBME) practice surgery shelf exams. ChatGPT (Generative Pretrained Transformer-4o, September 2024) evaluated each question sequentially. Questions with images were excluded. Responses were recorded, and a board-certified general surgeon evaluated each justification. Each justification was graded as either having no errors, minor errors that do not significantly impact understanding of the topic, or major errors that significantly impact understanding of the topic. Results ChatGPT answered 96.6% of questions correctly. 9.2% of all responses contained minor errors, and 9.2% contained major errors. Among correctly answered questions, 9.5% contained minor errors, while 6.0% contained major errors. All major errors were due to incorrect information that was presented as correct. Conclusion ChatGPT demonstrates high accuracy when assessed on its ability to correctly answer multiple-choice questions that medical students could encounter on a surgery shelf exam. However, caution must be used when using ChatGPT as an adjunct to traditional education methods. With 15.5% of ChatGPT’s correct responses containing errors, often confidently asserting false information, students risk learning incorrect information if unaware of this limitation.

Metrics

12 Record Views

Details

Title: Potential and pitfalls: accuracy versus adequacy of ChatGPT’s performance on surgery shelf examination
Creators: Baylee Brochu - University of Miami Miller School of Medicine
Michael D. Cobler-Lichter - DeWitt Daughtry Family Department of Surgery, University of Miami Miller School of Medicine
Talia R. Arcieri - University of Miami
Nikita M. Shah - University of Miami Miller School of Medicine
Jessica M. Delamater - DeWitt Daughtry Family Department of Surgery, University of Miami Miller School of Medicine
Ana M. Reyes - University of Miami
Matthew S. Sussman - University of Miami
Edward B. Lineen - University of Miami
Laurence R. Sands - DeWitt Daughtry Family Department of Surgery, University of Miami Miller School of Medicine
Vanessa W. Hui - DeWitt Daughtry Family Department of Surgery, University of Miami Miller School of Medicine
Steven E. Rodgers - University of Miami
Chad M. Thorson - DeWitt Daughtry Family Department of Surgery, University of Miami Miller School of Medicine
Publication Details: Global surgical education : journal of the Association for Surgical Education, Vol.5(1), p.10
Publisher: Springer US; LONDON
Number of pages: 8
Academic Unit: UMMG Dept of Surgery - Div of Pediatric & Adolescent Surgery; Miller School of Medicine; UMMG Dept of Surgery - Div of Colon & Rectal Surgery; UMMG Department of Surgery; UMMG Dept of Surgery - Div of Trauma & Surgical Critical Care Services; UMMG Dept of Surgery - Div of Surgical Oncology
Language: English
Resource Type: Journal article
Record Identifier: 991032905708202976