Abstract
Purpose
The recent surge in artificial intelligence (AI)-related technologies presents an opportunity for revolutionary advances in traditional methods of medical education. ChatGPT is one such application of AI that can take free-form text as input and generate a human-like response. This study sought to evaluate ChatGPT’s performance on a simulated surgery shelf exam and assess its potential as a learning tool for medical students.
Methods
Two 50-question tests were randomly selected from the National Board of Medical Examiners (NBME) practice surgery shelf exams. ChatGPT (Generative Pretrained Transformer-4o, September 2024) evaluated each question sequentially. Questions with images were excluded. Responses were recorded, and a board-certified general surgeon evaluated each justification. Each justification was graded as either having no errors, minor errors that do not significantly impact understanding of the topic, or major errors that significantly impact understanding of the topic.
Results
ChatGPT answered 96.6% of questions correctly. 9.2% of all responses contained minor errors, and 9.2% contained major errors. Among correctly answered questions, 9.5% contained minor errors, while 6.0% contained major errors. All major errors were due to incorrect information that was presented as correct.
Conclusion
ChatGPT demonstrates high accuracy when assessed on its ability to correctly answer multiple-choice questions that medical students could encounter on a surgery shelf exam. However, caution must be used when using ChatGPT as an adjunct to traditional education methods. With 15.5% of ChatGPT’s correct responses containing errors, often confidently asserting false information, students risk learning incorrect information if unaware of this limitation.