Evaluating advanced AI reasoning models: ChatGPT-4.0 and DeepSeek-R1 diagnostic performance in otolaryngology: a comparative analysis

Soumil Prasad; Jake Langlie; Luke Pasick; Ryan Chen; Elizabeth Franzmann

doi:10.1016/j.amjoto.2025.104667

Back

Evaluating advanced AI reasoning models: ChatGPT-4.0 and DeepSeek-R1 diagnostic performance in otolaryngology: a comparative analysis

Journal article

Peer reviewed

Evaluating advanced AI reasoning models: ChatGPT-4.0 and DeepSeek-R1 diagnostic performance in otolaryngology: a comparative analysis

Soumil Prasad, Jake Langlie, Luke Pasick, Ryan Chen and Elizabeth Franzmann

American journal of otolaryngology, Vol.46(4), p.104667

2025-07-01

DOI: https://doi.org/10.1016/j.amjoto.2025.104667

PMID: 40367837

Appears in Miller School of Medicine - Latest Publications

Abstract

Adenoidectomy

Artificial Intelligence

Generative Artificial Intelligence

Humans

Laryngectomy

Otolaryngology - methods

Tonsillectomy

This study aimed to evaluate the diagnostic accuracy, comprehensiveness, and clinical relevance of two advanced artificial intelligence (AI) models, OpenAI's ChatGPT-4.0 and DeepSeek-R1, in the field of otolaryngology. Five common otolaryngology procedures-adenotonsillectomy, tympanoplasty, endoscopic sinus surgery, parotidectomy, and total laryngectomy-were analyzed through standardized queries posed to both AI models. Because the prompts replicate questions that patients typically search online, our evaluation focuses on patient-facing informational adequacy. Responses were independently evaluated by two study members for accuracy, clinical relevance, and comprehensiveness, with discrepancies resolved through consensus. The analysis included comparison with clinical guidelines. ChatGPT-4.0 generally provided detailed procedural insights, effectively covering indications, methodologies, risks, and recovery processes. However, it occasionally suggested excessive diagnostic imaging and omitted subtle yet significant surgical nuances. DeepSeek-R1 delivered concise, structured responses clearly categorizing indications, treatment alternatives, and procedural risks. Nonetheless, it frequently lacked detailed elaboration, omitting important surgical techniques and minor complications. For instance, DeepSeek-R1 omitted specifics such as hemostatic techniques in adenotonsillectomy and graft stabilization details in tympanoplasty. Neither model adequately addressed critical elements like comprehensive staging, detailed surgical planning, and long-term recovery nuances, especially for complex procedures such as total laryngectomy. Both ChatGPT-4.0 and DeepSeek-R1 demonstrated significant diagnostic potential but revealed limitations in precision, comprehensiveness, and nuanced clinical reasoning. Their clinical utility remains restricted, highlighting a continued need for AI refinement to enhance patient-specific decision-making capabilities in otolaryngology.

Files and links (1)

url

https://doi.org/10.1016/j.amjoto.2025.104667View

Published (Version of record) Open

Metrics

1 Record Views

Details

Title: Evaluating advanced AI reasoning models: ChatGPT-4.0 and DeepSeek-R1 diagnostic performance in otolaryngology: a comparative analysis
Creators: Soumil Prasad - University of Miami
Jake Langlie - University of Miami
Luke Pasick - University of Miami
Ryan Chen - University of Miami
Elizabeth Franzmann - University of Miami
Publication Details: American journal of otolaryngology, Vol.46(4), p.104667
Academic Unit: UMMG Department of Otolaryngology; Miller School of Medicine
Language: English
Resource Type: Journal article
PMID: 40367837
Record Identifier: 991032795897402976