Evaluating the Reliability and Readability of AI Chatbot Responses for Microtia Patient Education

Supriya Dadi; Taylor Kring; Kyle Latz; David Cohen; Seth Thaller

doi:10.1097/SCS.0000000000011988

Back

Evaluating the Reliability and Readability of AI Chatbot Responses for Microtia Patient Education

Journal article

Peer reviewed

Evaluating the Reliability and Readability of AI Chatbot Responses for Microtia Patient Education

Supriya Dadi, Taylor Kring, Kyle Latz, David Cohen and Seth Thaller

The Journal of craniofacial surgery, Vol.37(3/4)

2025-10-02

DOI: https://doi.org/10.1097/SCS.0000000000011988

PMID: 41037793

Appears in Miller School of Medicine - Latest Publications

Abstract

microtia

information quality

readability

health communication

patient education

chatbots

Artificial intelligence

Ear microtia is a congenital deformity that can range from mild underdevelopment to complete absence of the external ear. Often unilateral, it causes visible facial asymmetry leading to psychosocial distress for patients and families. Caregivers report feeling guilty and anxious, while patients experience increased rates of depression and social challenges. This is often a difficult time for the patient and their families, who often turn to AI chatbots for guidance before and after receiving definitive surgical care. This study evaluates the quality and readability of leading AI-based chatbots when responding to patient-centered questions about the condition. Popular AI chatbots (ChatGPT 4o, Google Gemini, DeepSeek, and OpenEvidence) were asked 25 queries about microtia developed from the FAQ section on hospital websites. Responses were evaluated using modified DISCERN criteria for quality and SMOG scoring for readability. ANOVA and post hoc analyses were performed to identify significant differences. Google Gemini achieved the highest DISCERN score (M=37.16, SD=2.58), followed by OpenEvidence (M=32.19, SD=3.54). DeepSeek (M=30.76, SD=4.29) and ChatGPT (M=30.32, SD=2.97) had the lowest DISCERN scores. OpenEvidence had the worst readability (M=18.06, SD=1.12), followed by ChatGPT (M=16.32, SD=1.41). DeepSeek was the most readable (M=14.63, SD=1.60), closely followed by Google Gemini (M=14.73, SD=1.27). Overall, the average DISCERN and SMOG scores across all platforms were 32.19 (SD=4.43) and 15.93 (SD=1.94), respectively, indicating a good quality and an undergraduate reading level. None of the platforms consistently met both quality and readability standards, though Google Gemini performed relatively well. As reliance on AI for early health information grows, ensuring the accessibility of chatbot responses will be crucial for supporting informed decision-making and enhancing the patient experience.

Metrics

1 Record Views

Details

Title: Evaluating the Reliability and Readability of AI Chatbot Responses for Microtia Patient Education
Creators: Supriya Dadi - University of Miami
Taylor Kring - University of Miami
Kyle Latz - University of Miami
David Cohen - University of Miami
Seth Thaller - University of Miami
Publication Details: The Journal of craniofacial surgery, Vol.37(3/4)
Publisher: LIPPINCOTT WILLIAMS & WILKINS; PHILADELPHIA
Number of pages: 4
Academic Unit: Miller School of Medicine; UMMG Department of Surgery; UMMG Dept of Surgery - Div of Plastic & Reconstructive Surgery
Language: English
Resource Type: Journal article
PMID: 41037793
Record Identifier: 991032825976502976