Accurate documentation is critical in ophthalmology, yet clinical notes often contain subtle errors that can affect decision-making. This study prospectively compared contemporary large language models (LLMs) for detecting clinically salient errors in emergency ophthalmology encounter notes and generating actionable corrections. 129 de-identified notes,
[...] Read more.
Accurate documentation is critical in ophthalmology, yet clinical notes often contain subtle errors that can affect decision-making. This study prospectively compared contemporary large language models (LLMs) for detecting clinically salient errors in emergency ophthalmology encounter notes and generating actionable corrections. 129 de-identified notes, each seeded with a predefined target error, were independently audited by four LLMs (o3 (OpenAI, closed-source), DeepSeek-v3-r1 (Deepseek, open-source), MedGemma-27B (Google, open-source), and GPT-4o (OpenAI, closed-source)) using a standardized prompt. Two masked ophthalmologists graded error localization, relevance of additional issues, and overall recommendation quality, with within-case analyses applying appropriate nonparametric tests. Performance varied significantly across models (Cochran’s Q = 71.13,
p = 2.44 × 10
−15). o3 achieved the highest error localization accuracy at 95.7% (95% CI, 89.5–98.8), followed by DeepSeek-v3-r1 (90.3%), MedGemma-27b (80.9%), and GPT-4o (53.2%). Ordinal outcomes similarly favored o3 and DeepSeek-v3-r1 (both
p < 10
−9 vs. GPT-4o), with mean recommendation quality scores of 3.35, 3.05, 2.54, and 2.11, respectively. These findings demonstrate that LLMs can serve as accurate “second-eyes” for ophthalmology documentation. A proprietary model led on all metrics, while a strong open-source alternative approached its performance, offering potential for privacy-preserving on-premise deployment. Clinical translation will require oversight, workflow integration, and careful attention to ethical considerations.
Full article