Background: Papilledema, an ophthalmic finding associated with increased intracranial pressure, is often induced by dermatological medications, including corticosteroids, isotretinoin, and tetracyclines. Early detection is crucial for preventing irreversible optic nerve damage, but access to ophthalmologic expertise is often limited in rural settings.
[...] Read more.
Background: Papilledema, an ophthalmic finding associated with increased intracranial pressure, is often induced by dermatological medications, including corticosteroids, isotretinoin, and tetracyclines. Early detection is crucial for preventing irreversible optic nerve damage, but access to ophthalmologic expertise is often limited in rural settings. Artificial intelligence (AI) may enable the automated and accurate detection of papilledema from fundus images, thereby supporting timely diagnosis and management.
Objective: The primary objective of this study was to explore the diagnostic capability of ChatGPT-4o, a general large language model with multimodal input, in identifying papilledema from fundus photographs. For context, its performance was compared with a ResNet-based convolutional neural network (CNN) specifically fine-tuned for ophthalmic imaging, as well as with the assessments of two human ophthalmologists. The focus was on applications relevant to dermatological care in resource-limited environments.
Methods: A dataset of 1094 fundus images (295 papilledema, 799 normal) was preprocessed and partitioned into a training set and a test set. The ResNet model was fine-tuned using discriminative learning rates and a one-cycle learning rate policy. GPT-4o and two human evaluators (a senior ophthalmologist and an ophthalmology resident) independently assessed the test images. Diagnostic metrics including sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and Cohen’s Kappa, were calculated for each evaluator.
Results: GPT-4o, when applied to papilledema detection, achieved an overall accuracy of 85.9% with substantial agreement beyond chance (Cohen’s Kappa = 0.72), but lower specificity (78.9%) and positive predictive value (73.7%) compared to benchmark models. For context, the ResNet model, fine-tuned for ophthalmic imaging, reached near-perfect accuracy (99.5%, Kappa = 0.99), while two human ophthalmologists achieved accuracies of 96.0% (Kappa ≈ 0.92).
Conclusions: This study explored the capability of GPT-4o, a large language model with multimodal input, for detecting papilledema from fundus photographs. GPT-4o achieved moderate diagnostic accuracy and substantial agreement with the ground truth, but it underperformed compared to both a domain-specific ResNet model and human ophthalmologists. These findings underscore the distinction between generalist large language models and specialized diagnostic AI: while GPT-4o is not optimized for ophthalmic imaging, its accessibility, adaptability, and rapid evolution highlight its potential as a future adjunct in clinical screening, particularly in underserved settings. These findings also underscore the need for validation on external datasets and real-world clinical environments before such tools can be broadly implemented.
Full article