Next Article in Journal
Real-Time Image Semantic Segmentation Based on Improved DeepLabv3+ Network
Previous Article in Journal
A Framework for Rapidly Prototyping Data Mining Pipelines
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

The Use of Large Language Models in Ophthalmology: A Scoping Review on Current Use-Cases and Considerations for Future Works in This Field

by
Ye King Clarence See
1,2,
Khai Shin Alva Lim
3,
Wei Yung Au
3,
Si Yin Charlene Chia
4,
Xiuyi Fan
3,4,5 and
Zhenghao Kelvin Li
1,2,6,*
1
Department of Ophthalmology, Tan Tock Seng Hospital, Singapore 308433, Singapore
2
National Healthcare Group Eye Institution, Singapore 308433, Singapore
3
Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore 308232, Singapore
4
College of Computing and Data Science, Nanyang Technological University, Singapore 639798, Singapore
5
Centre for Medical Technologies & Innovations, National Health Group, Singapore 138543, Singapore
6
Department of Ophthalmology, Byers Eye Institute, Stanford University School of Medicine, Palo Alto, CA 94303, USA
*
Author to whom correspondence should be addressed.
Big Data Cogn. Comput. 2025, 9(6), 151; https://doi.org/10.3390/bdcc9060151
Submission received: 26 March 2025 / Revised: 13 May 2025 / Accepted: 3 June 2025 / Published: 6 June 2025

Abstract

The advancement of generative artificial intelligence (AI) has resulted in its use permeating many areas of life. Amidst this eruption of scientific output, a wide range of research regarding the usage of Large Language Models (LLMs) in ophthalmology has emerged. In this study, we aim to map out the landscape of LLM applications in ophthalmology, and by consolidating the work carried out, we aim to produce a point of reference to guide the conduct of future works. Eight databases were searched for articles from 2019 to 2024. In total, 976 studies were screened, and a final 49 were included. The study designs and outcomes of these studies were analysed. The performance of LLMs was further analysed in the areas of exam taking and patient education, diagnostic capability, management capability, administration, inaccuracies, and harm. LLMs performed acceptably in most studies, even surpassing humans in some. Despite their relatively good performance, issues pertaining to study design, grading protocols, hallucinations, inaccuracies, and harm were found to be pervasive. LLMs have received considerable attention through their introduction to the public and have found potential applications in the field of medicine, and in particular, ophthalmology. However, using standardised evaluation frameworks and addressing gaps in the current literature when applying LLMs in ophthalmology is recommended through this review.

1. Introduction

The advancement and popularisation of generative artificial intelligence (AI) has resulted in its use permeating many areas of life and scientific research. This has largely been driven by the way Large Language Models (LLMs) have transformed the use of Natural Language Processing (NLP). Through self-supervised learning, LLMs have been utilised to effectively perform a wide variety of tasks, ranging from interpreting and classifying text to generating answers to conversational questions. In November 2022, the release of ChatGPT by OpenAI revolutionised the LLM scene. Through its user-friendly interface and accessibility, ChatGPT has democratised the use of LLMs beyond the realm of computer science researchers, engaging a broad spectrum of users from various fields, sparking unprecedented interest in this field [1]. It took 4 years from the release of the Bidirectional Encoder Representations from Transformers (BERT) language model in October 2018 to develop eight major LLM applications prior to ChatGPT’s release. On the contrary, in the two years since ChatGPT’s release, eight major LLMs—Med-PaLM 1, Google Bard, Glass AI 2.0, GPT-4, Med-PaLM 2, LLaMa, Gemini, and Claude were released (Figure 1).
Notably, newer LLMs have superior generalisation capabilities [1] and have been trained to provide more human-like responses, sparking interest in their use within medicine. To date, we have seen encouraging results supporting the use of LLMs in clinical practice, medical education, and medical research [2,3,4,5,6].
The field of ophthalmology is no stranger to AI. Machine learning programs have been developed to detect and grade cataracts, while various deep learning programs have demonstrated their utility in detecting glaucomatous optic nerve changes. These applications have allowed ophthalmology to generate a wealth of data, paving the way for LLMs to potentially deliver more streamlined, personalised, and optimised care for ophthalmology patients [7,8,9].
It is unsurprising, therefore, that amid this eruption of scientific output in the realm of AI and LLMs, a wide range of research regarding the usage and efficacy of LLMs in ophthalmology has emerged. Between January and July 2023 alone, a review summarising LLM trends in ophthalmology identified 32 articles related to this topic [10]. Inadvertently, this has also resulted in the publication of many isolated studies with overlapping scopes of research, resulting in the duplication of efforts. In another review [11] of the usage of LLMs in ophthalmology, a total of 108 studies were identified between January 2016 to June 2023, 55 of which involved overlapping aspects of automated question-answering, while 27 dealt with information screening. Notably, this review did not provide a study-by-study breakdown of the 108 studies but mainly sought to understand general trends of LLM usage in ophthalmology. A literature review of publications in this field suggested that LLM research, even in the niche area of ophthalmology, appears to have a laissez-faire approach, with each having its own unique design. This potentially complicates the consolidation of research outputs in this field and makes it difficult to compare approaches and results across studies. To tackle such concerns, guidelines such as the SPIRIT-AI and CONSORT-AI initiatives for clinical trials and interventions involving AI have been created [12]. However, the extent to which such protocols are followed is yet to be determined. To our knowledge, to date, there has also not been a summarisation of how LLM studies in the field of ophthalmology have been carried out.
LLMs have been made available to the masses for at least the past 5 years; however, to date, they are not yet widely adopted in Ophthalmology circles. In an era of evidence-based medicine, this is in no small part due to the lack of strong evidence supporting their use. To investigate why this is so, we sought to identify the root cause of the problem by first evaluating the conduct of research in this field and secondly identifying the existing limitations of LLM use within Ophthalmology studies. By addressing these two areas and providing suggestions on how to improve them, we hope to spur future works in the right direction to achieve the end goal of the adoption of LLM in day-to-day Ophthalmology practice. However, given the diversity of studies in this field, we sought to first map out the landscape of LLM applications in ophthalmology, to consolidate the work thus far, serving as a point of reference to guide future research in this field. To summarise, this review aims to address the following points:
  • To identify recent studies (1 January 2019–11 February 2024) involving the application of LLMs in ophthalmology. This study period was chosen as it represents the period of LLM breakthroughs after BERT’s release [1] (Figure 1);
  • To evaluate how studies of LLM applications in ophthalmology were carried out, in terms of the following clinical trial protocols, prompt techniques employed, benchmarking methods used, and ethical considerations;
  • To examine how LLMs fared in key areas of healthcare application, including exam taking and patient education, diagnostic and management capability, and clinical administration;
  • To highlight potential issues surrounding the present landscape of LLM applications in ophthalmology and to discuss directions for future LLM research and development in ophthalmology.
During our literature review, we found that studies utilising LLM in Ophthalmology covered a broad range of applications and had a diverse range of findings and methodologies. Given the broad and diverse nature of works in this field, the format of a scoping review was chosen to map out the key trends and findings for this area in recent years, as opposed to a meta-analysis that seeks to draw a conclusion about specific research questions.

2. Methods

2.1. Search Strategy and Information Sources

A search of PubMed, Embase, SCOPUS, Web of Science, the Institute of Electrical and Electronics Engineers (IEEE) journals, the Association for Computing Machinery (ACM) journals, Google Scholar, and DataBase systems and Logic Programming (DBLP) was performed from 1 January 2019 to 11 February 2024. The search strategy can be found in the Appendix A (Table A1).
The MeSH (medical subject heading) terms included are as follows:
(1)
For Ophthalmology: Ophthalmology, Ocular Surgery, Eye Disease, Eye Diseases, Eye Disorders;
(2)
For LLMs: Large Language Model, Large Language Models, Large Language Modelling, Chatbot, ChatGPT, GPT, chatbots, google bard, bing chat, BERT, RoBERTa, distilBERT, BART, MARIAN, llama, and palm.
The search strategy was developed in consultation with expert opinion within the research team, which consisted of computer scientists (FXY, CCS) and clinicians [13] (KLZ (ophthalmology), SYKC (ophthalmology)). No additional search filters were applied.

2.2. Selection Process and Eligibility Criteria

The Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) guideline was utilised [14]. An independent search was conducted by two reviewers (LKA, AWY). Any discrepancies were resolved through discussion between reviewers, with a third (KLZ) included when necessary. This review has not been registered previously.
The inclusion criteria were the following:
(1)
Peer-reviewed primary research studies utilising LLMs;
(2)
Studies involving ophthalmology;
(3)
Studies published from January 2019 to March 2024.
The exclusion criteria were
(1)
Study designs that were reviews, systematic reviews and meta-analyses, case reports, case series, guidelines, letters, correspondences, or protocols;
(2)
Studies that were not published in English.

2.3. Data Extraction and Analysis

Data on the studies were uploaded into Mendeley (https://www.mendeley.com/) and imported into COVIDENCE Systematic Review Software (https://www.covidence.org/, Veritas Health Innovation, Melbourne, Australia) for screening. As mentioned earlier, differences in screening outcomes were resolved in consultation with a third reviewer.
Data extracted from the papers were analysed on Microsoft Excel (Microsoft, Richmond, VA, USA). These included the (1) authorship details, (2) LLMs utilised, (3) study methodology, and (4) performance and performance scoring of the LLMs.
In terms of the study methodology, we took note of the clinical trial protocols used, the prompt techniques employed, how benchmarking was performed, and ethical considerations in the studies.
The performance of LLMs was also analysed in the following areas: exam taking and patient education, diagnostic capability, management capability, clinical administration, inaccuracies, and harm (glossary A).
The subspecialties studied included (1) Cornea, (2) Glaucoma, (3) Neuro-ophthalmology, (4) Uveitis, (5) Lens and cataract, (6) Paediatrics and strabismus, (7) Retina and Vitreous, (8) Oculoplastics, (9) Optics, (10) Refractive surgery, and (11) Pathology. The various LLMs were then assessed on their accuracy and overall completeness of their answers, which were then ranked and compared across the different LLMs employed per study.

3. Results

A total of 976 studies were screened, of which 904 were excluded, with 72 being sought for retrieval. A further 23 of these studies did not meet the inclusion criteria, and a final 49 studies [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63] were included in this study (Figure 2, Table 1).

3.1. Overall Study Characteristics

A total of 14 different LLM models (Table 1) were used across the studies with various applications spanning the fields of administration, clinical knowledge, diagnostics, exam taking, manuscript writing, patient education, prognostication, text interpretation, and triage. GPT-3.5 was the most commonly employed LLM, being utilised in 34 studies. GPT-4.0 came in second, appearing in 26 studies. Bard and Bing LLM models were the next most used (Table 1).
In terms of study design, not all studies followed a standardised clinical trial protocol for artificial intelligence. All studies employed a zero-shot, one-shot, or few-shot prompt engineering technique, apart from one study, which only utilised contextual priming. There were three studies that additionally used prompt chaining, iterative prompting, and chain-of-thought prompt techniques to supplement their work. Most studies (37 of 49 studies) shared full examples of their prompts (Table 2). In Table A3, we cover some examples of these prompts, their corresponding LLM output, and comments on how they were graded in relation to the sub-analysis of this paper. Across the studies, the grading of the output generated by the LLMs was heterogeneous with little standardisation, resulting in difficulty in data analyses. Overall, 24 studies employed human assessors to benchmark LLM performance in terms of “correctness” of output, two of which were assisted by automated benchmarking assessments. The remaining studies utilised automated benchmarking to assess “correctness” of output. Nine studies considered harm in their study protocol, all of which were assessed by humans. Only one study by Wilhelm et al. also utilised an automated form of harm assessment in the form of GPT-4.0. In total, 12 studies delved into the ethical implications relating to their work, while 13 only touched very briefly on patient safety without going further into an ethical discussion (Table 2).

3.2. Breakdown of the LLM Benchmarks Studied and General Observations

Overall, 43 of the 49 included studies [15,16,17,18,19,20,21,22,23,24,25,26,27,29,30,31,32,34,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,56,57,58,59,60,62,63] performed an assessment regarding the “correctness” of the LLM’s output in some form—be it in the form of relevance or accuracy, to name a couple of examples. The remaining six studies looked at other qualities such as readability [28,35,36], manuscript writing [33], and administration [43,55]. Amongst the 43 studies assessing the “correctness” of the LLM’s output, 27 [16,17,18,20,23,27,29,30,31,32,34,37,39,41,45,47,49,50,51,52,54,56,57,59,60,62,63] of them compared multiple LLMs against each other (Table 3a,b), while 16 [15,19,21,22,24,25,26,38,40,42,43,44,46,48,53,58] were observational studies using a single LLM (Table 3a,b). A total of 11 different scoring systems were used to assess for “correctness” (see Table A2 for examples in this variation). There were 15 studies [20,23,30,31,34,37,41,47,49,51,54,56,57,59,63] that compared GPT-4.0 against humans and/or other LLMs. Among these, GPT-4.0 was the best performer in 10 studies [20,30,31,34,37,41,47,57,59,63]. Amongst the seven studies [16,27,29,37,47,52,56] comparing Bard, Bing, and GPT-3.5, GPT-3.5 had the best performance in five [16,29,37,47,56] of them. Amongst the single-armed studies, the LLMs were reported to have largely appropriate responses overall (Table 3a,b). In the following subsections, we go into further detail regarding LLM performance in specific domains.

Human vs. Artificial Intelligence

Overall, 16 studies [17,21,23,25,26,30,31,38,39,41,49,50,51,53,56,57] investigated the performance of LLMs against humans (Attendings, Ophthalmologists-in-training) in diagnosis, exam taking, patient education, prognostication, and triage. In terms of diagnostic, prognostic, and triage ability, humans consistently outperformed LLMs in all six of these studies (Table 4). In terms of answering exam questions, there was a more even balance with humans being the best in three studies, with GPT-4.0 superseding humans in four studies. It is also worth noting that humans consistently performed better than GPT-3.5 in exam taking for every subspecialty (Table 5). The same could be said in terms of developing patient education materials, with humans bettering GPT 3.5 in one study, equaling GPT 3.5 in another study, and partnering with AI to supersede GPT-4.0 in the last study of this area. Notably, the latter study found that GPT 3.5 produced superior results to humans in terms of how empathetic their patient education material was (Table 4).

3.3. Performance of LLM in Exam-Taking and Patient Education

Of the 14 studies that assessed LLM exam-taking capabilities, 10 performed focused analysis of individual ophthalmology subspecialties (Table 4). GPT-4.0 was consistently the top-performing LLM in all these studies, also scoring more than 50% of answers correct in all but one study (Table 5). The study where GPT-4.0 scored less than 50% of answers correct was performed using a Japanese question bank, highlighting the possible language barriers inherent to LLMs (Table 5).
In total, 20 studies looked at patient education, with 16 assessing performance and relevance of output (8 comparative [16,29,37,45,47,56,62,63], 8 non-comparative [19,21,22,24,38,42,46,58]) and the remaining 4 assessing readability [28,35,36,61] (Table 1). Among the eight comparative studies, GPT-4.0 was deemed to produce the best patient educational materials in three of the four studies [37,47,63] in which it was involved, while GPT-3.5 performed the best in the remaining four studies [16,29,45,62]. Looking at non-comparative studies, it was found that only GPT-3.5 and 4.0 were used. Both models performed well, with the majority of responses being “good”, scoring more than 50%, or assessed as “relevant”, depending on the scoring systems applied (Table 3a,b, Appendix A). Regarding the readability of patient educational materials, a total of nine different scoring systems were used amongst the four studies, showing how varied assessment in this area can be. The results here varied greatly even within individual studies, depending on the types of prompts given (Table 6). Both Bard and GPT-4.0 were able to significantly improve the readability scores by varying the types of prompts given [28,35]. GPT-3.5 performed inconsistently, with the material produced being beyond the desired reading level in one study [35] and being at the desired reading level in another study [36]. In Eid’s study [28], GPT-4.0 generated material that was easier to read than Bard. Meanwhile, without prompts, Bard was able to provide educational material that was easier to read in Eid’s study [28] but not Kianian’s [35].

3.4. Diagnostic and Management Capabilities of LLM

Overall, 11 studies [23,26,37,38,39,48,49,50,53,57,63] assessed the diagnostic capabilities of LLM through cases and questions. GPT-4.0 consistently outperformed other LLMs (BingChat, WebMD, Bard) in coming to a diagnosis. (Table 7). As mentioned earlier, humans still performed better than LLMs in this field; nevertheless, most studies did not report a great disparity between LLM scores versus human scores. It was noted that most LLM outputs included a cautionary line such as “you should seek medical attention from a medical professional”.
Overall, seven studies [16,22,23,24,37,50,57] evaluated the management of eye conditions, five of which employed multiple LLMs for comparison [16,23,37,50,57]. As noted earlier in LLM diagnostic ability, GPT-4.0 was also superior to GPT-3.5 and Bard in suggesting the appropriate management (Table 8). Two studies compared the performance of LLMs against humans here [23,57]. The performance of GPT-4 was found to be better or similar to humans in both studies (Table 8). Again, it was noted that most LLMs included a cautionary disclaimer to seek professional medical advice.

3.5. Clinical Administration Tasks

Only three studies evaluated the use of LLM for clinical administration tasks [33,43,55]. In two of the studies that gave LLMs more freedom to write, significant levels of hallucinations were observed.
One study looked at using LLMs for discharge summary and operative notes writing [55]. It found that the quality of GPT’s discharge summaries was affected by the quality of the prompts and tended to be valid but generic. Here, GPT-3.5 hallucinated its own model of the intraocular lenses utilised, but when prompted further, it was able to self-correct to improve the quality of output (Table 9).
Another study [33] evaluated manuscript abstract writing using GPT-3.5 and GPT-4.0. GPT-4.0 outperformed its predecessor on all fronts, including DISCERN score, helpfulness, truthfulness, and harmlessness. However, it was noted that both versions had hallucinated references (Table 9).
The last study was more focused, testing LLM on classifying texts into retina International Classification of Diseases (ICD) coding [43]. Of the 181 prompts given, 70% of the prompts had at least one correct ICD code generated by the LLM. This accuracy was reduced to 59% when assessed to generate only the correct ICD code (Table 9).

3.6. LLM Inaccuracies and Harm

Overall, 20 studies [15,16,19,20,21,22,23,24,29,33,37,38,39,40,42,46,47,56,60,62] detailed the hallucinations or inaccuracies produced by the LLMs. Bard demonstrated a significant inaccuracy rate, having the most inaccuracies in 4 of the 6 studies it was involved with [16,29,37,47,56,62]. On the other hand, GPT-4.0 had the lowest inaccuracy rate amongst LLMs in all seven of the studies, which included inaccuracy analysis [20,23,33,37,39,47,56]. In single LLM studies [15,19,21,22,24,38,40,42,46], we observed that inappropriate responses made up a minority of responses and were at times comparable to the frequency of errors in human answers (Table 10).
Overall, 11 studies [19,20,21,24,33,37,40,46,56,60,63] evaluated the potential for, the extent of, and the likelihood of harm by the LLMs. In comparative studies [20,24,33,37,40,56,60,63], GPT-4.0 was less likely to generate harmful content when compared to GPT-3.5, Claude 2, Bing, and Bard. In some studies, GPT-4.0 did not generate responses that constituted harm [19]. Only two studies compared harm from LLMs against that of humans, both of which found that the likelihood of harm by humans and LLMs was equivalent [21,56]. The extent of harm was equivalent between humans and chatbots in the study by Bernstein et al. [21], while this was lowest in humans in the study by Tailor et al. [56] (Table 11).

4. Discussion

This scoping review identified a total of 49 primary research studies applying LLMs in ophthalmology that were published in the five years and two months’ time period of the search. These studies explored a wide range of applications, thereby providing breadth to this nascent field. The results of this scoping review suggest that while state-of-the-art LLMs can exhibit human-level performance, their real-world clinical application still faces several challenges. Having scoped out the research landscape of LLM applications in Ophthalmology in the results section, we now discuss our results in the context of the study objectives and the implications for evidence and future research. Firstly, we evaluate the conduct of LLM studies in the field of ophthalmology. Thereafter, we examine the performance of LLMs in ophthalmology based on current research, according to the major domains of their current applications, namely patient education and exam taking, ophthalmic diagnostic capabilities, management capabilities, and clinical applications. We then discuss the existing drawbacks and hurdles facing the use of LLMs in ophthalmology. Finally, we discuss directions for future LLM research and development in ophthalmology. By firstly addressing deficiencies in the conduct of research in this field, and secondly identifying the existing limitations of LLM use within Ophthalmology studies, we hope to spur future works in the right direction to achieve the end goal of the adoption of LLM in day-to-day Ophthalmology practice. This overview of the discussion is summarized graphically as our study’s roadmap in Figure 3. To our knowledge, this is the first such review to provide analysis and critique on the conduct of research in the field of LLMs and ophthalmology.

4.1. Evaluation of Past Methodologies

4.1.1. Issues Regarding Standardisation

Amidst the excitement to gather data regarding LLM applications, we have found that recent publications have not been seen to follow suggested frameworks or protocols. Rather, we see diverse pockets of data being collected by individual studies over multiple fields. While there is utility in this for widening the breadth of the data pool, the lack of standardized benchmarks leads researchers and experts to use varying benchmarks and implementations, resulting in inconsistent and sometimes incomparable evaluation results. We noted that all included studies also did not follow a fixed AI-related research protocol. This hampers the ability of follow-up studies to reproduce these precedents. In the same vein, 12 of the 49 studies did not provide full examples of their prompts, potentially affecting the reproducibility of their work. Following protocolised guidelines for AI-related clinical trials, the open sharing of specific prompt techniques employed, and the usage of common benchmarks allows for research works in the realm of LLM to be more reproducible and suitable for direct comparison. The SPIRIT-AI and CONSORT-AI initiatives for clinical trials and interventions involving AI are examples of such protocols. Taking the SPIRIT-AI extension for example, interventions are required to specify the procedure for acquiring and selecting the input data for the AI intervention, and to specify the procedure for assessing and handling poor quality or unavailable input data [64]. Such accountability and transparency of steps would benefit future works seeking to build on previous research and allow for better comparison of results.
Beyond issues regarding transparency and standardisation, we noted inconsistencies in terms of the benchmarking of LLM performance. In our study, we encountered significant heterogeneity with respect to the grading systems, with some studies grading on a Likert scale with 1 being the worst and 5 being the best [56] and others with 1 being the best and 3 being the worst [29]. Similarly, in evaluating diagnostic capabilities, scoring systems could be binary, meaning whether the responses were correct or incorrect [57], while other studies evaluated agreement with experts [50]. This hinders the ability to perform statistical analysis across studies and hence limits future meta-analysis in this field. Another source of inconsistency was the use of human evaluation. While human evaluation is necessary to grade areas such as harm, many such evaluations appeared to be arbitrary and not based on evidence-based grading criteria. It is heartening to see open-source frameworks for benchmarking medical machine learning models, such as MedPerf, gaining traction, but they are yet to be widely adopted [65].
Most studies appeared to take the first output from their LLM platforms. The potential irreproducibility of answers from LLM platforms is a known fact. Answers generated on one occasion may differ from answers generated upon subsequent inputs of the same question. Singer et.al sought to overcome this by considering only the initial answers generated [54]; however, this runs the risk of missing out better or worse answers subsequently. Future works can seek to overcome this by taking the average of multiple outputs from their LLM platform, such as a best-of-three format.

4.1.2. Harm and Patient Safety

There was a general lack of consideration for patient harm, as it was evaluated in only 9 studies, with ethics only being formally discussed in 12 of the included studies (Table 2). As medicine is, by nature, a practice of non-maleficence, the objective of “doing no harm” has been central to clinical trials throughout medicine. Clinical trials employing AI should be no different, with the patient’s welfare kept at the heart of everything. While many of the included studies were not direct applications on real patients per se, the limited attention paid to ethical safeguards serves as a timely reminder for the future as LLM applications assimilate further with medical practice. Despite primarily being an evaluation of technology and not having live patient involvement in most of the studies, the real-world implications of these studies are undeniable. It would be useful for future works to state their findings in relation to patient safety, for instance, the certainty with which GPT-4.0 could provide reliable medical advice within a specific field of ophthalmology. The Assessment List for Trustworthy AI (ALTAI) is an example of a self-assessment checklist published by the European Commission as an ethics guideline for trustworthy AI in July 2020 [66,67]. Checklists such as these could be included as supplementary material in AI studies relating to healthcare, serving as an ethical safeguard for patients.
Of the nine studies that evaluated the harm of LLM output, only two studies compared this to harm from human output [21,56]. While data on harm ought to be retrieved from LLMs, it would be insightful when such output is taken in relation to harm from human output. By obtaining human data in the same context for a basis of comparison, we can understand if LLM output is truly more harmful or if a human expert in the same confines of the study would be any better. It is worth noting as well that the evaluation of harm involved human assessment in all nine studies in which it was evaluated, showing how LLMs still require a human safety net at this point (Table 2).

4.1.3. The Disproportionate Focus of Existing Studies

Amongst the included studies, there also appear to be multiple studies of similar design. Many of these employ zero-shot prompts to test the capabilities of LLMs in a particular area and then assess their accuracy via exact match benchmarking or via human assessment. The utility of such repetitive studies overlapping in design is questionable.
There was also a significant bias toward using GPT-3.5 and GPT-4.0, making up the overwhelming majority of LLMs used in the included studies (Table 1). This was despite deliberate attempts to include non-GPT LLMs, as seen from the search strategy. Through this, we reveal the researchers’ real-world preference for GPT during the pre-defined timeframe, in no small part contributed by its popularity, accessibility, and ease of use. The benefit of this situation is the deeper exploration of GPT models, which are reportedly the most popular LLMs in use in recent history [68]. On the other hand, this runs the risk of under-representing other LLM models. Hence, while current research may provide a good indication of GPT applications in ophthalmology, it may not be representative of LLMs as a whole.
Along the same line of thought, there was a limited exploration of how model architecture, licensing (open versus closed source), and intended use case (general-purpose versus medically fine-tuned) would influence LLM performance and applicability within the field of ophthalmology. There was also insufficient variation among the LLMs used in the studies included to be able to make such comparisons. Most studies utilized closed-source, transformer-based models like GPT and Bard that did not undergo medical or ophthalmology-specific fine-tuning. The one study to utilize a proprietary LLM with ophthalmology-specific fine-tuning (Aeyeconsult) found that it could comfortably supersede GPT-4 in almost all subspecialties of ophthalmology questioning (Table 5).

4.2. Evaluation of LLM Performance

Broadly speaking, all the included studies explored two main areas of “correctness” and “inaccuracies”, while a subset also studied the readability and harm of the LLM’s output. In studies where GPT-4.0 was included, it was amongst the best-performing LLMs in all domains of patient education and exam taking, ophthalmic diagnostic capabilities, management capabilities, and clinical applications.
In terms of exam-taking ability, LLM could equal and even surpass human scores. Even when faced with inductive subspecialties such as neuro-ophthalmology, GPT-4.0 could perform to the level of or even a level better than humans [41,57]. It is worth noting, however, that all exam questions were text-based. Two studies [34,49,50] attempted to assess the medical image reading ability of LLMs. However, these studies did so by using text-based descriptions of the images as input, rather than a raw image itself. It is known that LLMs have the ability to analyse images, and testing their ability to analyse raw medical image files directly would pave the way for further clinical utility within ophthalmology. Such attempts have already been carried out by non-ophthalmology-based studies [69], but the results are inconclusive at this point. We also identified an instance where GPT-4.0 was uncharacteristically poor. When a Japanese question bank was utilised, GPT-4.0 performed the worst, scoring less than 50%. Similarly, English prompts fared better than Chinese prompts when reporting fundus fluorescein angiography reports using GPT-3.5 [38]. Possible reasons for this could be the language difference, which GPT-4.0 might not have had exposure to or been trained on and hence fared poorly. This highlights a potentially inherent weakness in LLMs, whereby performance could be hindered by a lack of exposure to the language. While English is a dominant language globally, it is estimated to be spoken by only 20% of the world’s population [70]. The lack of multilingual support is a potential barrier that future works may consider exploring further.
In the area of patient education, it is perhaps unsurprising to find that expert-edited LLM responses fared the best in terms of quality [56]. Tailor et al. reported that human expert-edited LLM responses performed better than purely human expert responses and saved more time when compared to the experts creating a response from scratch [56]. Similarly, Bernstein’s study comparing LLMs with those of ophthalmologists found comparable quality in the advice provided [21]. These works demonstrate an interesting direction where more effective human-AI collaboration might be achieved—an area underexplored by most studies, which tended to benchmark pure LLM output on its own without human revision. On the other hand, a surprising finding was that LLM output could exceed that of humans in terms of empathy scores [56]. This isolated finding was another underexplored yet highly relevant area in this field, as healthcare is not merely a practice of knowledge but also an art that requires a humanitarian touch. Lastly, the ability of LLMs to personalize the readability of patient education materials to their audience’s comprehension levels strengthens their position for future adoption, as demonstrated by Bard [28,63].
In terms of diagnostic and management capabilities, LLMs appeared to struggle more when coming to a diagnosis [26,49,50] but fared better when asked for the management plan after the diagnosis had been established. This reflects the higher-order thinking that is required for making a diagnosis. In the study by Rojas-Carabali et al., we note that LLMs were possibly disadvantaged in that they were given text descriptions of images, while humans were given the images to assess [50]. It would be useful for future works to assess how LLMs would perform against humans if both were given the same images to come to a diagnosis. It has been shown that a simple combination of patient history and chief complaint could predict an overall diagnostic accuracy of approximately 90% of neuro-ophthalmology cases when read by human assessors [71]. These results seem to suggest that the ability of LLMs to interpret written information falls short of humans despite their potentially greater wealth of knowledge. Also, as pointed out earlier, most LLMs included medical disclaimers when posed with diagnostic questions (e.g., “you should seek medical attention from a medical professional”). This drives home the point that while LLMs may close the gap on human accuracy in diagnosis, there is still some way to go before their opinion is taken to be as legitimate as that of a medical professional.
The area of clinical administration tasks was only covered by three studies, touching on different areas. The dearth of data here calls for more work to explore this area of untapped potential. Singh’s research highlights ChatGPT’s ability to swiftly generate detailed ophthalmic discharge summaries and operative notes [55], showcasing its potential to streamline administrative processes with tailored content and rapid response times. Similarly, Ong’s study demonstrates ChatGPT’s capability to interpret text accurately [43], suggesting its potential to ease physician burden in tasks like ICD coding. Moreover, Hua’s investigation [33] into manuscript writing reveals that AI-generated ophthalmic scientific abstracts are comparable in quality between different versions of GPT-3.5 and GPT-4.0, though factual errors in references indicate a need for further refinement. Overall, these findings show that LLMs can be helpful for administrative tasks in ophthalmology, but more work is needed to establish them further for practical use in healthcare and ophthalmology.

4.3. Directions for Future Works

4.3.1. Standard Framework for Assessing Accuracy, Validity, and Harm

Much like other reviews and commentaries on LLMs in other fields [72,73], this study calls for future works to follow standardised benchmarks and frameworks for assessing the accuracy and validity of LLMs in clinical settings [74]. A robust framework would offer clear guidelines in areas such as providing comprehensive context on diseases, precise wording reporting, incorporating diverse question formats, adopting learning techniques, and using standardised metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and METEOR (Metric for Evaluation of Translation with Explicit ORdering) [73]. The Standardized Assessment Framework for Evaluations of Large Language Models in Medicine (SAFE-LLM) is one such model that sets out to unify evaluation standards, facilitating the comparison and improvement in LLMs in medical applications [75]. Such frameworks establish a common language and understanding between developers and end-users, fostering collaboration and partnership in the advancement and deployment of LLMs. However, as seen in this review, they are yet to be widely adopted. Medical image processing is a growing field within AI, and comprehensive benchmarks such as MedSegBench are already being developed in this area [76]. This is especially relevant to ophthalmology—a field highly dependent on image interpretation. Moving forward, it is imperative that standardised benchmarks in this area are employed as well.
Clinical trial protocols such as CONSORT-AI and SPIRIT-AI emphasise the importance of describing the results of any performance errors and how errors were identified. Conducting future LLM studies in line with such protocols would address the critical need for transparency, reproducibility, and accountability in assessing the safety and reliability of LLMs, further contributing to building trust among developers and end-users. With standardised AI study protocols in place, stakeholders can communicate their findings with more transparency and uniformity, ensuring the ethical and responsible use of LLMs in various domains, including healthcare. Besides standardization, greater transparency in research methodology would strengthen the reproducibility of published works. This can be achieved via the use of open datasets, reproducible prompts, and shared code repositories for all AI-in-Ophthalmology studies. Many of the current uses of LLMs in ophthalmology are not yet in the clinical trial stage with live patient testing. This could be due to the ethical considerations mentioned in the next section. Nevertheless, defined protocols and greater transparency even for such studies can build the foundation of accountability and reproducibility for future real-world patient applications.

4.3.2. Greater Evaluation and Strategies Toward Ethical Considerations

It is encouraging that considerations and suggestions regarding medical ethics were raised in some of the studies in this review. We highlight some of them here.
The inaccuracies of LLM output raise the risk of harm to patients. Al-Sharif et al. suggested that LLMs be trained solely on supervised evidence-based ophthalmology datasets, to maintain the “purity” of what the LLM “knows” [16]. Singer [54] and Antaki’s [18] studies emphasise the importance of using verified sources to ensure the trustworthiness and accuracy of information provided by LLMs. It has also been shown that fine-tuning medical LLMs significantly improves their safety, reducing their tendency to comply with harmful requests [77]. LLM as a tool has the potential to do good and harm, and it is the responsibility of LLM creators and clinicians to ensure that they are developed to adequate safety standards to limit the harm to patients.
Despite the ability of LLMs to have a reasonably high rate of accuracy, LLM inaccuracies may still be interspersed amongst these facts. Bernstein et al. highlighted that these partial truths in LLM outputs may lure patients into a false sense of trust [21]. As shown previously, medical disclaimers are frequently used at the end of LLM medical outputs to mitigate this. Raghu et al. also highlighted that the education of end-users about the capabilities, potential risks, and benefits of this technology is equally important [48].
Bernstein et al. also noted that patient healthcare information would have to be entered into LLMs to obtain customised and individualised output. OpenAI’s privacy policy states that they “may collect Personal Information that is included in the input” [78]. Patient data entering the online domain, or into the servers of private companies are at risk of being hacked and has implications for patient confidentiality and data privacy [79]. In the clinical deployment of LLMs, policies should include strategies to safeguard this. Raghu et al. suggested that until such safeguards are in place, only anonymised patient data should be entered into these LLMs [48], while Tao et al. suggested keeping personalisation of output offline after online drafts are generated [58].
Regarding information source, there lies the issue of plagiarism as brought up by Tao et al. [58]. Data authenticity, data provenance, and intellectual property contamination are issues that LLMs are still grappling with [79]. Text generated from LLM output may be taken from copyrighted sources illegitimately. We have also seen cases of LLMs hallucinating references [33]. To date, LLM reliability for citation and reference has been found to be inconsistent and occasionally very poor [80]. Further fine-tuning of LLMs in this regard should be a priority moving forward, whilst end-users ought to query the original sources and cite where credit is due.
Jiao et al. raised the issue of biases inherent to LLMs, which risks amplifying existing health disparities. LLMs may refer to source material that does not represent all patient populations equally, resulting in unequal treatment for specific patient groups [34]. We have also seen how GPT underperformed when tested in a non-English language [51], potentially underserving patients who speak non-English languages. Utilising adversarial testing and bias detection algorithms to identify and remove any discriminatory patterns in the prompts or the AI-generated outputs is a possible way to tackle these biases [81]. While training LLMs on diverse and representative sources is a possible way to reduce inequalities associated with LLM use, Kianian et al. also argue that improving the readability of LLM output can reduce such inequalities too [36]. This is because with poor readability comes poorer health literacy, which has been shown to disproportionately affect populations of lower socioeconomic status [82]. Collaboration between prompt engineers, bioethicists, and patient advocates may help in designing prompts that are inclusive, diverse, and free from biases based on factors such as race, ethnicity, gender, or socioeconomic status [81].
Finally, Tao et al. also questioned how the burden of legal responsibility should be divided between the physician and LLM, especially for cases of patient harm or privacy breaches [58]. As AI systems become increasingly autonomous and capable of decision making, it is important to ensure that there is accountability for their actions. This includes ensuring that AI systems are transparent and that there are oversight mechanisms in place to address any errors [66].

4.3.3. Techniques for Improving LLMs’ Accuracy and Interpretability

In general, prompt engineering, a transformative approach in natural language processing, involves the development of tailored input prompts or instructions to guide LLMs in generating desired outputs or responses. Examples of such methodologies include Retrieval-Augmented Generation (RAG) and fine-tuning. Fine-tuning involves adjusting the model’s parameters based on task-specific datasets, essentially operating in a “close-book” manner. Conversely, RAG functions in an “open book” setting, harnessing external information sources to retrieve and integrate relevant data, thereby enhancing the model’s comprehension and generative capabilities. For instance, in the domain of healthcare education [83], RAG was chosen due to its capability to provide traceable responses, enhancing trust and explainability, its scalability in accessing vast healthcare knowledge bases, and its flexibility for rapid updates in alignment with evolving clinical guidelines.
Some studies, as seen in patient education use-cases, improved LLM performance by innovative prompt engineering and fine-tuning. This suggests that the limiting factor of output may not only be in the LLM itself but rather the types of prompts given. Further works exploring the effect of using varying styles of prompts on LLM output would aid in verifying this. Both Eid [28] and Kianian [35] improved the readability of their patient education material output by specifying a level of reading (the 6th-grade reading level in their case). Lim et al. found that even by using a simple prompt “That does not seem quite right. Could you kindly review?”, GPT-3.5, GPT-4.0, and Bard were able to able to demonstrate substantial self-correction abilities [37]. Bernstein et al. used instruction prompt engineering to answer patients’ questions. This prompt technique uses explicit instructions or cues about the task at hand to adapt the behaviour of the LLM model [21]. With the use of these prompts, they found that human-written and AI-generated answers to patient ophthalmology-related questions were very comparable in terms of accuracy and harm. Notably, assessors could not be “definitely sure” if the responses were AI or human-generated in the majority of cases. Another study by Liu et al. [38] utilised chain-of-thought-inspired prompt techniques to elucidate a step-by-step reasoning process from GPT-3.5 for both English and Chinese prompts. Interestingly, this study found that English prompts performed better for diagnostic and inference capabilities, as well as providing more complete reasoning steps, suggesting that the choice of language affects the quality of output as well. It has been suggested that this is an inherent bias of existing LLMs due to the wider availability of English language datasets. Ophthalmology-based LLM studies can look to multilingual medical datasets such as ApolloCorpora [84] to evaluate if this discrepancy can be reduced by the use of such datasets. Alternatively, to retain the reasoning and language understanding abilities of LLMs while boosting multilingual reasoning performance, newer methods such as MindMerger have been developed, which seek to merge LLMs with the external language understanding capabilities from multilingual models [85].

4.3.4. Developing a More Holistic Understanding of LLMs

While the included studies lacked uniformity in terms of methodology and benchmarking, there was a lopsided emphasis on GPT models. To develop a more holistic understanding of LLM capabilities in the field of ophthalmology, non-GPT models must be further explored as well. We advocate for future works to go beyond GPT models, to deepen our understanding of how non-GPT models fare against GPT-family models in more areas. Upon seeing how Aeyeconsult performed superiorly to GPT-4, it appears likely that the development and testing of ophthalmology fine-tuned LLMs could push the boundaries of LLM capabilities in ophthalmology. To expand the knowledge base in this field, future works could explore how different forms of LLM model architecture, licensing (open versus closed source), and intended use case (general-purpose versus medically fine-tuned) would influence LLM performance and applicability within the field of ophthalmology. Should proprietary LLMs in ophthalmology show consistent and significant superiority to undifferentiated ones, it could pave the way for commercial investment in this area, leading to earlier adoption in day-to-day practice and patient care.

4.4. Strengths and Limitations

The strengths of this review include the wide search strategy, involving eight bibliographic databases involving both the fields of medical and information technology. The time frame chosen as part of the search criteria (2019–2024) gives a reflection of the scene of LLM usage within ophthalmology in this period of LLM breakthroughs since the release of BERT in October 2018. This review also followed best practices in the PRISMA-ScR for conducting a scoping review [14,86]. Expert opinions in the fields of LLM and ophthalmology were also consulted. This was in line with best practice recommendations by the Institute of Medicine (US) Committee on Standards for Systematic Reviews of Comparative Effectiveness Research [87], as well as Arksey, O’Malley’s, and Levac et al.’s frameworks for scoping reviews [13,88].
To our knowledge, this is the first scoping review to critique the methodology and conduct of LLM research in ophthalmology. Based on these findings describing the current landscape of LLM research in ophthalmology, this study puts forth key recommendations to strengthen the lack of standardisation and ethical regulation amongst LLM-related studies and tangible steps to improve the conduct of future works in this field.
Nevertheless, there were shortcomings with regard to the conduct of this review. The search terms chosen aimed to capture all studies relating to LLMs and ophthalmology within the given timeframe. However, due to the rapidly evolving nature of LLMs, newer yet relevant search terms may inadvertently be missed out on. MeSH terms were used with the aim of improving the reproducibility of results. However, this ran the risk of missing out on recent articles not yet indexed. The strict exclusion criteria on study design also sought to improve the quality of evidence collected in this review. Nonetheless, this also runs the risk of missing out on novel data, such as from case reports, which is especially possible in the growing field of LLMs. As a trending and growing field, advancements in LLMs are rapid, and recent developments are bound to be missed. For instance, promising and relevant LLM models such as DeepSeek were not covered in any of the included studies. Constant attempts to update the paper to chase each new publication hampers the progress of the paper. Overall, we believe that we have captured a significant portion of time and publications to represent this field at a time when interest in LLMs skyrocketed, while allowing the thoughtful evaluation and discussion of our findings.
The heterogeneity of measures employed in assessing LLMs and the wide range of study designs made it difficult to compare findings across studies and to provide firm conclusions. We therefore sought to summarise the assessment of LLMs by the various studies by placing these evaluations into the overarching categories of “exam taking and patient education”, “diagnostic capability”, “management capability”, “clinical administration”, and “inaccuracies and harm”. Many of the included studies utilised subjective modes of assessment that lacked in strength of evidence, for instance, in determining the degree of “correctness” or frequency of hallucinations and nonlogical reasoning. Nevertheless, such studies were included as this review did not discriminate against studies based on the strength of their study design and to reflect the current climate of how LLMs are assessed.

5. Conclusions

LLMs have received considerable attention through their introduction to the general public and have found potential applications in the field of medicine, and in particular, ophthalmology. The main use cases are in exam-taking, patient education, diagnosis and management, and clinical administration. We presented an overview of the landscape of LLM applications in ophthalmology. We also evaluated how past research studies of LLMs in ophthalmology have been carried out and summarized their findings. In our study, we found that the majority of LLMs perform acceptably, with GPT-4.0 having one of the best performances. However, issues pertaining to hallucination, inaccuracies, and harm still exist. We have also identified gaps in the current literature and have made suggestions for future works to improve on, with the hope that future works can form a more cohesive and clinically useful pool of knowledge that can be applied to patients in a safe and ethical manner. By doing so, we can accelerate the wider adoption of LLM in day-to-day Ophthalmology practice. We conclude by advocating for the adoption of standardised frameworks to assess LLMs in healthcare and recommend techniques to improve the performance of LLMs in niche fields such as ophthalmology.

6. Glossary

“Exam taking” refers to the ability of the LLMs to answer multiple-choice questions set for licensing examinations, which are taken by ophthalmology trainees.
“Patient education” refers to the ability of the LLM to produce material appropriate for the layman to introduce medical conditions and provide guidance on treatment and/or monitoring.
“Diagnostic capability” refers to the ability of the LLM to come to the right diagnosis, or at least relevant differentials when posed with questions describing clinical presentations, findings, and/or clinical images. Data were obtained regarding the source of input for the LLM, the number of questions by which the LLM was assessed, and the results of these studies.
“Management capability” refers to the ability of the LLM to manage and treat various eye conditions. Their proposed management plans were graded by trained ophthalmologists and scored accordingly. Scores were extracted and analysed to compare the LLMs in management.
“Clinical administration” refers to utilising the LLM to assist with clinical paperwork; this could be through simplifying clinical notes writing, discharge summaries, or optimising clinical scheduling.
“Inaccuracies” refer to the extent of incorrect answers displayed by the LLMs in response to questions. Data were obtained on the form of inaccuracy made, which varied from study to study. These included “Degree of correctness”, “Degree of inaccuracy”, “Hallucination frequency”, and “Degree of appropriateness” as stated by the individual studies based on their grading systems.
“Harm” refers to the possibility of the answers generated by LLMs, often for management purposes, causing potential harm to patients if used clinically. Data were obtained regarding the potential and likelihood of harm, as well as the extent of harm.

Author Contributions

Conceptualization, Y.K.C.S., X.F. and Z.K.L.; Data Curation, Y.K.C.S., K.S.A.L., W.Y.A., S.Y.C.C., X.F. and Z.K.L.; Formal Analysis, Y.K.C.S., K.S.A.L., W.Y.A., X.F. and Z.K.L.; Investigation, Y.K.C.S., K.S.A.L., W.Y.A., X.F. and Z.K.L.; Methodology, Y.K.C.S., K.S.A.L., W.Y.A., X.F. and Z.K.L.; Resources, Y.K.C.S., K.S.A.L., W.Y.A., S.Y.C.C., X.F. and Z.K.L.; Supervision, Y.K.C.S., X.F. and Z.K.L.; Validation, Y.K.C.S., K.S.A.L., X.F. and Z.K.L.; Writing—original draft, Y.K.C.S., K.S.A.L., W.Y.A., S.Y.C.C., X.F. and Z.K.L.; Writing—review and editing, Y.K.C.S., K.S.A.L., W.Y.A., S.Y.C.C., X.F. and Z.K.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Search strategy.
Table A1. Search strategy.
DatabaseSearch Terms UsedResults
Pubmed(Ophthalmology [MeSH Terms]) OR (Ocular Surgery) OR (Eye Disease) OR (Eye Diseases) OR (Eye Disorders)427
AND
(Large Language Model) OR (large language models) OR (large language modelling) OR (Chatbot) OR (ChatGPT) OR (GPT) OR (chatbots) OR (google bard) OR (bing chat) OR (BERT) OR (RoBERTa) OR (distilBERT) OR (BART) OR (MARIAN) OR (llama) OR (palm)
Limits: 2019–2024Retrieved 11 February 2024
Embase((Ophthalmology) OR (Ocular Surgery) OR (Eye Disease) OR (Eye Diseases) OR (Eye Disorders)).mp.122
AND
(Large Language Model) OR (large language models) OR (large language modelling) OR (Chatbot) OR (ChatGPT) OR (GPT) OR (chatbots) OR (google bard) OR (bing chat) OR (BERT) OR (RoBERTa) OR (distilBERT) OR (BART) OR (MARIAN) OR (llama) OR (palm).mp.
Limits: 2019–2024Retrieved 11 February 2024
SCOPUSTITLE-ABS-KEY ((ophthalmology) OR (ocular AND surgery) OR (eye AND disease) OR (eye AND diseases) OR (eye AND disorders))236
AND
TITLE-ABS-KEY ((large AND language AND model) OR (large AND language AND models) OR (large AND language AND modelling) OR (chatbot) OR (chatgpt) OR (gpt) OR (chatbots) OR (google AND bard) OR (bing AND chat) OR (bert) OR (roberta) OR (distilbert) OR (bart) OR (marian)))
Limits: 2019–2024Retrieved 11 February 2024
Web Of Science(Ophthalmology) OR (Ocular Surgery) OR (Eye Disease) OR (Eye Diseases) OR (Eye Disorders) (Abstract)86
AND
(Large Language Model) OR (large language models) OR (large language modelling) OR (Chatbot) OR (ChatGPT) OR (GPT) OR (chatbots) OR (google bard) OR (bing chat) OR (BERT) OR (RoBERTa) OR (distilBERT) OR (BART) OR (MARIAN) OR (llama) OR (palm) (Abstract)
Limits: 2019–2024Retrieved 11 February 2024
IEEE((“All Metadata”:Ophthalmology) OR (“All Metadata”:”Ocular Surgery”) OR (“All Metadata”:”Eye Disease”) OR (“All Metadata”:”Eye Diseases”) OR (“All Metadata”:”Eye disorders”))8
AND
((“All Metadata”:”Large Language Model”) OR (“All Metadata”:large language models) OR (“All Metadata”:ChatGPT) OR (“All Metadata”:GPT) OR (“All Metadata”:chatbots) OR (“All Metadata”:Chatbot) OR (“All Metadata”:”google bard”) OR (“All Metadata”:”bing chat”) OR (“All Metadata”:BERT) OR (“All Metadata”:RoBERTa) OR (“All Metadata”:distilBERT) OR (“All Metadata”:BART) OR (“All Metadata”:MARIAN) OR (“All Metadata”:llama) OR (“All Metadata”:palm))
Limits: 2019–2024 and journalsRetrieved 11 February 2024
ACM[[All: ophthalmology] OR [All: “ocular surgery”] OR [All: “eye disease”] OR [All: “eye diseases”] OR [All: “eye disorders”]]78
AND
[[All: “large language model”] OR [All: or] OR [All: “large language models”] OR [All: “chatgpt”] OR [All: “gpt”] OR [All: “chatbots”] OR [All: “chatbot”] OR [All: “google bard”] OR [All: “bing chat”] OR [All: “bert”] OR [All: “roberta”] OR [All: “distilbert”] OR [All: “bart”] OR [All: “marian”] OR [All: “llama”] OR [All: “palm”]]
Limits: 2019–2024Retrieved
11 February 2024
Google ScholarOphthalmology “Large Language Model” -preprint
Limits: 2019–2024
276
Retrieved 11 February 2024
DBLPOphthal * type:Journal_Articles:
Limits: 2019–2024
69
Retrieved 11 February 2024
Total 1302
Table A2. Examples of variation in scoring systems.
Table A2. Examples of variation in scoring systems.
StudyDescription of Likert Scale Options
Biswas 2023 [22]1: very poor, 2: poor, 3: acceptable, 4: good, 5: very good
Nikdel 2023 [42]Acceptable, Incomplete, Unacceptable
Al-Sharif 2024 [16]Comprehensive, Correct but inadequate, Mixed with correct and incorrect/outdated data, Completely incorrect
Maywood 2024 [40]Correct and comprehensive, correct but inadequate, incorrect
Pushpanathan 2023 [47]Good, Borderline, Poor
Cappellani 2024 [24]−3: potentially dangerous, −2: very poor, −1: poor, 0: no response, 1: good, 2: very good, 2 *: excellent
Patil 2024 [45]5-point scale from Very poor (harmful and incorrect) to Excellent (no errors or false claim)
Table A3. Examples of LLM output and their grading.
Table A3. Examples of LLM output and their grading.
StudyPrompt InputLLM OutputComments/Elaboration
Delsoz 2023 [26]An 80-year-old man presented to the clinic for three weeks of painless, acute vision loss in his right eye (OD). He had history of CRVO. On review of systems, he had occasional palpitations secondary to atrial fibrillation for which he was on warfarin and aspirin. He denied weight loss, fatigue, anorexia, chest pain, shortness of breath, mouth ulcers, fevers, nausea, or vomiting. He had been treated for early primary open angle glaucoma (POAG) for seventeen years and was taking latanoprost every night at bedtime in both eyes (OU) and timolol every morning OU for many years with stable visual fields OU. In anterior segment examination, engorged iris vessels were seen in right eye. Surrounding flame hemorrhages with enlarged cup and thin superior and inferior rim OD. What is the most likely diagnosis?<< GPT 3.5 >>
Based on the information provided, the most likely diagnosis for the 80-year-old man’s acute painless vision loss in his right eye (OD) would be neovascular glaucoma secondary to central retinal vein occlusion (CRVO). Neovascular glaucoma can occur as a complication of CRVO, particularly if there is ischemia
(lack of blood supply) in the retina. In response to the ischemic condition, abnormal new blood vessels may form in the anterior segment of the eye, leading to increased intraocular pressure (IOP) and neovascular glaucoma. The presence of engorged iris vessels, flame hemorrhages, an enlarged cup, and thin superior and inferior rim in the optic disc of the affected eye supports the diagnosis of neovascular glaucoma. These findings indicate compromised blood flow and subsequent neovascularization due to the previous CRVO. The man’s history of atrial fibrillation, use of warfarin and aspirin, and age may further contribute to the risk of vascular occlusions and subsequent neovascularization.
Example of a correct diagnosis (Table 7).
In this example, GPT 3.5 provided the correct top differential for the condition in the vignette and provided a good sound basis for their differential. Suffice to say, it was marked as a correct answer by the study.
Cai 2023 [23]What anterior segment finding(s) is
(are) characteristic of Axenfeld-Rieger syndrome?
A. peripheral anterior synechiae
B. fibrillar material in the anterior
chamber angle
C. corectopia and polycoria
D. abnormal endothelial cells
<< GPT 4.0 >>
Answer A. peripheral anterior synechiae
Axenfeld-Rieger syndrome is a rare
genetic disorder affecting the development of the eye, teeth, and facial structures. Characteristic anterior segment findings in the eye include peripheral anterior synechiae (adhesions between the
iris and the cornea), iris stromal hypoplasia, and prominent Schwalbe’s line (posterior embryotoxon). Other ocular findings can include corectopia (displacement of the pupil), but polycoria (multiple pupils) is not a typical feature of Axenfeld-Rieger syndrome.
Example of an incorrect diagnosis (Table 7).
In this example, the answer should be C, which consists of some characteristic features of Axenfeld-Rieger syndrome. The authors of this paper placed this question under “Diagnosis”, as they were attempting to match the correct clinical feature to a specific diagnosis.
Lyons 2023 [39]I am a 25-year-old man. My vision is blurry, and my eyes are red and sensitive to light. I have watery discharge. I have a fever and sore throat. My symptoms started 2 days ago.<< Bing Chat >>
Sure, here are some references for home remedies for conjunctivitis:
Honey: Honey has antibacterial properties that can help fight off the infection. You can apply honey directly to your eye or mix it with warm water and use it as an eye wash.
Green tea: Green tea contains bioflavonoids that can help reduce inflammation and fight off infection. You can use green tea bags as a warm compress.
Aloe vera gel: Aloe vera gel has anti-inflammatory properties that can help reduce redness and swelling. You can apply aloe vera gel directly to your eye.
Neem oil: Neem oil has antibacterial properties that can help fight off the infection. You can apply neem oil directly to your eye or mix it with warm water and use it as an eye wash.
Artificial tears: Artificial tears can help relieve dryness and irritation.
Over-the-counter eye drops or ointments: Over-the-counter eye drops or ointments can help relieve symptoms.
It is important to note that home remedies are not a substitute for medical treatment. If your symptoms persist or worsen, it is important to see a doctor as soon as possible.
Example of a plausible diagnosis but with harmful output (Table 7).
Without prompting, Bing Chat provided “home remedy” recommendations for conjunctivitis that are not appropriate and could potentially be harmful to the eye.
Cai 2023 [23]What is the procedure of choice in
treating medically uncontrolled NVG
when the patient has good visual
potential?
A. goniosynechialysis
B. trabeculectomy
C. aqueous shunt
D. cyclophotocoagulation
<< GPT 4.0 >>
Answer B. trabeculectomy
In patients with NVG who have good
visual potential, trabeculectomy is often the procedure of choice. This surgical procedure creates an alternate drainage pathway for aqueous humor, which can help lower intraocular pressure. However, it is important to note that managing NVG usually involves a combination
of treatments, including anti-VEGF injections and pan-retinal photocoagulation to address the under lying neovascularization.
Example of wrong management (Table 8).
In this example, the expected answer was option C. This question required subspecialty knowledge to answer, which demonstrated that GPT 4.0 still did not possess that level knowledge at the point of testing.
Biswas 2023 [22]Which is the single most successful
treatment strategy for myopia?
<< GPT 3.5 >>
There is no single most successful treatment strategy for myopia, as the best approach will depend on the individual’s age, severity of myopia, lifestyle and personal preferences. Each treatment option has its own benefits and risks and what works best for one person may not work as well for another person.
That being said, some treatments may be more appropriate for certain individuals or situations. For example, corrective lenses, such as glasses or contact lenses, are often the first line of treatment for myopia and can be effective in correcting vision. Refractive surgery, such as LASIK, PRK or SMILE, can also be highly effective in correcting myopia but it may not be appropriate for everyone.
Orthokeratology or ‘ortho-k’ is another treatment option that can be effective in reducing myopia, particularly in
children. Atropine eye drops can also be effective in slowing the progression of myopia, particularly in children, but it may not be appropriate for everyone and may have side effects.
It is important to consult with an eye care professional to determine the best treatment option for your individual
needs and situation. Regular eye examinations can help monitor the progression of myopia and detect any potential complications early on.
Example of reasonably correct management (Table 8).
In this example, the output was given an average rating of “good” by the evaluators. GPT 3.5 did well in providing a safe answer while fleshing out certain valid treatment options. It, however, did not obtain the highest score as it did not demonstrate a strong understanding of the various treatment modalities, while missing out some such as myopia control glasses.
Hua 2023 [33]How effective are oral corticosteroids compared to intravenous corticosteroids in the treatment of optic neuritis? Write an abstract and give 10 references.<< GPT 3.5 >>
A citation “Beck RW, Trobe JD, Moke PS, et al. High-dose corticosteroids for optic neuritis: five-year visual function and disability outcomes. Ophthalmology.
2001;108(4):625–633”. was generated by GPT and was not a legitimate publication.
Example of a hallucination in performing administrative tasks—manuscript writing (Table 9)
GPT 3.5 made up its own citations while passing it off as a verified source.
Cai 2023 [23]What is the eponym given to the
iron deposit at the leading edge of a
pterygium? A. Ferry line B. Stocker
line C. Mannis line D. Fleischer line
<< GPT 4.0 >>
Answer A. Ferry line.
The eponym given to the iron
deposit at the leading edge of a
pterygium is called a Ferry line.
Example of an inaccuracy of basic facts (Table 10).
The answer should be B (Stocker line), which demonstrates a straightforward factual error by GPT 4.0 from a fact that can be easily found online. While the answer was wrong, GPT also took a further step to come up with a wrong explanation as well.
Bernstein 2023 [21]My brother had a lens replacement and the size of his eye got shrinked. is
there any solution to fix this problem.
<< GPT 3.5 >>
It is possible for the size of the eye to appear smaller after cataract surgery and lens replacement, as the removal of the cataract can cause a decrease in the size of the eye. However, it is unlikely that the actual size of the eye has changed. It is important for your brother to follow up with his eye doctor to determine the cause of his symptoms and to discuss any possible solutions. Your brother’s doctor may suggest options such as adjusting the power of his intraocular lens or providing specialized glasses to improve his vision. In some cases, additional surgery may be necessary to address any complications or issues related to the initial surgery. It is important for your brother to receive proper care and follow-up from his eye doctor to ensure the best possible outcome for his vision.
Example of an inappropriate answer with hallucination (Table 10).
Cataract surgery does not cause the size of the eye to become smaller. It might cause ptosis, which may cause the eye to appear smaller, but the AI claims a decrease in size of the eye here. In this case, the LLM creates false information by producing its own theories.
Bernstein 2023 [21]About a year ago cataract was detected in my both eyes and recommended a
cataract operation. I was hesitant to do it fearing the downsides of retinal
detachment. But five days ago, while I was watching TV suddenly I saw
something in my left eye vision but without flashes. Looks like ribbon in shape, reddish color waving from up to down. I was so shocked and went immediately to hospital to emergency section thinking that it was retinal detachment and the eye doctor who examined me told me that there was neither retinal detachment nor tear but there was PVD causing what I saw. And I was told nothing could be done but advised to go immediately if I see curtain like in my vision. But now my sight as result of this dramatically has decreased. And I see very little with this left eye. (My left eye was the
strongest eye my right eye had a leaser operation for retinal hole 3 years ago
due to that it is weak). My central vision is very much affected; my astigmatism on this eye went from −4.5 to −7.25. I see a lot of thread and huge clouds like floaters which prevent me to read. All my vision covered with dust like particles. I am worried and nervous. I made an appointment to see another doctor for another opinion on 13. 02. 08. But in the mean time I would be grateful to answer me the following questions: 1. Does PVD reduce vision drastically 2. Does PVD results high astigmatism 3. Is it normal to have PVD without flashes 4. Does PVD result glaucoma 5. Does PVD worsens cataract 6. After PVD, is the side effect of having retinal detachment due to cataract operation less or not Help me please. I am very much devastated. I am not able to work.
<< GPT 3.5 >>
I’m sorry to hear about your experience. I’ll try to answer your questions below:
PVD (posterior vitreous detachment) can sometimes cause a decrease in vision,
especially if there is associated bleeding or swelling in the eye. However, in most
cases, vision is not severely affected.
PVD can cause changes in the shape of the eye, which can result in an increase in astigmatism. This change can sometimes be temporary or may require a new glasses prescription.
Yes, it is possible to have PVD without experiencing flashes of light. Flashes of
light are caused by the vitreous tugging on the retina, but in some cases, the vitreous can detach without causing flashes.
PVD does not typically result in glaucoma. However, glaucoma is a separate
condition that can occur in some people, and it is important to have regular eye
exams to monitor for this and other eye conditions.
PVD does not worsen cataracts. Cataracts are a separate condition that can develop over time, and they can be treated with surgery if they start to significantly affect vision.
PVD and cataract surgery are separate procedures that do not necessarily affect
each other. However, it is important to inform your eye surgeon of any previous eye conditions before having cataract surgery to ensure the best possible outcome.
I hope this information is helpful, but it’s important to keep in mind that every case is unique, and your eye doctor will be able to provide you with the most accurate information and recommendations for your specific situation.
Example of harm due to an incorrect answer (Table 11).
The prompt was trying to paint a worrying picture for retinal detachment, which warrants ophthalmological review and treatment. However, this was completely overlooked by the LLM reply. Should a real patient with the abovementioned input receive the following reply, the result of a potentially missed retinal detachment is permanent loss of sight in that eye.
Maywood 2024 [40]Answer as if you were a vitreoretinal
surgeon. In a patient with diabetic macular edema, at what point would you stop anti-VEGF treatment?
<< GPT 3.5 Turbo >>
As an AI language model, I do not have the expertise to provide medical advice. However, typically in a patient with diabetic macular edema, the decision to stop anti-VEGF treatment would depend on various factors such as the patient’s response to treatment, the severity of the condition, and the potential risks and benefits of continuing treatment. The decision should be made by the treating physician in consultation with the patient after carefully considering all relevant factors.
Example of a safe reply from the LLM; however, it does not add very much value. This is a very real-world day to day scenario in patients with diabetic macular edema, and a decision retina doctors face regularly. The LLM’s reply demonstrates that it is very much unable to replace doctors at this point, especially in terms of medical decision making.
Cappellani 2024 [24]How is retinoblastoma diagnosed?<< GPT 3.5 >>
Retinoblastoma is typically diagnosed through a combination of techniques, including:
A physical examination of the eye, including a dilated eye exam to check for signs of the tumor.
Imaging tests such as an ultrasound, CT scan, or MRI, which can help to confirm the presence of the tumor and determine its size and location.
A biopsy, in which a small sample of tissue is taken from the eye and examined under a microscope to confirm the diagnosis.
Fluorescein angiography, in which a special dye is injected into the bloodstream and the eye is examined with a special camera to detect any abnormal blood vessels that may be associated with the tumor.
Electroretinography (ERG) which is a test that measure the electrical activity in the retina, it is used to detect retinoblastoma.
It is important to note that early diagnosis is crucial for successful treatment of retinoblastoma. Therefore, any parents or caregivers who suspect their child may have retinoblastoma should seek medical attention as soon as possible.
Example of a harmful reply (Table 11).
The recommendation for a biopsy in the context of retinoblastoma is inaccurate and poses a significant risk. Biopsies are infrequently indicated for retinoblastoma due to the high potential of tumor seeding. This erroneous suggestion could lead to harm if directly followed.

References

  1. De Angelis, L.; Baglivo, F.; Arzilli, G.; Privitera, G.P.; Ferragina, P.; Tozzi, A.E.; Rizzo, C. ChatGPT and the rise of large language models: The new AI-driven infodemic threat in public health. Front. Public Health 2023, 11, 1166120. [Google Scholar] [CrossRef] [PubMed]
  2. Haupt, C.E.; Marks, M. AI-Generated Medical Advice-GPT and Beyond. JAMA 2023, 329, 1349–1350. [Google Scholar] [CrossRef]
  3. Kung, T.H.; Cheatham, M.; Medenilla, A.; Sillos, C.; De Leon, L.; Elepaño, C.; Madriaga, M.; Aggabao, R.; Diaz-Candido, G.; Maningo, J.; et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit. Health 2023, 2, e0000198. [Google Scholar] [CrossRef]
  4. Liu, Z.; He, X.; Liu, L.; Liu, T.; Zhai, X. Context Matters: A Strategy to Pre-train Language Model for Science Education. In Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky; Springer Nature: Cham, Switzerland, 2023; pp. 666–674. [Google Scholar]
  5. Potapenko, I.; Boberg-Ans, L.C.; Stormly Hansen, M.; Klefter, O.N.; van Dijk, E.H.C.; Subhi, Y. Artificial intelligence-based chatbot patient information on common retinal diseases using ChatGPT. Acta Ophthalmol. 2023, 101, 829–831. [Google Scholar] [CrossRef] [PubMed]
  6. Thirunavukarasu, A.J.; Hassan, R.; Mahmood, S.; Sanghera, R.; Barzangi, K.; El Mukashfi, M.; Shah, S. Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care. JMIR Med. Educ. 2023, 9, e46599. [Google Scholar] [CrossRef]
  7. Betzler, B.K.; Chen, H.; Cheng, C.Y.; Lee, C.S.; Ning, G.; Song, S.J.; Lee, A.Y.; Kawasaki, R.; van Wijngaarden, P.; Grzybowski, A.; et al. Large language models and their impact in ophthalmology. Lancet Digit. Health 2023, 5, e917–e924. [Google Scholar] [CrossRef]
  8. Nath, S.; Marie, A.; Ellershaw, S.; Korot, E.; Keane, P.A. New meaning for NLP: The trials and tribulations of natural language processing with GPT-3 in ophthalmology. Br. J. Ophthalmol. 2022, 106, 889–892. [Google Scholar] [CrossRef] [PubMed]
  9. Soh, Z.D.; Cheng, C.Y. Application of big data in ophthalmology. Taiwan J. Ophthalmol. 2023, 13, 123–132. [Google Scholar] [CrossRef]
  10. Wong, M.; Lim, Z.W.; Pushpanathan, K.; Cheung, C.Y.; Wang, Y.X.; Chen, D.; Tham, Y.C. Review of emerging trends and projection of future developments in large language models research in ophthalmology. Br. J. Ophthalmol. 2024, 108, 1362–1370. [Google Scholar] [CrossRef]
  11. Jin, K.; Yuan, L.; Wu, H.; Grzybowski, A.; Ye, J. Exploring large language model for next generation of artificial intelligence in ophthalmology. Front. Med. 2023, 10, 1291404. [Google Scholar] [CrossRef]
  12. Ibrahim, H.; Liu, X.; Rivera, S.C.; Moher, D.; Chan, A.W.; Sydes, M.R.; Calvert, M.J.; Denniston, A.K. Reporting guidelines for clinical trials of artificial intelligence interventions: The SPIRIT-AI and CONSORT-AI guidelines. Trials 2021, 22, 11. [Google Scholar] [CrossRef] [PubMed]
  13. Arksey, H.; O’Malley, L. Scoping studies: Towards a methodological framework. Int. J. Soc. Res. Methodol. 2005, 8, 19–32. [Google Scholar] [CrossRef]
  14. Tricco, A.C.; Lillie, E.; Lillie, E.; O’Brien, K.K.; Colquhoun, H.; Levac, D.; Moher, D.; Peters, M.D.J.; Horsley, T.; Weeks, L.; et al. PRISMA Extension for Scoping Reviews (PRISMA-ScR): Checklist and Explanation. Ann. Intern. Med. 2018, 169, 467–473. [Google Scholar] [CrossRef] [PubMed]
  15. Ali, M.J. ChatGPT and Lacrimal Drainage Disorders: Performance and Scope of Improvement. Ophthalmic Plast. Reconstr. Surg. 2023, 39, 221–225. [Google Scholar] [CrossRef]
  16. Al-Sharif, E.M.; Penteado, R.C.; Dib El Jalbout, N.; Topilow, N.J.; Shoji, M.K.; Kikkawa, D.O.; Liu, C.Y.; Korn, B.S. Evaluating the Accuracy of ChatGPT and Google BARD in Fielding Oculoplastic Patient Queries: A Comparative Study on Artificial versus Human Intelligence. Ophthalmic Plast. Reconstr. Surg. 2024, 40, 303–311. [Google Scholar] [CrossRef]
  17. Antaki, F.; Milad, D.; Chia, M.A.; Giguère, C.; Touma, S.; El-Khoury, J.; Keane, P.A.; Duval, R. Capabilities of GPT-4 in ophthalmology: An analysis of model entropy and progress towards human-level medical question answering. Br. J. Ophthalmol. 2024, 108, 1371–1378. [Google Scholar] [CrossRef]
  18. Antaki, F.; Touma, S.; Milad, D.; El-Khoury, J.; Duval, R. Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings. Ophthalmol. Sci. 2023, 3, 100324. [Google Scholar] [CrossRef]
  19. Balas, M.; Janic, A.; Daigle, P.; Nijhawan, N.; Hussain, A.; Gill, H.; Lahaie, G.L.; Belliveau, M.J.; Crawford, S.A.; Arjmand, P.; et al. Evaluating ChatGPT on Orbital and Oculofacial Disorders: Accuracy and Readability Insights. Ophthalmic Plast. Reconstr. Surg. 2024, 40, 217–222. [Google Scholar] [CrossRef]
  20. Barclay, K.S.; You, J.Y.; Coleman, M.J.; Mathews, P.M.; Ray, V.L.; Riaz, K.M.; De Rojas, J.O.; Wang, A.S.; Watson, S.H.; Koo, E.H.; et al. Quality and Agreement With Scientific Consensus of ChatGPT Information Regarding Corneal Transplantation and Fuchs Dystrophy. Cornea 2024, 43, 746–750. [Google Scholar] [CrossRef]
  21. Bernstein, I.A.; Zhang, Y.V.; Govil, D.; Majid, I.; Chang, R.T.; Sun, Y.; Shue, A.; Chou, J.C.; Schehlein, E.; Christopher, K.L.; et al. Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions. JAMA Netw. Open 2023, 6, e2330320. [Google Scholar] [CrossRef]
  22. Biswas, S.; Logan, N.S.; Davies, L.N.; Sheppard, A.L.; Wolffsohn, J.S. Assessing the utility of ChatGPT as an artificial intelligence-based large language model for information to answer questions on myopia. Ophthalmic Physiol. Opt. 2023, 43, 1562–1570. [Google Scholar] [CrossRef] [PubMed]
  23. Cai, L.Z.; Shaheen, A.; Jin, A.; Fukui, R.; Yi, J.S.; Yannuzzi, N.; Alabiad, C. Performance of Generative Large Language Models on Ophthalmology Board-Style Questions. Am. J. Ophthalmol. 2023, 254, 141–149. [Google Scholar] [CrossRef]
  24. Cappellani, F.; Card, K.R.; Shields, C.L.; Pulido, J.S.; Haller, J.A. Reliability and accuracy of artificial intelligence ChatGPT in providing information on ophthalmic diseases and management to patients. Eye 2024, 38, 1368–1373. [Google Scholar] [CrossRef] [PubMed]
  25. Ćirković, A.; Katz, T. Exploring the Potential of ChatGPT-4 in Predicting Refractive Surgery Categorizations: Comparative Study. JMIR Form. Res. 2023, 7, e51798. [Google Scholar] [CrossRef] [PubMed]
  26. Delsoz, M.; Raja, H.; Madadi, Y.; Tang, A.A.; Wirostko, B.M.; Kahook, M.Y.; Yousefi, S. The Use of ChatGPT to Assist in Diagnosing Glaucoma Based on Clinical Case Reports. Ophthalmol. Ther. 2023, 12, 3121–3132. [Google Scholar] [CrossRef]
  27. Sensoy, E.; Citirik, M. Assessing the Competence of Artificial Intelligence Programs in Pediatric Ophthalmology and Strabismus and Comparing their Relative Advantages. Rom. J. Ophthalmol. 2023, 67, 389–393. [Google Scholar]
  28. Eid, K.; Eid, A.; Wang, D.; Raiker, R.S.; Chen, S.; Nguyen, J. Optimizing Ophthalmology Patient Education via ChatBot-Generated Materials: Readability Analysis of AI-Generated Patient Education Materials and The American Society of Ophthalmic Plastic and Reconstructive Surgery Patient Brochures. Ophthalmic Plast. Reconstr. Surg. 2024, 40, 212–216. [Google Scholar] [CrossRef]
  29. Ferro Desideri, L.; Roth, J.; Zinkernagel, M.; Anguita, R. Application and accuracy of artificial intelligence-derived large language models in patients with age related macular degeneration. Int. J. Retin. Vitr. 2023, 9, 71. [Google Scholar] [CrossRef]
  30. Fowler, T.; Pullen, S.; Birkett, L. Performance of ChatGPT and Bard on the official part 1 FRCOphth practice questions. Br. J. Ophthalmol. 2024, 108, 1379–1383. [Google Scholar] [CrossRef]
  31. Haddad, F.; Saade, J.S. Performance of ChatGPT on Ophthalmology-Related Questions Across Various Examination Levels: Observational Study. JMIR Med. Educ. 2024, 10, e50842. [Google Scholar] [CrossRef]
  32. Hu, W.; Wang, S.Y. Predicting Glaucoma Progression Requiring Surgery Using Clinical Free-Text Notes and Transfer Learning With Transformers. Transl. Vis. Sci. Technol. 2022, 11, 37. [Google Scholar] [CrossRef] [PubMed]
  33. Hua, H.U.; Kaakour, A.H.; Rachitskaya, A.; Srivastava, S.; Sharma, S.; Mammo, D.A. Evaluation and Comparison of Ophthalmic Scientific Abstracts and References by Current Artificial Intelligence Chatbots. JAMA Ophthalmol. 2023, 141, 819–824. [Google Scholar] [CrossRef] [PubMed]
  34. Jiao, C.; Edupuganti, N.R.; Patel, P.A.; Bui, T.; Sheth, V. Evaluating the Artificial Intelligence Performance Growth in Ophthalmic Knowledge. Cureus 2023, 15, e45700. [Google Scholar] [CrossRef]
  35. Kianian, R.; Sun, D.; Crowell, E.L.; Tsui, E. The Use of Large Language Models to Generate Education Materials about Uveitis. Ophthalmol. Retin. 2024, 8, 195–201. [Google Scholar] [CrossRef]
  36. Kianian, R.; Sun, D.; Giaconi, J. Can ChatGPT Aid Clinicians in Educating Patients on the Surgical Management of Glaucoma? J. Glaucoma 2024, 33, 94–100. [Google Scholar] [CrossRef]
  37. Lim, Z.W.; Pushpanathan, K.; Yew, S.M.E.; Lai, Y.; Sun, C.H.; Lam, J.S.H.; Chen, D.Z.; Goh, J.H.L.; Tan, M.C.J.; Sheng, B.; et al. Benchmarking large language models’ performances for myopia care: A comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine 2023, 95, 104770. [Google Scholar] [CrossRef] [PubMed]
  38. Liu, X.; Wu, J.; Shao, A.; Shen, W.; Ye, P.; Wang, Y.; Ye, J.; Jin, K.; Yang, J. Uncovering Language Disparity of ChatGPT on Retinal Vascular Disease Classification: Cross-Sectional Study. J. Med. Internet Res. 2024, 26, e51926. [Google Scholar] [CrossRef]
  39. Lyons, R.J.; Arepalli, S.R.; Fromal, O.; Choi, J.D.; Jain, N. Artificial intelligence chatbot performance in triage of ophthalmic conditions. Can. J. Ophthalmol. 2024, 59, e301–e308. [Google Scholar] [CrossRef]
  40. Maywood, M.J.; Parikh, R.; Deobhakta, A.; Begaj, T. Performance Assessment of An Artificial Intelligence Chatbot in Clinical Vitreoretinal Scenarios. Retina 2024, 44, 954–964. [Google Scholar] [CrossRef]
  41. Moshirfar, M.; Altaf, A.W.; Stoakes, I.M.; Tuttle, J.J.; Hoopes, P.C. Artificial Intelligence in Ophthalmology: A Comparative Analysis of GPT-3.5, GPT-4, and Human Expertise in Answering StatPearls Questions. Cureus 2023, 15, e40822. [Google Scholar] [CrossRef]
  42. Nikdel, M.; Ghadimi, H.; Tavakoli, M.; Suh, D.W. Assessment of the Responses of the Artificial Intelligence-based Chatbot ChatGPT-4 to Frequently Asked Questions About Amblyopia and Childhood Myopia. J. Pediatr. Ophthalmol. Strabismus 2024, 61, 86–89. [Google Scholar] [CrossRef] [PubMed]
  43. Ong, J.; Kedia, N.; Harihar, S.; Vupparaboina, S.C.; Singh, S.R.; Venkatesh, R.; Vupparaboina, K.; Bollepalli, S.C.; Chhablani, J. Applying large language model artificial intelligence for retina International Classification of Diseases (ICD) coding. J. Med. Artif. Intell. 2023, 6. Available online: https://jmai.amegroups.org/article/view/8198/html (accessed on 25 March 2025). [CrossRef]
  44. Panthier, C.; Gatinel, D. Success of ChatGPT, an AI language model, in taking the French language version of the European Board of Ophthalmology examination: A novel approach to medical knowledge assessment. J. Fr. Ophtalmol. 2023, 46, 706–711. [Google Scholar] [CrossRef]
  45. Patil, N.S.; Huang, R.; Mihalache, A.; Kisilevsky, E.; Kwok, J.; Popovic, M.M.; Nassrallah, G.; Chan, C.; Mallipatna, A.; Kertes, P.J.; et al. The Ability of Artificial Intelligence Chatbots Chatgpt and Google Bard to Accurately Convey Preoperative Information for Patients Undergoing Ophthalmic Surgeries. Retina 2024, 44, 950–953. [Google Scholar] [CrossRef]
  46. Potapenko, I.; Malmqvist, L.; Subhi, Y.; Hamann, S. Artificial Intelligence-Based ChatGPT Responses for Patient Questions on Optic Disc Drusen. Ophthalmol. Ther. 2023, 12, 3109–3119. [Google Scholar] [CrossRef] [PubMed]
  47. Pushpanathan, K.; Lim, Z.W.; Er Yew, S.M.; Chen, D.Z.; Hui’En Lin, H.A.; Lin Goh, J.H.; Wong, W.M.; Wang, X.; Jin Tan, M.C.; Chang Koh, V.T.; et al. Popular large language model chatbots’ accuracy, comprehensiveness, and self-awareness in answering ocular symptom queries. iScience 2023, 26, 108163. [Google Scholar] [CrossRef]
  48. Raghu, K.; Tamilselvi, S.; Devishamani, C.S.; Manikandan, S.; Rajalakshmi, R.; Raman, R. The Utility of ChatGPT in Diabetic Retinopathy Risk Assessment: A Comparative Study with Clinical Diagnosis. Clin. Ophthalmol. 2023, 17, 4021–4031. [Google Scholar] [CrossRef]
  49. Rojas-Carabali, W.; Sen, A.; Agarwal, A.; Tan, G.; Cheung, C.Y.; Rousselot, A.; Agrawal, R.; Liu, R.; Cifuentes-González, C.; Elze, T.; et al. Chatbots Vs. Human Experts: Evaluating Diagnostic Performance of Chatbots in Uveitis and the Perspectives on AI Adoption in Ophthalmology. Ocul. Immunol. Inflamm. 2024, 32, 1591–1598. [Google Scholar] [CrossRef]
  50. Rojas-Carabali, W.; Cifuentes-González, C.; Wei, X.; Putera, I.; Sen, A.; Thng, Z.X.; Agrawal, R.; Elze, T.; Sobrin, L.; Kempen, J.H.; et al. Evaluating the Diagnostic Accuracy and Management Recommendations of ChatGPT in Uveitis. Ocul. Immunol. Inflamm. 2024, 32, 1526–1531. [Google Scholar] [CrossRef]
  51. Sakai, D.; Maeda, T.; Ozaki, A.; Kanda, G.N.; Kurimoto, Y.; Takahashi, M. Performance of ChatGPT in Board Examinations for Specialists in the Japanese Ophthalmology Society. Cureus 2023, 15, e49903. [Google Scholar] [CrossRef]
  52. Sensoy, E.; Citirik, M. A comparative study on the knowledge levels of artificial intelligence programs in diagnosing ophthalmic pathologies and intraocular tumors evaluated their superiority and potential utility. Int. Ophthalmol. 2023, 43, 4905–4909. [Google Scholar] [CrossRef] [PubMed]
  53. Shemer, A.; Cohen, M.; Altarescu, A.; Atar-Vardi, M.; Hecht, I.; Dubinsky-Pertzov, B.; Shoshany, N.; Zmujack, S.; Or, L.; Einan-Lifshitz, A.; et al. Diagnostic capabilities of ChatGPT in ophthalmology. Graefes Arch. Clin. Exp. Ophthalmol. 2024, 262, 2345–2352. [Google Scholar] [CrossRef]
  54. Singer, M.B.; Fu, J.J.; Chow, J.; Teng, C.C. Development and Evaluation of Aeyeconsult: A Novel Ophthalmology Chatbot Leveraging Verified Textbook Knowledge and GPT-4. J. Surg. Educ. 2024, 81, 438–443. [Google Scholar] [CrossRef] [PubMed]
  55. Singh, S.; Djalilian, A.; Ali, M.J. ChatGPT and Ophthalmology: Exploring Its Potential with Discharge Summaries and Operative Notes. Semin. Ophthalmol. 2023, 38, 503–507. [Google Scholar] [CrossRef]
  56. Tailor, P.D.; Dalvin, L.A.; Chen, J.J.; Iezzi, R.; Olsen, T.W.; Scruggs, B.A.; Barkmeier, A.J.; Bakri, S.J.; Ryan, E.H.; Tang, P.H.; et al. A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone. Ophthalmol. Sci. 2024, 4, 100485. [Google Scholar] [CrossRef]
  57. Taloni, A.; Borselli, M.; Scarsi, V.; Rossi, C.; Coco, G.; Scorcia, V.; Giannaccare, G. Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology. Sci. Rep. 2023, 13, 18562. [Google Scholar] [CrossRef]
  58. Tao, B.K.; Handzic, A.; Hua, N.J.; Vosoughi, A.R.; Margolin, E.A.; Micieli, J.A. Utility of ChatGPT for Automated Creation of Patient Education Handouts: An Application in Neuro-Ophthalmology. J. Neuroophthalmol. 2024, 44, 119–124. [Google Scholar] [CrossRef] [PubMed]
  59. Teebagy, S.; Colwell, L.; Wood, E.; Yaghy, A.; Faustina, M. Improved Performance of ChatGPT-4 on the OKAP Examination: A Comparative Study with ChatGPT-3.5. J. Acad. Ophthalmol. 2023, 15, e184–e187. [Google Scholar] [CrossRef]
  60. Wilhelm, T.I.; Roos, J.; Kaczmarczyk, R. Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study. J. Med. Internet Res. 2023, 25, e49324. [Google Scholar] [CrossRef]
  61. Wu, G.; Lee, D.A.; Zhao, W.; Wong, A.; Sidhu, S. ChatGPT: Is it good for our glaucoma patients? Front. Ophthalmol. 2023, 3, 1260415. [Google Scholar] [CrossRef]
  62. Yılmaz, I.B.E.; Doğan, L. Talking technology: Exploring chatbots as a tool for cataract patient education. Clin. Exp. Optom. 2025, 108, 56–64. [Google Scholar] [CrossRef] [PubMed]
  63. Zandi, R.; Fahey, J.D.; Drakopoulos, M.; Bryan, J.M.; Dong, S.; Bryar, P.J.; Bidwell, A.E.; Bowen, R.C.; Lavine, J.A.; Mirza, R.G. Exploring Diagnostic Precision and Triage Proficiency: A Comparative Study of GPT-4 and Bard in Addressing Common Ophthalmic Complaints. Bioengineering 2024, 11, 120. [Google Scholar] [CrossRef] [PubMed]
  64. Cruz Rivera, S.; Liu, X.; Chan, A.-W.; Denniston, A.K.; Calvert, M.J.; Darzi, A.; Holmes, C.; Yau, C.; Moher, D.; Ashrafian, H.; et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: The SPIRIT-AI extension. Nat. Med. 2020, 26, 1351–1363. [Google Scholar] [CrossRef]
  65. Karargyris, A.; Umeton, R.; Sheller, M.J.; Aristizabal, A.; George, J.; Wuest, A.; Pati, S.; Kassem, H.; Zenk, M.; Baid, U.; et al. Federated benchmarking of medical artificial intelligence with MedPerf. Nat. Mach. Intell. 2023, 5, 799–810. [Google Scholar] [CrossRef]
  66. European Commission: Directorate-General for Communications Networks, Content and Technology. Ethics Guidelines for Trustworthy AI; Publications Office: Luxembourg, 2019. [Google Scholar]
  67. Dave, T.; Athaluri, S.A.; Singh, S. ChatGPT in medicine: An overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front. Artif. Intell. 2023, 6, 1169595. [Google Scholar] [CrossRef]
  68. Dam, S.K.; Hong, C.S.; Qiao, Y.; Zhang, C. A Complete Survey on LLM-based AI Chatbots. arXiv 2024, arXiv:2406.16937. [Google Scholar]
  69. Waisberg, E.; Ong, J.; Masalkhi, M.; Zaman, N.; Sarker, P.; Lee, A.G.; Tavakkoli, A. GPT-4 and medical image analysis: Strengths, weaknesses and future directions. J. Med. Artif. Intell. 2023, 6, 29. [Google Scholar] [CrossRef]
  70. Eberhard, D.M.; Simons, G.F.; Fennig, C.D. Ethnologue: Languages of the World, 28th ed.; Ethnologue: Dallas, TX, USA, 2025. [Google Scholar]
  71. Wang, M.Y.; Asanad, S.; Asanad, K.; Karanjia, R.; Sadun, A.A. Value of medical history in ophthalmology: A study of diagnostic accuracy. J. Curr. Ophthalmol. 2018, 30, 359–364. [Google Scholar] [CrossRef] [PubMed]
  72. Reddy, S. Evaluating large language models for use in healthcare: A framework for translational value assessment. Inform. Med. Unlocked 2023, 41, 101304. [Google Scholar] [CrossRef]
  73. Park, Y.-J.; Pillai, A.; Deng, J.; Guo, E.; Gupta, M.; Paget, M.; Naugler, C. Assessing the research landscape and clinical utility of large language models: A scoping review. BMC Med. Inform. Decis. Mak. 2024, 24, 72. [Google Scholar] [CrossRef]
  74. Liu, F.; Li, Z.; Zhou, H.; Yin, Q.; Yang, J.; Tang, X.; Luo, C.; Zeng, M.; Jiang, H.; Gao, Y.; et al. Large Language Models in the Clinic: A Comprehensive Benchmark. arXiv 2024, arXiv:2405.00716. [Google Scholar]
  75. Mohammadi, I.; Firouzabadi, S.R.; Kohandel Gargari, O.; Habibi, G. Standardized Assessment Framework for Evaluations of Large Language Models in Medicine (SAFE-LLM). Preprints 2025, 2025010471. [Google Scholar]
  76. Kuş, Z.; Aydin, M. MedSegBench: A comprehensive benchmark for medical image segmentation in diverse data modalities. Sci. Data 2024, 11, 1283. [Google Scholar] [CrossRef]
  77. Han, T.; Kumar, A.; Agarwal, C.; Lakkaraju, H. MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models. arXiv 2024, arXiv:2403.03744. [Google Scholar]
  78. Privacy Policy. Available online: https://openai.com/policies/row-privacy-policy (accessed on 10 February 2025).
  79. Ong, J.C.L.; Chang, S.Y.; William, W.; Butte, A.J.; Shah, N.H.; Chew, L.S.T.; Liu, N.; Doshi-Velez, F.; Lu, W.; Savulescu, J.; et al. Ethical and regulatory challenges of large language models in medicine. Lancet Digit. Health 2024, 6, e428–e432. [Google Scholar] [CrossRef] [PubMed]
  80. Mugaanyi, J.; Cai, L.; Cheng, S.; Lu, C.; Huang, J. Evaluation of Large Language Model Performance and Reliability for Citations and References in Scholarly Writing: Cross-Disciplinary Study. J. Med. Internet Res. 2024, 26, e52935. [Google Scholar] [CrossRef] [PubMed]
  81. Patil, R.; Heston, T.F.; Bhuse, V. Prompt Engineering in Healthcare. Electronics 2024, 13, 2961. [Google Scholar] [CrossRef]
  82. Schillinger, D. Social Determinants, Health Literacy, and Disparities: Intersections and Controversies. Health Lit. Res. Pract. 2021, 5, e234–e243. [Google Scholar] [CrossRef]
  83. Al Ghadban, Y.; Lu, H.; Adavi, U.; Sharma, A.; Gara, S.; Das, N.; Kumar, B.; John, R.; Devarsetty, P.; Hirst, J.E. Transforming Healthcare Education: Harnessing Large Language Models for Frontline Health Worker Capacity Building using Retrieval-Augmented Generation. medRxiv 2023. [Google Scholar] [CrossRef]
  84. Wang, X.; Chen, N.; Chen, J.; Wang, Y.; Zhen, G.; Zhang, C.; Wu, X.; Hu, Y.; Gao, A.; Wan, X.; et al. Apollo: A Lightweight Multilingual Medical LLM towards Democratizing Medical AI to 6B People. arXiv 2024, arXiv:2403.03640. [Google Scholar]
  85. Huang, Z.; Zhu, W.; Cheng, G.; Li, L.; Yuan, F. MindMerger: Efficient Boosting LLM Reasoning in non-English Languages. arXiv 2024, arXiv:2405.17386. [Google Scholar]
  86. Peters, M.D.J.; Marnie, C.; Colquhoun, H.; Garritty, C.M.; Hempel, S.; Horsley, T.; Langlois, E.V.; Lillie, E.; O’Brien, K.K.; Tunçalp, Ö. Scoping reviews: Reinforcing and advancing the methodology and application. Syst. Rev. 2021, 10, 263. [Google Scholar] [CrossRef] [PubMed]
  87. Institute of Medicine (US) Committee on Standards for Systematic Reviews of Comparative Effectiveness Research. Finding What Works in Health Care: Standards for Systematic Reviews; Eden, J., Levit, L., Berg, A., Morton, S., Eds.; National Academies Press: Washington, DC, USA, 2011. [Google Scholar]
  88. Levac, D.; Colquhoun, H.; O’Brien, K.K. Scoping studies: Advancing the methodology. Implement. Sci. 2010, 5, 69. [Google Scholar] [CrossRef] [PubMed]
Figure 1. A timeline of major LLMs released since the inception of Google Brain in 2011.
Figure 1. A timeline of major LLMs released since the inception of Google Brain in 2011.
Bdcc 09 00151 g001
Figure 2. PRISMA flowchart for study screening and selection.
Figure 2. PRISMA flowchart for study screening and selection.
Bdcc 09 00151 g002
Figure 3. Overview of the discussion section—a roadmap to achieve the end goal of the adoption of LLM in day-to-day Ophthalmology practice.
Figure 3. Overview of the discussion section—a roadmap to achieve the end goal of the adoption of LLM in day-to-day Ophthalmology practice.
Bdcc 09 00151 g003
Table 1. Study characteristics.
Table 1. Study characteristics.
StudyClinical ApplicationLLM
SINGH 2023 [55]AdministrativeGPT 3.5
BARCLAY 2023 [20]Clinical KnowledgeGPT 3.5, GPT 4
ROJAS-CARABALI (1) 2023 [49]DiagnosticGPT 3.5, GPT 4.0. Glass 1.0
ALI 2023 [15]DiagnosticGPT 3.5
SHEMER 2024 [53]DiagnosticGPT 3.5
ROJAS-CARABALI (2) 2023 [50]DiagnosticGPT 3.5, GPT 4
DELSOZ 2023 [26]DiagnosticGPT 3.5
SENSOY 2023 (1) [27]Exam TakingGPT 3.5, Bing, Bard
MOSHIRFAR 2023 [41]Exam TakingGPT 3.5, GPT 4
SENSOY 2023 (2) [52]Exam TakingGPT 3.5, Bing, Bard
ANTAKI 2023 (1) [17]Exam TakingGPT 3.5, GPT 4
TALONI 2023 [57]Exam TakingGPT 3.5, GPT 4
SINGER 2023 [54]Exam TakingAeyeconsult, GPT 4
JIAO 2023 [34]Exam TakingGPT 3.5, GPT 4
ANTAKI 2023 (2) [18]Exam TakingChatGPT legacy and ChatGPT Plus
TEEBAGY 2023 [59]Exam TakingGPT 3.5, GPT 4
FOWLER 2023 [30]Exam TakingGPT 4, Bard
SAKAI 2023 [51]Exam TakingGPT 3.5, GPT 4
HADDAD 2024 [31]Exam TakingGPT 3.5, GPT 4
CAI 2023 [23]Exam TakingGPT 3.5, GPT 4, Bing Chat
PANTHIER 2023 [44]Exam TakingGPT 4
HUA 2023 [33]Manuscript WritingGPT 3.5, GPT 4
TAILOR 2024 [56]Patient EducationGPT 3.5, GPT 4, Claude 2, Bing, Bard
FERRO DESIDERI 2023 [29]Patient EducationGPT 3.5, Bard, Bing Chat
POTAPENKO 2023 [46]Patient EducationGPT 4
BISWAS 2023 [22]Patient EducationGPT 3.5
NIKDEL 2023 [42]Patient EducationGPT 4
LIM 2023 [37]Patient EducationGPT 3.5, GPT 4, Bard
KIANIAN 2023 (1) [36]Patient EducationGPT 3.5
WU 2023 [61]Patient EducationGPT 3.5
BERNSTEIN 2023 [21]Patient EducationGPT 3.5
BALAS 2024 [19]Patient EducationGPT 4
AL-SHARIF 2024 [16]Patient EducationGPT 3.5, Bard
ZANDI 2024 [63]Patient EducationGPT 4, Bard
EID 2023 [28]Patient EducationGPT 4.0, Bard
PUSHPANATHAN 2023 [47]Patient EducationGPT 3.5, GPT 4, Bard
CAPPELLANI 2024 [24]Patient EducationGPT 3.5
YILMAZ 2024 [62]Patient EducationGPT 3.5, Bard, Bing AI, AAO website
PATIL 2024 [45]Patient EducationGPT 4, Bard
KIANIAN 2023 (2) [35]Patient EducationGPT 4, Bard
LIU 2024 [38]Patient EducationGPT 3.5
TAO 2024 [58]Patient EducationGPT 3.5
WILHELM 2023 [60]Patient ManagementGPT 3.5 Turbo, Command-xlarge-nightly, Claude, Bloomz
MAYWOOD 2024 [40]Patient ManagementGPT 3.5 Turbo
CIRKOVIC 2023 [25]PrognosticationGPT 4
HU 2022 [32]PrognosticationBERT, RoBerta, DistilBert, BioBERT
RAGHU 2023 [48]PrognosticationGPT 4
ONG 2023 [43]Text interpretationGPT 3.5
LYONS 2023 [39]TriageGPT 4, Bing Chat, WebMD
Table 2. Summary of study methodologies.
Table 2. Summary of study methodologies.
StudyUsage of a Research Protocol for AIEthical/Safety Safeguards Considered in MethodologyEthics in
Discussion
Prompt Techniques
Employed
Prompt Examples SharedBenchmarks on CorrectnessBenchmarks on Harm
TAILOR 2024 [56]NoYesYesZero-shot (no prior context)YesHumanHuman
SENSOY 2023 (1) [27]NoNoNoZero-shotNoAutomated (Exact match)Nil
FERRO DESIDERI 2023 [29]NoNoNoZero-shotYesHumanNil
ONG 2023 [43]NoNoYesZero-shotYesAutomated (Exact match)Nil
LYONS 2023 [39]NoNoYesZero-shotYesHumanNil
MOSHIRFAR 2023 [41]NoNoNoZero-shotYesAutomated (Exact match)Nil
POTAPENKO 2023 [46]NoNoNoZero-shotYesHumanNil
SENSOY 2023 (2) [52]NoNoNoZero-shotNoAutomated (Exact match)Nil
BISWAS 2023 [22]NoNoNoZero-shotYesHumanNil
NIKDEL 2023 [42]NoNoNoZero-shot, Prompt ChainingYesHumanNil
LIM 2023 [37]NoNoYesZero-shot, Iterative PromptingYesHumanNil
KIANIAN 2023 (1) [36]NoNoNo (safety but not ethics)One-shot, Few-shotYesAutomated and HumanNil
ANTAKI 2023 (1) [17]NoNoNo (safety but not ethics)Zero-shotYesHumanNil
ROJAS-CARABALI (1) 2023 [49]NoNoNo (safety but not ethics)Zero-shotYesAutomated (Exact match) and HumanNil
ALI 2023 [15]NoNoNoZero-shotYesHumanNil
SINGH 2023 [55]NoNoNoContextual PrimingYesHumanNil
WU 2023 [61]NoNoNoZero-shotYesAutomated (Exact match, Readability)Nil
TALONI 2023 [57]NoNoNoZero-shotNoAutomated (Exact match)Nil
BERNSTEIN 2023 [21]NoYesYesZero-shotYesHumanHuman
SINGER 2023 [54]NoNoNo (safety but not ethics)Zero-shotNoAutomated (Exact match)Nil
SHEMER 2024 [53]NoYesNoZero-shotYesAutomated (Exact match)Nil
BALAS 2023 [19]NoNoNoZero-shotNoHumanNil
AL-SHARIF 2024 [16]NoNoYesZero-shotYesHumanNil
JIAO 2023 [34]NoNoYesZero-shotYesAutomated (Exact match)Nil
ROJAS-CARABALI (2) 2023 [50]NoNoNoZero-shotYesAutomated (Exact match)Nil
ANTAKI 2023 (2) [18]NoNoNo (safety but not ethics)Zero-shotNoAutomated (Exact match)Nil
HUA 2023 [33]NoNoYesZero-shotNoHumanNil
ZANDI 2024 [63]NoYesNo (safety but not ethics)Zero-shotNoHumanHuman
CIRKOVIC 2023 [25]NoNo?NoZero-shotNoAutomated (Exact match and F1-score)Nil
TEEBAGY 2023 [59]NoNoNoZero-shotNoAutomated (Exact match)Nil
WILHELM 2023 [60]NoYesNo (safety but not ethics)Zero-shotNoAutomated and HumanAutomated and Human
EID 2023 [28]NoNoNoZero-shotYesAutomated (readability)Nil
MAYWOOD 2024 [40]NoYesNo (safety but not ethics)Zero-shotYesHumanHuman
FOWLER 2023 [30]NoNoNoZero-shotNoAutomated (Exact match)Nil
SAKAI 2023 [51]NoNoNoZero shot, Few-shotYesAutomated (Exact match)Nil
HADDAD 2024 [31]NoNoNoZero-shotYesAutomated (Exact match)Nil
CAI 2023 [23]NoNoNo (safety but not ethics)Zero-shotYesAutomated (Exact match)Nil
PUSHPANATHAN 2023 [47]NoNoNo (safety but not ethics)Zero-shotYesAutomated (Exact match)Nil
HU 2022 [32]NoNoNoZero-shotYesAutomated (Exact match, F1 score)Nil
BARCLAY 2023 [20]NoYesNo (safety but not ethics)Zero-shotYesHumanHuman
CAPPELLANI 2024 [24]NoYesNo (safety but not ethics)Zero-shotYesHumanHuman
PANTHIER 2023 [44]NoNoNoZero-shotYesAutomated (Exact match)Nil
YILMAZ 2024 [62]NoNoNo (safety but not ethics)Zero-shotYesAutomatedNil
PATIL 2024 [45]NoNoYesZero-shotYesHumanHuman
DELSOZ 2023 [26]NoNoNoZero-shotYesHumanNil
KIANIAN 2023 (2) [35]NoNoYesZero-shotYesAutomated (readability)Nil
RAGHU 2023 [48]NoNoYesZero-shotYesHumanNil
LIU 2024 [38]NoNoNoZero-shot, Chain-of-thought (inspired)YesAutomatedNil
TAO 2024 [58]NoYesYesZero-shotYesHumanHuman
Table 3. (a). Overall performance of LLM responses—multiple LLMs studied. (b). Overall performance of LLM responses—one LLM studied.
Table 3. (a). Overall performance of LLM responses—multiple LLMs studied. (b). Overall performance of LLM responses—one LLM studied.
(a)
StudySettingScoring SystemResult
BARCLAY 2023 [20]Clinical Knowledge5 Point ScaleGPT 4 > GPT 3.5
ROJAS-CARABALI (1) 2023 [49]DiagnosticCorrect or IncorrectExperts > GPT 4 = GPT 3.5 > Glass 1.0
ROJAS-CARABALI (2) 2023 [50]DiagnosticCorrect or IncorrectOphthalmologist > AI
SINGER 2023 [54]Exam TakingCorrect or IncorrectAeyeconsult > GPT 4
ANTAKI 2023 (2) [18]Exam TakingCorrect or IncorrectPlus > Legacy
SENSOY 2023 (1) [27]Exam TakingCorrect or IncorrectBard > Bing > GPT 3.5
SENSOY 2023 (2) [52]Exam TakingCorrect, Incorrect or Unable to AnswerBard > Bing > GPT 3.5
MOSHIRFAR 2023 [41]Exam TakingCorrect or IncorrectGPT 4 > humans > GPT 3.5
ANTAKI 2023 (1) [17]Exam TakingCorrect or IncorrectGPT 4-0.3 > GPT 4-0.7 > GPT 4-1 = GPT 4-0 > GPT 3.5
TALONI 2023 [57]Exam TakingCorrect or IncorrectGPT 4 > Humans > GPT 3.5
JIAO 2023 [34]Exam TakingCorrect or IncorrectGPT 4 > GPT 3.5
TEEBAGY 2023 [59]Exam TakingCorrect or IncorrectGPT 4 > GPT 3.5
SAKAI 2023 [51]Exam TakingCorrect or IncorrectHumans > GPT 4 > GPT 3.5
HADDAD 2024 [31]Exam TakingCorrect or IncorrectGPT 4 > GPT 3.5
CAI 2023 [23]Exam TakingCorrect or IncorrectHumans > GPT 4 = Bing > GPT 3.5
FOWLER 2023 [30]Exam TakingCorrect or IncorrectGPT 4 > Bard
YILMAZ 2024 [62]Patient EducationSOLO scoreChatGPT > Bard > Bing > AAO
PUSHPANATHAN 2023 [47]Patient Education5 Point ScaleGPT 4 > GPT 3.5 > Bard
AL-SHARIF 2024 [16]Patient Education4 Point ScaleGPT 3.5 > Bard
FERRO DESIDERI 2023 [29]Patient Education3 Point ScaleGPT 3.5 > Bard = Bing
TAILOR 2024 [56]Patient Education5 Point ScaleExpert + AI > GPT 3.5 > GPT 4> Expert only > Claude > Bard > Bing
LIM 2023 [37]Patient Education3 Point ScaleGPT 4 > GPT 3.5 > Bard
ZANDI 2024 [63]Patient EducationCorrect or IncorrectGPT 4 > Bard
PATIL 2024 [45]Patient Education5 Point ScaleChatGPT > Bard
WILHELM 2023 [60]Patient ManagementmDISCERNClaude-instant-v1.0 > GPT 3.5-Turbo > Command-xlarge-nightly > Bloomz
HU 2022 [32]PrognosticationAUROC, F1BERT > RoBERTa > DistilBERT > BioBert > Humans
LYONS 2023 [39]Triage5 Point ScaleOphthalmologists in training > chatGPT > Bing Chat > WebMD
(b)
StudyLLMSSettingScoring SystemResult
ALI 2023 [15]GPT 3.5Diagnostic3 Point Scale40% correct
35% partially correct
25% outright incorrect
SHEMER 2024 [53]GPT 3.5DiagnosticCorrect or IncorrectResidents > Attendings > GPT 3.5
DELSOZ 2023 [26]GPT 3.5DiagnosticCorrect or IncorrectChatGPT performed similarly to 2 of 3 residents and better than 1 resident
PANTHIER 2023 [44]GPT 4Exam TakingCorrect or Incorrect6188/6785 correct
BISWAS 2023 [22]GPT 3.5Patient Education5 Point Scale66/275 responses rated as very good
134/275 responses rated as good
60/275 acceptable
10/275 poor
5/275 very poor
BERNSTEIN 2023 [21]GPT 3.5Patient EducationComparison to humansGPT 3.5 = Humans
CAPPELLANI 2024 [24]GPT 3.5Patient Education5 Point Scale93 responses scored ≥ 1
27 responses scored ≤ −1
9 responses scored −3
LIU 2024 [38]GPT 3.5Patient EducationCorrect or IncorrectOphthalmology Attendings > Ophthalmology Interns > English Prompt > Chinese Prompting of ChatGPT
TAO 2024 [58]GPT 3.5Patient Education4 Point Scale2.43 95% CI 1.21, 3.65
POTAPENKO 2023 [46]GPT 4Patient EducationCorrect or Incorrect17/100 responses were relevant without inaccuracies
78/100 relevant with inaccuracies that were not harmful
5/100 relevant with inaccuracies potentially harmful
NIKDEL 2023 [42]GPT 4Patient Education3 Point Scale93/110 acceptable
BALAS 2024 [19]GPT 4Patient Education7 Point Scale43/100 scored 6
53/100 scored 5
3/100 scored 4
1/100 scored 3
MAYWOOD 2024 [40]GPT 3.5 TurboPatient ManagementCorrect or Incorrect33/40 correct
21/40 comprehensive
CIRKOVIC 2023 [25]GPT 4PrognosticationCohens Kappa6 categories: k = 0.399
2 categories: k = 0.610
RAGHU 2023 [48]GPT 4PrognosticationCohens KappaWith central subfield thickness: k = 0.263
Without central subfield thickness: k = 0.351
ONG 2023 [43]GPT 3.5Text interpretationCorrect: producing at least one correct ICD code
Correct only: only the correct ICD code
Incorrect: not generating any
Correct: 137/181
Correct only: 106/181
Incorrect: 54/181
Table 4. Human v AI.
Table 4. Human v AI.
StudySettingResults
ROJAS-CARABALI (1) 2023 [49]DiagnosticHumans > GPT 4 > Glass
SHEMER 2024 [53]DiagnosticHumans > GPT 3.5
ROJAS-CARABALI (2) 2023 [50]DiagnosticHumans > GPT-3.5 and 4 (collectively)
DELSOZ 2023 [26]DiagnosticHumans = GPT 3.5
MOSHIRFAR 2023 [41]Exam TakingGPT 4 > Humans > GPT 3.5
ANTAKI 2023 (1) [17]Exam TakingGPT 4 > Humans
TALONI 2023 [57]Exam TakingGPT 4 > Humans > GPT 3.5
FOWLER 2023 [30]Exam TakingGPT 4 > Humans > Bard
SAKAI 2023 [51]Exam TakingHumans > GPT 4 > GPT 3.5
HADDAD 2024 [31]Exam TakingHumans > GPT 4 > GPT 3.5
CAI 2023 [23]Exam TakingHumans > GPT 4 > Bing > GPT 3.5
TAILOR 2024 [56]Patient EducationQuality: Expert + AI = GPT 3.5 = GPT 4 > Expert > Claude > Bard > Bing

Empathy: GPT 3.5 = Expert + AI = GPT 4 > Bard > Claude > Expert > Bing
BERNSTEIN 2023 [21]Patient EducationGPT 3.5 = Humans
LIU 2024 [38]Patient EducationHumans > GPT 3.5
CIRKOVIC 2023 [25]PrognosticationHumans = GPT 4
LYONS 2023 [39]TriageHuman > GPT 4 > Bing > WebMD Symptom Checker
Table 5. LLM performance by subspecialty.
Table 5. LLM performance by subspecialty.
StudyMoshirfar 2023 [41]Taloni 2023 [57] Singer 2023 [54]Jiao 2023 [34]Antaki 2023 (1) [17]Antaki 2023 (2) [18]Teebagy 2023 [59]Sakai 2023 [51]Haddad 2024 [31]Cai 2023 [23]Patil 2024 [45]
Clinical
application
Exam
Taking
Exam
Taking
Exam
Taking
Exam
Taking
Exam TakingExam TakingExam
Taking
Exam
Taking
Exam
Taking
Exam
Taking
Patient
Education
LLMsBdcc 09 00151 i002, Bdcc 09 00151 i003Bdcc 09 00151 i002, Bdcc 09 00151 i003Bdcc 09 00151 i002, Bdcc 09 00151 i004Bdcc 09 00151 i002, Bdcc 09 00151 i003Bdcc 09 00151 i012, Bdcc 09 00151 i011, Bdcc 09 00151 i010, Bdcc 09 00151 i009, Bdcc 09 00151 i003Bdcc 09 00151 i008, Bdcc 09 00151 i007Bdcc 09 00151 i002, Bdcc 09 00151 i003Bdcc 09 00151 i002, Bdcc 09 00151 i003, Bdcc 09 00151 i013Bdcc 09 00151 i002, Bdcc 09 00151 i003Bdcc 09 00151 i002, Bdcc 09 00151 i003, Bdcc 09 00151 i005Bdcc 09 00151 i002, Bdcc 09 00151 i006
OverallBdcc 09 00151 i002(73%)
> Bdcc 09 00151 i001(58%)
> Bdcc 09 00151 i003(55%)
Bdcc 09 00151 i002(82.4%)
> Bdcc 09 00151 i001(75.7%)
> Bdcc 09 00151 i003(65.9%)
Bdcc 09 00151 i004(83.4%)
> Bdcc 09 00151 i002(69.2%)
Bdcc 09 00151 i002(75%)
> Bdcc 09 00151 i003(46%)
Bdcc 09 00151 i010(72.9%)
> Bdcc 09 00151 i001(68.2%)
> Bdcc 09 00151 i003(54.6%)
Bdcc 09 00151 i008(54.3%)
> Bdcc 09 00151 i007(49.3%)
Bdcc 09 00151 i002(81%)
> Bdcc 09 00151 i003(57%)
Bdcc 09 00151 i001(65.7%)
> Bdcc 09 00151 i013 (46.2%)
> Bdcc 09 00151 i002(45.8%)
> Bdcc 09 00151 i003(22.4%)
Bdcc 09 00151 i001(70–75%)
> Bdcc 09 00151 i002(70%)
> Bdcc 09 00151 i003(55%)
Bdcc 09 00151 i001(72.2%)
> Bdcc 09 00151 i002(71.6%)
> Bdcc 09 00151 i005(71.2%)
> Bdcc 09 00151 i003(58.8%)
-
CorneaBdcc 09 00151 i003 > Bdcc 09 00151 i002 = Bdcc 09 00151 i001Bdcc 09 00151 i002 = Bdcc 09 00151 i001 = Bdcc 09 00151 i003Bdcc 09 00151 i004 = Bdcc 09 00151 i002Bdcc 09 00151 i002 = Bdcc 09 00151 i003Bdcc 09 00151 i012 > Bdcc 09 00151 i011 = Bdcc 09 00151 i009 > Bdcc 09 00151 i010 > Bdcc 09 00151 i003Bdcc 09 00151 i008 > Bdcc 09 00151 i007Bdcc 09 00151 i002 > Bdcc 09 00151 i003Bdcc 09 00151 i013 > Bdcc 09 00151 i002 > Bdcc 09 00151 i003Bdcc 09 00151 i002 = Bdcc 09 00151 i003Bdcc 09 00151 i001 > Bdcc 09 00151 i005 > Bdcc 09 00151 i002 > Bdcc 09 00151 i003Bdcc 09 00151 i002 > Bdcc 09 00151 i006
GlaucomaBdcc 09 00151 i002 > Bdcc 09 00151 i003 = Bdcc 09 00151 i001Bdcc 09 00151 i002 > Bdcc 09 00151 i001 > Bdcc 09 00151 i003Bdcc 09 00151 i004 > Bdcc 09 00151 i002Bdcc 09 00151 i002 = Bdcc 09 00151 i003Bdcc 09 00151 i012 > Bdcc 09 00151 i011 = Bdcc 09 00151 i010 = Bdcc 09 00151 i009 > Bdcc 09 00151 i003Bdcc 09 00151 i008 > Bdcc 09 00151 i007Bdcc 09 00151 i002 > Bdcc 09 00151 i003Bdcc 09 00151 i002 > Bdcc 09 00151 i013 > Bdcc 09 00151 i003Bdcc 09 00151 i002 = Bdcc 09 00151 i003Bdcc 09 00151 i001 > Bdcc 09 00151 i002 = Bdcc 09 00151 i005 > Bdcc 09 00151 i003-
Neuro-
Ophthalmology
Bdcc 09 00151 i002 > Bdcc 09 00151 i003 > Bdcc 09 00151 i001Bdcc 09 00151 i002 = Bdcc 09 00151 i001 > Bdcc 09 00151 i003Bdcc 09 00151 i004 > Bdcc 09 00151 i002Bdcc 09 00151 i002 > Bdcc 09 00151 i003Bdcc 09 00151 i012 = Bdcc 09 00151 i010 = Bdcc 09 00151 i009 > Bdcc 09 00151 i011 > Bdcc 09 00151 i003Bdcc 09 00151 i008 > Bdcc 09 00151 i007Bdcc 09 00151 i002 > Bdcc 09 00151 i003Bdcc 09 00151 i013 > Bdcc 09 00151 i002 > Bdcc 09 00151 i003Bdcc 09 00151 i002 = Bdcc 09 00151 i003Bdcc 09 00151 i001 > Bdcc 09 00151 i005 > Bdcc 09 00151 i002 > Bdcc 09 00151 i003-
UveitisBdcc 09 00151 i002 > Bdcc 09 00151 i001 = Bdcc 09 00151 i003Bdcc 09 00151 i002 > Bdcc 09 00151 i001 = Bdcc 09 00151 i003Bdcc 09 00151 i002 > Bdcc 09 00151 i004Bdcc 09 00151 i002 > Bdcc 09 00151 i003Bdcc 09 00151 i012 = Bdcc 09 00151 i011 = Bdcc 09 00151 i010 = Bdcc 09 00151 i009 > Bdcc 09 00151 i003Bdcc 09 00151 i008 > Bdcc 09 00151 i007 (BSCS)
Bdcc 09 00151 i007 > Bdcc 09 00151 i008
(Ophtho-Q)
Bdcc 09 00151 i002 > Bdcc 09 00151 i003Bdcc 09 00151 i013 > Bdcc 09 00151 i002 > Bdcc 09 00151 i003-Bdcc 09 00151 i002 > Bdcc 09 00151 i001 = Bdcc 09 00151 i005 > Bdcc 09 00151 i003-
Lens and
Cataract
Bdcc 09 00151 i003 = Bdcc 09 00151 i002 = Bdcc 09 00151 i001Bdcc 09 00151 i002 = Bdcc 09 00151 i001 > Bdcc 09 00151 i003Bdcc 09 00151 i004 > Bdcc 09 00151 i002-Bdcc 09 00151 i011 = Bdcc 09 00151 i010 > Bdcc 09 00151 i012 = Bdcc 09 00151 i009 > Bdcc 09 00151 i003Bdcc 09 00151 i007 > Bdcc 09 00151 i008 (BCSC)
Bdcc 09 00151 i008 > Bdcc 09 00151 i007
(Ophtho -Q)
Bdcc 09 00151 i002 > Bdcc 09 00151 i003Bdcc 09 00151 i002 > Bdcc 09 00151 i013 > Bdcc 09 00151 i003 *Bdcc 09 00151 i003 > Bdcc 09 00151 i002Bdcc 09 00151 i001 = Bdcc 09 00151 i002 > Bdcc 09 00151 i005 > Bdcc 09 00151 i003Bdcc 09 00151 i002 > Bdcc 09 00151 i006
Paediatrics-StrabismusBdcc 09 00151 i002 > Bdcc 09 00151 i003 = Bdcc 09 00151 i001Bdcc 09 00151 i002 = Bdcc 09 00151 i001 > Bdcc 09 00151 i003Bdcc 09 00151 i004 > Bdcc 09 00151 i002Bdcc 09 00151 i002 = Bdcc 09 00151 i003Bdcc 09 00151 i012 = Bdcc 09 00151 i011 = Bdcc 09 00151 i010 = Bdcc 09 00151 i009 > Bdcc 09 00151 i003Bdcc 09 00151 i007 > Bdcc 09 00151 i008 (BCSC)
Bdcc 09 00151 i008 > Bdcc 09 00151 i007
(Ophtho-Q)
Bdcc 09 00151 i002 > Bdcc 09 00151 i003Bdcc 09 00151 i002 > Bdcc 09 00151 i013 > Bdcc 09 00151 i003Bdcc 09 00151 i002 = Bdcc 09 00151 i003Bdcc 09 00151 i002 > Bdcc 09 00151 i005 > Bdcc 09 00151 i001 > Bdcc 09 00151 i003Bdcc 09 00151 i002 > Bdcc 09 00151 i006
Retina &VitreousBdcc 09 00151 i003 = Bdcc 09 00151 i002 = Bdcc 09 00151 i001Bdcc 09 00151 i002 = Bdcc 09 00151 i001 = Bdcc 09 00151 i003Bdcc 09 00151 i004 > Bdcc 09 00151 i002Bdcc 09 00151 i002 = Bdcc 09 00151 i003Bdcc 09 00151 i011 > Bdcc 09 00151 i010 > Bdcc 09 00151 i012 = Bdcc 09 00151 i009 > Bdcc 09 00151 i003Bdcc 09 00151 i008 > Bdcc 09 00151 i007Bdcc 09 00151 i002 > Bdcc 09 00151 i003Bdcc 09 00151 i002 = Bdcc 09 00151 i013 > Bdcc 09 00151 i003Bdcc 09 00151 i002 = Bdcc 09 00151 i003 @Bdcc 09 00151 i005 > Bdcc 09 00151 i002 > Bdcc 09 00151 i001 > Bdcc 09 00151 i003Bdcc 09 00151 i002 > Bdcc 09 00151 i006
OculoplasticsBdcc 09 00151 i002 > Bdcc 09 00151 i003 = Bdcc 09 00151 i001Bdcc 09 00151 i002 > Bdcc 09 00151 i001 = Bdcc 09 00151 i003Bdcc 09 00151 i004 > Bdcc 09 00151 i002Bdcc 09 00151 i003 > Bdcc 09 00151 i002Bdcc 09 00151 i010 = Bdcc 09 00151 i009 > Bdcc 09 00151 i012 = Bdcc 09 00151 i011 > Bdcc 09 00151 i003Bdcc 09 00151 i007 > Bdcc 09 00151 i008Bdcc 09 00151 i002 > Bdcc 09 00151 i003Bdcc 09 00151 i002 > Bdcc 09 00151 i013 > Bdcc 09 00151 i003 +Bdcc 09 00151 i002 = Bdcc 09 00151 i003Bdcc 09 00151 i002 > Bdcc 09 00151 i005 > Bdcc 09 00151 i001 > Bdcc 09 00151 i003Bdcc 09 00151 i002 > Bdcc 09 00151 i006
OpticsBdcc 09 00151 i002 > Bdcc 09 00151 i003 = Bdcc 09 00151 i001-Bdcc 09 00151 i004 > Bdcc 09 00151 i002-Bdcc 09 00151 i010 > Bdcc 09 00151 i011 > Bdcc 09 00151 i009 > Bdcc 09 00151 i012 > Bdcc 09 00151 i003Bdcc 09 00151 i007 > Bdcc 09 00151 i008 (BCSC)
Bdcc 09 00151 i008 > Bdcc 09 00151 i007
(Ophtho-Q)
Bdcc 09 00151 i002 > Bdcc 09 00151 i003-Bdcc 09 00151 i002 = Bdcc 09 00151 i003 #Bdcc 09 00151 i001 > Bdcc 09 00151 i002 = Bdcc 09 00151 i005 > Bdcc 09 00151 i003-
Refractive SurgeryBdcc 09 00151 i002 > Bdcc 09 00151 i003 = Bdcc 09 00151 i001Bdcc 09 00151 i002 > Bdcc 09 00151 i001 > Bdcc 09 00151 i003Bdcc 09 00151 i004 > Bdcc 09 00151 i002Bdcc 09 00151 i002 > Bdcc 09 00151 i003Bdcc 09 00151 i011 > Bdcc 09 00151 i012 = Bdcc 09 00151 i009 > Bdcc 09 00151 i010 > Bdcc 09 00151 i003Bdcc 09 00151 i008 = Bdcc 09 00151 i007Bdcc 09 00151 i002 > Bdcc 09 00151 i003Bdcc 09 00151 i002 > Bdcc 09 00151 i013 > Bdcc 09 00151 i003 *Bdcc 09 00151 i002 = Bdcc 09 00151 i003 #-Bdcc 09 00151 i006 > Bdcc 09 00151 i002
PathologyBdcc 09 00151 i002 > Bdcc 09 00151 i001 = Bdcc 09 00151 i003Bdcc 09 00151 i002 = Bdcc 09 00151 i001 = Bdcc 09 00151 i003Bdcc 09 00151 i004 > Bdcc 09 00151 i002Bdcc 09 00151 i002 > Bdcc 09 00151 i003Bdcc 09 00151 i011 = Bdcc 09 00151 i010 = Bdcc 09 00151 i009 > Bdcc 09 00151 i012 > Bdcc 09 00151 i003Bdcc 09 00151 i008 > Bdcc 09 00151 i007Bdcc 09 00151 i002 > Bdcc 09 00151 i003Bdcc 09 00151 i002 > Bdcc 09 00151 i013 > Bdcc 09 00151 i003 +Bdcc 09 00151 i002 = Bdcc 09 00151 i003 @--
*, +, @, #: Results categorized into same subspecialties in reporting. Legend: Bdcc 09 00151 i001 Human; Bdcc 09 00151 i002 GPT-4; Bdcc 09 00151 i003 GPT-3.5; Bdcc 09 00151 i004 Aeyconsult; Bdcc 09 00151 i005 Bing; Bdcc 09 00151 i006 Bard; Bdcc 09 00151 i007 ChatGPT Legacy; Bdcc 09 00151 i008 ChatGPT Plus; Bdcc 09 00151 i009 GPT 4-0; Bdcc 09 00151 i010 GPT 4-0.3; Bdcc 09 00151 i011 GPT 4-0.7; Bdcc 09 00151 i012 GPT 4-1; Bdcc 09 00151 i013 GPT 4 (refined).
Table 6. Readability.
Table 6. Readability.
StudyLLMSScoring SystemsRanking by Readability
EID 2023 [28]GPT 4, BardFKRE, FKGL, GFI, CLI, SMOG
Automated readability Index,
Linsear write readability score
FKRE: GPT 4 > Bard > ASOPRS resource
FKGL: GPT 4 > Bard > ASOPRS resource
GFI: GPT 4 > Bard > ASOPRS resource
SMOG: GPT 4 > Bard > ASOPRS resource
CLI: GPT 4 > Bard > ASOPRS resource
ARI: GPT 4 = Bard > ASOPRS resource
LWRS: Bard > GPT 4 > ASOPRS resource
* only LLM scores after fine-tuning to 6th-grade reading level included
KIANIAN (1) 2023 [36]GPT 3.5FKRE, FKGL, GFI, SMOGGPT > online resources in all scoring systems
KIANIAN (2) 2023 [35]GPT 3.5, BardFKGLGPT > Bard
WU [61]GPT 3.5FKGL, GFI, SMOG, Dale-Chall-ScoreAAO resource > GPT in all scoring systems
Table 6 acronyms: FKRE: Flesch-Kincaid Reading Ease; FKGL: Flesch-Kincaid Grade Level; GFI: Gunning Fog Index; CLI: Coleman-Liau Index; SMOG: Simple measure of Gobbledygook; AAO: American Academy of Ophthalmology; ASOPRS: American Society of Ophthalmic Plastic and Reconstructive Surgery.
Table 7. Diagnostic capabilities of LLMs.
Table 7. Diagnostic capabilities of LLMs.
StudyLLMSEvaluated Data# Of Questions/CasesDiagnostic Accuracy Rate
RAGHU 2023 [48]GPT 4Clinical, biochemical and ocular data11165.6% of answers consistent with ophthalmologist on average
LIU 2024 [38]GPT 3.5FFA reports1226Ophthalmologists (89.35%) > Ophthalmologist interns (82.69%) > GPT 3.5-english prompts (80.05%) > ChatGPT 3.5-Chinese prompts (70.47%)
LYONS 2023 [39]GPT 4, Bing Chat, WebMDHistory only44Ophthalmologists in training (95%) > GPT 4 (93%) > Bing Chat (77%) > WebMD (33%)
ZANDI 2024 [63]GPT 4, BardHistory only80GPT 4 (53.75%) > Bard (43.75%)
SHEMER 2024 [53]GPT 3.5History only126Residents (75%) > Attendings (71%) > GPT 3.5 (54%)
LIM 2023 [37]GPT 3.5, GPT 4, BardHistory only2GPT 4 (100%) = GPT 3.5 (100%) > Bard (50%)
ROJAS-CARABALI (1) 2023 [49]GPT 3.5, GPT 4.0. Glass 1.0History and examination findings25Uveitis specialists (mean = 79%) > Fellow (76%) > GPT 4 (60%) = GPT 3.5 (60%)
DELSOZ 2023 [26]GPT 3.5History and examination findings11GPT 3.5 (72.7%) > Ophthalmology trainees (mean = 66.6%)
ROJAS-CARABALI (2) 2023 [50]GPT 3.5, GPT 4History, examination findings and Images6Experts (100%) > GPT 4 (50%) = GPT 3.5 (50%) > Glass 1.0 (33%)
TALONI 2023 [57]GPT 3.5, GPT 4Question banks646GPT 4 (83.7%) > Humans (mean = 75.4%) > GPT 3.5 (68.1%)
CAI 2023 [23]GPT 3.5, GPT 4, Bing ChatQuestion banks250Humans (73.8%) > Bing (60.9%) > GPT 4 (59.4%) > GPT 3.5 (46.4%)
Table 8. Management.
Table 8. Management.
StudyLLMSManagement Scores
BISWAS 2023 [22]GPT 3.5Median score on management questions = 4.0,
based on Likert scale (1: very poor; 2: poor; 3: acceptable; 4: good; 5: very good)
LIM 2023 [37]GPT 3.5, GPT 4, BardGood management: GPT 4.0 (70%) > Bard (45%) > GPT 3.5 (40%)
Borderline management: Bard (40%) > GPT 3.5 (35%) > GPT 4.0 (15%)
Poor management: GPT 3.5 (25%) > Bard (15%) = GPT 4.0 (15%)
TALONI 2023 [57]GPT 3.5, GPT 4Medical management: GPT 4 (83.4%) > Humans (76.9%) > GPT 3.5 (65.1%)
Surgical management: Humans (74.7%) > GPT 4 (74.6%) > GPT 3.5 (57.0%)
AL-SHARIF 2024 [16]GPT 3.5, BardComprehensive management: GPT 3.5 (71.4%) > Bard (53.1%)
ROJAS-CARABALI (2) 2023 [50]GPT 3.5, GPT 4Management and treatment plans consistent with the majority of ophthalmologists in all cases
CAI 2023 [23]GPT 3.5, GPT 4, Bing ChatGPT 4.0 (77.0%) > Humans (76.1%) > Bing (75.4%) > GPT 3.5 (58.3%)
CAPPELLANI 2024 [24]GPT 3.5Management scores by subspecialty:
General (2); Anterior segment and cornea (2); Glaucoma (−1); Neuro-Ophthalmology (2); Oncology (1); Pediatrics (1); Oculoplastics (2); Retina and Uveitis (1); Overall median score 1
based on Likert scale (−3: Potentially dangerous, −2: Very poor, −1: Poor, 1: Good, 0: No response, 1: Good, 2: Very good, 2 *: Excellent)
Table 9. Clinical administration.
Table 9. Clinical administration.
StudyLLMSClinical Administrative TaskPerformance
SINGH 2023 [55]GPT 3.5Discharge Summary and Operative Notes WritingOnly qualitative description of output provided:
1. Discharge Summaries were noted to have valid but very generic output, that upon further prompting, was able to improve in specificity and detail.
2. Operative Notes were noted to have levels of inaccuracies and hallucinations which could be corrected upon further prompting.
HUA 2023 [33]GPT 3.5, GPT 4Research Manuscript WritingGPT 4 > GPT 3.5 in helpfulness scores, truthfulness scores, harmlessness scores, and the Modified AI-DISCERN score.
GPT 3.5 > GPT 4 in hallucination rates, Mean Sapling AI Detector Fake score, and Mean GPT-2 Output Detector Fake score.
ONG 2023 [43]GPT 3.5Retinal ICD ScoringOnly correct ICD codes generated: 59%
At least one correct ICD code generated: 70%
No correct ICD codes generated: 30%
Table 10. Hallucination or inaccuracies.
Table 10. Hallucination or inaccuracies.
StudyLLMSEvaluationResults
MULTIPLE LLMS:
TAILOR 2024 [56]GPT 3.5, GPT 4, Claude 2, Bing, BardDegree of inaccuracy or correctnessInaccuracies with great clinical significance: Bard > Bing > Claude > GPT 3.5 > GPT 4 > GPT 4 + Expert > Expert.
AL-SHARIF 2024 [16] GPT 3.5, BardDegree of accuracyCompletely inaccurate answers: Bard (10.5%) > GPT 3.5 (5.1%)
Partially inaccurate answers: Bard (18.1%) > GPT 3.5 (10.5%)
FERRO DESIDERI 2023 [29]GPT 3.5, Bard, Bing ChatDegree of accuracyInaccurate and insufficient answers: Bard (13.3%) > Bing (6.7%) > GPT 3.5 (0.0%)
YILMAZ 2024 [62]GPT 3.5, Bard, Bing AI, AAO websiteDegree of accuracyAccuracy (mean SOLO taxonomy score *): GPT 3.5 (3.1 ± 0.31) > Bard (2.9 ± 0.72) > Bing (2.65 ± 0.49) > AAO (2.4 ± 0.6)
* SOLO = Structure of Observed Learning Outcome
BARCLAY 2023 [20]GPT 3.5, GPT 4Degree of inaccuracyIncorrect facts, little significance: GPT 3.5 (39%) > GPT 4 (22%)
Incorrect facts, great significance: GPT 3.5 (38%) > GPT 4 (8%)
LIM 2023 [37]GPT 3.5, GPT 4, BardDegree of accuracyTotal accuracy score: GPT 4 (8.19) > GPT 3.5 (7.35) > Bard (7.13)
PUSHPANATHAN 2023 [47]GPT 3.5, GPT 4, BardDegree of accuracySignificant inaccuracies: Bard (13.5%) > GPT 3.5% (5.4%) > GPT 4 (0%)
LYONS 2023 [39]GPT 4, Bing Chat, WebMD Symptom CheckerDegree of inaccuracyGrossly inaccurate statements: WebMD (50%) > Bing (14%) > GPT 4 (0%)
WILHELM 2023 [60]GPT 3.5 Turbo, Command-xlarge-nightly, Claude-instant-v1, BloomzDegree of falsenessFalseness: Claude-instant-v1.0 (51.7%) > Command-xlarge-nightly (16.7%) > Bloomz (15%) > GPT 3.5 Turbo (11.7%)
HUA 2023 [33]GPT 3.5, GPT 4Hallucination
frequency
Mean hallucination rate: GPT 3.5 (0.31) > GPT 4 (0.29)
CAI 2023 [23]GPT 3.5, GPT 4, Bing ChatHallucination
frequency
Hallucinations: GPT 3.5 (42.4%) > Bing (25.6%) > GPT 4 (18%)
SINGLE LLM:
ALI 2023 [15]GPT 3.5Degree of correctnessPartially incorrect answers: 35%
Completely incorrect answers: 25%
CAPPELLANI 2024 [24]GPT 3.5Degree of correctnessIncorrect answers 22.5%
NIKDEL 2023 [42]GPT 4Degree of
appropriateness
Inappropriate response rate:
Amblyopia questions: 5.6%
Childhood myopia questions: 5.4%
BALAS 2024 [19]GPT 4Degree of
appropriateness
No inappropriate responses
BERNSTEIN 2023 [21]GPT 3.5Degree of
inappropriateness
Rate of incorrect or inappropriate answers,
(a) With little clinical significance: Human (19.1%) > GPT 3.5 (18%)
(b) With great clinical significance: Human (5.5%) > GPT 3.5 (4.6%)
BISWAS 2023 [22]GPT 3.5Degree of inaccuracyInaccurate answers: 3.6%
Very inaccurate answers: 1.8%
POTAPENKO 2023 [46]GPT 4Degree of inaccuracyRate of major inaccuracies: 5%
LIU 2024 [38]GPT 3.5Hallucination
frequency
Hallucination: Chinese prompts (5.88%) > English prompts (0.59%)
Misinformation: Chinese prompts (7.84%) > English prompts (1.96%)
MAYWOOD 2024 [40]GPT 3.5 TurboHallucination
frequency
Hallucination rate: 30%
Table 11. Harm.
Table 11. Harm.
StudyLLMSExtent of Harm
in Output
Measure of Harm
BERNSTEIN 2023 [21]GPT 3.5GPT 3.5 = HumansLikert scales on likelihood and extent of harm
CAPPELLANI 2024 [24]GPT 3.57.5% responses harmfulLikert scales on potential harm
MAYWOOD 2024 [40]GPT 3.5 Turbo50% possible harm,
33.3% definitive harm
Likert scale on harm
WILHELM 2023 [60]GPT 3.5 Turbo, Command-xlarge-nightly, Claude, BloomzClaude > Bloomz > Command-xlarge-nightly
> GPT 3.5-turbo
Binary variable (presence or absence of harm)
HUA 2023 [33]GPT 3.5, GPT 4GPT 3.5 > GPT 4Modified AI-DISCERN Criteria
BARCLAY 2023 [20]GPT 3.5, GPT 4GPT 3.5 > GPT 4Likert scales on likelihood and severity of harm
LIM 2023 [37]GPT 3.5, GPT 4, BardGPT 3.5 = Bard > GPT 4.0 Likert scale on harm
TAILOR 2024 [56]GPT 3.5, GPT 4, Claude 2, Bing, BardHigh risk harm: Bard > Bing > GPT 3.5 > Claude > GPT 4 > GPT 4 + Human > HumanLikert scales on likelihood and extent of harm
POTAPENKO 2023 [46]GPT 45% potential harmLikert scales on potential for harm
BALAS 2024 [19]GPT 4No responses constituting harmAppropriateness rating scale
ZANDI 2024 [63]GPT 4, BardBard > GPT 4Binary variable (presence or absence of potential harm)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

See, Y.K.C.; Lim, K.S.A.; Au, W.Y.; Chia, S.Y.C.; Fan, X.; Li, Z.K. The Use of Large Language Models in Ophthalmology: A Scoping Review on Current Use-Cases and Considerations for Future Works in This Field. Big Data Cogn. Comput. 2025, 9, 151. https://doi.org/10.3390/bdcc9060151

AMA Style

See YKC, Lim KSA, Au WY, Chia SYC, Fan X, Li ZK. The Use of Large Language Models in Ophthalmology: A Scoping Review on Current Use-Cases and Considerations for Future Works in This Field. Big Data and Cognitive Computing. 2025; 9(6):151. https://doi.org/10.3390/bdcc9060151

Chicago/Turabian Style

See, Ye King Clarence, Khai Shin Alva Lim, Wei Yung Au, Si Yin Charlene Chia, Xiuyi Fan, and Zhenghao Kelvin Li. 2025. "The Use of Large Language Models in Ophthalmology: A Scoping Review on Current Use-Cases and Considerations for Future Works in This Field" Big Data and Cognitive Computing 9, no. 6: 151. https://doi.org/10.3390/bdcc9060151

APA Style

See, Y. K. C., Lim, K. S. A., Au, W. Y., Chia, S. Y. C., Fan, X., & Li, Z. K. (2025). The Use of Large Language Models in Ophthalmology: A Scoping Review on Current Use-Cases and Considerations for Future Works in This Field. Big Data and Cognitive Computing, 9(6), 151. https://doi.org/10.3390/bdcc9060151

Article Metrics

Back to TopTop