Evaluating the Reliability of ChatGPT for Health-Related Questions: A Systematic Review
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsDear Authors,
this is an important and timely paper, which I considered generally well written, with sound research methods.
Nevertheless a few minor improvements would increase the quality of the paper.
a) Among the methods, the authors should justfy the selected method of quantatative synthesis. I understand the challenges of synthesizing results from diverse metrics, but the lack of meta-analysis and the choice of the box-plot based method should be explained.
b) The authors mention the diversity of evaluation criteria. It would be useful to educate the readers about the approaches used in the papers. Is there any meaningful taxonomy for the strategies applied?
c) I suggest the authors reconsider the terminology used for evaluation metrics. I was not sure if "adjusted accuracy" is the best term, if the evaluation metrics included a range of graded judgements about raters perceptions, which are technically different from an accuracy metric.
d) The authors should also revise the terminology concerning the validation of ChatGPT's performance in the screening process. The 100 randomly chosen papers do not inform about reliability (i.e. agreement of repeated measurements), but the sensitivity / recall of ChatGPT in the screening. 0 relevant record in a random sample of 100 suggests that less than 3/100 relevant papers were exluded erroneously by ChatGPT with 95% confidence.
Altogether, I suggest a minor revision for this paper.
Author Response
Dear Authors,
This is an important and timely paper, which I considered generally well written, with sound research methods.
Nevertheless a few minor improvements would increase the quality of the paper.
a) Among the methods, the authors should justify the selected method of quantitative synthesis. I understand the challenges of synthesizing results from diverse metrics, but the lack of meta-analysis and the choice of the boxplot-based method should be explained.
Thank you very much for your feedback. We acknowledge your comment and appreciate the opportunity to clarify our methodological choices. A meta-analysis was not performed due to significant heterogeneity in the data, with some categories having sparse data (only one or two entries) and others displaying considerable variability in accuracy metrics. Pooling such data would risk generating biased or misleading results.
Instead, we selected a boxplot-based approach to provide a transparent and descriptive synthesis of the data, allowing for visualization of central tendencies and variability across categories while respecting the limitations of the dataset. This method effectively highlights trends without imposing assumptions that may not hold under these conditions. Additionally, boxplots serve as an exploratory tool well-suited for analyzing heterogeneous data and identifying outliers or patterns.
We added this explanation to the manuscript to clarify our decision and address your feedback.
b) The authors mention the diversity of evaluation criteria. It would be useful to educate the readers about the approaches used in the papers. Is there any meaningful taxonomy for the strategies applied?
We greatly appreciate your valuable feedback. Most studies employed Likert-like scales to measure the accuracy of ChatGPT responses, while six studies utilized standard performance metrics such as accuracy, recall, precision, and specificity. However, the Likert scales varied significantly—some used 3-point, 4-point, or 5-point scales, among others, with starting points being either 0 or 1. This variability influenced the significance of each Likert point, making direct comparisons across the studies challenging.
To address this issue, we introduced a new measure that we named adjusted accuracy, which normalizes all evaluation metrics to a uniform 0-100% scale. This adjustment accounts for differences in scale range and starting points, enabling meaningful synthesis of results across diverse evaluation strategies. By implementing adjusted accuracy, we ensured consistency and comparability in our analysis, providing a clearer picture of ChatGPT’s performance. We have added an explanation that combines this feedback with comment C.
c) I suggest the authors reconsider the terminology used for evaluation metrics. I was not sure if "adjusted accuracy" is the best term, if the evaluation metrics included a range of graded judgements about raters’ perceptions, which are technically different from an accuracy metric.
We appreciate your feedback and the opportunity to clarify our terminology. While we acknowledge that Likert-like scales represent graded judgments rather than direct accuracy measurements, the term "adjusted accuracy" was chosen to harmonize diverse evaluation metrics—including Likert-like scales and standard metrics such as precision and recall—into a consistent 0-100 scale.
Although traditional accuracy metrics strictly measure correctness, adjusted accuracy interprets Likert-like scales as proxies for accuracy judgments, reflecting the perceived correctness or appropriateness of ChatGPT’s responses. This interpretation aligns conceptually with a broader understanding of accuracy. To further clarify this to readers, we have added an explanation detailing the different metrics available in the data and what we mean by adjusted accuracy.
d) The authors should also revise the terminology concerning the validation of ChatGPT's performance in the screening process. The 100 randomly chosen papers do not inform about reliability (i.e. agreement of repeated measurements), but the sensitivity / recall of ChatGPT in the screening. 0 relevant record in a random sample of 100 suggests that less than 3/100 relevant papers were excluded erroneously by ChatGPT with 95% confidence.
Thank you for your valuable feedback and comment. We have revised the terminology in our methods to refer to performance, specifically sensitivity/recall-based performance, to more accurately describe the evaluation of ChatGPT in the screening process. This ensures clarity and avoids conflating it with reliability, which pertains to repeated measurement agreement.
Altogether, I suggest a minor revision for this paper.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors of this study conducted a systematic literature review to summarize the consistency and reliability of using ChatGPT in medical domain. Their search databased includes PubMed only, and their search keyword only includes “ChatGPT”. Their so called “high recall” keyword yielded about 1,101 articles in total from 2023 onwards and they finally included 128 articles for detailed analysis.
In my opinion, the search strategy is not reliable. They used a single database and a single keyword. Moreover, the authors used ChatGPT to help with article screening, even though the ChatGPT is notorious for its hallucination issue. Additionally, now it is near the end of 2024, but their search time was the end of 2023, which is one year before. such a fast-evolving field, I don’t think such a systematic review carries any scientific value.
Author Response
The authors of this study conducted a systematic literature review to summarize the consistency and reliability of using ChatGPT in medical domain. Their search databased includes PubMed only, and their search keyword only includes “ChatGPT”. Their so called “high recall” keyword yielded about 1,101 articles in total from 2023 onwards and they finally included 128 articles for detailed analysis.
In my opinion, the search strategy is not reliable. They used a single database and a single keyword. Moreover, the authors used ChatGPT to help with article screening, even though the ChatGPT is notorious for its hallucination issue. Additionally, now it is near the end of 2024, but their search time was the end of 2023, which is one year before. Such a fast-evolving field, I don’t think such a systematic review carries any scientific value.
We greatly appreciate your valuable and insightful feedback and comments. We selected PubMed because it is a widely recognized and comprehensive resource for peer-reviewed biomedical and healthcare literature. Given the focus of our systematic review on the healthcare applications of ChatGPT, we prioritized a database that would most likely include studies of high relevance to our topic. While we acknowledge that including additional databases may yield a broader range of results, we believe our focus on PubMed allowed us to maintain scientific rigor while effectively addressing our research questions. This rationale has been added to the methodology section.
We understand your concern about the use of a single keyword. Our approach was purposefully designed to ensure a focused and high-recall search for studies specifically related to ChatGPT's applications in healthcare. By using the term “ChatGPT” without additional qualifiers, we aimed to capture all relevant studies in the defined scope of our review.
To complement this focused approach, we incorporated a dual-phase screening process to enhance the reliability of the study selection:
A. Automated Screening with ChatGPT: The initial screening emphasized achieving 100% recall, ensuring no potentially relevant studies were missed.
B. Comprehensive Manual Validation: Following the automated screening, human reviewers conducted a detailed secondary screening, applying rigorous inclusion and exclusion criteria to ensure only the most relevant studies were selected.
We acknowledge that large language models, including ChatGPT, are prone to hallucination issues. To address this, we carefully designed our prompt to prioritize 100% recall, ensuring that the risk of excluding an eligible study was minimized. ChatGPT was instructed to loosely include studies for the second screening phase, which was independently conducted by human reviewers. As a result, while ChatGPT labeled many ineligible studies as eligible, these were subsequently excluded during the human review process. In essence, we intentionally sacrificed precision to maximize recall and safeguard the inclusion of all potentially relevant studies.
Regarding the timeline of the study, our systematic review was designed to provide a snapshot of the state of research on ChatGPT in the medical domain at the time of the review, with a search period ending in December 2023. We acknowledge that new studies have likely emerged since then, and we have addressed this limitation in the discussion section, highlighting the need for future updates to capture the latest developments.
While we have not extended our review to include studies published through 2024, we believe the analysis of studies from 2023 provides valuable insights into trends, challenges, and opportunities associated with ChatGPT in the medical domain. This foundational understanding can serve as a basis for future systematic reviews or meta-analyses that incorporate more recent literature. We hope this clarification addresses your concerns and demonstrates the relevance of our findings despite the temporal constraints.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for Authors(line 69) Why was the search for scientific articles conducted using only Pubmed? Since, as can be seen later in the section "Limitations of the study", this choice had a precise reason, it would be necessary to introduce it also in this section as a specific premise.
(line 71) was the search for scientific articles conducted on a single date? This is the meaning of what is stated in the line 71?
Comments for author File: Comments.pdf
Author Response
Summary
The article offers an interesting reflection on the scientific literature that investigates the effectiveness of Chat GPT as a “medical consultant”. This analysis is important due to the ever-increasing use of artificial intelligence and the trust given to it by people regarding the accuracy of the medical indications provided.
Article
The article does not appear to have fully complied with all the items indicated in the PRISMA method for systematic reviews. The abstract does not correspond, in its structure, to the requirements of the PRISMA 2020 guidelines that the authors declared to have followed. Furthermore, the information relating to the sources consulted is not made explicit: the PubMed database is cited as the only source consulted but without defining the reasons, except in the "Limitations of the study" section. Instead, it should be made explicit as an integral part of the methodology followed.
Review
The topic discussed and the innovative methodology introduced to make the analysis of the scientific literature on the topic more accurate are very interesting. A training session was conducted through the creation of a specific algorithm for the exact instrument being investigated and this allowed the authors to perform a sort of meta-analysis, the accuracy of which could be interesting to evaluate.
Specific comments
(line 69) Why was the search for scientific articles conducted using only PubMed? Since, as can be seen later in the section "Limitations of the study", this choice had a precise reason, it would be necessary to introduce it also in this section as a specific premise.
Thank you very much for your valuable feedback and comment. We selected PubMed because it is a widely recognized and comprehensive resource for peer-reviewed biomedical and healthcare literature. Given the focus of our systematic review on the healthcare applications of ChatGPT, we prioritized a database that would most likely include studies of high relevance to our topic. While we acknowledge that including additional databases may yield a broader range of results, we believe our focus on PubMed allowed us to maintain scientific rigor while effectively addressing our research questions.
To address your feedback, we have added this explanation to the methodology section as a specific premise.
(line 71) Was the search for scientific articles conducted on a single date? This is the meaning of what is stated in the line 71?
Thank you for your feedback and comment. This is correct. The search was conducted on a single date on December 15th, 2023. This approach was chosen to establish a clear cutoff point for analysis and to focus the review on the state of the literature at that time. We have clarified this point in the manuscript to ensure transparency.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors have clarified my questions in sufficient detail. I think, this study can be accepted for publication in its current form.
Author Response
Thank you for your thoughtful review of our manuscript. We appreciate your valuable insights, which have significantly enhanced the quality of our paper.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors did not respond to my previous comments appropriately. They insisted that using a single word “ChatGPT” and conducted the search on PubMed is a “high recall” method. I don’t think the result is accurate and “high-recall”. The authors said they used PubMed because their focus was in the medical field, however, many existing literature reviews regarding large language models in medical field used multiple databases for the initial search (e.g., https://pubmed-ncbi-nlm-nih-gov.ezp-prod1.hul.harvard.edu/38639098/, https://jamanetwork-com.ezp-prod1.hul.harvard.edu/journals/jama/fullarticle/10.1001/jama.2024.21700, https://www.cell.com/iscience/fulltext/S2589-0042(24)00935-0). I personally never saw a peer-reviewed article regarding LLMs in medicine with acceptable quality using only one database and one keyword for the initial search. Additionally, I conducted a search by myself using the keyword “ChatGPT” on PubMed, but I found there were over 2,000 publications from 2023 to 2024 (i.e., papers published in 2024 were not included), and the number was largely different from what the authors reported, which was 1,101. Therefore, the results do not seem reliable to me.
Author Response
Using PubMed as a single database
Thanks for thoroughly reviewing the paper and providing your feedback. Regarding the use of PubMed as the sole database, we carefully selected this platform due to its extensive coverage of high-quality, peer-reviewed literature in the biomedical and healthcare domains. Our systematic review specifically focused on the healthcare applications of ChatGPT, and we prioritized a database most relevant to this field to ensure the relevance and rigor of our results. While we recognize the merits of including additional databases, our choice of PubMed was made with the intention of maintaining a focused and manageable scope, which is critical given the exploratory nature of this study. In future work, an updated study can be conducted with a larger scope and examining more recently developed models (GPT o1, o3, Claude 3, etc.).
To further substantiate our findings, we replicated the search in Scopus using the same parameters applied in PubMed (“ChatGPT,” year: 2023, subject area: Medicine, document type: Article or Conference Paper, English language only, and final publication stage). This yielded 552 results, of which 97 were not indexed in PubMed. An initial GPT-assisted screening suggested 27 of these might qualify for our study. However, upon manual review, only 6 ultimately met our inclusion criteria. Although relying solely on PubMed is a recognized limitation (as acknowledged in our manuscript’s limitations section), the addition of these 6 articles would not meaningfully alter our study’s findings or conclusions.
We appreciate your reference to existing literature reviews that utilized broader search strategies and multiple databases. While such methodologies are well-suited for different scopes, we believe our focused approach aligns with the specific research questions we sought to address. Nevertheless, we have revised the manuscript to explicitly acknowledge this limitation and suggest expanding the scope of future reviews to include multiple databases and broader keyword strategies.
Recall explanation
We acknowledge your observation regarding the use of a single keyword, "ChatGPT," in our search strategy. Allow us to clarify our rationale further and explain why we believe this approach provided a focused yet comprehensive dataset for our review. Our systematic review specifically aimed to analyze the application of ChatGPT, a specific large language model, in the medical domain. Therefore, we deliberately selected a single keyword "ChatGPT" to ensure our search results were narrowly tailored to this model while remaining general to the application and domain. We avoid studies on other large language models or general AI technologies that do not directly address the research questions. Using additional terms or broadening the keyword scope (e.g., including "large language models" or "LLMs") would have introduced significant noise and irrelevant studies into our dataset, diluting the specificity and focus of our review.
To support the validity of our approach:
- "ChatGPT" as a keyword directly aligns with our study's scope, focusing on this specific technology. Using this term exclusively minimized the inclusion of studies irrelevant to our objectives while still ensuring comprehensive coverage of research related to ChatGPT. Using a more complex PubMed query may result in including fewer ChatGPT papers thus reducing recall.
- While the initial search was focused on "ChatGPT" for high recall using PubMed data base, we implemented a dual-phase screening process:
1). Automated screening with ChatGPT: Again, we aim for 100% recall using this automated screening step. We achieve this using carefully engineered prompts. We validated the 100% recall using a manual screening step where we looked at a randomly selected set of 100 ChatGPT reviewed articles and noting no false negatives. We believe that 100 articles (out of ~1000) is a representative set and we have high confidence that this step resulted in little-to-no false negatives (missed relevant articles). As described in the manuscript, this step excluded 625 articles leaving 476 papers.
2). Manual screening: while the first step has high confidence 100% recall, all the 467 articles flagged as relevant by ChatGPT were then manually screened by the authors; of which, 263 papers were excluded manually. Leaving a total of 128 papers considered eligible for this study.
Discrepancy in numbers
We sincerely appreciate your observation regarding the discrepancy between the number of papers we reported and the number you found. Upon revisiting our search process, we identified the root cause of this discrepancy. Our initial PubMed search (conducted on 12/15/2023) yielded 1,940 results. However, 839 of these results were missing abstracts and were excluded from our screening process. Unfortunately, we had overlooked this initial exclusion in our manuscript. We are grateful for your feedback, which prompted us to identify and correct this oversight. The manuscript has been revised to address this issue.
This explanation and the revisions made to our manuscript address your concerns and provide greater clarity regarding our approach. We are grateful for your valuable feedback, which has significantly contributed to improving the presentation and scope of our study.
Reviewer 3 Report
Comments and Suggestions for AuthorsThank you for your response to the requests previously sent.
Author Response
Thank you for your thoughtful review of our manuscript. We appreciate your valuable insights, which have significantly enhanced the quality of our paper.
Round 3
Reviewer 2 Report
Comments and Suggestions for AuthorsN/A