Responses of Artificial Intelligence Chatbots to Testosterone Replacement Therapy: Patients Beware!
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors
Overall authors ask a reasonable question as it pertains to TRT and related information available on various LLM. They evaluate 4 LLM for quality and readability.
Discussion section needs significant revision. Currently, it simply states the results. It needs further context with regards to what other evidence on this subject matter is available in the literature already and how does this study add to that or compare to it? what are the limitations of the study? what are the take aways from the findings and how can it be applied to patient care?
Author Response
We would like to thank the reviewer for their insightful comments. Please find attached our revised manuscript for publication in SIUJ. Please find below our detailed responses to the reviewer's comments.
We appreciate your time and effort in reviewing our manuscript and would be happy to make further revisions that you feel might be needed.
Reviewer 1
Comment 1: Discussion section needs significant revision. Currently, it simply states the results. It needs further context with regards to what other evidence on this subject matter is available in the literature already and how does this study add to that or compare to it?
Response: We would like to thank the reviewer for this suggestion. The discussion section has been expanded to include the reviewer’s suggestions.
Comment 2: What are the limitations of the study?
Response: We would like to thank the reviewer for this suggestion. We have modified the discussion section to include limitations of the study -
“Our study is limited by the inherent stochastic nature of the AI chatbots, as the responses can vary with similar prompts. This reflects their training process, which is based on probabilistic algorithms to generate responses, leading to inconsistencies. Further research is required to assess if multiple answers to the same question show variability in the scores. Furthermore, the questions were asked sequentially, where the first response could influence subsequent responses. Our goal was to replicate a patient experience, where multiple questions might be asked in a single interaction.”
Comment 3: What are the takeaways from the findings and how can it be applied to patient care?
Response: We thank the reviewer for this suggestion. We have modified the discussion section to include clinical implications of this study -
“AI chatbots provide instant access to information, and can offer support to patients across various settings, from addressing questions about their treatment plans, providing crucial information before a surgery to assistance with interpretation of lab results. Multiple studies have demonstrated variability in quality scores across different AI chatbots, highlighting the differences in the information they provide across various healthcare topics. Relying on a single chatbot can potentially provide incomplete or inaccurate information. Therefore, patients must be encouraged to validate chatbot generated advice by consulting healthcare professionals. Furthermore, to improve readability, making use of certain prompts like ‘Explain it to me in simple terms’ may be useful [21].
From a physician’s perspective, it is essential to recognize the limitations of these chatbots and engage in thorough discussion with patients to dispel common myths and misconceptions. ”
Reviewer 2 Report
Comments and Suggestions for Authors
I commend the authors for thoroughly analyzing the differences in reading level, comprehension, and quality among five AI chatbots providing information on the Testosterone Replacement Therapy. In a time marked by widespread development and use of chatbots, healthcare professionals need to assess the quality of these natural language processors. This ensures that the general public receives accurate, readily comprehensible information.
While this study is well executed and presents its findings in a clear and concise manner, there are some major concerns that need to be addressed. These concerns are significant and should be considered for revision for acceptance for publication:
1) How and by whom were the questions asked to chatbots prepared?
2) Were the chatbots asked questions only once? Was all browser-related data deleted between questions? Chatbots may repeatedly produce different answers to the same questions, and not deleting browser data may create bias in the questions.
3) Please indicate in the material method section that ethics committee approval or IRB number is not required.
4) The discussion section should provide a more detailed analysis of how these chatbots can be developed in the future.
5) In the discussion section, I also recommend that you take advantage of the following recent study that used the same analysis methods:
Åžahin, M. F., Topkaç, E. C., Dogan, C., Åžeramet, S., Özcan, R., Akgül, H. M., & Yazici, C. M. (2024). STILL USING ONLY CHATGPT? THE COMPARISON OF FIVE DIFFERENT ARTIFICIAL INTELLIGENCE CHATBOTS’ANSWERS TO THE MOST COMMON QUESTIONS ABOUT KIDNEY STONES. Journal of Endourology, (ja).
Comments for author File: Comments.pdf
Author Response
We would like to thank the reviewer for their insightful comments. Please find attached our revised manuscript for publication in SIUJ. Please find below our detailed responses to the reviewer's comments.
We appreciate your time and effort in reviewing our manuscript and would be happy to make further revisions that you feel might be needed.
Reviewer 2
Comment 1: How and by whom were the questions asked to chatbots prepared?
Response: We would like to thank the reviewer for this suggestion and have incorporated the following in the materials and methods section -
“The questions were prepared after deliberation between board certified urologist and medical students based on patient experiences and the commonly asked questions about TRT encountered in the outpatient clinic”
Comment 2: Were the chatbots asked questions only once? Was all browser-related data deleted between questions? Chatbots may repeatedly produce different answers to the same questions, and not deleting browser data may create bias in the questions.
Response: We would like to thank the reviewer for this suggestion and have incorporated the following in the materials and methods section -
“To prevent bias, the chatbots were accessed in incognito mode to prevent browser history from affecting the answers. The questions were asked only once, in a sequential order and remained the same across all chatbot engines.”
We appreciate the bias that may be introduced, and recognize it as a limitation of our study -
“Furthermore, the questions were asked sequentially, where the first response could influence subsequent responses. Our goal was to replicate a patient experience, where multiple questions might be asked in a single interaction.”
Comment 2: Please indicate in the material method section that ethics committee approval or IRB number is not required.
Response: We would like to thank the reviewer for this suggestion and have incorporated the suggestion in the materials and methods section.
Comment 3: The discussion section should provide a more detailed analysis of how these chatbots can be developed in the future.
Response: We thank the reviewer for this suggestion. We have modified the discussion section to include future development of chatbots -
“Future development
As technology advances, the quality and readability of information from chatbots are likely to improve. Out of the three chatbots evaluated, only ChatGPT lacked citations in its responses, highlighting a critical area for improvement. Although many citations were from Mayo Clinic, Cleveland Clinic, Harvard Health Publishing and WebMD, further progress could be achieved by exclusive reference from trusted and regulated resources as such. Additionally, providing a user feedback portal to report misinformation could serve as an additional means to maintain accuracy and reliability of the information generated by chatbots.”
Comment 4: In the discussion section, I also recommend that you take advantage of the following recent study that used the same analysis methods:
Åžahin, M. F., Topkaç, E. C., Dogan, C., Åžeramet, S., Özcan, R., Akgül, H. M., & Yazici, C. M. (2024). STILL USING ONLY CHATGPT? THE COMPARISON OF FIVE DIFFERENT ARTIFICIAL INTELLIGENCE CHATBOTS’ANSWERS TO THE MOST COMMON QUESTIONS ABOUT KIDNEY STONES. Journal of Endourology, (ja).
Response: We thank the reviewer for the recommendation. We have modified the discussion section further as per the suggestions.
Reviewer 3 Report
Comments and Suggestions for Authors
- The manuscript does not clearly state its novel contribution to the field.
- Make the abstract more concise and informative by including key results.
- Expand the introduction to address specific gaps in knowledge about TRT that motivated this study.
- Clarify why these four AI chatbots were selected and why other popular tools (e.g., newer versions of ChatGPT) were not considered.
- The reasoning behind chatbot question formulation is not well-explained. Do they represent common patient questions?
- There's no explanation of how the findings could affect patient comprehension or decision-making.
- The discussion of the chatbot results is inadequate. The findings need clearer explanation.
- The section on the hallucination effect and AI limitations is underdeveloped. More details are needed on specific chatbot errors, such as incorrect medical advice or missing critical safety information.
- Strengthen the conclusion by discussing the implications for future chatbot development and regulation.
- Offer actionable recommendations for clinicians and regulators based on the study’s findings.
- Add more detail to figure legends and ensure tables contain all necessary information.
- The discussion lacks sufficient exploration of the significance of the findings, particularly in providing clinically relevant insights into the differences between chatbots.
- Update the references to include recent studies on AI chatbots in healthcare.
Author Response
We would like to thank the reviewer for their insightful comments. Please find attached our revised manuscript for publication in SIUJ. Please find below our detailed responses to the reviewer's comments.
We appreciate your time and effort in reviewing our manuscript and would be happy to make further revisions that you feel might be needed.
Reviewer 3
Comment 1: The manuscript does not clearly state its novel contribution to the field.
Response: We would like to thank the reviewer for the comment. We have expanded the introduction to include information that motivated our study. To the best of our knowledge, no study has evaluated responses from AI chatbots on Testosterone Replacement Therapy. We have made necessary changes and believe it offers insights on current technology, strengths, limitations and future improvements.
Comment 2: Make the abstract more concise and informative by including key results.
Response: We would like to thank the reviewer for this suggestion. We have made modifications to the abstract as per the reviewer’s suggestion.
Comment 3: Expand the introduction to address specific gaps in knowledge about TRT that motivated this study.
Response: We would like to thank the reviewer for this suggestion and have incorporated the following statement in the introduction section -
“Testosterone Replacement Therapy (TRT) is a subject clouded by misinformation and there are existing gaps in knowledge about the benefits and risks among patients with a survey showing 50% of respondents being unaware of the risks [10]. The Internet was found to be one of the top sources to access information on TRT [10]. The symptoms of low testosterone including reduced energy, sex drive and erectile function can have associated stigma, leading to patients resorting to online resources to address concerns.”
Comment 4: Clarify why these four AI chatbots were selected and why other popular tools (e.g., newer versions of ChatGPT) were not considered.
Response: We would like to thank the reviewer for this suggestion and have incorporated the following in the materials and methods section -
“The chatbots were selected based on feedback from Urology attendings and residents who identified those that were popular among the patient population. The latest free versions of these chatbots were selected, as they are readily available and widely accessible by the patients.”
Comment 5: The reasoning behind chatbot question formulation is not well-explained. Do they represent common patient questions?
Response: We would like to thank the reviewer for this suggestion and have incorporated the following in the materials and methods section -
“The questions were prepared after deliberation between board certified urologist and medical students based on patient experiences and the commonly asked questions about TRT encountered in the outpatient clinic (Figure 1). The questions were asked in simple language to simulate a patient’s perspective.”
Comment 6: There's no explanation of how the findings could affect patient comprehension or decision-making.
Response: We would like to thank the reviewer for this suggestion and have expanded the discussion section to include clinical implications -
“AI chatbots provide instant access to information, and can offer support to patients across various settings, from addressing questions about their treatment plans, providing crucial information before a surgery to assistance with interpretation of lab results. Multiple studies have demonstrated variability in quality scores across different AI chatbots, highlighting the differences in the information they provide across various healthcare topics. Relying on a single chatbot can potentially provide incomplete or inaccurate information. Therefore, patients must be encouraged to validate chatbot generated advice by consulting healthcare professionals. Furthermore, to improve readability, making use of certain prompts like ‘Explain it to me in simple terms’ may be useful.”
Comment 7: The discussion of the chatbot results is inadequate. The findings need clearer explanation.
Response: We thank the reviewer for their suggestion. We have made revision to results section to provide a clearer explanation of the findings.
Comment 8: The section on the hallucination effect and AI limitations is underdeveloped. More details are needed on specific chatbot errors, such as incorrect medical advice or missing critical safety information.
Response: We would like to thank the reviewer for this suggestion and have included the following in the introduction section -
“Alkaissi and McFarlane discuss this phenomenon in detail, highlighting an instance where ChatGPT provided inaccurate information on the mechanism of homocysteine-induced osteoporosis and, when prompted, provided incorrect citations [10].”
Comment 9: Strengthen the conclusion by discussing the implications for future chatbot development and regulation.
Response: We thank the reviewer for this suggestion. We have expanded the discussion section and modified the conclusion to include future chatbot development and regulation -
“As technology advances, the quality and readability of information from chatbots are likely to improve. Out of the three chatbots evaluated, only ChatGPT lacked citations in its responses, highlighting a critical area for improvement. Although many citations were from Mayo Clinic, Cleveland Clinic, Harvard Health Publishing and WebMD, further progress could be achieved by exclusive reference from trusted and regulated resources as such. Additionally, providing a user feedback portal to report misinformation could serve as an additional means to maintain accuracy and reliability of the information generated by chatbots.”
Comment 10: Offer actionable recommendations for clinicians and regulators based on the study’s findings.
Response: We thank the reviewer for this suggestion. We have expanded the discussion section to provide recommendations for clinicians and regulators.
“From a physician’s perspective, it is essential to recognize the limitations of these chatbots and engage in thorough discussion with patients to dispel common myths and misconceptions.
Future development
As technology advances, the quality and readability of information from chatbots are likely to improve. Out of the three chatbots evaluated, only ChatGPT lacked citations in its responses, highlighting a critical area for improvement. Although many citations were from Mayo Clinic, Cleveland Clinic, Harvard Health Publishing and WebMD, further progress could be achieved by exclusive reference from trusted and regulated resources as such. Additionally, providing a user feedback portal to report misinformation could serve as an additional means to maintain accuracy and reliability of the information generated by chatbots.”
Comment 11: Add more detail to figure legends and ensure tables contain all necessary information.
Response: We thank the reviewer for this suggestion. We have made relevant changes to the figures and tables to depict the necessary information.
Comment 12: The discussion lacks sufficient exploration of the significance of the findings, particularly in providing clinically relevant insights into the differences between chatbots.
Response: We thank the reviewer for the suggestion. We have made revisions to the discussion section to include how our research corresponds to or differs from other literature available, the clinical implications and future development of chatbots.
Comment 13: Update the references to include recent studies on AI chatbots in healthcare.
Response: We would like to thank the reviewer for this suggestion. We have updated the discussion section to include references for recent studies on AI chatbots in healthcare.