Next Article in Journal
Advancing Personalized and Inclusive Education for Students with Disability Through Artificial Intelligence: Perspectives, Challenges, and Opportunities
Previous Article in Journal
Revealing Factors Influencing mHealth Adoption Intention Among Generation Y: An Empirical Study Using SEM-ANN-IPMA Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Comparative Evaluation of Artificial Intelligence Models for Contraceptive Counseling

1
Department of Obstetrics, Gynecology, and Reproductive Sciences, Yale School of Medicine, New Haven, CT 06510, USA
2
Department of Internal Medicine, Yale School of Medicine, New Haven, CT 06510, USA
3
Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, CT 06520, USA
4
Department of Obstetrics and Gynecology, University of Pittsburgh Medical Center, Pittsburgh, PA 15219, USA
5
Department of Medicine, Division of Women’s Health, Brigham and Women’s Hospital, Boston, MA 02115, USA
*
Author to whom correspondence should be addressed.
Digital 2025, 5(2), 10; https://doi.org/10.3390/digital5020010
Submission received: 21 January 2025 / Revised: 4 March 2025 / Accepted: 20 March 2025 / Published: 25 March 2025

Abstract

:
Background: As digital health resources become increasingly prevalent, assessing the quality of information provided by publicly available AI tools is vital for evidence-based patient education. Objective: This study evaluates the accuracy and readability of responses from four large language models—ChatGPT 4.0, ChatGPT 3.5, Google Bard, and Microsoft Bing—in providing contraceptive counseling. Methods: A cross-sectional analysis was conducted using standardized contraception questions, established readability indices, and a panel of blinded OB/GYN physician reviewers comparing model responses to an AAFP benchmark. Results: The models varied in readability and evidence adherence; notably, ChatGPT 3.5 provided more evidence-based responses than GPT-4.0, although all outputs exceeded the recommended 6th-grade reading level. Conclusions: Our findings underscore the need for the further refinement of LLMs to balance clinical accuracy with patient-friendly language, supporting their role as a supplement to clinician counseling.

1. Introduction

As digital resources increasingly become the first line of inquiry for many patients with medical concerns, the accuracy and comprehensibility of online health information are of paramount concern [1]. This is particularly true in the field of reproductive health, where decisions informed by online resources can have profound implications on individual health outcomes and public health at large. Contraception counseling, with its intricate balance of medical, personal, and social considerations, exemplifies a clinical area where the quality and accuracy of information is critical.
The integration of artificial intelligence (AI) into health communication has opened novel avenues for patient education [2,3,4]. Large language models (LLMs), such as OpenAI’s ChatGPT series and counterparts from other tech entities, have demonstrated impressive capabilities in generating human-like text responses [5,6,7,8]. These models are trained on vast corpuses of data, enabling them to simulate a broad understanding of human language and knowledge. Their application in healthcare has the potential to revolutionize patient engagement by providing instant, personalized responses to health inquiries. Nonetheless, the deployment of such models in a clinical context demands rigorous evaluations to ensure that the information provided is both accurate and accessible [9,10,11].
Accuracy in medical information is multi-dimensional and encompasses not only the factual correctness of the content but also its alignment with current clinical guidelines and evidence-based practices [12]. The dynamic nature of medical knowledge, with continuous updates and revisions to guidelines, poses a substantial challenge for LLMs for which training datasets may not reflect the most recent consensus. Moreover, the nuances of medical decision-making often require context-sensitive information that cannot be accounted for by generalist AI models, irrespective of their sophistication [13].
Readability, a measure of the ease of understanding a text, is another critical dimension that impacts the utility of information provided by LLMs. Health literacy, or the ability to obtain, process, and understand basic health information to make appropriate health decisions, is a fundamental right. Yet, it remains a challenge for a significant portion of the population. The American Medical Association recommends that patient education materials be written at a 6th-grade reading level to be widely comprehensible [14,15]. However, the advanced language capabilities of LLMs often produce content that surpasses this threshold, potentially alienating those with limited health literacy [16].
This study seeks to systematically evaluate the accuracy and readability of responses from four leading LLMs on a range of common contraception-related questions. By benchmarking these AI-generated responses against evidence-based responses in a recent review on contraception published by American Academy of Family Physicians (AAFP) [17], we aim to critically assess the readiness of LLMs for deployment in patient counseling scenarios. While studies have been conducted with ChatGPT in OB/GYN, no studies have evaluated all four LLMs. This investigation is grounded in the aim of ensuring that emerging AI tools enhance, rather than hinder, patient autonomy in reproductive decision-making. The ultimate goal of this project is to inform the development of AI applications in healthcare that are not only technically proficient but also ethically aligned with the principles of patient-centered care.

2. Methods

This cross-sectional analysis was designed to evaluate and compare the performance of four contemporary LLMs in providing readable and accurate responses to common contraception questions. This study was determined to be exempt from review by the Yale University institutional review board and followed the STROBE reporting guidelines [18]. The LLMs included in this study were OpenAI’s ChatGPT 4.0, OpenAI’s ChatGPT 3.5, Google Bard, and Microsoft Bing. These models were selected based on their widespread use, advanced language-generation capabilities, and diverse underlying technologies. Each model was accessed via its publicly available interface on 6/10/2023.
The contraception questions selected from the AAFP review represent a comprehensive range of topics commonly encountered in clinical practice. The questions were selected after a review of current clinical guidelines and in consultation with 2 OB/GYN physicians to ensure they address key aspects of contraceptive counseling, including efficacy, safety, and patient-specific considerations. Six questions were chosen to cover a spectrum of common contraceptive counseling scenarios encountered by healthcare providers (Table 1). Each LLM was queried with the six questions, and the resulting responses were recorded verbatim. To mitigate potential biases, each model’s responses were collected in a new chat session to prevent learning or tailoring responses based on previous interactions. Responses were anonymized and formatting was standardized to prevent identification of the source LLM by the reviewers.
For the readability analysis, the LLMs’ outputs were assessed by employing four established readability indices: the Gunning–Fog Index (GF), Flesch–Kincaid Grade Level (FK), Automated Readability Index (AR), and Coleman–Liau Index (CL) [19,20]. These indices were chosen for their widespread recognition and use in assessing health education materials. Readability scores for all four indices align with U.S. school grade levels, with a score of 6 indicating a sixth-grade reading level. To standardize the outputs and ensure an equal comparison, all formatting was removed, including bullet points and numbered lists, as is consistent with previous readability studies [16,21]. Ancillary response information (e.g., “please note I am not a medical professional”) was also removed to focus the analysis on the clinical content.
For the accuracy analysis, two OB/GYN physicians independently reviewed the entire response output of each LLM and AAFP review. The physician reviewers were blinded to the source of the responses and assessed the accuracy, specificity, evidence basis, completeness, and readability of each response. The reviewers used a 5-point Likert scale, with anchors clearly defined for each criterion to maintain consistency in scoring. Descriptive statistics were used to summarize the Likert scale ratings and readability scores. All analyses were conducted using SAS version 9.4.

3. Results

Readability analyses showed that of the four LLMs, Google Bard exhibited the lowest average readability with a score of 10.59 which signifies a tenth grade reading level. GPT-3.5 averaged 13.65, Microsoft Bing averaged 14.18, and GPT-4.0 averaged 15.44, thus demonstrating higher readability scores, consistent with college-grade reading levels (Table 2). The average reading grade level of the responses from the four LLMs deviated from the target goal of 6 by a range of 4.59 (Google Bard) to 9.44 (GPT-4.0).
The response accuracy assessment utilized the average of physician reviewer scores for all individual LLM and AAFP responses (Table 3). The AAFP reference set a high benchmark for whether the questions were responded to with an average response rating of 4.92 out of 5. Among the LLMs, GPT-4.0, GPT-3.5, and Google Bard were each rated at 4.67 based on this criteria, while Microsoft Bing was rated at 4.42. During an assessment of the evidence-basis for each response, the AAFP reference achieved a score of 5.0. The LLMs exhibited varying levels of adherence to evidence-based information, with GPT-3.5 at 3.17, Microsoft Bing at 2.92, and both GPT-4.0 and Google Bard at 2.83. When assessed for completeness and the presence of extraneous information, the AAFP reference scored 3.33, while GPT-4.0 scored 2.75, GPT-3.5 and Google Bard scored 2.67, and Microsoft Bing scored 2.42. In terms of referral to healthcare providers or resources, the LLMs generally showed higher referral rates, with GPT-3.5 and Google Bard both achieving a perfect score of 5.0 and GPT-4.0 close behind with a score of 4.58. The AAFP reference garnered a lower score of 3.0 and Microsoft Bing even lower at 2.33. Finally, the use of absolutes in responses was lowest for the AAFP reference at a score of 1.0. Among the LLMs, GPT-3.5 was best at minimizing absolutist language and had the lowest score at 1.5, followed by Bard at 1.58, GPT-4.0 at 1.75, and Bing at 2.25.

4. Comment

4.1. Principal Findings

The advent of large language models (LLMs), like OpenAI’s ChatGPT, Google Bard, and Microsoft Bing, offers a novel and rapidly expanding avenue for patient education in areas like contraception counseling. Our study aimed to systematically evaluate the accuracy and readability of responses from four leading LLMs—OpenAI’s ChatGPT 4.0 and 3.5, Google Bard, and Microsoft Bing—in providing contraception-related information. This evaluation is crucial in an era where accurate and comprehensible health information is a cornerstone of effective patient care and autonomy.

4.2. Results in the Context of What Is Known

Our results indicate that all LLMs generally responded to the question asked but notably differ in adherence to evidence-based information. All LLMs fell short of the AAFP’s perfect score for evidence-basis. This gap highlights a critical limitation of current LLMs—they may not always draw from the most current evidence-based guidelines or reliably reason through nuance, both crucial aspects in medical decision-making. Interestingly, GPT-3.5 scored highest among LLMs in this regard, and its outperformance of GPT-4.0 suggests that newer versions of these models are not necessarily better in providing reliable information. When asked “What contraceptive methods are less safe for people with migraines?”, GPT-3.5 cited guidelines issued by the American College of Obstetricians and Gynecologists (ACOG) and the World Health Organization (WHO); however, GPT-4.0 failed to reference any established guidelines in its response. In other instances, responses from GPT-3.5 and GPT-4.0 highlighted internal inconsistences. When asked “Are fertility awareness methods (FAMs) of contraception effective?”, GPT-4.0 wrote “However, perfect use of FAMs can result in lower failure rates, typically around 1-5%,” while GPT-3.5 wrote “However, with perfect use, the failure rate can be much lower, around 2-3% for some methods.” The sources of these efficacy values are unclear, and the inconsistency raises some concern regarding the training material that LLMs use to develop their evidence-basis.
An analysis of response completeness and the presence of extraneous information also revealed an unsatisfactory performance by LLMs compared to the AAFP review. When asked “What forms of emergency contraception are effective?”, all LLMs failed to address the impact of body mass index (BMI) on emergency contraception methods. The risk of pregnancy with common forms of emergency contraception varies significantly in obese and overweight patients [22]. This suggests that while LLMs can provide relevant information, they may also fail to include important details, potentially leading to misinformation or confusion.
Regarding referral to healthcare providers or resources, LLMs generally outperformed the AAFP reference. This is encouraging, as it demonstrates the models’ capacity to recognize the limits of their advice and the importance of professional medical consultation. The lower use of absolutes in responses, particularly by the AAFP reference and GPT-3.5, is a significant finding. Over-reliance on absolutes can be misleading in medicine, where uncertainties and patient-specific factors often play a substantial role. This highlights the nuanced understanding required in conveying medical information, an area where LLMs can be further optimized.
In terms of readability, Google Bard and GPT-3.5 offered content that was somewhat more accessible to the general public, as indicated by lower readability scores. In contrast, responses from Microsoft Bing and GPT-4.0 were at a higher grade level, which could be less approachable for individuals with limited health literacy. This finding is crucial, considering the recommendation for patient education materials to be at a 6th-grade reading level. It suggests a need for further refinement in the LLMs’ language generation algorithms to make health information more universally accessible.
The range of readability scores of LLM-generated content is a striking aspect of our results. The higher reading levels of LLM responses, which far surpassed the recommended standards for patient education materials, raise concerns about the equitable delivery of health information. This is critical, given that lower health literacy is associated with poorer health outcomes and greater use of health services. The challenge lies in developing LLMs capable of conveying complex medical information in simplified terms without compromising content quality, a balance that is yet to be fully achieved.

4.3. Clinical Implications

Our findings indicate significant variability in both the accuracy and readability of contraceptive information provided by large language models (LLMs). This limits them as sources as standalones for patient counseling. If LLMs are to be considered for integration within the clinical space, they should be used to supplement, rather than replace, direct patient–provider interactions. This approach ensures contextual relevance and maintains the nuances essential for individualized patient care. It is of utmost importance to adapt an equitable framework under which any LLM is utilized within clinical settings, which includes the verification of generated content to current, evidence-based clinical guidelines.

4.4. Research Implications

This work stresses the need for ongoing efforts to refine large language models for medical use. Future studies should zero in on creating flexible algorithms that can pull in real-time clinical data, keeping the information in step with current evidence-based standards. Broadening the training data to cover a wider array of clinical situations and factoring in aspects like patient viewpoints and healthcare fairness will make these tools more practical and ethically sound. Bringing in a diverse mix of reviewers and digging deeper into outcomes that matter to patients will also help sharpen the health information that LLMs deliver.

4.5. Strengths and Limitations

Our study has several strengths. By comparing four different LLMs, we offer a broad analysis of the current dominant artificial intelligence systems and can assess trends across models. For our readability analysis, we used four different indices to derive a composite average, thereby minimizing the limitations of any one index. For our accuracy analysis, we utilized a peer-reviewed AAFP article as a benchmark, ensuring a vetted and rigorous standard for accurate responses to contraception questions. Furthermore, the OB/GYN physician evaluators were blinded to the source of each of the responses thereby minimizing preconceived biases about LLM performance. During the evaluative process, a standardized grading scale with anchors clearly defined each criterion to maintain consistency in scoring.
The cross-sectional nature of this study limits our ability to assess the consistency of LLMs over time. Given the rapid growth within the field of AI, there is a need for ongoing evaluations. Second, our analysis was confined to a single set of contraception questions. The scope of questions assessed may not encompass the full breadth of scenarios encountered by patients, nor does it capture the interactive and iterative nature of patient-provider conversations. Additionally, the reliance on a small panel of reviewers may not represent a consensus across the broader medical community, which highlights the need for more diverse reviewer perspectives in future evaluations. Additionally, we were unable to explore any broader assessment criteria, potentially overlooking the critical dimensions of healthcare information quality. A further examination of patient perspectives would provide deeper insights into the real-world applicability and effectiveness of LLM-generated health information. Furthermore, our focus on readability indices, though standardized, does not fully capture the comprehensiveness and nuance of language comprehension. These reading-level indices are formulas for readability but do not account for factors, such as cultural relevance, personal experience, or emotional tone, which are crucial to patient education. Lastly, the AAFP benchmark text was intended for an audience of healthcare providers and some of the content may not have been as relevant for a general consumer audience.

5. Conclusions

A comprehensive study on the use of LLMs for contraception counseling reveals both the potential and limitations of AI in patient education. While LLMs like OpenAI’s ChatGPT, Google Bard, and Microsoft Bing can provide useful information and effectively guide users towards professional medical consultation, there remain significant gaps in aligning their outputs with current evidence-based guidelines and ensuring readability at recommended levels. Newer versions of these LLMs do not consistently outperform older versions, highlighting that the organic evolution of these models does not necessarily equate to more accurate or accessible patient health information. Successive iterations of these LLMs must keep these goals in mind to see a marked improvement in health care applications.
The study highlights the importance of ensuring accurate and accessible health information in newly developed LLMs in order to provide equitable health communication and minimize misinformation to consumers. It also underscores the importance of AI tools as supplements, rather than replacements, for direct patient–physician interactions. Looking forward, enhancing LLMs’ algorithms for greater medical accuracy and adaptability in content complexity remains a priority. Simultaneously, integrating these tools into clinical practice calls for a collaborative effort among developers, clinicians, and patients to maximize their benefits and address inherent shortcomings. In light of rapid advancements in LLM technology, future studies should explore integrating multimodal data (e.g., voice and image inputs) and deploying these models in real-time clinical decision support systems. Such approaches may further personalize contraceptive counseling and improve patient outcomes. This study paves the way for future research focused on the real-world applications of LLMs in various clinical settings, aiming to improve patient outcomes and satisfaction in the evolving landscape of digital health.

Author Contributions

Conceptualization, A.V.P. and S.S.S.; methodology, A.V.P.; software, R.H.D. and A.P. (Ankita Patil); validation, A.A. and S.J.; formal analysis, A.V.P. and R.H.D.; resources, S.S.S.; writing—original draft preparation, A.V.P.; writing—review and editing, K.A. and A.P. (Aisvarya Panakam); supervision, S.S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study did not require IRB review per 45 CFR § 46.

Informed Consent Statement

Not applicable. Does not involve humans.

Data Availability Statement

Data are available at request and per supervisor approval on a case-by-case basis.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Rodriguez, J.A.; Clark, C.R.; Bates, D.W. Digital Health Equity as a Necessity in the 21st Century Cures Act Era. JAMA 2020, 323, 2381–2382. [Google Scholar] [CrossRef] [PubMed]
  2. Amin, K.; Khosla, P.; Doshi, R.; Chheang, S.; Forman, H.P. Artificial Intelligence to Improve Patient Understanding of Radiology Reports. Yale J. Biol. Med. 2023, 96, 407–417. [Google Scholar] [CrossRef] [PubMed]
  3. Piersson, A.D.; Dzefi-Tettey, K. OC01.02: Accuracy and readability of patient-focused information on obstetrics ultrasound imaging from online sources versus ChatGPT-generated. Ultrasound Obstet. Gynecol. 2023, 62, 1–2. [Google Scholar] [CrossRef]
  4. Ahn, S. The impending impacts of large language models on medical education. Korean J. Med. Educ. 2023, 35, 103–107. [Google Scholar] [CrossRef]
  5. Ray, P.P. Bridging the gap: Integrating ChatGPT into obstetrics and gynecology research—A call to action. Arch. Gynecol. Obstet. 2023, 309, 1111–1113. [Google Scholar] [CrossRef]
  6. Lyu, Q.; Tan, J.; Zapadka, M.E.; Ponnatapura, J.; Niu, C.; Myers, K.J.; Wang, G.; Whitlow, C.T. Translating radiology reports into plain language using ChatGPT and GPT-4 with prompt learning: Results, limitations, and potential. Vis. Comput. Ind. Biomed. Art. 2023, 6, 9. [Google Scholar]
  7. Li, H.; Moon, J.T.; Iyer, D.; Balthazar, P.; Krupinski, E.A.; Bercu, Z.L.; Newsome, J.M.; Banerjee, I.; Gichoya, J.W.; Trivedi, H.M. Decoding radiology reports: Potential application of OpenAI ChatGPT to enhance patient understanding of diagnostic reports. Clin. Imaging 2023, 101, 137–141. [Google Scholar] [CrossRef]
  8. Ali, S.R.; Dobbs, T.D.; Hutchings, H.A.; Whitaker, I.S. Using ChatGPT to write patient clinic letters. Lancet Digit. Health 2023, 5, e179–e181. [Google Scholar] [CrossRef]
  9. Grünebaum, A.; Chervenak, J.; Pollet, S.L.; Katz, A.; Chervenak, F.A. The exciting potential for ChatGPT in obstetrics and gynecology. Am. J. Obstet. Gynecol. 2023, 228, 696–705. [Google Scholar] [CrossRef]
  10. Wan, C.; Cadiente, A.; Khromchenko, K.; Friedricks, N.; Rana, R.A.; Baum, J.D. ChatGPT: An Evaluation of AI-Generated Responses to Commonly Asked Pregnancy Questions. Open J. Obstet. Gynecol. 2023, 13, 1528–1546. [Google Scholar] [CrossRef]
  11. Allahqoli, L.; Ghiasvand, M.M.; Mazidimoradi, A.; Salehiniya, H.; Alkatout, I. Diagnostic and Management Performance of ChatGPT in Obstetrics and Gynecology. Gynecol. Obstet. Investig. 2023, 88, 310–313. [Google Scholar] [CrossRef] [PubMed]
  12. Goodman, R.S.; Patrinely, J.R.; Stone, C.A.; Zimmerman, E.; Donald, R.R.; Chang, S.S.; Berkowitz, S.T.; Finn, A.P.; Jahangir, E.; Scoville, E.A.; et al. Accuracy and Reliability of Chatbot Responses to Physician Questions. JAMA Netw. Open. 2023, 6, e2336483. [Google Scholar] [CrossRef] [PubMed]
  13. Doshi, R.H.; Bajaj, S.S.; Krumholz, H.M. ChatGPT: Temptations of Progress. Am. J. Bioeth. 2023, 23, 6–8. [Google Scholar] [CrossRef] [PubMed]
  14. Weiss, B.D. Health Literacy and Patient Safety: Help Patients Understand. In Manual for Clinicians; AMA Foundation: Berkeley, CA, USA, 2007. [Google Scholar]
  15. Hansberry, D.R.; Agarwal, N.; Baker, S.R. Health literacy and online educational resources: An opportunity to educate patients. AJR Am. J. Roentgenol. 2015, 204, 111–116. [Google Scholar] [CrossRef]
  16. Doshi, R.; Amin, K.; Khosla, P.; Bajaj, S.; Chheang, S.; Forman, H.P. Utilizing Large Language Models to Simplify Radiology Reports: A comparative analysis of ChatGPT3.5, ChatGPT4.0, Google Bard, and Microsoft Bing. medRxiv 2023. [Google Scholar] [CrossRef]
  17. Paradise, S.L.; Landis, C.A.; Klein, D.A. Evidence-Based Contraception: Common Questions and Answers. Am. Fam. Physician 2022, 106, 251–259. [Google Scholar]
  18. Ayers, J.W.; Zhu, Z.; Poliak, A.; Leas, E.C.; Dredze, M.; Hogarth, M.; Smith, D.M. Evaluating Artificial Intelligence Responses to Public Health Questions. JAMA Netw. Open. 2023, 6, e2317517. [Google Scholar] [CrossRef]
  19. Coleman, M.; Liau, T.L. A computer readability formula designed for machine scoring. J. Appl. Psychol. 1975, 60, 283–284. [Google Scholar] [CrossRef]
  20. Sare, A.; Patel, A.; Kothari, P.; Kumar, A.; Patel, N.; Shukla, P.A. Readability Assessment of Internet-based Patient Education Materials Related to Treatment Options for Benign Prostatic Hyperplasia. Acad. Radiol. 2020, 27, 1549–1554. [Google Scholar] [CrossRef]
  21. Chen, L.; Zaharia, M.; Zou, J. How is ChatGPT’s behavior changing over time? Harv. Data Sci. Review 2024, 6. [Google Scholar] [CrossRef]
  22. Glasier, A.; Cameron, S.T.; Blithe, D.; Scherrer, B.; Mathe, H.; Levy, D.; Gainer, E.; Ulmann, A. Can we identify women at risk of pregnancy despite using emergency contraception? Data from randomized trials of ulipristal acetate and levonorgestrel. Contraception 2011, 84, 363–367. [Google Scholar] [CrossRef]
Table 1. Summary of average physician reviewer scores for all criteria used to evaluate the accuracy of responses to contraceptive counseling inquiries a.
Table 1. Summary of average physician reviewer scores for all criteria used to evaluate the accuracy of responses to contraceptive counseling inquiries a.
QuestionAAFP bGPT-4.0GPT-3.5Google BardMicrosoft Bing
What forms of emergency contraception are effective?3.533.13.12.5
Are fertility awareness methods of contraception effective?3.43.33.53.42.5
What contraceptive methods are less safe for people with migraines?3.53.63.53.23.2
How long does long-acting reversible contraception remain effective?3.53.53.33.12.8
Can depot medroxyprogesterone acetate be self-administered subcutaneously? What are the effects on bone mineral density?3.43.33.642.9
What are contraception considerations in transgender and gender-diverse people with a uterus?3.43.23.43.33.3
a Higher scores (maximum 5) are more favorable for the following criteria: Was the question responded to? Was the response evidence based? Was the response complete or had any extraneous information? Did the response refer the user to a healthcare provider or other resources? Lower scores (minimum 1) are more favorable for the following criteria: Does the response speak in absolutes? b American Academy of Family Physicians.
Table 2. Readability index score for each large language model.
Table 2. Readability index score for each large language model.
GPT-4.0GPT-3.5Google BardMicrosoft Bing
Gunning–Fog Index13.8013.3210.1013.91
Flesch–Kincaid Grade Level14.5012.579.5512.45
Automated Readability Index16.9515.2211.3515.40
Coleman–Liau Index16.4913.4711.3714.96
Average Reading Grade Level15.4413.6510.5914.18
Table 3. Average physician reviewer scores for each criteria used to assess the accuracy of large language model responses.
Table 3. Average physician reviewer scores for each criteria used to assess the accuracy of large language model responses.
CriteriaAAFP aGPT-4.0GPT-3.5Google BardMicrosoft Bing
Was the question responded to?4.924.674.674.674.42
Was the response evidence based?5.002.833.172.832.92
Was the response complete or had any extraneous information?3.332.752.672.672.42
Did the response refer the user to a healthcare provider or other resources?3.004.585.005.002.33
Does the response speak in absolutes?1.001.751.501.582.25
For criteria such as ‘response accuracy’ and ‘evidence basis’, higher scores (maximum 5) are more favorable; conversely, for the ‘use of absolutes’ criterion, lower scores (minimum 1) indicate a better performance. a American Academy of Family Physicians.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Patel, A.V.; Jasani, S.; AlAshqar, A.; Doshi, R.H.; Amin, K.; Panakam, A.; Patil, A.; Sheth, S.S. Comparative Evaluation of Artificial Intelligence Models for Contraceptive Counseling. Digital 2025, 5, 10. https://doi.org/10.3390/digital5020010

AMA Style

Patel AV, Jasani S, AlAshqar A, Doshi RH, Amin K, Panakam A, Patil A, Sheth SS. Comparative Evaluation of Artificial Intelligence Models for Contraceptive Counseling. Digital. 2025; 5(2):10. https://doi.org/10.3390/digital5020010

Chicago/Turabian Style

Patel, Anisha V., Sona Jasani, Abdelrahman AlAshqar, Rushabh H. Doshi, Kanhai Amin, Aisvarya Panakam, Ankita Patil, and Sangini S. Sheth. 2025. "Comparative Evaluation of Artificial Intelligence Models for Contraceptive Counseling" Digital 5, no. 2: 10. https://doi.org/10.3390/digital5020010

APA Style

Patel, A. V., Jasani, S., AlAshqar, A., Doshi, R. H., Amin, K., Panakam, A., Patil, A., & Sheth, S. S. (2025). Comparative Evaluation of Artificial Intelligence Models for Contraceptive Counseling. Digital, 5(2), 10. https://doi.org/10.3390/digital5020010

Article Metrics

Back to TopTop