An Empirical Evaluation of Large Language Models on Consumer Health Questions
Abstract
:1. Introduction
- Evaluation of Five LLMs in answering Consumer Health Queries: We systematically compare GPT-4o mini, Llama 3.1 (70B), Mistral-123B, Mistral-7B, and Gemini-Flash on a real-world dataset of consumer health questions.
- Cross-Model Evaluation Method: We use a cross-evaluation framework, assessing each model’s ability to judge both its own outputs and those of other models, and compare these findings with human judgments.
- Dataset Limitations and Model Insights: We highlight limitations in MedRedQA and derive some useful insights regarding model performance from analyzing the results.
- Discussion of Practical and Ethical Implications: We examine real-world issues such as model safety mechanisms, misinformation risks, and the challenges of regulatory compliance in healthcare contexts.
2. Related Work
3. Methods
3.1. Purpose and Scope
3.2. Model Selection
- GPT-4o mini: A compact variant of the GPT-4o model that prioritizes efficiency while maintaining strong reasoning capabilities.
- Llama 3.1: 70B: Part of the Llama model family, this 70-billion parameter model performs competitively with other compact models and demonstrates robust language understanding across a range of benchmarks.
- Gemini-Flash: This model is part of the Gemini 1.5 series and is developed as a more cost-effective and faster alternative to Gemini-1.5 Pro [12].
- Mistral 7B and Mistral 123B: Known for outperforming larger models on several benchmarks, the Mistral family offers powerful small-scale models that demonstrate competitive performance for their size.
3.3. Dataset Selection
3.3.1. Existing Datasets in Medical QA
3.3.2. MedRedQA
3.4. Prompt Generation
3.5. Evaluation of Model Responses
3.6. Assessment of Model Reliability as Evaluators
- Short and direct expert responses: For example, if a consumer asked, “Should I be concerned about this symptom?” and the expert responded with a direct confirmation or denial with a “Yes” or “No” in the response, or a strongly implied affirmative or negative stance. These samples were included as it would be easy to verify whether the model response matches the expert response or not.
- Responses requesting additional information: Some expert responses requested more details rather than answering the question. These samples were included as they provided a clear decision rule: model responses that recognized the need for further details were labeled as “Agree” and those that did not were labeled as “Disagree”.
- No long responses: Longer expert responses–i.e., those that did not have a definitive “Yes” or “No” response–were not included as determining whether a model response matches the expert response might have required domain knowledge.
4. Results
5. Discussion
6. Limitations
- Structured Annotations: Labeling questions that require external information (e.g., images) to flag cases where responding to the question may not be possible due to missing contextual data.
- Follow-up Exchanges: Capturing multi-turn interactions where users provide additional details requested by experts, making the dataset more representative of real-world consultations.
- Multiple Expert Responses: Multiple physician responses per question could enhance the evaluation of LLM responses by providing a greater number of responses to evaluate the LLM responses against.
7. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Glossary
Natural Language Processing | A field of AI that enables computers to understand and process human language. |
Large Language Models (LLMs) | AI models trained on vast amounts of text data to generate human-like language and answer complex queries |
Retrieval Augmented Generation | A method combining LLMs with external information retrieval to improve response accuracy. |
Fine-Tuning | The process of training a pre-existing AI model on a specialized dataset to improve its performance in a specific domain |
Electronic Health Records (EHRs) | Digital Medical Records containing patient history and treatment details. |
AskDocs subreddit | A reddit forum where verified medical experts answer user questions. |
References
- Del Fiol, G.; Workman, T.E.; Gorman, P.N. Clinical questions raised by clinicians at the point of care: A systematic review. JAMA Intern. Med. 2014, 174, 710–718. [Google Scholar] [CrossRef]
- Sarrouti, M.; Lachkar, A.; Ouatik, S.E.A. Biomedical question types classification using syntactic and rule based approach. In Proceedings of the 2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K), Lisbon, Portugal, 12–14 November 2015; IEEE: New York, NY, USA, 2015; Volume 1, pp. 265–272. [Google Scholar]
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Li, L.; Zhou, J.; Gao, Z.; Hua, W.; Fan, L.; Yu, H.; Hagen, L.; Zhang, Y.; Assimes, T.L.; Hemphill, L.; et al. A scoping review of using large language models (llms) to investigate electronic health records (ehrs). arXiv 2024, arXiv:2405.03066. [Google Scholar]
- Yao, Z.; Zhang, Z.; Tang, C.; Bian, X.; Zhao, Y.; Yang, Z.; Wang, J.; Zhou, H.; Jang, W.S.; Ouyang, F.; et al. Medqa-cs: Benchmarking large language models clinical skills using an ai-sce framework. arXiv 2024, arXiv:2410.01553. [Google Scholar]
- Nguyen, V.; Karimi, S.; Rybinski, M.; Xing, Z. MedRedQA for Medical Consumer Question Answering: Dataset, Tasks, and Neural Baselines. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, Bali, Indonesia, 1–4 November 2023; Volume 1: Long Papers, pp. 629–648. [Google Scholar]
- Reddit. AskDocs. 2024. Available online: https://www.reddit.com/r/AskDocs/ (accessed on 27 October 2024).
- OpenAI. GPT-4o Mini: Advancing Cost-Efficient Intelligence. 2024. Available online: https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ (accessed on 27 October 2024).
- Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; Yang, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar]
- Mistral, A.I. News on Mistral Large. 2024. Available online: https://mistral.ai/news/mistral-large-2407/ (accessed on 27 October 2024).
- Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; Casas, D.D.L.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar]
- Team, G.; Georgiev, P.; Lei, V.I.; Burnell, R.; Bai, L.; Gulati, A.; Tanzer, G.; Vincent, D.; Pan, Z.; Wang, S.; et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv 2024, arXiv:2403.05530. [Google Scholar]
- Baydaroğlu, Ö.; Yeşilköy, S.; Sermet, Y.; Demir, I. A comprehensive review of ontologies in the hydrology towards guiding next generation artificial intelligence applications. J. Environ. Inform. 2023, 42, 90–107. [Google Scholar] [CrossRef]
- Sajja, R.; Sermet, Y.; Cikmaz, M.; Cwiertny, D.; Demir, I. Artificial intelligence-enabled intelligent assistant for personalized and adaptive learning in higher education. Information 2024, 15, 596. [Google Scholar] [CrossRef]
- Sajja, R.; Sermet, Y.; Demir, I. End-to-End Deployment of the Educational AI Hub for Personalized Learning and Engagement: A Case Study on Environmental Science Education. EarthArxiv 2024, 7566. [Google Scholar] [CrossRef]
- Chi, N.C.; Nakad, L.; Fu, Y.K.; Demir, I.; Gilbertson-White, S.; Herr, K.; Demiris, G.; Burnside, L. Tailored Strategies and Shared Decision-Making for Caregivers Managing Patients’ Pain: A Web App. Innov. Aging 2020, 4 (Suppl. S1), 153. [Google Scholar] [CrossRef]
- Chi, N.C.; Shanahan, A.; Nguyen, K.; Demir, I.; Fu, Y.K.; Herr, K. Usability Testing of the Pace App to Support Family Caregivers in Managing Pain for People with Dementia. Innov. Aging 2023, 7 (Suppl. S1), 857. [Google Scholar] [CrossRef]
- Sermet, Y.; Demir, I. A semantic web framework for automated smart assistants: A case study for public health. Big Data Cogn. Comput. 2021, 5, 57. [Google Scholar] [CrossRef]
- Pursnani, V.; Sermet, Y.; Kurt, M.; Demir, I. Performance of ChatGPT on the US fundamentals of engineering exam: Comprehensive assessment of proficiency and potential implications for professional environmental engineering practice. Comput. Educ. Artif. Intell. 2023, 5, 100183. [Google Scholar] [CrossRef]
- Sajja, R.; Sermet, Y.; Cwiertny, D.; Demir, I. Platform-independent and curriculum-oriented intelligent assistant for higher education. Int. J. Educ. Technol. High. Educ. 2023, 20, 42. [Google Scholar] [CrossRef]
- Sajja, R.; Sermet, Y.; Cwiertny, D.; Demir, I. Integrating AI and Learning Analytics for Data-Driven Pedagogical Decisions and Personalized Interventions in Education. arXiv 2023, arXiv:2312.09548. [Google Scholar]
- Samuel, D.J.; Sermet, M.Y.; Mount, J.; Vald, G.; Cwiertny, D.; Demir, I. Application of Large Language Models in Developing Conversational Agents for Water Quality Education, Communication and Operations. EarthArxiv 2024, 7056. [Google Scholar] [CrossRef]
- Saab, K.; Tu, T.; Weng, W.H.; Tanno, R.; Stutz, D.; Wulczyn, E.; Zhang, F.; Strother, T.; Park, C.; Vedadi, E.; et al. Capabilities of gemini models in medicine. arXiv 2024, arXiv:2404.18416. [Google Scholar]
- Zhang, M.; Zhu, L.; Lin, S.Y.; Herr, K.; Chi, C.L.; Demir, I.; Dunn Lopez, K.; Chi, N.C. Using artificial intelligence to improve pain assessment and pain management: A scoping review. J. Am. Med. Inform. Assoc. 2023, 30, 570–587. [Google Scholar] [CrossRef]
- Nori, H.; Lee, Y.T.; Zhang, S.; Carignan, D.; Edgar, R.; Fusi, N.; King, N.; Larson, J.; Li, Y.; Liu, W.; et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv 2023, arXiv:2311.16452. [Google Scholar]
- Jin, Q.; Dhingra, B.; Liu, Z.; Cohen, W.W.; Lu, X. Pubmedqa: A dataset for biomedical research question answering. arXiv 2019, arXiv:1909.06146. [Google Scholar]
- Nachane, S.S.; Gramopadhye, O.; Chanda, P.; Ramakrishnan, G.; Jadhav, K.S.; Nandwani, Y.; Dinesh, R.; Joshi, S. Few shot chain-of-thought driven reasoning to prompt LLMs for open ended medical question answering. arXiv 2024, arXiv:2403.04890. [Google Scholar]
- Kanithi, P.K.; Christophe, C.; Pimentel, M.A.; Raha, T.; Saadi, N.; Javed, H.; Maslenkova, S.; Hayat, N.; Rajan, R.; Khan, S. Medic: Towards a comprehensive framework for evaluating llms in clinical applications. arXiv 2024, arXiv:2409.07314. [Google Scholar]
- Welivita, A.; Pu, P. A survey of consumer health question answering systems. Ai Mag. 2023, 44, 482–507. [Google Scholar] [CrossRef]
- Demner-Fushman, D.; Mrabet, Y.; Ben Abacha, A. Consumer health information and question answering: Helping consumers find answers to their health-related information. J. Am. Med. Inform. Assoc. 2020, 27, 194–201. [Google Scholar] [CrossRef]
- MedlinePlus. U.S. National Library of Medicine, National Institutes of Health. Available online: https://medlineplus.gov/ (accessed on 27 October 2024).
- Nguyen, V. Consumer Medical Question Answering: Challenges and Approaches. Doctoral Dissertation, The Australian National University, Canberra, Australia, 2024. ANU Open Research Repository. [Google Scholar]
- Abacha, A.B.; Mrabet, Y.; Sharp, M.; Goodwin, T.R.; Shooshan, S.E.; Demner-Fushman, D. Bridging the gap between consumers’ medication questions and trusted answers. In MEDINFO 2019: Health and Wellbeing e-Networks for All; IOS Press: Amsterdam, The Netherlands, 2019; pp. 25–29. [Google Scholar]
- Ayers, J.W.; Poliak, A.; Dredze, M.; Leas, E.C.; Zhu, Z.; Kelley, J.B.; Faix, D.J.; Goodman, A.M.; Longhurst, C.A.; Hogarth, M.; et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern. Med. 2023, 183, 589–596. [Google Scholar] [CrossRef] [PubMed]
- Carnino, J.M.; Pellegrini, W.R.; Willis, M.; Cohen, M.B.; Paz-Lansberg, M.; Davis, E.M.; Grillone, G.A.; Levi, J.R. Assessing ChatGPT’s Responses to Otolaryngology Patient Questions. Ann. Otol. Rhinol. Laryngol. 2024, 133, 658–664. [Google Scholar] [CrossRef] [PubMed]
- Chen, D.; Parsa, R.; Hope, A.; Hannon, B.; Mak, E.; Eng, L.; Liu, F.-F.; Fallah-Rad, N.; Heesters, A.M.; Raman, S. Physician and Artificial Intelligence Chatbot Responses to Cancer Questions from Social Media. JAMA Oncol. 2024, 10, 956–960. [Google Scholar] [CrossRef] [PubMed]
- Jin, D.; Pan, E.; Oufattole, N.; Weng, W.H.; Fang, H.; Szolovits, P. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. 2021, 11, 6421. [Google Scholar] [CrossRef]
- Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring massive multitask language understanding. arXiv 2020, arXiv:2009.03300. [Google Scholar]
- Ben Abacha, A.; Demner-Fushman, D. A question-entailment approach to question answering. BMC Bioinform. 2019, 20, 511. [Google Scholar] [CrossRef] [PubMed]
- Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81. [Google Scholar]
- Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. Bertscore: Evaluating text generation with bert. arXiv 2019, arXiv:1904.09675. [Google Scholar]
- Zhang, X.; Li, Y.; Wang, J.; Sun, B.; Ma, W.; Sun, P.; Zhang, M. Large language models as evaluators for recommendation explanations. In Proceedings of the 18th ACM Conference on Recommender Systems, Bari, Italy, 14–18 October 2024; pp. 33–42. [Google Scholar]
- Chern, S.; Chern, E.; Neubig, G.; Liu, P. Can large language models be trusted for evaluation? Scalable meta-evaluation of llms as evaluators via agent debate. arXiv 2024, arXiv:2401.16788. [Google Scholar]
- Xiong, G.; Jin, Q.; Lu, Z.; Zhang, A. Benchmarking retrieval-augmented generation for medicine. arXiv 2024, arXiv:2402.13178. [Google Scholar]
Study | Data Source | Scale | Contribution |
---|---|---|---|
[23] | MedQA [37]-mcq style based on USMLE | Large (n = 1273) | Med-Gemini achieves state-of-the-art accuracy of 91.1 on MedQA. Questions are clinical based and mcq styled. |
[27] | MedQA Open [27]-modified MedQA mcq style questions to make them open-ended | Medium (n = 500) | Proposed MedQA Open and tested Llama 2-7B and Llama 2-70B on a sample of 500 questions from MedQA Open and evaluated by medical students. Questions are clinical based. |
[34] | AskDocs subreddit (195 questions) | Low/Medium (n = 195) | ChatGPT 3.5 is tested on a sample of consumer-based questions and responses are compared with physician responses by human evaluators. |
[35] | AskDocs subreddit: Domain-specific Otolaryngology | Very small (15 questions) | ChatGPT is tested on a small sample of domain specific consumer-based questions and responses are evaluated by human evaluators. |
[36] | AskDocs subreddit: Cancer questions | Medium (n = 200) | GPT-3.5, GPT-4 and Claude are tested on domain specific (cancer related) questions and evaluated by human evaluators. |
This study | AskDocs subreddit: MedRedQA | Large (n = 5000) | GPT4-o, Mistral-123B, Mistral-7B, Llama 3.1-70B, and Gemini Flash are tested on a large number of consumer health questions and evaluated by LLMs. |
Dataset | Format | Type | Questions Source | Size | Pros | Cons |
---|---|---|---|---|---|---|
MedQA | Multiple-Choice | Clincal | USMLE Study materials | ~12,000 | Structured clinical questions, useful for assessing factual knowledge | not open-ended, clinical style questions, not consumer based |
PubMedQA-Labeled | Short/Yes-No | Clinical | PubMed Abstracts | ~1000 | clinical questions from biomedical literature, useful for assessing factual knowledge | clinical style questions, not consumer based |
MMLU-Clinical [38] | Multiple Choice | Clinical | Multi-Domain | Varies | Covers multiple medical knowledge domains | Not open-ended, clinical style questions not consumer based |
MedQA-Open | Open Ended | Clinical | USMLE Study materials | ~12,000 | open-ended questions | clinical style questions, not consumer based |
MedQA-CS [5] | Open Ended | Clinical | USMLE Step2 CS guidelines | ~1600 | open-ended questions | clinical style questions, not consumer based |
MedQuAD [39] | Open Ended | Clinical | NIH Websites | ~47,000 | open-ended questions | clinical style questions, not consumer based |
MedRedQA | Open Ended | Consumer based | AskDocs Subreddit | ~51,000 | open-ended consumer based questions |
RQ1 Prompt | RQ2 Prompt |
---|---|
System Instructions: You are able to understand medical questions and provide precise answers to them. Prompt: Then, different user prompts are tested to make sure that the model responds as required, with answers that are precise and that do not have additional commentary. The following ‘user’ prompt is given to each of the five models: “I will provide you with a medical question and the title associated with it. You will answer that question as precisely as possible, addressing only what is asked. There is no need to provide additional context and details.” | System Instructions: You are able to understand medical content and answer any queries regarding the content. Prompt: I will provide you with a medical question, its associated title, and two responses to that question: one from a medical expert (which is the correct answer or ground truth) and another from a different source. Your task is to compare the information in the other response with the expert’s, treating the expert’s answer as the ground truth. You are not evaluating the correctness of the other response directly. Instead, your focus is solely on how closely the information in the other response aligns with the information in the expert’s response (which is the correct answer). Your output must strictly be one of the following two words: 1. ‘Agree’ if the main information in both responses is the same. 2. ‘Disagree’ if the main information in the other response is not similar to the expert’s.Your output should consist of only one of these two terms, without any additional text |
Model | Score (%) |
---|---|
gpt4o-mini | 76.0 |
llama-3.1-70B | 72.0 |
gemini-flash | 77.2 |
mistral-123B | 74.4 |
mistral-7B | 51.6 |
Model | Agreement Score (%) |
---|---|
gpt4o-mini | 37.1 |
llama-3.1-70B | 26.4 |
gemini-flash | 28.4 |
mistral-123B | 32.3 |
mistral-7B | 16.4 |
Model | Average Agreement Scores (%) |
---|---|
gpt4o-mini | 51.2 |
llama-3,1-70B | 41.0 |
gemini-flash | 37.2 |
mistral-123B | 48.1 |
mistral-7B | 33.5 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Abrar, M.; Sermet, Y.; Demir, I. An Empirical Evaluation of Large Language Models on Consumer Health Questions. BioMedInformatics 2025, 5, 12. https://doi.org/10.3390/biomedinformatics5010012
Abrar M, Sermet Y, Demir I. An Empirical Evaluation of Large Language Models on Consumer Health Questions. BioMedInformatics. 2025; 5(1):12. https://doi.org/10.3390/biomedinformatics5010012
Chicago/Turabian StyleAbrar, Moaiz, Yusuf Sermet, and Ibrahim Demir. 2025. "An Empirical Evaluation of Large Language Models on Consumer Health Questions" BioMedInformatics 5, no. 1: 12. https://doi.org/10.3390/biomedinformatics5010012
APA StyleAbrar, M., Sermet, Y., & Demir, I. (2025). An Empirical Evaluation of Large Language Models on Consumer Health Questions. BioMedInformatics, 5(1), 12. https://doi.org/10.3390/biomedinformatics5010012