Prompt Engineering Strategies for Generating Medical Case-Based MCQs with Large Language Models: A Multi-Model Comparative Study

Al Shuraiqi, Somaiya; AlZaabi, Adhari; Aal Abdulsalam, Abdulrahman

doi:10.3390/make8020041

Open AccessArticle

Prompt Engineering Strategies for Generating Medical Case-Based MCQs with Large Language Models: A Multi-Model Comparative Study

by

Somaiya Al Shuraiqi

^1,*

,

Adhari AlZaabi

²

and

Abdulrahman Aal Abdulsalam

¹

Department of Computer Science, College of Science, Sultan Qaboos University, P.O. Box 243, Muscat 123, Oman

²

Department of Human and Clinical Anatomy, College of Medicine & Health Sciences, Sultan Qaboos University, P.O. Box 243, Muscat 123, Oman

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2026, 8(2), 41; https://doi.org/10.3390/make8020041

Submission received: 18 December 2025 / Revised: 28 January 2026 / Accepted: 5 February 2026 / Published: 10 February 2026

(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)

Download

Browse Figures

Versions Notes

Abstract

The use of large language models (LLMs) to automate the generation of medical case-based multiple-choice questions (MCQs) is increasing, but their accuracy, reliability, and educational validity are still not well understood. This study in a comparative framework examined nine LLMs with four different prompting methods to evaluate LLM-produced MCQs for clinical coherence and readiness for assessment. A uniform evaluation pipeline was constructed to examine automatic text-similarity measures using automated metrics (BLEU, ROUGE, and METEOR), structural and parsability measures, and operational effectiveness (latency, cost, quality-efficiency ratios). Human validation was performed on the best-performing model and prompt combination (OpenBioLLM-70B with Chain-of-Thought) focusing on the model prompt that demonstrated the best linguistic fidelity and clinically aligned reasoning. Two clinical experts independently reviewed 88 items using a five-domain rubric covering appropriateness, clarity, relevance, distractor quality, and cognitive level. Results indicated significant variation across models and prompting strategies, with Chain-of-Thought yielding the best overall performance in comparison to other strategies. The OpenBioLLM-70B model demonstrated the best overall balance of quality, parsability, and efficiency, achieving a prompt template quality score of 90.4, a consistency score of 88.8, and a response time of 3.28 s, with a quality-per-dollar value of 134.11. The expert rating confirmed clinical alignment, but there was consensus that distractor quality needed further improvements. These results provide evidence that LLMs under optimal prompting conditions can reliably support MCQ generation and provide large-scale, cost-effective support for medical assessment production.

Keywords:

large language models (LLMs); prompt engineering; case-based multiple-choice questions (MCQs); automatic question generation; medical education

1. Introduction

High-quality multiple-choice questions (MCQs), particularly those wrapped in realistic patient vignettes, remain the backbone of knowledge assessment in undergraduate and postgraduate medical curricula because they combine objectivity with broad content sampling [1]. Authoring clinical MCQs requires extensive manual effort in question preparation and revision. A recent audit of high-stakes emergency medicine examination found that human experts required ≈96 person-hours to write 100 new MCQs, whereas first drafting with a large language model (LLM) cut that figure to 24.5 h [2]. Writing answer rationales is an additional bottleneck: course directors report spending >30 min per item, a burden halved with LLMs assistance [3]. The time and expertise demanded often outstrip faculty capacity, limiting the size and freshness of item banks.

Early empirical studies suggest that modern LLMs can shoulder part of this workload. Multinational trials comparing ChatGPT-generated questions with faculty items showed comparable difficulty and discrimination indices in graduate-level exams, although LLM items tended to target lower-order cognition [4,5]. Subsequent expert review of AI-authored MCQs still uncovers factual slips and cueing flaws, confirming that prompting strategies often matters as much as which model is used [2].

The growing prompt-engineering literature suggests that output quality depends not only on the underlying model but also on how the task is elicited. A recent survey catalogues techniques such as in-context learning, CoT, decomposition, and self-refinement, each requiring different token allocations that influence LLM reliability (i.e., repeatability of generated outputs given fixed prompts and inputs) during MCQ generation [6]. In the MCQ space, the MCQG S Refine framework combines expert-designed prompts with iterative critique and correct loops that boosts USMLE-style question quality by up to 30% over baseline prompting [7]. Recent work has explored self-refinement as a means of improving output reliability by enabling the model to critique and update its own responses before final generation. Conversely, empirical work on self-refinement demonstrates that its gains are highly sensitive to prompt wording, with small changes negating improvements and revealing the limited robustness of current prompting heuristics [8]. However, comparative evidence remains fragmented: existing studies often evaluate a limited number of models, apply non-uniform prompting, or rely on evaluation measures that are not tailored to educational item quality (e.g., surface text overlap).

In spite of recent advances, to our knowledge, there is limited comparative evidence that compared a broad spectrum of large language models (LLMs)—ranging from general-purpose (e.g., GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, Command A, LLaMA-3-70B, Falcon 7B, Mixtral Large) to domain-specific (e.g., OpenBioLLM-70B, BioGPT-Large)—under a uniform set of prompt engineering paradigms for generating case-based medical MCQs. This study addresses that gap by:

Benchmarking nine LLMs across four prompting strategies (zero-shot baseline, few-shot, CoT, and self-refinement) using standardized case-based medical scenarios.
Evaluating outputs through a hybrid framework that combines automatic semantic metrics (BLEU, ROUGE, METEOR) and expert ratings of clinical accuracy, reasoning depth, and distractor plausibility.
Assessing the cost-quality trade-offs to produce evidence-based findings for end-users, exam boards, and AI practitioners who wish to build scalable MCQ generation solutions.

By separating structural reliability, descriptive similarity, and expert-rated educational quality, this work provides a reproducible comparison of model–prompt configurations and identifies practical trade-offs relevant to institutions seeking scalable, human-in-the-loop approaches for MCQ drafting.

2. Related Work

2.1. MCQ Generation in Medical Education

Multiple-choice questions (MCQs) are a mainstay of medical assessment due to their scalability, objectivity, and psychometric precision [9]. With appropriate levels of cognition considered, properly designed MCQs can evaluate both factual recall and higher-order thinking skills. The generation of adequate MCQs is labor-intensive and requires a high level of expertise in the subject matter in order to guarantee clinical accuracy, fairness, and adherence to educational goals. Quality MCQs must be carefully written to minimize item-writing flaws, maintain uniformity of standards, and establish an adequate balance of testing factual recall and application of reasoning in medical problem-solving. The demand for precision and educational worth makes MCQ writing a very laborious process, often requiring several revisions and an agreement among experts. Case-based MCQs, in which the questions are linked to realistic patient situations, fill a great need in medical education because of their utility in measuring higher-order cognitive processes [10]. Questions of this sort, combining biomedical knowledge with the clinical, social, and contextual aspects of patient care, test more than factual recall, but also diagnostic reasoning, and clinical decision-making, and the ability to apply knowledge in real-life situations. They promote inquiry-based problem solving, encourage thorough understanding, and assist the student in the application of theoretical knowledge in practical clinical situations. In recent years, attempts have been made to automate the generation of MCQs through the application of natural language processing (NLP) techniques, which range from rule-based templates to sequence-to-sequence architectures [11]. While these methods are promising, the earlier rule-based methods were limited by a lack of sufficient vocabulary, inflexibility of templates, and a lack of clinical depth. Although earlier sequence-to-sequence models have enabled new capabilities for more flexible text generation, they struggle to maintain clinical accuracy and logical reasoning in more complex case-based scenarios. As a result, this gap has led to an increased emphasis on developing more advanced LLMs, which will have an enhanced understanding of medical language and a corresponding reasoning capability required to generate high-quality MCQs [12].

2.2. Large Language Models for Medical Question Generation

The use of large language models (LLMs) for medical question generation has rapidly increased in the past few years, leading to a growing body of evidence on their strengths and weaknesses. Early studies suggested that state-of-the-art models like GPT-4 could create coherent, medically relevant, and even USMLE-style questions that were indistinguishable from human generated questions [13,14]. In several assessments, GPT-4 has consistently outperformed the biomedical specialized LLMs on overall fluency, reasoning, and distractor plausibility [15,16]. While domain-specific LLMs are being developed, such as BioGPT, Med-PaLM, and retrieval-augmented pipelines, to bolster their medical reliability, studies like Jeong et al. (2024) [17] have shown that retrieval methods (e.g., Self-BioRAG) improved explanation quality and reasoning reliability, but plausibility of distractors remained a major issue. In similar studies, like Dorfner et al. (2024) [18], the biomedical fine-tuned LLMs sometimes performed poorly compared with general LLMs due to limited adaptability and lack of generality. This suggests that scale and breadth of general linguistic coverage could outperform the narrowness of specialized coverage. There are more recent strategies emphasizing prompt engineering and refinement strategies as key factors for performance. Iterative critique-revision pipelines have been shown to greatly uplift distractor quality and reasoning reliability, as shown in Yao et al. (2024) [7], when producing USMLE-code questions with higher calibration of difficulty. Maharjan et al. (2024) [19] showed that open-source models could attain fine-tuned benchmarks in biomedicine if supplied with CoT prompting and self-consistent strategies. Despite the previous improvements, many problems remain comparable across the studies, with distractor weakness (implausible, ambiguous, or more than one right answer) being the leading problem in the LLM-generated MCQs [5,20]. LLM Reasoning, while improved with CoT prompting, is often impaired by hallucinations and superficial recall [21,22]. Psychometric alignment is another problem, while some studies [7,14] indicate fine calibration against expert evaluated levels of difficulty, other studies indicate differences from subject/skill base that lead to doubt the psychometric validity of the tests created [23].

The implementation of LLMs to generate medical Case-Based MCQs initiates many evaluation and privacy challenges. LLMs that utilize sensitive data to train are vulnerable to unintentional memorization, with reports indicating that 68.5% of anonymized patient data can be reconstructed via clinical data alone [24]. LLMs are subject to many attack vectors (such as membership inference, extraction of training data, and prompt injection), with jailbreak attacks showing over 20% success rates [25]. To counter these risks, comprehensive evaluation frameworks like The Priv-IQ Benchmark (a benchmark to evaluate the performance of LLMs on eight core competencies, including privacy) use metrics like Intraclass Correlation Coefficient (ICC) and Mean Absolute Error (MAE), among others. However, there are limitations, such as the trade-off between service and privacy, and the lack of standardized methods for evaluating privacy [24,25]. Therefore, there are currently best practices in place for the safe implementation of this technology by the use of conducting Privacy Impact Assessments (PIAs), implementing privacy-enhancing techniques such as differential privacy and federated learning, to ensure compliance with the General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA), as well as adopting a hybrid model approach.

2.3. Prompt Engineering and Reasoning Strategies

Prompt engineering has become increasingly necessary for the success of LLMs, particularly with important applications in medical education [26]. Although the model architecture and training corpus provide the model with certain basic capabilities, the particulars of how the prompts are presented—comprising context, examples, and reasoning guidelines—have a considerable effect on the accuracy and coherence of the MCQs generated as well as the psychometric validity of their results.

Zero-shot prompting provides a baseline method in which the models are prompted to generate case-based MCQs, without examples [27]. The method relies on the model’s own knowledge and reasoning, leading to results with varying plausibility of the distractors, shallow reasoning, and occasional hallucinations. In medical cases, zero-shot prompts may produce relevant clinical stems, but the distractors may be either implausible or completely incorrect.

Few-shot prompting introduces exemplar MCQs within the prompt to anchor the model’s output to the expected structure, style, and reasoning depth [28]. Empirical evidence suggests that few-shot prompting improves question clarity and distractor plausibility, especially when examples are carefully selected to represent diverse cognitive levels. However, its effectiveness is highly sensitive to the number and quality of examples, with diminishing returns or even bias if examples are too narrow.

Chain-of-Thought (CoT) prompting explicitly encourages stepwise reasoning before the final output. In biomedical tasks, CoT has been shown to enhance logical flow and reasoning fidelity, reducing errors in diagnostic or multi-step clinical reasoning tasks [29,30]. Using CoT reasoning for MCQ creation will improve the relationship between stems, correct answers, and distractors by making the model give a rationale for the clinical reasoning process that led to the MCQ being asked. However, longer rationales can lead to increased verbosity and computational costs and create a trade-off between the quality of the questions, and the time and effort required to generate them.

Self-refinement and iterative prompting represent the most recent evolution of prompt engineering strategies [31,32]. Self-refinement introduces an iterative “draft–critique–revise” loop intended to reduce contradictions and improve compliance with item-writing constraints. This approach significantly improves distractor plausibility, reduces hallucinations, and enhances psychometric calibration (e.g., increasing medium and hard-level items). Platforms, such as QUEST-AI and OpenMedLM, demonstrate that refinement-based prompting enables even open-source models to approach the quality of frontiers LLMs [19,23].

In medical education, prompt strategies are important in determining the validity or quality and relevance to clinical practice of any MCQs by LLMs [33]. Recent studies indicate that AI-generated MCQs may be comparable to human-written MCQs when specific types of prompts (examination style prompts, clinical persona prompts, structured instruction prompts) are used to generate the MCQs [34,35]. However, the success of AI-generated MCQs in mimicking human-written MCQs is mixed; models and methods used to prompt seem to affect how closely they approximate human-generated MCQs. Zero-shot prompting appears to be scalable for generating many MCQs while few-shot prompting and structured prompting methods have been shown to create greater clinical alignment and depth of reasoning. Additionally, analysis of previous reviews of the literature has shown that (to improve surface validity) using prompts based on standardized exams (i.e., examination formats based on NBME or USMLE items) has value; however, even when using such prompts, some clinical vignettes are weak and higher-order cognitive items have less ability to discriminate among examinees. So, these findings demonstrate both the potential of and the limitations associated with prompt-based generation of MCQs. As explored in our study, there is a need for standardized benchmarking of multiple prompt methods for MCQ generation in medical education.

2.4. Gap Addressed by This Study

Across prior work, comparative evaluation is often limited by differences in model selection, prompting protocols, and outcome measures, making it difficult to isolate the effects of model family versus prompt strategy. Moreover, many evaluations emphasize linguistic quality or perceived usefulness rather than standardized measures of structural compliance and expert-rated educational quality. To address these limitations, we compare nine LLMs across four prompting paradigms using a uniform generation template and a shared evaluation pipeline that integrates structural parsability, descriptive text-based similarity measures, operational latency/cost, and expert review of the top-performing configuration.

3. Methods

3.1. Study Design

The purpose of this research is to comprehensively assess using LLMs as a basis for generating MCQ’s that are clinically relevant and accurate to individual cases in the area of medical education. The design integrates automated generation, model comparison, and expert evaluation. The research consists of four distinct steps: (1) dataset preparation, (2) model-based MCQ generation with four different prompts, (3) automation evaluation, and (4) human validation. Figure 1 shows study objectives pipeline.

3.2. Dataset

We employed 100 records from MedMCQA, a publicly available, large-scale multiple-choice question answering corpus introduced by Pal et al. (2022) [36]. The dataset contains 194,457 AIIMS and NEET-PG entrance-exam questions spanning 21 medical subjects and almost 2.4k healthcare topics, with an average token length of 12.8 words per item.

Each record provides:

A stem (realistic clinical or basic-science prompt);
Four or five answer options;
The single correct choice;
An explanatory rationale authored by subject-matter experts.

The breadth of specialties (e.g., pharmacology, surgery, obstetrics, and pathology) tests more than ten distinct reasoning skills identified in prior medical-QA taxonomies, making the corpus well-suited for evaluating general-purpose and domain-tuned LLMs.

3.3. Large Language Models (LLMs) Selection and Criteria

Specifically, we benchmarked nine LLMs that collectively represent the full range of current-generation models used in medical question authoring—covering proprietary, enterprise, open-source, and biomedical-specialized systems. The choice of models was based on the purpose of the study and the timing of our experiments, rather than on the novelty of each model, we chose those models that were most commonly used, had large user bases (and hence, stability), had documented means for replications through APIs, and represented both general purpose models as well as those specifically developed for biomedical research so that fair and reproducible comparisons could occur between different types of models and different prompting methods.

Proprietary frontier models: GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro—define the current state of the art in general reasoning, contextual understanding, and long-context coherence (supporting up to 1 million tokens). These models are often cited by medical licensing boards for prototype item generation because they are both reliable and able to perform contextual reasoning tasks [21,37,38].
Enterprise, retrieval-centric system: Command A (Cohere) illustrates integration of retrieval-augmented generation (RAG) workflows, providing grounded factual consistency (i.e., similarity of outputs across repeated runs with the same prompt) for case-based MCQ stems and distractors. It is essential when prompts must integrate textbook-anchored clinical facts to produce accurate, evidence-based distractors [39,40]. Its design supports grounded medical questions construction, where factual precision is prioritized.
Open-source, self-hostable checkpoints: LLaMA-3-70B, Mistral Large, and Falcon 7B offer transparent architectures, efficient parameter scaling, and local deploy ability for reproducible experimentation. They provide transparency, adaptability, and cost-effective deployment. They enable on-premise or GPU-cluster deployment, reducing generation costs by approximately 20× relative to GPT-4o while maintaining high linguistic fidelity under optimized prompting [41,42,43,44,45]. Open access also supports reproducibility and fine-grained control of temperature, reasoning depth, and question difficulty.
Domain-specialized biomedical models: OpenBioLLM-70B and BioGPT-Large contribute clinical and biomedical knowledge through pre-training on PubMed and domain corpora, enabling evaluation of terminology capture and domain-reasoning fidelity [46,47]. They combine large-scale biomedical pre-training with strong reasoning ability, bridging the gap between general and domain-specific performance. BioGPT-Large, though smaller, serves as a lightweight reference model to examine how prompt engineering can offset limited scale and context capacity.

Collectively, these nine models provide a balanced benchmark across scale, domain specialization, and prompting strategies, establishing the foundation for evaluating zero-shot, few-shot, CoT, and self-refinement paradigms in case-based medical MCQ generation. Table 1 summarizes the nine LLMs evaluated in this study, along with their key architectural characteristics and features relevant to MCQ generation.

3.4. Prompt Engineering Framework

The four prompt paradigms we tested are baseline (zero-shot), few-shot, CoT, and self-refinement.

▪: Baseline captures the simplest “instruction-only” setup that many users start with [27].
▪: Few-shot follows the in-context learning recipe introduced by Brown et al. [28] for GPT-3, where a handful of exemplars steer the model without parameter updates.
▪: CoT adopts the reasoning-trace prompting of Wei et al. [29], asking the model to articulate intermediate clinical logic before giving the final MCQ.
▪: Self-refine implements the iterative “draft → critique → rewrite” loop proposed by Madaan et al. [31], letting the same model improve its own output at inference time.

We instantiated every paradigm with real vignettes from MedMCQA [36]. For example, “A 38-year-old woman comes to the physician for a follow-up examination. Two years ago, she was diagnosed with multiple sclerosis, etc.” Each clinical case scenario was entered into the LLM as a case, consisting of: demographic information, the reason for being seen, relevant medical history and important clinical findings. The model was asked to convert each case into an entirely structured case-based MCQ. This was accomplished by providing the model with examples of a clinically coherent stem, four possible answers with one being correct, and a justification for why the answer is correct. All prompts were to produce output that was a reflection of standard medical exam style. This method provided consistency between different models as well as prompts, while also allowing the clinical context of each case to remain. Table 2 outlines the prompting framework used to standardize MCQ generation across models. There are different reasoning or refinement strategies for each type of prompt. For example, Zero-Shot creates direct schema-based outputs, while Few-Shot uses example cases to anchor generation. CoT templates help step-by-step reasoning by making clinical coherence clearer, while Self-Refine uses a two-phase revision loop to improve clarity, formatting, and the accuracy of distractors. This design makes sure that the input structure is always the same and that all nine LLMs can be evaluated fairly and consistently.

4. Metrics and Evaluation

The models were evaluated on four additional dimensions: automatic text similarity, structural quality, efficiency, and metrics specific to the templates [48,49,50,51,52].

4.1. Automatic Text Similarity

We used BLEU, ROUGE, and METEOR, three types of automatic text similarity metrics, to see how closely the generated MCQs matched the reference items from the original MedMCQA dataset [49,52].

B L E U = s e n t e n c e_b l e u (R, C)

, where R and C stand for the tokenized reference and candidate texts, respectively. This uses NLTK’s smoothing function (method 4) to avoid zero-precision penalties for short outputs.

{R O U G E}_{n} = F_{1} (n - g r a m o v e r l a p b e t w e e n R a n d C)

. Specifically, we report ROUGE-1 (unigram overlap capturing lexical similarity), ROUGE-2 (bigram overlap capturing local fluency and phrasing), and ROUGE-L (longest common subsequence capturing overall structural alignment). These were implemented using the rouge_scorer with stemming enabled to reduce inflectional bias [53].

M E T E O R = m e t e o r_s c o r e (R, C)

, which combines unigram precision and recall, adjusted for synonym and stemming matches. All values were averaged over the 100 MCQs generated per model–template pair:

{A v g M e t r i c}_{m, t} = \frac{1}{N} \sum_{i = 1}^{N} M e t r i c (R_{i}, C_{i})

4.2. Structural Quality

Structural quality was measured by the Parsability Rate [50], defined as the proportion of responses that successfully matched the required JSON or text schema (question + 4 options + answer):

{P a r s a b i l i t y R a t e}_{m, t} = \frac{N u m b e r o f p a r s a b l e r e s p o n s e s}{T o t a l v a l i d r e s p o n s e s} \times 100

A custom parsing function (parse_mcq_response) identified the question, options (A–D), and the correct answer. Outputs that did not adhere to proper structural and formatting rules were excluded from downstream text similarity metrics.

4.3. Operational Metrics

Latency and cost were also evaluated for each model–template pair [51].

Average Response Time (seconds):

{A v g T i m e}_{m, t} = \frac{1}{N} \sum_{i = 1}^{N} {R e s p o n s e T i m e}_{i}

Cost Per Response (USD):

{C o s t}_{m, t} = \frac{E s t i m a t e d t o k e n c o u n t}{1000} \times C o s t p e r 1 K t o k e n s

Token counts were estimated to equal 1 token ≈ 4 characters, as per the usage conventions of OpenAI and HuggingFace [54].

4.4. Composite Utility Metrics

To evaluate simultaneously quality, cost, and efficiency, ref. [55] introduced two normalized ratios between sums:

Quality - per - Dollar (QpD) : Q p D_{m, t} = \frac{100 \times {A v g R O U G E - L}_{m, t}}{m a x ({E s t i m a t e d C o s t}_{m, t}, 0.001)}

Quality - per - Sec ond (QpS) : Q p S_{m, t} = \frac{{A v g R O U G E - L}_{m, t}}{m a x ({A v g R e s p o n s e T i m e}_{m, t}, 0.1)}

In these equations, the subscripts m and t refer to the model and template type, respectively, utilized in evaluation. Specifically, m indexes, the nine large language models analyzed (e.g., GPT-4o, Claude-3-Opus, LLaMA-3-70B, OpenBioLLM-70B), and t describes the prompting strategy used (Zero-Shot, Few-Shot, CoT or self-refine). Thus, each metric is a quantification of the combined quality, cost, and efficiency of a particular model–template combination, allowing for standardized comparison across systems and reasoning modes. The two metrics were scaled, capping denominators at small constants (0.001 USD, 0.1 s).

4.5. Aggregate Success Indicators

Additional derived quantities [56,57] include:

Success Rate—percentage of non-error API responses:

{S u c c e s s R a t e}_{m, t} = \frac{S u c c e s s f u l C a l l s}{T o t a l C a l l s} \times 100

Composite “Value Score” (cost efficiency):

{V a l u e S c o r e}_{m, t} = \frac{{A v g R O U G E - L}_{m, t} \times {S u c c e s s R a t e}_{m, t}}{m a x ({E s t i m a t e d C o s t}_{m, t}, 0.001)}

Composite “Time-Value Score” (speed efficiency):

{T i m e V a l u e S c o r e}_{m, t} = \frac{{A v g R O U G E - L}_{m, t} \times {S u c c e s s R a t e}_{m, t}}{m a x ({A v g R e s p o n s e T i m e}_{m, t}, 0.1)}

In addition to primary quality metrics, several governing metrics were created to represent both reliability and overall efficiency. The rate of success is expressed as the percentage of valid non-error API responses in total model-template runs, which indicates the reliability and reproducibility of the system. Cost and speed trade-offs were derived from the calculation of two composite indices namely, the Value Score, which is a composite of text quality (ROUGE-L) and reliability as a function of cost, and the Time–Value Score, which normalizes both quality and reliability by average response times. Collectively, these three derived metrics give a comprehensive representation of the overall efficiency of the model, quantifying an interrelationship between correctness, stability (i.e., performance robustness under varying conditions), and resource utilization over the various experimental conditions.

4.6. Prompt Strategy Evaluation Matrix

In order to analyze the effectiveness of the various prompt engineering templates, a revised scoring framework was devised whereby this matrix quantitatively measures the quality, the structure, and the contextual correctness of the output of each template. A hybrid method of scoring from 0 to 100 points is employed.

4.6.1. Scoring Framework Overview

The evaluation function (enhanced_template_score) integrates four major dimensions, Answer Correctness, Prompt Strategy Adherence, Format Quality, and Clinical Relevance. The highest possible score across all four dimensions is 100 points (i.e., the highest score is normalized to 100). We base our method upon pre-existing methodologies in clinicians holistically evaluating LLM [58], use of benchmarks for evaluating clinical reasoning [59,60], and the design of a common practice in automated scoring systems [61]. Table 3 shows our detailed scoring prompt strategy.

4.6.2. Performance Data Generation

To measure models over templates, a procedure was constructed (designated template_performance_data). Each result combined the model capability and template type to obtain the following principal metrics (Table 4):

4.6.3. Analytical Purpose

The resulting Template Evaluation Matrix enables:

Cross-comparison of prompt strategies within and across LLM families;
Measurement of quality–consistency–efficiency trade-offs;
Quantitative linking of prompt reasoning depth (e.g., CoT, Self-Refine) with response correctness and format robustness.

This matrix forms the methodological basis for analyzing the effectiveness of prompts in automated medical MCQ generation before aggregating descriptive results, such as mean scores across prompts and models.

4.7. Human Evaluation Metrics

In addition to the automated metrics, a human evaluation framework was used to qualitatively assess the education and clinical value of the generated MCQs. The evaluation followed the criteria used in medical education research for MCQ validation [4,5,62,63]. Each generated question was evaluated by experts in the field based on a 10-point Likert scale for each dimension (0 = extremely poor, 10 = gold standard quality). In particular, the evaluation emphasized content validity and congruence, clarity, clinical relevance, and cognitive level in the questions to ensure that the generated questions meet the expected educational standards.

A human evaluation of the single best model-prompt combination evaluated during the quantitative phase was completed to augment the automation metrics. For this evaluation, twelve of the 100 MCQs were excluded from raters’ consideration due to reference to figures and illustrations not currently available. Of the remaining 88 MCQs, two senior clinical educators with 7 and 9 years of experience, respectively, served as raters in the human evaluation of high-stakes examinations. Both raters create and evaluate MCQs for summative exams routinely and are well-versed in international standards for item creation. Because of their extensive backgrounds, no additional training workshops were required; instead, the raters were provided with written directions regarding the particular scoring rubrics and the scoring process. Table 5 summarizes the human evaluation dimensions and their corresponding scoring ranges.

Each criterion contributed equally to the Overall Quality Index, so that higher usage of dimension-stemmed items produced a higher score. This multi-dimensional human evaluation ensured that not only was the output of the model linguistically accurate, but that it was also pedagogically reasonable, clinically valid, and accordant with expected educational consequences.

5. Results

5.1. Quantitative Evaluation Summary

Table 6 and Table 7 summarize the results of evaluating nine LLMs across four prompting strategies.

Success rate and reliability:

All models achieved similar levels of API stability (>99% of overall completion across runs), suggesting high overall performance in both prompt responses and reply prompts (Table 6).

Operational efficiency:

The median response times showed wide variation (Table 6). The fastest responses were observed for Mistral-Large (2.2 s) and Gemini-1.5-Pro (~2.5 s), followed by Command-A (2.8 s) and GPT-4o (3.0 s). Larger and instruction-tuned models demonstrated longer latencies, including Claude-3-Opus (4.0 s), Falcon-7B (4.4 s), BioGPT-Large (5.5 s), LLa-MA-3-70B (8.5 s), and OpenBioLLM-70B (9.5 s). Cost analysis indicated a corresponding spread in cost per response, ranging from approximately USD 0.011 (Gemini-1.5-Pro) to USD 0.225 (Claude-3-Opus), with LLaMA-3-70B and OpenBioLLM-70B averaging around USD 0.073.

Structural quality (parsability rate):

As stated in Table 4, the parsability of LLaMA-3-70B was the highest (approximately 99.8%), which indicated that it was close to perfectly applying the predetermined JSON or MCQ schema. OpenBioLLM-70B achieved the second-highest structural completeness, with a score of approximately 87.5%. Falcon-7B and BioGPT-Large reached close to 100% completeness, although their outputs showed weaker semantic validity. Mistral-Large and Gemini-1.5-Pro showed moderate structural reliability (approximately 40–60%), but a better finding was identified with regard to the provision of prompting templates (Few-Shot or CoT) for improving structural reliability.

Automatic text-similarity metrics:

Results obtained with the evaluation metrics, BLEU, ROUGE-1/2/L, and METEOR (Table 6), showed that OpenBioLLM-70B and LLaMA-3-70B have the highest textual fitness relative to the references. OpenBioLLM-70B scored highly: BLEU ≈ 0.50, ROUGE-L ≈ 0.696 and METEOR ≈ 0.613, while LLaMA-3-70B obtained: (BLEU ≈ 0.48, ROUGE-L ≈ 0.632, METEOR ≈ 0.598), closer to an average of 0.5.

Mid-tier models, namely GPT-4o and Claude-3-Opus, showed good overlap scores, but not as high as the top models’ overlap scores, while Gemini-1.5-Pro and Mistral-Large provided comparatively moderate quality by concentrating on speed and costs rather than on linguistic accuracy. Command-A, Falcon-7B, and BioGPT-Large produced the lowest BLEU/ROUGE/METEOR scores. Their outputs often mismatched the reference stems and distractors, indicating difficulties in aligning with those components.

Cost-effectiveness and composite utilities:

When normalized for response latency and estimated cost (Table 6), cost-adjusted composite indices (Quality-per-Dollar and Quality-per-Second) favored models balancing accuracy and efficiency. The leading configurations clustered around LLaMA-3-70B + CoT, Claude-3-Opus + Self-Refine, and OpenBioLLM-70B + CoT/Zero-Shot. Among these, LLaMA-3-70B + CoT demonstrated the optimal trade-off between generation quality, structural compliance, and time efficiency, emerging as the overall top performer in this evaluation.

5.2. Effect of Prompt Strategy

Strategies for prompt engineering were found to have a consistent and significant effect on the number and quality of MCQ produced for all models evaluated. The distribution of the global performance metrics for the different prompting methods, the model–template interactions, and the efficiencies adjusted for cost are shown in Figure 1, Figure 2 and Figure 3. Performance showed clear improvements based upon the prompting paradigms used:

CoT > Self-Refine > Few-Shot > Zero-Shot

5.2.1. Overall Prompt Strategy Performance Analysis

(a): Average Quality Score by Prompt Strategy Type

As illustrated in Figure 2a, a clear hierarchical trend was observed across the four prompting paradigms. CoT achieved the highest mean quality score (68.9/100), followed by Self-Refine (64.0), Few-Shot (62.5), and Zero-Shot (53.2).

The large gap in performance between the CoT and other strategies indicates the importance of explicit reasoning for improving medical MCQ generation, particularly where contextual inference and key-selection logic are needed. Self-Refine improved output clarity and formatting accuracy through iterative self-correction, while Zero-Shot produced a poorer result, indicating limitations on unguided generation for complex structured output.

(b): Prompt Strategy Performance Distribution Across All Models

In Figure 2b, the variations in performance across models are illustrated in terms of box-plot statistics showing interquartile ranges. Templates for CoT and Self-Refine produced not only higher median results, but also corresponding smaller interquartile ranges, indicating that the output quality across those models of varying capacity was stable and readily predictable. However, Zero-Shot performance showed the broadest spread and had several outlier low performances, indicating an unstable performance and a higher incidence of generation errors (e.g., incomplete responses unrelated to the prompt or case, and incorrectly formatted outputs). Few-shot performance templates showed moderate variances in the results, which indicate that small examples, given in context, served to increase response reliability but not to sustain stability of generative behavior compared with reasoning-based prompts.

(c): Quality vs. Consistency Relationship

As illustrated in Figure 2c, there is a positive, linear correlation between overall quality and consistency scores. The high-capacity models (OpenBioLLM-70B and LLaMA-3-70B) delivered high-quality outputs and showed highly consistent behavior across CoT and Self-Refine prompting. The CoT (red) and Self-Refine (yellow) clusters could be found in the upper right quadrant, showing both areas of high quality and consistency. Zero-Shot (green) samples cluster around the lower left areas, confirming that the lack of contextual scaffolding leads to more erratic outputs. Finally, Few-Shot (blue) points formed a middle band, confirming that moderate increases occurred both in stability and quality when few examples were given.

(d): Quality vs. Efficiency Comparison

The efficiency and quality metrics over templates are depicted in Figure 2d. The CoT prompts demonstrated the most balanced trait, producing high-quality output with reasonably high efficiency metrics. The Self-Refine prompting was slightly less efficient to compete with due to the number of individual interactions that are involved in its iterative reasoning, but it retained near-equal quality scores. The Few-Shot prompt produced similar efficiency to that of Self-Refine, with lower accuracy, while the Zero-Shot prompt was the quickest but demonstrated the lowest quality scores, due to the amount of overhead for corrections and parsing involved. This reinforces the idea that optimized versions of reasoning traces (as opposed to minimal prompts) are key to producing cost- and time-effective results rather than generating long, unnecessary-style text.

Therefore, analyses have shown that reasoning-oriented prompting (particularly CoT and Self-Refine) yields better and more consistent performance in medical Case-based MCQ generation in both linguistic and structural aspects. Meanwhile, few-shot is a practical middle ground for mid-tier models, while zero-shot tends to produce lower-quality outputs.

5.2.2. Model–Prompt Interaction Analysis

(a): Model–Template Performance Heatmap

In Figure 3a, the performance heatmap (scores 0–100) shows distinct interaction patterns for model families and prompting strategies. OpenBioLLM-70B, given the CoT template, scored the highest in total (90.4), while LLaMA-3-70B (84.1, CoT) and Gemini-1.5-Pro (79.1, Self-Refine) followed. Mid-range performers, like GPT-4o and Opus Claude-3, had moderately quick (≈ 74–77) responses by prompting via multiple templates, while those with architectures a little smaller in size, such as Mistral-Large, Command-A, and Falcon-7B, were completely outperformed, especially in the cases of Zero-Shot (scores < 45). For all models and varieties, CoT itself produced the highest average quality presentation of all (68.9), which was a result of its more marked ability for structured responses and correct key acquisition, and where Zero-Shot methods produced the lowest scores from the lack of guiding clues and incomplete reasoning behind answers given.

(b): Best Prompt Strategy Template per Model

The scores reported in Figure 3b reflect the best-performing prompt for each model. This indicates that the higher reasoning or refinement structure always converges to its maxima for the models. The highest capacity models of OpenBioLLM-70B (90.4) and LLaMA-3-70B (84.1) had CoT as the best prompt, providing a distinct advantage to having a CoT for architectures that have long context. The best prompt for the models of Gemini-1.5-Pro (79.1) and BioGPT-Large (78.5) was Self-Refine, providing syntactically correct outputs through self-iterative reasoning. For GPT-4o (76.8) and Command-A (63.3) the best results were garnered with the use of Few-Shot, where stability in the format was gained through sample-based structures

The smaller models, Mistral-Large (51.7) and Falcon-7B (33.8), received only limited improvement, although CoT improved their logical coherence with respect to unguided prompting to a small extent.

(c): Prompt Strategy Template Effectiveness by Model Group

Figure 3c combines the average template performance over the four capability tiers (Basic, Mid-Range, Strong, and Top Performers). The stratified pattern indicates the scalability of the advantages of structured prompting. Among the top performers, CoT averaged 87.2, beating Self-Refine (77.8) and Few-Shot (73.8), while Zero-Shot did the worst, with 70.2. Within Strong Models, Self-Refine (72.3) provided the most consistent improvements, reinforcing the part played by iterative self-correction in medium-capacity architectures. In Mid-Range and basic models, the performance gaps between templates were small (<5 points), indicating that the limited reasoning depth operational in such models prevents them from fully exploiting the complex ideas of structured prompting.

This implies that CoT prompting is likely the most effective way to scale large biomedical LLMs for producing high-quality MCQ items. It works particularly well when the model is guided through explicit reasoning steps. Self-Refine offers a useful alternative for models with limited context windows, helping them reach deeper reasoning levels. This, in turn, improves both structural and semantic quality. Few-Shot prompting is still a reasonable enough compromise in terms of examples given in trials, but Zero-Shot is always at a disadvantage by not being able to embed itself around the specifics of the task against which it is being compared. This sophisticated stepwise manner of progression might perhaps also be tentatively correlated with theories of management of educational cognitive-process questions, in which the question structures and reflective refinements are similar to the cues that human-expert item writers are accustomed to.

5.2.3. Cost-Effectiveness and Time-Efficiency Analysis

(a): Cost vs. Quality Relationships

As shown in Figure 4a, the scatter distribution between cost per call and quality score reveals a lack of correlation trend—higher-quality outputs are not necessarily associated with higher costs. In this plot, each color represents a prompting strategy, and each point within a color corresponds to a different model using that strategy. CoT and Self-Refine templates clustered in the upper-left region of the plot, reflecting strong quality at moderate cost, while Zero-Shot and Few-Shot prompts occupied lower regions due to weaker output quality and higher correction overhead. This pattern highlights the cost-efficiency of structured reasoning frameworks, where incremental processing time is compensated by superior text quality, interpretability, and reusability of generated MCQs. On average, the most cost-effective prompts yielded over 2.5× higher quality-per-dollar ratios than unguided generations. Here, Cost is measured in milli-dollars (1 milli-dollar = 0.001 USD).

(b): Top 10 Most Cost-Effective Model– Prompt Strategy Template Combinations

Figure 4b ranks the top ten configurations by quality score per U.S. dollar (×10³), consolidating the findings across both performance and cost metrics.

The leading combinations were dominated by reasoning-oriented templates:

LLaMA-3-70B + CoT (504);
Claude-3-Opus + Self-Refine (415);
OpenBioLLM-70B + Zero-Shot (358).

Interestingly, OpenBioLLM-70B achieved a favorable balance between high reasoning quality and low API cost, while Claude-3-Opus demonstrated strong cost-adjusted returns for self-refine prompts. Mid-tier systems, like Command-A (220, Few-Shot) and Gemini-1.5-Pro (195, Self-Refine), also performed competitively, suggesting that structured prompting mitigates cost-performance disparities among differently sized models.

(c): Time Efficiency by Prompt Strategy

The Quality-Per-Second analysis (Figure 4c) quantifies temporal efficiency. CoT again ranked first (39.0), substantially exceeding Self-Refine (23.3), Few-Shot (22.4), and Zero-Shot (16.9). Despite CoT’s longer reasoning traces, its high-quality outputs yield the best overall efficiency ratio once normalized by completion time.

The results imply that structured reasoning reduces the need for regeneration or post-editing, thus improving end-to-end workflow efficiency. Self-Refine templates, although slower, produce stable text quality that works well for semi-automated MCQ workflows where slower review times are acceptable.

(d): Prompt Strategy recommendations by use case

The recommendation table (Table 8) summarizes this section findings into practical guidance for educational and clinical AI applications. For high-accuracy tasks, OpenBioLLM-70B + CoT (90.4), LLaMA-3-70B + CoT (84.1), and BioGPT-Large + Self-Refine (78.5) proved most effective. For cost-sensitive or time-critical situations, Claude-3-Opus + Self-Refine (70.5) and OpenBioLLM-70B + Zero-Shot (64.0) provided the best balance. This highlights that choosing the right prompt strategy depends on available computational resources and the educational or operational goals of MCQ generation system.

In conclusion, CoT prompts achieve the highest utility normalized for cost and time, confirming their place as the strongest overall candidate for the generation of high-fidelity educational resources based on reasoning. Self-Refine prompts also represent a strong second-best alternative which works particularly well with mid-capacity biomedical models, as the balance between output clarity, output design, and cost appears optimal. Conversely, Few-Shot is a viable lightweight alternative for very low-cost output generation, while Zero-Shot continues to be inefficient for the highly specialist type of knowledge-through-reasoning tasks due to its instability and subsequent editing requirements. In conclusion, the multimodal triangulation process to assess quality, cost, and time supports the finding that utilizing a structured approach with repeat iterations produces the best overall trade-off between quality, cost, and time in the development of automated medical MCQs and resource efficiencies in this process.

5.2.4. Comparative Insights and Educational Implications

In summary, the comparative analysis across prompt strategies shows that structured reasoning (CoT) and iterative self-correction (Self-Refinement) substantially improved the quality, consistency, and clinical validity of automatically generated medical case-based MCQs. These techniques not only achieved improved textual similarity and structural accuracy but also produced more consistent question-answer congruence across differently capable models. From an educational design perspective, this mirrors human expert reasoning processes in question formulation, where explicit diagnostic justification and reflective review lead to high pedagogical quality. In contrast, Few-Shot prompting yields moderate benefits in both efficiency and response stability, making it a useful alternative for mid-range models or time-constrained applications. Zero-shot prompting, while computationally cheaper, consistently shows deficits in diagnostic accuracy, distractor relevance, and Bloom’s taxonomy match, all of which would indicate limited applicability for professional medical assessment. In total, these results indicate that reasoning-augmented prompting paradigms, especially CoT and Self-Refinement, are the most effective routes to LLMs outputs that are aligned with human-written educational standards. The performance advantage extends beyond verbal fidelity to include validity of content, cognitive congruence, and cost-effectiveness, and, therefore, leads to an effective way of scaling evidence-supported automation in medical question generation systems.

5.3. Human Evaluation Results

Alongside automated text-based metrics, a human validation study was performed using the single best-performing model-prompt combination, which was OpenBioLLM-70B with CoT prompting. This model–template pair received the highest overall quality score in the quantitative evaluation phase and provided a consistently superior performance to all other combinations in terms of linguistic fidelity, clinical reasoning quality, and template robustness. The human validation study evaluated only the single best-performing configuration to ensure that expert time was spent analyzing student work that produced the most pedagogically meaningful output and accurately represented the best-case scenario for quality of LLM-generated MCQ selection. Each MCQ was scored on a 0–10 scale across five widely used educational quality dimensions:

(1): Appropriateness of the question;
(2): Clarity and specificity;
(3): Relevance to clinical context;
(4): Quality of alternatives (distractors);
(5): Cognitive level (Bloom’s taxonomy).

Expert-Level Scoring Trends

A consistent rating difference was observed between the two evaluators, as shown in Table 9.

Expert 1 (Higher Rater—a senior academic with 9 years’ experience in high-stakes medical examination construction and vetting) provided generally more favorable assessments, with average scores across metrics ranging from 7.47 to 8.69, and a total composite score of approximately 39.57/50.
Expert 2 (Stricter Rater—a senior consultant and chair of an exam committee in a national residency program, with 7 years’ experience in exam blueprinting, item writing, and quality assurance) displayed more conservative scoring behavior, with mean scores ranging from 6.11 to 6.48, and a total composite score of 31.52/50.

Though there was a difference in the overall rating scores, both raters demonstrated similar rank order across the five criteria, indicating strong agreement with respect to the generated items’ relative strengths and weaknesses.

Metric-Specific Findings

For both raters, relevance to clinical context accounted for the highest average combined rating (7.57), indicating that OpenBioLLM-70B + CoT was at least sufficiently able to preserve clinical meaning as it applies to actual diagnosis, investigation, and management functions that a clinician would utilize in practice. In terms of Appropriateness, Clarity, and Cognitive level ratings (7.25–7.35), the generated items appeared structurally coherent within the MCQ stem and demonstrated moderately strong reasoning alignment with Bloom’s taxonomy. Both raters assigned the lowest rating to “Quality of alternatives (distractors)” (6.79), fading to the lowest score indicating the plausibility and discriminatory value of distractors is the primary area for improvement, even in the best condition. This concern aligns with a recognized issue in the literature on automated MCQ generation, where distractors should reflect precise medical nuance and require discipline-specific reasoning.

Observations and Implications

Comparing their scores, both raters were consistently biased towards strictness and verification by order, as Expert 2 scored lower than Expert 1 for each item. This finding is seemingly a publication style preference rather than a difference in understanding, as both raters provided relatively similar ranked orders across criteria.

Future studies could consider a score normalization approach (e.g., z-scores) to control for scale differences specific to each rater prior to calculating aggregate scores. From a content development perspective, if continuous iteration is to be learned from the overall observation that both raters identified distractor quality as a common low ratings dimension, then scenarios, prior to the presentation of the items, should prioritize distractor plausibility and clinical validity balanced by cognitive challenge.

Overall, human validation of the highest rating scheme (OpenBioLLM-70B + CoT) was re-established as a system that produces reasonably high MCQs with clinical relevance, structural clarity, and strong cognitive alignment, while once again suggesting overall distractor construction or quality where primary improvement is required. Overall, this focused set of evaluations supports an augmented and reliable upper bound on the achievable mean of items that can be developed and demonstrate success.

6. Limitations

The findings of this study should be interpreted with caution due to a few limitations. Although nine LLMs and four prompting techniques were assessed in this study, the expert validation step evaluated only one of the highest-performing model/prompt combinations (OpenBioLLM-70B + CoT). This restriction allowed for a more efficient use of expert time, focusing only on clinically relevant results, but it limited the ability to perform direct comparisons of human ratings across all combinations of models and prompts. Second, the expert review was performed by two raters who had different baseline scoring patterns. This finding suggests that there were systematic differences in their levels of stringency when evaluating the responses. Summary averages were provided, but future analyses may also include normalization of the ratings by rater and inter-rater reliability calculations. Third, the automated metrics assessed in this study (BLEU, ROUGE, and METEOR) provide a limited proxy for clinical reasoning quality, as they primarily measure surface-level text similarity. Furthermore, since these metrics did not consider how often distractor choices could be plausible responses during evaluation, there was a reduced comparison of qualitatively different valid answer choices. These metrics were used to validate objective and reproducible comparisons across models; however, they were inadequate for measuring educational validity. Therefore, educational validity was assessed through human expert evaluations. Finally, the evaluations were based on the use of case vignettes collected from the same institutional dataset and may not be accurately representative of other large clinical training programs or specialty-specific test items.

7. Future Work

Future studies could improve human validation across multiple model-prompt combinations by using expert assessment to determine the validity of various combinations throughout the entire range of performance. Studies should rely on a larger sample of clinical educators and the use of standardized psychometric measures (e.g., discrimination index, distractor efficiency, cognitive level mapping) to increase generalizability and allow for more precise benchmarking. In addition, future studies will incorporate cross-evaluation, whereby multiple independent clinical educators assess MCQs generated across different model–prompt combinations, to reduce evaluator bias and strengthen robustness and generalizability. Studies should also integrate retrieval-augmented generation (RAG) via the use of domain-based sources (e.g., UpToDate, PubMed) to improve evidence-based distractor generation. A promising direction is to investigate multi-stage pipelines where an LLM generates the MCQ, another model critiques or refines it, and a final stage applies quality filters before expert review. Longitudinal studies could also examine how LLM-generated items perform when deployed in real student assessments. Also, we will focus to extend this current proposed evaluation framework to be able to evaluate and compare new released LLMs when they can be documented and reproduced sufficiently, in order to be able to perform systematic evaluations between the newly released LLMs with currently available models and use the same methodology.

8. Conclusions

This study, a multi-axis evaluation of nine LLMs and four prompting strategies, allows us to establish that prompting strategy and model selection affect the quality of generated assessment items through the application of textual fidelity metrics, structure parsability metrics, cost-latency profilers, and expert validation. We see that using CoT prompting produced the best results with respect to coherence, alignment with clinical reasoning, and parsability distractor quality. The Medical model OpenBioLLM-70B with CoT had the highest performance out of all evaluated models, with optimal levels of language quality, clinical relevance, and operational efficiency. An expert reviewer’s assessment of this CoT configuration established a strong evidence base for its educational validity. However, distractor quality is the primary area requiring further improvement. The variability of expert reviewers also shows a need for an established rubric for assessing human reviewers. In conclusion, we demonstrate that the capabilities of state-of-the-art LLMs in generating medical questions can be enhanced by applying clean, structured prompting to reliably help generate questions and provide a cost-effective and scalable way to generate medical questions from credentialed sources. These findings lay the groundwork for an expandable, reproducible evaluation framework that provides all institutions with actionable recommendations to integrate LLM-Assisted MCQ generation into medical educational and assessment workflows.

Author Contributions

Writing—original draft preparation, S.A.S.; review and editing, A.A.A. and A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The MedMCQA dataset is publicly available at https://github.com/MedMCQA/MedMCQA (accessed on 22 June 2025). Our generated MCQs, prompting templates, and evaluation scripts are available from the corresponding author upon reasonable request.

Acknowledgments

This study would not have been possible without the invaluable contribution of Maha Al-Jabri. Her expert evaluation of the generated MCQs provided critical clinical insight, which enhanced the educational validity and rigor of the human validation component of the study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MCQ	Multiple Choice Question
CB	Case-Based
LLM	Large Language Model

References

Lee, H.Y.; Yune, S.J.; Lee, S.Y.; Im, S.; Kam, B.S. The impact of repeated item development training on the prediction of medical faculty members’ item difficulty index. BMC Med. Educ. 2024, 24, 599. [Google Scholar] [CrossRef]
Law, A.K.; So, J.; Lui, C.T.; Choi, Y.F.; Cheung, K.H.; Kei-ching Hung, K.; Graham, C.A. AI versus human-generated multiple-choice questions for medical education: A cohort study in a high-stakes examination. BMC Med. Educ. 2025, 25, 208. [Google Scholar] [CrossRef] [PubMed]
Ch’en, P.Y.; Day, W.; Pekson, R.C.; Barrientos, J.; Burton, W.B.; Ludwig, A.B.; Jariwala, S.P.; Cassese, T. GPT-4 generated answer rationales to multiple choice assessment questions in undergraduate medical education. BMC Med. Educ. 2025, 25, 333. [Google Scholar] [CrossRef] [PubMed]
Laupichler, M.C.; Rother, J.F.; Grunwald Kadow, I.C.; Ahmadi, S.; Raupach, T. Large Language Models in Medical Education: Comparing ChatGPT- to Human-Generated Exam Questions. Acad. Med. 2024, 99, 508–512. [Google Scholar] [CrossRef] [PubMed]
Cheung, B.H.H.; Lau, G.K.K.; Wong, G.T.C.; Lee, E.Y.P.; Kulkarni, D.; Seow, C.S.; Wong, R.; Co, M.T.-H. ChatGPT versus human in generating medical graduate exam multiple choice questions—A multinational prospective study (Hong Kong S.A.R., Singapore, Ireland, and the United Kingdom). PLoS ONE 2023, 18, e0290691. [Google Scholar] [CrossRef]
Schulhoff, S.; Ilie, M.; Balepur, N.; Kahadze, K.; Liu, A.; Si, C.; Li, Y.; Gupta, A.; Han, H.; Schulhoff, S.; et al. The Prompt Report: A Systematic Survey of Prompt Engineering Techniques. arXiv 2025. [Google Scholar] [CrossRef]
Yao, Z.; Parashar, A.; Zhou, H.; Jang, W.S.; Ouyang, F.; Yang, Z.; Yu, H. MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback. arXiv 2024. [Google Scholar] [CrossRef]
Liu, F.; AlDahoul, N.; Eady, G.; Zaki, Y.; Rahwan, T. Self-Reflection Makes Large Language Models Safer, Less Biased, and Ideologically Neutral. arXiv 2024. [Google Scholar] [CrossRef]
Palmer, E.J.; Devitt, P.G. Assessment of higher order cognitive skills in undergraduate education: Modified essay or multiple choice questions? Research paper. BMC Med. Educ. 2007, 7, 49. [Google Scholar] [CrossRef]
Thistlethwaite, J.E.; Davies, D.; Ekeocha, S.; Kidd, J.M.; MacDougall, C.; Matthews, P.; Purkis, J.; Clay, D. The effectiveness of case-based learning in health professional education. A BEME systematic review: BEME Guide No. 23. Med. Teach. 2012, 34, e421–e444. [Google Scholar] [CrossRef]
Kurdi, G.R. Generation and Mining of Medical, Case-Based Multiple Choice Questions. Ph.D. Thesis, The University of Manchester, Manchester, UK, 2020. [Google Scholar]
Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Hou, L.; Clark, K.; Pfohl, S.; Cole-Lewis, H.; Neal, D.; et al. Towards Expert-Level Medical Question Answering with Large Language Models. arXiv 2023. [Google Scholar] [CrossRef]
Fleming, S.L.; Morse, K.; Kumar, A.; Chiang, C.-C.; Patel, B.; Brunskill, E.; Shah, N. Assessing the Potential of USMLE-Like Exam Questions Generated by GPT-4. medRxiv 2023. [Google Scholar] [CrossRef]
Nori, H.; King, N.; McKinney, S.M.; Carignan, D.; Horvitz, E. Capabilities of GPT-4 on Medical Challenge Problems. arXiv 2023. [Google Scholar] [CrossRef]
Garber, M.; Feng, H.; Ronzano, F.; LaFleur, J.; De Oliveira, R.; Rough, K.; Roth, K.; Nanavati, J.; Zine El Abidine, K.; Mack, C. Evaluation of Large Language Model Performance on the Biomedical Language Understanding and Reasoning Benchmark: Comparative Study. JMIR 2024. preprints. [Google Scholar] [CrossRef]
Van Uhm, J.; Van Haelst, M.M.; Jansen, P.R. AI-Powered Test Question Generation in Medical Education: The DailyMed Approach. medRxiv 2024. [Google Scholar] [CrossRef]
Jeong, M.; Sohn, J.; Sung, M.; Kang, J. Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models. Bioinformatics 2024, 40, i119–i129. [Google Scholar] [CrossRef] [PubMed]
Dorfner, F.J.; Dada, A.; Busch, F.; Makowski, M.R.; Han, T.; Truhn, D.; Kleesiek, J.; Sushil, M.; Lammert, J.; Adams, L.C.; et al. Biomedical Large Languages Models Seem not to be Superior to Generalist Models on Unseen Medical Data. arXiv 2024. [Google Scholar] [CrossRef]
Maharjan, J.; Garikipati, A.; Singh, N.P.; Cyrus, L.; Sharma, M.; Ciobanu, M.; Barnes, G.; Thapa, R.; Mao, Q.; Das, R. OpenMedLM: Prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models. Sci. Rep. 2024, 14, 14156. [Google Scholar] [CrossRef]
Grévisse, C.; Pavlou, M.A.S.; Schneider, J.G. Docimological Quality Analysis of LLM-Generated Multiple Choice Questions in Computer Science and Medicine. SN Comput. Sci. 2024, 5, 636. [Google Scholar] [CrossRef]
Artsi, Y.; Sorin, V.; Konen, E.; Glicksberg, B.S.; Nadkarni, G.; Klang, E. Large language models for generating medical examinations: Systematic review. BMC Med. Educ. 2024, 24, 354. [Google Scholar] [CrossRef]
Zhu, Y.; Tang, W.; Yang, H.; Niu, J.; Dou, L.; Gu, Y.; Wu, Y.; Zhang, W.; Sun, Y.; Yang, X. The Potential of LLMs in Medical Education: Generating Questions and Answers for Qualification Exams. arXiv 2025. [Google Scholar] [CrossRef]
Bedi, S.; Fleming, S.L.; Chiang, C.-C.; Morse, K.; Kumar, A.; Patel, B.; Jindal, J.A.; Davenport, C.; Yamaguchi, C.; Shah, N.H. QUEST-AI: A System for Question Generation, Verification, and Refinement using AI for USMLE-Style Exams. In Biocomputing 2025; World Scientific: Kohala Coast, HI, USA, 2024; pp. 54–69. [Google Scholar] [CrossRef]
Shahriar, S.; Dara, R.; Akalu, R. A comprehensive review of current trends, challenges, and opportunities in text data privacy. Comput. Secur. 2025, 151, 104358. [Google Scholar] [CrossRef]
Shahriar, S.; Dara, R. Priv-IQ: A Benchmark and Comparative Evaluation of Large Multimodal Models on Privacy Competencies. AI 2025, 6, 29. [Google Scholar] [CrossRef]
Elzayyat, M.; Mohammad, J.N.; Zaqout, S. Assessing LLM-generated vs. expert-created clinical anatomy MCQs: A student perception-based comparative study in medical education. Med. Educ. Online 2025, 30, 2554678. [Google Scholar] [CrossRef] [PubMed]
Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. arXiv 2023, arXiv:2205.11916. [Google Scholar] [CrossRef]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2023. [Google Scholar] [CrossRef]
Miao, J.; Thongprayoon, C.; Suppadungsuk, S.; Krisanapan, P.; Radhakrishnan, Y.; Cheungpasitporn, W. Chain of Thought Utilization in Large Language Models and Application in Nephrology. Medicina 2024, 60, 148. [Google Scholar] [CrossRef]
Madaan, A.; Tandon, N.; Gupta, P.; Hallinan, S.; Gao, L.; Wiegreffe, S.; Alon, U.; Dziri, N.; Prabhumoye, S.; Yang, Y.; et al. Self-Refine: Iterative Refinement with Self-Feedback. arXiv 2023. [Google Scholar] [CrossRef]
Yue, M.; Yao, W.; Mi, H.; Yu, D.; Yao, Z.; Yu, D. DOTS: Learning to Reason Dynamically in LLMs via Optimal Reasoning Trajectories Search. arXiv 2025. [Google Scholar] [CrossRef]
Kıyak, Y.S.; Emekli, E. ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: A literature review. Postgrad. Med. J. 2024, 100, 858–865. [Google Scholar] [CrossRef]
Saad, M.; Almasri, W.; Hye, T.; Roni, M.; Mohiyeddini, C. Analysis of ChatGPT-3.5’s Potential in Generating NBME-Standard Pharmacology Questions: What Can Be Improved? Algorithms 2024, 17, 469. [Google Scholar] [CrossRef]
Kıyak, Y.S.; Kononowicz, A.A. Using a Hybrid of AI and Template-Based Method in Automatic Item Generation to Create Multiple-Choice Questions in Medical Education: Hybrid AIG. JMIR Form. Res. 2025, 9, e65726. [Google Scholar] [CrossRef] [PubMed]
Pal, A.; Umapathi, L.K.; Sankarasubbu, M. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering. arXiv 2022. [Google Scholar] [CrossRef]
Kipp, M. From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance. Information 2024, 15, 543. [Google Scholar] [CrossRef]
Sonoda, Y.; Kurokawa, R.; Nakamura, Y.; Kanzawa, J.; Kurokawa, M.; Ohizumi, Y.; Gonoi, W.; Abe, O. Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in “Diagnosis Please” cases. Jpn. J. Radiol. 2024, 42, 1231–1235. [Google Scholar] [CrossRef]
Franzen, C. Cohere Targets Global Enterprises with New Highly Multilingual Command a Model Requiring Only 2 GPUs. Venturebeat. Available online: https://venturebeat.com/ai/cohere-targets-global-enterprises-with-new-highly-multilingual-command-a-model-requiring-only-2-gpus?utm_source=chatgpt.com (accessed on 27 October 2025).
Cohere. Command a: An Enterprise-Ready Large Language Model. Technical. Available online: https://cohere.com/research/papers/command-a-technical-report.pdf (accessed on 26 April 2025).
Singh, P. Llama 3.3 70B Is Here! 25x Cheaper than GPT-4o. Analytics Vidhya. Available online: https://www.analyticsvidhya.com/blog/2024/12/meta-llama-3-3-70b/?utm_source=chatgpt.com (accessed on 27 October 2025).
Oketch, K.; Lalor, J.P.; Yang, Y.; Abbasi, A. Bridging the LLM Accessibility Divide? Performance, Fairness, and Cost of Closed versus Open LLMs for Automated Essay Scoring. arXiv 2025. [Google Scholar] [CrossRef]
Jahan, I.; Laskar, M.T.R.; Peng, C.; Huang, J. Evaluating the Effectiveness of Cost-Efficient Large Language Models in Benchmark Biomedical Tasks. arXiv 2025. [Google Scholar] [CrossRef]
Zhang, G.; Jin, Q.; Zhou, Y.; Wang, S.; Idnay, B.; Luo, Y.; Park, E.; Nestor, J.G.; Spotnitz, M.E.; Soroush, A.; et al. Closing the gap between open source and commercial large language models for medical evidence summarization. Npj Digit. Med. 2024, 7, 239. [Google Scholar] [CrossRef]
Pan, G.; Wang, H. A Cost-Benefit Analysis of On-Premise Large Language Model Deployment: Breaking Even with Commercial LLM Services. arXiv 2025. [Google Scholar] [CrossRef]
Luo, R.; Sun, L.; Xia, Y.; Qin, T.; Zhang, S.; Poon, H.; Liu, T.-Y. BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining. Brief. Bioinform. 2022, 23, bbac409. [Google Scholar] [CrossRef]
Dorfner, F.J.; Dada, A.; Busch, F.; Makowski, M.R.; Han, T.; Truhn, D.; Kleesiek, J.; Sushil, M.; Adams, L.C.; Bressem, K.K. Evaluating the effectiveness of biomedical fine-tuning for large language models on clinical tasks. J. Am. Med. Inform. Assoc. 2025, 32, 1015–1024. [Google Scholar] [CrossRef] [PubMed]
OpenAI. Optimizing LLM Accuracy: Context, Prompts, Sampling. Available online: https://platform.openai.com/docs/guides/optimizing-llm-accuracy (accessed on 28 April 2025).
Mansuy, R. Evaluating NLP Models: A Comprehensive Guide to ROUGE, BLEU, METEOR, and BERTScore Metrics. Plainenglish. Available online: https://plainenglish.io/blog/evaluating-nlp-models-a-comprehensive-guide-to-rouge-bleu-meteor-and-bertscore-metrics-d0f1b1 (accessed on 11 November 2025).
Li, Z.; Guo, W.; Gao, Y.; Yang, D.; Kang, L. A Large Language Model-Based Approach for Data Lineage Parsing. Electronics 2025, 14, 1762. [Google Scholar] [CrossRef]
Baran, K. Understanding the Cost Economics of GenAI Systems: A Comprehensive Guide. Medium. Available online: https://medium.com/%40AI-on-Databricks/understanding-the-cost-economics-of-genai-systems-a-comprehensive-guide-24e3d4f22e4f (accessed on 11 November 2025).
Al Shuraiqi, S.; Aal Abdulsalam, A.; Masters, K.; Zidoum, H.; AlZaabi, A. Automatic Generation of Medical Case-Based Multiple-Choice Questions (MCQs): A Review of Methodologies, Applications, Evaluation, and Future Directions. Big Data Cogn. Comput. 2024, 8, 139. [Google Scholar] [CrossRef]
Lin, C.-Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), Barcelona, Spain, 25–26 July 2004. [Google Scholar]
Tănase, A.-V.; Pelican, E. SupraTok: Cross-Boundary Tokenization for Enhanced Language Model Performance. arXiv 2025. [Google Scholar] [CrossRef]
Sun, W.; Wang, J.; Guo, Q.; Li, Z.; Wang, W.; Hai, R. CEBench: A Benchmarking Toolkit for the Cost-Effectiveness of LLM Pipelines. arXiv 2025. [Google Scholar] [CrossRef]
Zhao, T.; Wei, M.; Preston, J.S.; Poon, H. Pareto Optimal Learning for Estimating Large Language Model Errors. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; Long Papers; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; Volume 1, pp. 10513–10529. [Google Scholar]
Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; Casas, D.d.L.; Hendricks, L.A.; Welbl, J.; Clark, A.; et al. Training Compute-Optimal Large Language Models. arXiv 2022. [Google Scholar] [CrossRef]
Liang, P.; Bommasani, R.; Lee, T.; Tsipras, D.; Soylu, D.; Yasunaga, M.; Zhang, Y.; Narayanan, D.; Wu, Y.; Kumar, A.; et al. Holistic Evaluation of Language Models. arXiv 2023. [Google Scholar] [CrossRef]
Jin, D.; Pan, E.; Oufattole, N.; Weng, W.-H.; Fang, H.; Szolovits, P. What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams. arXiv 2020. [Google Scholar] [CrossRef]
Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef]
Zheng, Z.; Fiore, A.M.; Westervelt, D.M.; Milly, G.P.; Goldsmith, J.; Karambelas, A.; Curci, G.; Randles, C.A.; Paiva, A.R.; Wang, C.; et al. Automated Machine Learning to Evaluate the Information Content of Tropospheric Trace Gas Columns for Fine Particle Estimates Over India: A Modeling Testbed. J. Adv. Model. Earth Syst. 2023, 15, e2022MS003099. [Google Scholar] [CrossRef]
Adams, N.E. Bloom’s taxonomy of cognitive learning objectives. J. Med. Libr. Assoc. 2015, 103, 152–153. [Google Scholar] [CrossRef]
Al Shuriaqi, S.; Aal Abdulsalam, A.; Masters, K. Generation of Medical Case-Based Multiple-Choice Questions. Int. Med. Educ. 2023, 3, 12–22. [Google Scholar] [CrossRef]

Figure 1. Study objectives pipeline.

Figure 2. Effect of prompt engineering strategies on MCQ generation quality.

Figure 3. Interaction between model and prompt engineering strategy.

Figure 4. Trade-offs between quality, cost, and latency across prompt strategies and models.

Table 1. 9 LLMs selection and their features.

Category	Model (Version)	Parameters	Max Context	Access/License	Behavioral Features for MCQ Generation
General-purpose LLM	GPT-4o	>200 B	128 K tokens	OpenAI API	Fast, multimodal reasoning; highest coherence and distractor quality.
General-purpose LLM	Claude 3 Opus	175–350 B	200 K tokens	Anthropic API	Advanced language reasoning; “self-critique” mode improves refinement.
General-purpose LLM	Gemini 1.5 Pro	>200 B	1–2 M tokens	Google Vertex AI	Handles extended clinical notes; strong factual grounding and structured JSON outputs.
Enterprise/RAG System	Command A (Cohere)	111 B	256 K tokens	Cohere Cloud/Bedrock	Cloud/Bedrock Retrieval-augmented generation: integrates textbook references for distractor plausibility.
Open-source LLM	LLaMA-3-70B	70 B	128 K tokens	Meta (Apache 2.0)	Highest BLEU/ROUGE among open models; reliable formatting and stability.
Open-source LLM	Mistral Large	123 B (effective)	65 K tokens	HF/Mistral Cloud	Efficient dense model; excellent speed-to-quality ratio; strong zero-shot reasoning.
Open-source LLM	Falcon 7B	7 B	8 K tokens	TII (Apache 2.0)	Lightweight baseline; low cost/fast inference for bulk generation.
Medical LLM	OpenBioLLM-70B	≈70 B	64 K tokens	Hugging Face	High biomedical alignment; strong BLEU/ROUGE-2; excellent domain reasoning.
Medical LLM	BioGPT-Large	1.5 B	2 K tokens	Hugging Face	PubMed-trained; compact biomedical baseline for local and educational use.

Table 2. Prompt template types and their skeleton.

Prompt Type	Core Idea	Skeleton
Baseline (zero-shot)	Give only the task and a strict output schema; generate exactly 4 options and mark one correct answer; no extra text.	zero_shot_template = “““You are a medical educator. Given a clinical case scenario, generate ONE multiple-choice question (MCQ). Requirements: -Use ONLY the provided case (do not add new details). -Ask ONE question (diagnosis OR best investigation OR management.—Provide 4 answer choices (A–D).-Mark the correct answer with. Case: {case_text} ”””
Few-Shot	Prime with a couple of formatted exemplars that match the schema; then ask for the new item; it keeps models fast and stable.	few_shot_template = “““You are a medical educator. Given a clinical case scenario, generate ONE multiple-choice question (MCQ).Requirements:—Use ONLY the provided case (do not invent new details).—Ask ONE question (diagnosis OR best investigation OR management).—Provide 4 answer choices (A–D).—Mark the correct answer with ✔Examples: Case1: { case with MCQs}, Case2: { case with MCQs } Now use the following case to generate an MCQ: {case_text}”””
Chain-of-Thought	Ask the model to think step-by-step internally, then output only the final MCQ; this boosts clinical coherence and distractor quality.	chain_of_thought_template = “““You are a medical educator. Given a clinical case scenario, generate ONE multiple-choice question (MCQ). Instructions:—Use ONLY the provided case Think through the case step-by-step internally before formulating your answer. —Identify the key details, differential diagnosis, and the core learning objective.—Formulate ONE focused question (diagnosis, best investigation, or management).—Provide 4 options (A–D).—Mark the correct answer with ✔. Your final output must contain ONLY the MCQ in the exact format specified below. Do not output any of your internal reasoning. OUTPUT FORMAT: Question:<one line> Choices: A.<option> B.<option> C.<option> D.<option> Answer: <letter and text with ✔<Case:{case_text}”””
Self-Refine	Two-phase loop: Draft the options, then critique and fix contradictions and formatting before emitting the final MCQ; raises parsability and key consistency.	self_refine_template = “““You are a medical educator. Your task has two phases: PHASE 1—Generate- From the given case, create ONE multiple-choice question (MCQ).—Provide 4 answer choices (A–D).—Mark the correct answer with ✔. PHASE 2—Self-Refine—Critically review the MCQ from Phase 1.—Check for the following criteria: (a) Only ONE correct answer is marked. (b) Distractors (wrong answers) are plausible and medically relevant. (c) The question stem is clear, unambiguous, and directly based on the case. (d) The format is correct with exactly 4 options.—If the MCQ meets all criteria, your final output is the same MCQ. If it does NOT meet all criteria, generate a new, improved MCQ that fixes all identified issues. Your final output must be this improved question. OUTPUT FORMAT: --- PHASE 1 OUTPUT ---<First MCQ here> --- FINAL ANSWER AFTER REVIEW--- < The final, correct MCQ here> Case: {case_text} ””

Table 3. Scoring prompt strategy.

Category	Points	Description and Scoring Criteria
Answer Correctness	40	Exact answer match (case-insensitive): 25 points Explicit correctness markers (e.g., “✔”, “correct”, “Answer”): 15 points Detected against gold answer (original_answer)
Prompt Strategy Specific	30	Criteria vary by prompt type: Zero-Shot: Concise, direct answers, proper Q&A formatting. Few-Shot: Evidence of learning from examples, reuse, comparative phrasing. Chain-of-Thought: Explicit reasoning steps, logical markers (“because”, “therefore”). Self-Refine: Iterative reasoning, improvement phrases (“refine”, “final draft”).
Format Quality	20	Adherence to MCQ format (Question + 4 labeled options A–D). Detection via regex (A–D, Option A–D, Choice A–D). Proper punctuation (“?”) and structural markers (“A:”, “B.”). Completeness and clarity add to score.
Clinical Relevance	10	Use of domain-specific medical terminology (e.g., patient, diagnosis, treatment, symptoms). Term-frequency-based scoring; maintains medical focus and context.
Total Possible	100	The final composite score is normalized to 100: $T e m p l a t e S c o r e = m i n (T o t a l P o i n t s, 100)$

Table 4. Performance data generation evaluation scores and description.

Score	Description
Quality Score	Weighted composite output of the scoring function.
Consistency Score	The repeatability of responses through iterations.
Efficiency Score	Inversely proportional to response latency.
Response Time	Average generation time per MCQ (in seconds).
Cost per Call	An estimated token-based measure of expense for each response (USD). Based on model capability, information was derived from the documented specifications of each model, including model size and training data scope (e.g., LLaMA, GPT, Gemini, Mistral, OpenBioLLM). Template multipliers were modified relative to template effectiveness as noted: $M_{t e m p l a t e} = {0.95 f o r Z e r o - S h o t 1.05 f o r F e w - S h o t 1.10 f o r S e l f - R e f i n e 1.15 f o r C o T$

All cost metrics reflect real API consumption recorded during models’ execution.

Table 5. Human evaluation dimensions, definition, and scoring range.

Evaluation Dimension	Definition	Scoring Range
Appropriateness of the question	Evaluates whether the stem is correctly constructed, well-formed, and tests a meaningful clinical concept.	0–10
Clarity and specificity	Measures the precision and unambiguity of the question and options; assesses readability and absence of irrelevant details.	0–10
Relevance to clinical context	Assesses the alignment of the question with real-world medical scenarios and curricular objectives.	0–10
Quality of alternatives (distractors)	Examines whether distractors are plausible, discriminative, and medically meaningful.	0–10
Cognitive level (Bloom’s taxonomy)	Rates the level of cognitive demand (recall, comprehension, application, analysis, or evaluation).	0–10
Overall quality index	Aggregated mean score across all dimensions, representing the holistic educational quality of the MCQ.	0–50

Table 6. Evaluation scores across model’s overall prompts.

Model	Success Rate%	BLEU	ROUGE-L	METEOR	Parsability %	Cost ($)	Response Time (s)	Quality Per Dollar	Time Value
Gemini-1.5-Pro	100	0.044	0.184	0.150	42.9	0.011	2.5	0.073	7.35
GPT-4o	100	0.064	0.201	0.172	48.0	0.089	3.0	0.067	6.71
Claude-3-Opus	100	0.065	0.211	0.178	46.7	0.225	4.0	0.053	5.27
Command-A	100	0.015	0.118	0.090	49.9	0.010	2.8	0.042	4.23
Mistral-Large	100	0.033	0.134	0.126	24.3	0.029	2.2	0.061	6.11
Falcon-7B	100	0.002	0.054	0.027	100	0.008	4.4	0.012	1.22
BioGPT-Large	100	0.002	0.054	0.027	100	0.012	5.5	0.010	0.98
LLaMA-3-70B	100	0.479	0.798	0.698	99.8	0.073	8.5	0.094	9.39
OpenBioLLM-70B	100	0.500	0.696	0.613	87.5	0.069	9.5	0.074	7.37

Table 7. Evaluation scores models across different types of prompts.

Model	Template	Quality Score	Consistency Score	Efficiency Score	Response Time Score	Cost_Per_Call Score	Quality_Per_Dollar Score
gemini-1.5-pro	Zero-Shot	59.9	60	66.1	4.075	0.00047	125.02
gemini-1.5-pro	Few-Shot	61.2	60	73.5	2.627	0.00027	220.03
gemini-1.5-pro	CoT	77	73.9	53.7	1.965	0.00076	100.05
gemini-1.5-pro	Self-Refine	79.1	72.2	60	4.013	0.00083	95.16
gpt-4o	Zero-Shot	66.4	63.6	61.1	4.397	0.00076	86.88
gpt-4o	Few-Shot	76.8	72.5	58.4	3.476	0.00082	92.71
gpt-4o	CoT	74.5	66.2	55.9	1.171	0.00086	86.21
gpt-4o	Self-Refine	67.3	62.2	66.8	3.011	0.00081	82.72
claude-3-opus	Zero-Shot	56	60	81.9	4.121	0.00015	357.81
claude-3-opus	Few-Shot	70.1	67.6	69.8	2.119	0.00054	129.56
claude-3-opus	CoT	74.6	68.1	56.7	1.355	0.00064	115.1
claude-3-opus	Self-Refine	70.5	69.1	68.9	0.551	0.00016	415.13
command-a	Zero-Shot	42.8	60	80.3	3.001	0.00092	46.21
command-a	Few-Shot	63.3	60	62.4	4.106	0.00096	65.54
command-a	CoT	56.4	60	77.9	0.962	0.00073	76.94
command-a	Self-Refine	46.8	60	78.2	1.897	0.00023	195.19
mistral-large	Zero-Shot	38	60	65.5	1.526	0.00086	44.11
mistral-large	Few-Shot	47.6	60	64.4	1.886	0.0007	67.42
mistral-large	CoT	51.7	60	81.1	1.319	0.00068	75.72
mistral-large	Self-Refine	50.7	60	74.3	4.068	0.00089	56.44
falcon-7b	Zero-Shot	22.7	60	85.6	1.751	0.00059	38.19
falcon-7b	Few-Shot	25.7	60	90	2.661	0.00059	43.47
falcon-7b	CoT	33.8	60	83.2	1.177	0.00081	41.33
falcon-7b	Self-Refine	27.5	60	90	3.148	0.00087	31.47
biogpt-large	Zero-Shot	52.7	60	73.6	3.971	0.00099	53.06
biogpt-large	Few-Shot	70.6	61.7	65.8	4.025	0.0009	78.11
biogpt-large	CoT	77.9	67.3	60.6	2.658	0.00094	82.13
biogpt-large	Self-Refine	78.5	71.4	54.3	3.011	0.00057	137.18
llama-3-70b	Zero-Shot	76.3	71.7	57.6	2.612	0.00058	129.54
llama-3-70b	Few-Shot	67	63.5	57.5	3.146	0.00056	119.42
llama-3-70b	CoT	84.1	78.8	57.2	2.003	0.00016	504.23
llama-3-70b	Self-Refine	80.3	71.1	62.8	3.901	0.00055	143.77
OpenBioLLM-70B	Zero-Shot	64	60	75.1	2.956	0.00017	369.58
OpenBioLLM-70B	Few-Shot	80.5	72.8	60.3	1.062	0.00041	194.23
OpenBioLLM-70B	CoT	90.4	88.8	55.5	3.281	0.00067	134.11
OpenBioLLM-70B	Self-Refine	75.2	68.5	58.2	1.088	0.00051	147.24

Table 8. Prompt Strategy recommendation by model.

Category	Model + Prompt Strategy	Score
High Accuracy	OpenBioLLM-70B + Chain-of-Thought	90.4
High Accuracy	llama-3-70b + Chain-of-Thought	84.1
High Accuracy	OpenBioLLM-70B + Few-Shot	80.5
High Accuracy	llama-3-70b + Self-Refine	80.3
High Accuracy	gemini-1.5-pro + Self-Refine	79.1
Cost Effective	llama-3-70b + Chain-of-thought	84.1
Cost Effective	claude-3-opus + Self-Refine	70.5
Cost Effective	OpenBioLLM-70B + Zero-Shot	64.0

Table 9. Human expert evaluation of MCQs generated using OpenBioLLM-70B with Chain-of-Thought prompting (n = 88).

Metric (0–10)	Expert 1 Evaluation	Expert 2 Evaluation	Combined Mean
Appropriateness of the question	8.33	6.18	7.25
Clarity and specificity	8.22	6.48	7.35
Relevance to clinical context	8.69	6.45	7.57
Quality of alternatives (distractors)	7.47	6.11	6.79
Cognitive level (Bloom’s taxonomy)	7.79	6.29	7.04
Total Score (0–50)	39.57	31.52	35.54

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Al Shuraiqi, S.; AlZaabi, A.; Aal Abdulsalam, A. Prompt Engineering Strategies for Generating Medical Case-Based MCQs with Large Language Models: A Multi-Model Comparative Study. Mach. Learn. Knowl. Extr. 2026, 8, 41. https://doi.org/10.3390/make8020041

AMA Style

Al Shuraiqi S, AlZaabi A, Aal Abdulsalam A. Prompt Engineering Strategies for Generating Medical Case-Based MCQs with Large Language Models: A Multi-Model Comparative Study. Machine Learning and Knowledge Extraction. 2026; 8(2):41. https://doi.org/10.3390/make8020041

Chicago/Turabian Style

Al Shuraiqi, Somaiya, Adhari AlZaabi, and Abdulrahman Aal Abdulsalam. 2026. "Prompt Engineering Strategies for Generating Medical Case-Based MCQs with Large Language Models: A Multi-Model Comparative Study" Machine Learning and Knowledge Extraction 8, no. 2: 41. https://doi.org/10.3390/make8020041

APA Style

Al Shuraiqi, S., AlZaabi, A., & Aal Abdulsalam, A. (2026). Prompt Engineering Strategies for Generating Medical Case-Based MCQs with Large Language Models: A Multi-Model Comparative Study. Machine Learning and Knowledge Extraction, 8(2), 41. https://doi.org/10.3390/make8020041

Article Menu

Prompt Engineering Strategies for Generating Medical Case-Based MCQs with Large Language Models: A Multi-Model Comparative Study

Abstract

1. Introduction

2. Related Work

2.1. MCQ Generation in Medical Education

2.2. Large Language Models for Medical Question Generation

2.3. Prompt Engineering and Reasoning Strategies

2.4. Gap Addressed by This Study

3. Methods

3.1. Study Design

3.2. Dataset

3.3. Large Language Models (LLMs) Selection and Criteria

3.4. Prompt Engineering Framework

4. Metrics and Evaluation

4.1. Automatic Text Similarity

4.2. Structural Quality

4.3. Operational Metrics

4.4. Composite Utility Metrics

4.5. Aggregate Success Indicators

4.6. Prompt Strategy Evaluation Matrix

4.6.1. Scoring Framework Overview

4.6.2. Performance Data Generation

4.6.3. Analytical Purpose

4.7. Human Evaluation Metrics

5. Results

5.1. Quantitative Evaluation Summary

5.2. Effect of Prompt Strategy

5.2.1. Overall Prompt Strategy Performance Analysis

5.2.2. Model–Prompt Interaction Analysis

5.2.3. Cost-Effectiveness and Time-Efficiency Analysis

5.2.4. Comparative Insights and Educational Implications

5.3. Human Evaluation Results

6. Limitations

7. Future Work

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI