Abstract
The use of large language models (LLMs) to automate the generation of medical case-based multiple-choice questions (MCQs) is increasing, but their accuracy, reliability, and educational validity are still not well understood. This study in a comparative framework examined nine LLMs with four different prompting methods to evaluate LLM-produced MCQs for clinical coherence and readiness for assessment. A uniform evaluation pipeline was constructed to examine automatic text-similarity measures using automated metrics (BLEU, ROUGE, and METEOR), structural and parsability measures, and operational effectiveness (latency, cost, quality-efficiency ratios). Human validation was performed on the best-performing model and prompt combination (OpenBioLLM-70B with Chain-of-Thought) focusing on the model prompt that demonstrated the best linguistic fidelity and clinically aligned reasoning. Two clinical experts independently reviewed 88 items using a five-domain rubric covering appropriateness, clarity, relevance, distractor quality, and cognitive level. Results indicated significant variation across models and prompting strategies, with Chain-of-Thought yielding the best overall performance in comparison to other strategies. The OpenBioLLM-70B model demonstrated the best overall balance of quality, parsability, and efficiency, achieving a prompt template quality score of 90.4, a consistency score of 88.8, and a response time of 3.28 s, with a quality-per-dollar value of 134.11. The expert rating confirmed clinical alignment, but there was consensus that distractor quality needed further improvements. These results provide evidence that LLMs under optimal prompting conditions can reliably support MCQ generation and provide large-scale, cost-effective support for medical assessment production.
1. Introduction
High-quality multiple-choice questions (MCQs), particularly those wrapped in realistic patient vignettes, remain the backbone of knowledge assessment in undergraduate and postgraduate medical curricula because they combine objectivity with broad content sampling [1]. Authoring clinical MCQs requires extensive manual effort in question preparation and revision. A recent audit of high-stakes emergency medicine examination found that human experts required ≈96 person-hours to write 100 new MCQs, whereas first drafting with a large language model (LLM) cut that figure to 24.5 h [2]. Writing answer rationales is an additional bottleneck: course directors report spending >30 min per item, a burden halved with LLMs assistance [3]. The time and expertise demanded often outstrip faculty capacity, limiting the size and freshness of item banks.
Early empirical studies suggest that modern LLMs can shoulder part of this workload. Multinational trials comparing ChatGPT-generated questions with faculty items showed comparable difficulty and discrimination indices in graduate-level exams, although LLM items tended to target lower-order cognition [4,5]. Subsequent expert review of AI-authored MCQs still uncovers factual slips and cueing flaws, confirming that prompting strategies often matters as much as which model is used [2].
The growing prompt-engineering literature suggests that output quality depends not only on the underlying model but also on how the task is elicited. A recent survey catalogues techniques such as in-context learning, CoT, decomposition, and self-refinement, each requiring different token allocations that influence LLM reliability (i.e., repeatability of generated outputs given fixed prompts and inputs) during MCQ generation [6]. In the MCQ space, the MCQG S Refine framework combines expert-designed prompts with iterative critique and correct loops that boosts USMLE-style question quality by up to 30% over baseline prompting [7]. Recent work has explored self-refinement as a means of improving output reliability by enabling the model to critique and update its own responses before final generation. Conversely, empirical work on self-refinement demonstrates that its gains are highly sensitive to prompt wording, with small changes negating improvements and revealing the limited robustness of current prompting heuristics [8]. However, comparative evidence remains fragmented: existing studies often evaluate a limited number of models, apply non-uniform prompting, or rely on evaluation measures that are not tailored to educational item quality (e.g., surface text overlap).
In spite of recent advances, to our knowledge, there is limited comparative evidence that compared a broad spectrum of large language models (LLMs)—ranging from general-purpose (e.g., GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, Command A, LLaMA-3-70B, Falcon 7B, Mixtral Large) to domain-specific (e.g., OpenBioLLM-70B, BioGPT-Large)—under a uniform set of prompt engineering paradigms for generating case-based medical MCQs. This study addresses that gap by:
- Benchmarking nine LLMs across four prompting strategies (zero-shot baseline, few-shot, CoT, and self-refinement) using standardized case-based medical scenarios.
- Evaluating outputs through a hybrid framework that combines automatic semantic metrics (BLEU, ROUGE, METEOR) and expert ratings of clinical accuracy, reasoning depth, and distractor plausibility.
- Assessing the cost-quality trade-offs to produce evidence-based findings for end-users, exam boards, and AI practitioners who wish to build scalable MCQ generation solutions.
By separating structural reliability, descriptive similarity, and expert-rated educational quality, this work provides a reproducible comparison of model–prompt configurations and identifies practical trade-offs relevant to institutions seeking scalable, human-in-the-loop approaches for MCQ drafting.
2. Related Work
2.1. MCQ Generation in Medical Education
Multiple-choice questions (MCQs) are a mainstay of medical assessment due to their scalability, objectivity, and psychometric precision [9]. With appropriate levels of cognition considered, properly designed MCQs can evaluate both factual recall and higher-order thinking skills. The generation of adequate MCQs is labor-intensive and requires a high level of expertise in the subject matter in order to guarantee clinical accuracy, fairness, and adherence to educational goals. Quality MCQs must be carefully written to minimize item-writing flaws, maintain uniformity of standards, and establish an adequate balance of testing factual recall and application of reasoning in medical problem-solving. The demand for precision and educational worth makes MCQ writing a very laborious process, often requiring several revisions and an agreement among experts. Case-based MCQs, in which the questions are linked to realistic patient situations, fill a great need in medical education because of their utility in measuring higher-order cognitive processes [10]. Questions of this sort, combining biomedical knowledge with the clinical, social, and contextual aspects of patient care, test more than factual recall, but also diagnostic reasoning, and clinical decision-making, and the ability to apply knowledge in real-life situations. They promote inquiry-based problem solving, encourage thorough understanding, and assist the student in the application of theoretical knowledge in practical clinical situations. In recent years, attempts have been made to automate the generation of MCQs through the application of natural language processing (NLP) techniques, which range from rule-based templates to sequence-to-sequence architectures [11]. While these methods are promising, the earlier rule-based methods were limited by a lack of sufficient vocabulary, inflexibility of templates, and a lack of clinical depth. Although earlier sequence-to-sequence models have enabled new capabilities for more flexible text generation, they struggle to maintain clinical accuracy and logical reasoning in more complex case-based scenarios. As a result, this gap has led to an increased emphasis on developing more advanced LLMs, which will have an enhanced understanding of medical language and a corresponding reasoning capability required to generate high-quality MCQs [12].
2.2. Large Language Models for Medical Question Generation
The use of large language models (LLMs) for medical question generation has rapidly increased in the past few years, leading to a growing body of evidence on their strengths and weaknesses. Early studies suggested that state-of-the-art models like GPT-4 could create coherent, medically relevant, and even USMLE-style questions that were indistinguishable from human generated questions [13,14]. In several assessments, GPT-4 has consistently outperformed the biomedical specialized LLMs on overall fluency, reasoning, and distractor plausibility [15,16]. While domain-specific LLMs are being developed, such as BioGPT, Med-PaLM, and retrieval-augmented pipelines, to bolster their medical reliability, studies like Jeong et al. (2024) [17] have shown that retrieval methods (e.g., Self-BioRAG) improved explanation quality and reasoning reliability, but plausibility of distractors remained a major issue. In similar studies, like Dorfner et al. (2024) [18], the biomedical fine-tuned LLMs sometimes performed poorly compared with general LLMs due to limited adaptability and lack of generality. This suggests that scale and breadth of general linguistic coverage could outperform the narrowness of specialized coverage. There are more recent strategies emphasizing prompt engineering and refinement strategies as key factors for performance. Iterative critique-revision pipelines have been shown to greatly uplift distractor quality and reasoning reliability, as shown in Yao et al. (2024) [7], when producing USMLE-code questions with higher calibration of difficulty. Maharjan et al. (2024) [19] showed that open-source models could attain fine-tuned benchmarks in biomedicine if supplied with CoT prompting and self-consistent strategies. Despite the previous improvements, many problems remain comparable across the studies, with distractor weakness (implausible, ambiguous, or more than one right answer) being the leading problem in the LLM-generated MCQs [5,20]. LLM Reasoning, while improved with CoT prompting, is often impaired by hallucinations and superficial recall [21,22]. Psychometric alignment is another problem, while some studies [7,14] indicate fine calibration against expert evaluated levels of difficulty, other studies indicate differences from subject/skill base that lead to doubt the psychometric validity of the tests created [23].
The implementation of LLMs to generate medical Case-Based MCQs initiates many evaluation and privacy challenges. LLMs that utilize sensitive data to train are vulnerable to unintentional memorization, with reports indicating that 68.5% of anonymized patient data can be reconstructed via clinical data alone [24]. LLMs are subject to many attack vectors (such as membership inference, extraction of training data, and prompt injection), with jailbreak attacks showing over 20% success rates [25]. To counter these risks, comprehensive evaluation frameworks like The Priv-IQ Benchmark (a benchmark to evaluate the performance of LLMs on eight core competencies, including privacy) use metrics like Intraclass Correlation Coefficient (ICC) and Mean Absolute Error (MAE), among others. However, there are limitations, such as the trade-off between service and privacy, and the lack of standardized methods for evaluating privacy [24,25]. Therefore, there are currently best practices in place for the safe implementation of this technology by the use of conducting Privacy Impact Assessments (PIAs), implementing privacy-enhancing techniques such as differential privacy and federated learning, to ensure compliance with the General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA), as well as adopting a hybrid model approach.
2.3. Prompt Engineering and Reasoning Strategies
Prompt engineering has become increasingly necessary for the success of LLMs, particularly with important applications in medical education [26]. Although the model architecture and training corpus provide the model with certain basic capabilities, the particulars of how the prompts are presented—comprising context, examples, and reasoning guidelines—have a considerable effect on the accuracy and coherence of the MCQs generated as well as the psychometric validity of their results.
Zero-shot prompting provides a baseline method in which the models are prompted to generate case-based MCQs, without examples [27]. The method relies on the model’s own knowledge and reasoning, leading to results with varying plausibility of the distractors, shallow reasoning, and occasional hallucinations. In medical cases, zero-shot prompts may produce relevant clinical stems, but the distractors may be either implausible or completely incorrect.
Few-shot prompting introduces exemplar MCQs within the prompt to anchor the model’s output to the expected structure, style, and reasoning depth [28]. Empirical evidence suggests that few-shot prompting improves question clarity and distractor plausibility, especially when examples are carefully selected to represent diverse cognitive levels. However, its effectiveness is highly sensitive to the number and quality of examples, with diminishing returns or even bias if examples are too narrow.
Chain-of-Thought (CoT) prompting explicitly encourages stepwise reasoning before the final output. In biomedical tasks, CoT has been shown to enhance logical flow and reasoning fidelity, reducing errors in diagnostic or multi-step clinical reasoning tasks [29,30]. Using CoT reasoning for MCQ creation will improve the relationship between stems, correct answers, and distractors by making the model give a rationale for the clinical reasoning process that led to the MCQ being asked. However, longer rationales can lead to increased verbosity and computational costs and create a trade-off between the quality of the questions, and the time and effort required to generate them.
Self-refinement and iterative prompting represent the most recent evolution of prompt engineering strategies [31,32]. Self-refinement introduces an iterative “draft–critique–revise” loop intended to reduce contradictions and improve compliance with item-writing constraints. This approach significantly improves distractor plausibility, reduces hallucinations, and enhances psychometric calibration (e.g., increasing medium and hard-level items). Platforms, such as QUEST-AI and OpenMedLM, demonstrate that refinement-based prompting enables even open-source models to approach the quality of frontiers LLMs [19,23].
In medical education, prompt strategies are important in determining the validity or quality and relevance to clinical practice of any MCQs by LLMs [33]. Recent studies indicate that AI-generated MCQs may be comparable to human-written MCQs when specific types of prompts (examination style prompts, clinical persona prompts, structured instruction prompts) are used to generate the MCQs [34,35]. However, the success of AI-generated MCQs in mimicking human-written MCQs is mixed; models and methods used to prompt seem to affect how closely they approximate human-generated MCQs. Zero-shot prompting appears to be scalable for generating many MCQs while few-shot prompting and structured prompting methods have been shown to create greater clinical alignment and depth of reasoning. Additionally, analysis of previous reviews of the literature has shown that (to improve surface validity) using prompts based on standardized exams (i.e., examination formats based on NBME or USMLE items) has value; however, even when using such prompts, some clinical vignettes are weak and higher-order cognitive items have less ability to discriminate among examinees. So, these findings demonstrate both the potential of and the limitations associated with prompt-based generation of MCQs. As explored in our study, there is a need for standardized benchmarking of multiple prompt methods for MCQ generation in medical education.
2.4. Gap Addressed by This Study
Across prior work, comparative evaluation is often limited by differences in model selection, prompting protocols, and outcome measures, making it difficult to isolate the effects of model family versus prompt strategy. Moreover, many evaluations emphasize linguistic quality or perceived usefulness rather than standardized measures of structural compliance and expert-rated educational quality. To address these limitations, we compare nine LLMs across four prompting paradigms using a uniform generation template and a shared evaluation pipeline that integrates structural parsability, descriptive text-based similarity measures, operational latency/cost, and expert review of the top-performing configuration.
3. Methods
3.1. Study Design
The purpose of this research is to comprehensively assess using LLMs as a basis for generating MCQ’s that are clinically relevant and accurate to individual cases in the area of medical education. The design integrates automated generation, model comparison, and expert evaluation. The research consists of four distinct steps: (1) dataset preparation, (2) model-based MCQ generation with four different prompts, (3) automation evaluation, and (4) human validation. Figure 1 shows study objectives pipeline.
Figure 1.
Study objectives pipeline.
3.2. Dataset
We employed 100 records from MedMCQA, a publicly available, large-scale multiple-choice question answering corpus introduced by Pal et al. (2022) [36]. The dataset contains 194,457 AIIMS and NEET-PG entrance-exam questions spanning 21 medical subjects and almost 2.4k healthcare topics, with an average token length of 12.8 words per item.
- Each record provides:
- A stem (realistic clinical or basic-science prompt);
- Four or five answer options;
- The single correct choice;
- An explanatory rationale authored by subject-matter experts.
The breadth of specialties (e.g., pharmacology, surgery, obstetrics, and pathology) tests more than ten distinct reasoning skills identified in prior medical-QA taxonomies, making the corpus well-suited for evaluating general-purpose and domain-tuned LLMs.
3.3. Large Language Models (LLMs) Selection and Criteria
Specifically, we benchmarked nine LLMs that collectively represent the full range of current-generation models used in medical question authoring—covering proprietary, enterprise, open-source, and biomedical-specialized systems. The choice of models was based on the purpose of the study and the timing of our experiments, rather than on the novelty of each model, we chose those models that were most commonly used, had large user bases (and hence, stability), had documented means for replications through APIs, and represented both general purpose models as well as those specifically developed for biomedical research so that fair and reproducible comparisons could occur between different types of models and different prompting methods.
- Proprietary frontier models: GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro—define the current state of the art in general reasoning, contextual understanding, and long-context coherence (supporting up to 1 million tokens). These models are often cited by medical licensing boards for prototype item generation because they are both reliable and able to perform contextual reasoning tasks [21,37,38].
- Enterprise, retrieval-centric system: Command A (Cohere) illustrates integration of retrieval-augmented generation (RAG) workflows, providing grounded factual consistency (i.e., similarity of outputs across repeated runs with the same prompt) for case-based MCQ stems and distractors. It is essential when prompts must integrate textbook-anchored clinical facts to produce accurate, evidence-based distractors [39,40]. Its design supports grounded medical questions construction, where factual precision is prioritized.
- Open-source, self-hostable checkpoints: LLaMA-3-70B, Mistral Large, and Falcon 7B offer transparent architectures, efficient parameter scaling, and local deploy ability for reproducible experimentation. They provide transparency, adaptability, and cost-effective deployment. They enable on-premise or GPU-cluster deployment, reducing generation costs by approximately 20× relative to GPT-4o while maintaining high linguistic fidelity under optimized prompting [41,42,43,44,45]. Open access also supports reproducibility and fine-grained control of temperature, reasoning depth, and question difficulty.
- Domain-specialized biomedical models: OpenBioLLM-70B and BioGPT-Large contribute clinical and biomedical knowledge through pre-training on PubMed and domain corpora, enabling evaluation of terminology capture and domain-reasoning fidelity [46,47]. They combine large-scale biomedical pre-training with strong reasoning ability, bridging the gap between general and domain-specific performance. BioGPT-Large, though smaller, serves as a lightweight reference model to examine how prompt engineering can offset limited scale and context capacity.
Collectively, these nine models provide a balanced benchmark across scale, domain specialization, and prompting strategies, establishing the foundation for evaluating zero-shot, few-shot, CoT, and self-refinement paradigms in case-based medical MCQ generation. Table 1 summarizes the nine LLMs evaluated in this study, along with their key architectural characteristics and features relevant to MCQ generation.
Table 1.
9 LLMs selection and their features.
3.4. Prompt Engineering Framework
The four prompt paradigms we tested are baseline (zero-shot), few-shot, CoT, and self-refinement.
- ▪
- Baseline captures the simplest “instruction-only” setup that many users start with [27].
- ▪
- Few-shot follows the in-context learning recipe introduced by Brown et al. [28] for GPT-3, where a handful of exemplars steer the model without parameter updates.
- ▪
- CoT adopts the reasoning-trace prompting of Wei et al. [29], asking the model to articulate intermediate clinical logic before giving the final MCQ.
- ▪
- Self-refine implements the iterative “draft → critique → rewrite” loop proposed by Madaan et al. [31], letting the same model improve its own output at inference time.
We instantiated every paradigm with real vignettes from MedMCQA [36]. For example, “A 38-year-old woman comes to the physician for a follow-up examination. Two years ago, she was diagnosed with multiple sclerosis, etc.” Each clinical case scenario was entered into the LLM as a case, consisting of: demographic information, the reason for being seen, relevant medical history and important clinical findings. The model was asked to convert each case into an entirely structured case-based MCQ. This was accomplished by providing the model with examples of a clinically coherent stem, four possible answers with one being correct, and a justification for why the answer is correct. All prompts were to produce output that was a reflection of standard medical exam style. This method provided consistency between different models as well as prompts, while also allowing the clinical context of each case to remain. Table 2 outlines the prompting framework used to standardize MCQ generation across models. There are different reasoning or refinement strategies for each type of prompt. For example, Zero-Shot creates direct schema-based outputs, while Few-Shot uses example cases to anchor generation. CoT templates help step-by-step reasoning by making clinical coherence clearer, while Self-Refine uses a two-phase revision loop to improve clarity, formatting, and the accuracy of distractors. This design makes sure that the input structure is always the same and that all nine LLMs can be evaluated fairly and consistently.
Table 2.
Prompt template types and their skeleton.
4. Metrics and Evaluation
The models were evaluated on four additional dimensions: automatic text similarity, structural quality, efficiency, and metrics specific to the templates [48,49,50,51,52].
4.1. Automatic Text Similarity
We used BLEU, ROUGE, and METEOR, three types of automatic text similarity metrics, to see how closely the generated MCQs matched the reference items from the original MedMCQA dataset [49,52].
, where R and C stand for the tokenized reference and candidate texts, respectively. This uses NLTK’s smoothing function (method 4) to avoid zero-precision penalties for short outputs.
. Specifically, we report ROUGE-1 (unigram overlap capturing lexical similarity), ROUGE-2 (bigram overlap capturing local fluency and phrasing), and ROUGE-L (longest common subsequence capturing overall structural alignment). These were implemented using the rouge_scorer with stemming enabled to reduce inflectional bias [53].
, which combines unigram precision and recall, adjusted for synonym and stemming matches. All values were averaged over the 100 MCQs generated per model–template pair:
4.2. Structural Quality
Structural quality was measured by the Parsability Rate [50], defined as the proportion of responses that successfully matched the required JSON or text schema (question + 4 options + answer):
A custom parsing function (parse_mcq_response) identified the question, options (A–D), and the correct answer. Outputs that did not adhere to proper structural and formatting rules were excluded from downstream text similarity metrics.
4.3. Operational Metrics
Latency and cost were also evaluated for each model–template pair [51].
- Average Response Time (seconds):
- Cost Per Response (USD):
Token counts were estimated to equal 1 token ≈ 4 characters, as per the usage conventions of OpenAI and HuggingFace [54].
4.4. Composite Utility Metrics
To evaluate simultaneously quality, cost, and efficiency, ref. [55] introduced two normalized ratios between sums:
In these equations, the subscripts m and t refer to the model and template type, respectively, utilized in evaluation. Specifically, m indexes, the nine large language models analyzed (e.g., GPT-4o, Claude-3-Opus, LLaMA-3-70B, OpenBioLLM-70B), and t describes the prompting strategy used (Zero-Shot, Few-Shot, CoT or self-refine). Thus, each metric is a quantification of the combined quality, cost, and efficiency of a particular model–template combination, allowing for standardized comparison across systems and reasoning modes. The two metrics were scaled, capping denominators at small constants (0.001 USD, 0.1 s).
4.5. Aggregate Success Indicators
Additional derived quantities [56,57] include:
Success Rate—percentage of non-error API responses:
Composite “Value Score” (cost efficiency):
Composite “Time-Value Score” (speed efficiency):
In addition to primary quality metrics, several governing metrics were created to represent both reliability and overall efficiency. The rate of success is expressed as the percentage of valid non-error API responses in total model-template runs, which indicates the reliability and reproducibility of the system. Cost and speed trade-offs were derived from the calculation of two composite indices namely, the Value Score, which is a composite of text quality (ROUGE-L) and reliability as a function of cost, and the Time–Value Score, which normalizes both quality and reliability by average response times. Collectively, these three derived metrics give a comprehensive representation of the overall efficiency of the model, quantifying an interrelationship between correctness, stability (i.e., performance robustness under varying conditions), and resource utilization over the various experimental conditions.
4.6. Prompt Strategy Evaluation Matrix
In order to analyze the effectiveness of the various prompt engineering templates, a revised scoring framework was devised whereby this matrix quantitatively measures the quality, the structure, and the contextual correctness of the output of each template. A hybrid method of scoring from 0 to 100 points is employed.
4.6.1. Scoring Framework Overview
The evaluation function (enhanced_template_score) integrates four major dimensions, Answer Correctness, Prompt Strategy Adherence, Format Quality, and Clinical Relevance. The highest possible score across all four dimensions is 100 points (i.e., the highest score is normalized to 100). We base our method upon pre-existing methodologies in clinicians holistically evaluating LLM [58], use of benchmarks for evaluating clinical reasoning [59,60], and the design of a common practice in automated scoring systems [61]. Table 3 shows our detailed scoring prompt strategy.
Table 3.
Scoring prompt strategy.
4.6.2. Performance Data Generation
To measure models over templates, a procedure was constructed (designated template_performance_data). Each result combined the model capability and template type to obtain the following principal metrics (Table 4):
Table 4.
Performance data generation evaluation scores and description.
4.6.3. Analytical Purpose
The resulting Template Evaluation Matrix enables:
- Cross-comparison of prompt strategies within and across LLM families;
- Measurement of quality–consistency–efficiency trade-offs;
- Quantitative linking of prompt reasoning depth (e.g., CoT, Self-Refine) with response correctness and format robustness.
This matrix forms the methodological basis for analyzing the effectiveness of prompts in automated medical MCQ generation before aggregating descriptive results, such as mean scores across prompts and models.
4.7. Human Evaluation Metrics
In addition to the automated metrics, a human evaluation framework was used to qualitatively assess the education and clinical value of the generated MCQs. The evaluation followed the criteria used in medical education research for MCQ validation [4,5,62,63]. Each generated question was evaluated by experts in the field based on a 10-point Likert scale for each dimension (0 = extremely poor, 10 = gold standard quality). In particular, the evaluation emphasized content validity and congruence, clarity, clinical relevance, and cognitive level in the questions to ensure that the generated questions meet the expected educational standards.
A human evaluation of the single best model-prompt combination evaluated during the quantitative phase was completed to augment the automation metrics. For this evaluation, twelve of the 100 MCQs were excluded from raters’ consideration due to reference to figures and illustrations not currently available. Of the remaining 88 MCQs, two senior clinical educators with 7 and 9 years of experience, respectively, served as raters in the human evaluation of high-stakes examinations. Both raters create and evaluate MCQs for summative exams routinely and are well-versed in international standards for item creation. Because of their extensive backgrounds, no additional training workshops were required; instead, the raters were provided with written directions regarding the particular scoring rubrics and the scoring process. Table 5 summarizes the human evaluation dimensions and their corresponding scoring ranges.
Table 5.
Human evaluation dimensions, definition, and scoring range.
Each criterion contributed equally to the Overall Quality Index, so that higher usage of dimension-stemmed items produced a higher score. This multi-dimensional human evaluation ensured that not only was the output of the model linguistically accurate, but that it was also pedagogically reasonable, clinically valid, and accordant with expected educational consequences.
5. Results
5.1. Quantitative Evaluation Summary
Table 6.
Evaluation scores across model’s overall prompts.
Table 7.
Evaluation scores models across different types of prompts.
- Success rate and reliability:
All models achieved similar levels of API stability (>99% of overall completion across runs), suggesting high overall performance in both prompt responses and reply prompts (Table 6).
- Operational efficiency:
The median response times showed wide variation (Table 6). The fastest responses were observed for Mistral-Large (2.2 s) and Gemini-1.5-Pro (~2.5 s), followed by Command-A (2.8 s) and GPT-4o (3.0 s). Larger and instruction-tuned models demonstrated longer latencies, including Claude-3-Opus (4.0 s), Falcon-7B (4.4 s), BioGPT-Large (5.5 s), LLa-MA-3-70B (8.5 s), and OpenBioLLM-70B (9.5 s). Cost analysis indicated a corresponding spread in cost per response, ranging from approximately USD 0.011 (Gemini-1.5-Pro) to USD 0.225 (Claude-3-Opus), with LLaMA-3-70B and OpenBioLLM-70B averaging around USD 0.073.
- Structural quality (parsability rate):
As stated in Table 4, the parsability of LLaMA-3-70B was the highest (approximately 99.8%), which indicated that it was close to perfectly applying the predetermined JSON or MCQ schema. OpenBioLLM-70B achieved the second-highest structural completeness, with a score of approximately 87.5%. Falcon-7B and BioGPT-Large reached close to 100% completeness, although their outputs showed weaker semantic validity. Mistral-Large and Gemini-1.5-Pro showed moderate structural reliability (approximately 40–60%), but a better finding was identified with regard to the provision of prompting templates (Few-Shot or CoT) for improving structural reliability.
- Automatic text-similarity metrics:
Results obtained with the evaluation metrics, BLEU, ROUGE-1/2/L, and METEOR (Table 6), showed that OpenBioLLM-70B and LLaMA-3-70B have the highest textual fitness relative to the references. OpenBioLLM-70B scored highly: BLEU ≈ 0.50, ROUGE-L ≈ 0.696 and METEOR ≈ 0.613, while LLaMA-3-70B obtained: (BLEU ≈ 0.48, ROUGE-L ≈ 0.632, METEOR ≈ 0.598), closer to an average of 0.5.
Mid-tier models, namely GPT-4o and Claude-3-Opus, showed good overlap scores, but not as high as the top models’ overlap scores, while Gemini-1.5-Pro and Mistral-Large provided comparatively moderate quality by concentrating on speed and costs rather than on linguistic accuracy. Command-A, Falcon-7B, and BioGPT-Large produced the lowest BLEU/ROUGE/METEOR scores. Their outputs often mismatched the reference stems and distractors, indicating difficulties in aligning with those components.
- Cost-effectiveness and composite utilities:
When normalized for response latency and estimated cost (Table 6), cost-adjusted composite indices (Quality-per-Dollar and Quality-per-Second) favored models balancing accuracy and efficiency. The leading configurations clustered around LLaMA-3-70B + CoT, Claude-3-Opus + Self-Refine, and OpenBioLLM-70B + CoT/Zero-Shot. Among these, LLaMA-3-70B + CoT demonstrated the optimal trade-off between generation quality, structural compliance, and time efficiency, emerging as the overall top performer in this evaluation.
5.2. Effect of Prompt Strategy
Strategies for prompt engineering were found to have a consistent and significant effect on the number and quality of MCQ produced for all models evaluated. The distribution of the global performance metrics for the different prompting methods, the model–template interactions, and the efficiencies adjusted for cost are shown in Figure 1, Figure 2 and Figure 3. Performance showed clear improvements based upon the prompting paradigms used:
Figure 2.
Effect of prompt engineering strategies on MCQ generation quality.

Figure 3.
Interaction between model and prompt engineering strategy.
CoT > Self-Refine > Few-Shot > Zero-Shot
5.2.1. Overall Prompt Strategy Performance Analysis
- (a)
- Average Quality Score by Prompt Strategy Type
As illustrated in Figure 2a, a clear hierarchical trend was observed across the four prompting paradigms. CoT achieved the highest mean quality score (68.9/100), followed by Self-Refine (64.0), Few-Shot (62.5), and Zero-Shot (53.2).
The large gap in performance between the CoT and other strategies indicates the importance of explicit reasoning for improving medical MCQ generation, particularly where contextual inference and key-selection logic are needed. Self-Refine improved output clarity and formatting accuracy through iterative self-correction, while Zero-Shot produced a poorer result, indicating limitations on unguided generation for complex structured output.
- (b)
- Prompt Strategy Performance Distribution Across All Models
In Figure 2b, the variations in performance across models are illustrated in terms of box-plot statistics showing interquartile ranges. Templates for CoT and Self-Refine produced not only higher median results, but also corresponding smaller interquartile ranges, indicating that the output quality across those models of varying capacity was stable and readily predictable. However, Zero-Shot performance showed the broadest spread and had several outlier low performances, indicating an unstable performance and a higher incidence of generation errors (e.g., incomplete responses unrelated to the prompt or case, and incorrectly formatted outputs). Few-shot performance templates showed moderate variances in the results, which indicate that small examples, given in context, served to increase response reliability but not to sustain stability of generative behavior compared with reasoning-based prompts.
- (c)
- Quality vs. Consistency Relationship
As illustrated in Figure 2c, there is a positive, linear correlation between overall quality and consistency scores. The high-capacity models (OpenBioLLM-70B and LLaMA-3-70B) delivered high-quality outputs and showed highly consistent behavior across CoT and Self-Refine prompting. The CoT (red) and Self-Refine (yellow) clusters could be found in the upper right quadrant, showing both areas of high quality and consistency. Zero-Shot (green) samples cluster around the lower left areas, confirming that the lack of contextual scaffolding leads to more erratic outputs. Finally, Few-Shot (blue) points formed a middle band, confirming that moderate increases occurred both in stability and quality when few examples were given.
- (d)
- Quality vs. Efficiency Comparison
The efficiency and quality metrics over templates are depicted in Figure 2d. The CoT prompts demonstrated the most balanced trait, producing high-quality output with reasonably high efficiency metrics. The Self-Refine prompting was slightly less efficient to compete with due to the number of individual interactions that are involved in its iterative reasoning, but it retained near-equal quality scores. The Few-Shot prompt produced similar efficiency to that of Self-Refine, with lower accuracy, while the Zero-Shot prompt was the quickest but demonstrated the lowest quality scores, due to the amount of overhead for corrections and parsing involved. This reinforces the idea that optimized versions of reasoning traces (as opposed to minimal prompts) are key to producing cost- and time-effective results rather than generating long, unnecessary-style text.
Therefore, analyses have shown that reasoning-oriented prompting (particularly CoT and Self-Refine) yields better and more consistent performance in medical Case-based MCQ generation in both linguistic and structural aspects. Meanwhile, few-shot is a practical middle ground for mid-tier models, while zero-shot tends to produce lower-quality outputs.
5.2.2. Model–Prompt Interaction Analysis
- (a)
- Model–Template Performance Heatmap
In Figure 3a, the performance heatmap (scores 0–100) shows distinct interaction patterns for model families and prompting strategies. OpenBioLLM-70B, given the CoT template, scored the highest in total (90.4), while LLaMA-3-70B (84.1, CoT) and Gemini-1.5-Pro (79.1, Self-Refine) followed. Mid-range performers, like GPT-4o and Opus Claude-3, had moderately quick (≈ 74–77) responses by prompting via multiple templates, while those with architectures a little smaller in size, such as Mistral-Large, Command-A, and Falcon-7B, were completely outperformed, especially in the cases of Zero-Shot (scores < 45). For all models and varieties, CoT itself produced the highest average quality presentation of all (68.9), which was a result of its more marked ability for structured responses and correct key acquisition, and where Zero-Shot methods produced the lowest scores from the lack of guiding clues and incomplete reasoning behind answers given.
- (b)
- Best Prompt Strategy Template per Model
The scores reported in Figure 3b reflect the best-performing prompt for each model. This indicates that the higher reasoning or refinement structure always converges to its maxima for the models. The highest capacity models of OpenBioLLM-70B (90.4) and LLaMA-3-70B (84.1) had CoT as the best prompt, providing a distinct advantage to having a CoT for architectures that have long context. The best prompt for the models of Gemini-1.5-Pro (79.1) and BioGPT-Large (78.5) was Self-Refine, providing syntactically correct outputs through self-iterative reasoning. For GPT-4o (76.8) and Command-A (63.3) the best results were garnered with the use of Few-Shot, where stability in the format was gained through sample-based structures
The smaller models, Mistral-Large (51.7) and Falcon-7B (33.8), received only limited improvement, although CoT improved their logical coherence with respect to unguided prompting to a small extent.
- (c)
- Prompt Strategy Template Effectiveness by Model Group
Figure 3c combines the average template performance over the four capability tiers (Basic, Mid-Range, Strong, and Top Performers). The stratified pattern indicates the scalability of the advantages of structured prompting. Among the top performers, CoT averaged 87.2, beating Self-Refine (77.8) and Few-Shot (73.8), while Zero-Shot did the worst, with 70.2. Within Strong Models, Self-Refine (72.3) provided the most consistent improvements, reinforcing the part played by iterative self-correction in medium-capacity architectures. In Mid-Range and basic models, the performance gaps between templates were small (<5 points), indicating that the limited reasoning depth operational in such models prevents them from fully exploiting the complex ideas of structured prompting.
This implies that CoT prompting is likely the most effective way to scale large biomedical LLMs for producing high-quality MCQ items. It works particularly well when the model is guided through explicit reasoning steps. Self-Refine offers a useful alternative for models with limited context windows, helping them reach deeper reasoning levels. This, in turn, improves both structural and semantic quality. Few-Shot prompting is still a reasonable enough compromise in terms of examples given in trials, but Zero-Shot is always at a disadvantage by not being able to embed itself around the specifics of the task against which it is being compared. This sophisticated stepwise manner of progression might perhaps also be tentatively correlated with theories of management of educational cognitive-process questions, in which the question structures and reflective refinements are similar to the cues that human-expert item writers are accustomed to.
5.2.3. Cost-Effectiveness and Time-Efficiency Analysis
- (a)
- Cost vs. Quality Relationships
As shown in Figure 4a, the scatter distribution between cost per call and quality score reveals a lack of correlation trend—higher-quality outputs are not necessarily associated with higher costs. In this plot, each color represents a prompting strategy, and each point within a color corresponds to a different model using that strategy. CoT and Self-Refine templates clustered in the upper-left region of the plot, reflecting strong quality at moderate cost, while Zero-Shot and Few-Shot prompts occupied lower regions due to weaker output quality and higher correction overhead. This pattern highlights the cost-efficiency of structured reasoning frameworks, where incremental processing time is compensated by superior text quality, interpretability, and reusability of generated MCQs. On average, the most cost-effective prompts yielded over 2.5× higher quality-per-dollar ratios than unguided generations. Here, Cost is measured in milli-dollars (1 milli-dollar = 0.001 USD).

Figure 4.
Trade-offs between quality, cost, and latency across prompt strategies and models.
- (b)
- Top 10 Most Cost-Effective Model– Prompt Strategy Template Combinations
Figure 4b ranks the top ten configurations by quality score per U.S. dollar (×103), consolidating the findings across both performance and cost metrics.
The leading combinations were dominated by reasoning-oriented templates:
- LLaMA-3-70B + CoT (504);
- Claude-3-Opus + Self-Refine (415);
- OpenBioLLM-70B + Zero-Shot (358).
Interestingly, OpenBioLLM-70B achieved a favorable balance between high reasoning quality and low API cost, while Claude-3-Opus demonstrated strong cost-adjusted returns for self-refine prompts. Mid-tier systems, like Command-A (220, Few-Shot) and Gemini-1.5-Pro (195, Self-Refine), also performed competitively, suggesting that structured prompting mitigates cost-performance disparities among differently sized models.
- (c)
- Time Efficiency by Prompt Strategy
The Quality-Per-Second analysis (Figure 4c) quantifies temporal efficiency. CoT again ranked first (39.0), substantially exceeding Self-Refine (23.3), Few-Shot (22.4), and Zero-Shot (16.9). Despite CoT’s longer reasoning traces, its high-quality outputs yield the best overall efficiency ratio once normalized by completion time.
The results imply that structured reasoning reduces the need for regeneration or post-editing, thus improving end-to-end workflow efficiency. Self-Refine templates, although slower, produce stable text quality that works well for semi-automated MCQ workflows where slower review times are acceptable.
- (d)
- Prompt Strategy recommendations by use case
The recommendation table (Table 8) summarizes this section findings into practical guidance for educational and clinical AI applications. For high-accuracy tasks, OpenBioLLM-70B + CoT (90.4), LLaMA-3-70B + CoT (84.1), and BioGPT-Large + Self-Refine (78.5) proved most effective. For cost-sensitive or time-critical situations, Claude-3-Opus + Self-Refine (70.5) and OpenBioLLM-70B + Zero-Shot (64.0) provided the best balance. This highlights that choosing the right prompt strategy depends on available computational resources and the educational or operational goals of MCQ generation system.
Table 8.
Prompt Strategy recommendation by model.
In conclusion, CoT prompts achieve the highest utility normalized for cost and time, confirming their place as the strongest overall candidate for the generation of high-fidelity educational resources based on reasoning. Self-Refine prompts also represent a strong second-best alternative which works particularly well with mid-capacity biomedical models, as the balance between output clarity, output design, and cost appears optimal. Conversely, Few-Shot is a viable lightweight alternative for very low-cost output generation, while Zero-Shot continues to be inefficient for the highly specialist type of knowledge-through-reasoning tasks due to its instability and subsequent editing requirements. In conclusion, the multimodal triangulation process to assess quality, cost, and time supports the finding that utilizing a structured approach with repeat iterations produces the best overall trade-off between quality, cost, and time in the development of automated medical MCQs and resource efficiencies in this process.
5.2.4. Comparative Insights and Educational Implications
In summary, the comparative analysis across prompt strategies shows that structured reasoning (CoT) and iterative self-correction (Self-Refinement) substantially improved the quality, consistency, and clinical validity of automatically generated medical case-based MCQs. These techniques not only achieved improved textual similarity and structural accuracy but also produced more consistent question-answer congruence across differently capable models. From an educational design perspective, this mirrors human expert reasoning processes in question formulation, where explicit diagnostic justification and reflective review lead to high pedagogical quality. In contrast, Few-Shot prompting yields moderate benefits in both efficiency and response stability, making it a useful alternative for mid-range models or time-constrained applications. Zero-shot prompting, while computationally cheaper, consistently shows deficits in diagnostic accuracy, distractor relevance, and Bloom’s taxonomy match, all of which would indicate limited applicability for professional medical assessment. In total, these results indicate that reasoning-augmented prompting paradigms, especially CoT and Self-Refinement, are the most effective routes to LLMs outputs that are aligned with human-written educational standards. The performance advantage extends beyond verbal fidelity to include validity of content, cognitive congruence, and cost-effectiveness, and, therefore, leads to an effective way of scaling evidence-supported automation in medical question generation systems.
5.3. Human Evaluation Results
Alongside automated text-based metrics, a human validation study was performed using the single best-performing model-prompt combination, which was OpenBioLLM-70B with CoT prompting. This model–template pair received the highest overall quality score in the quantitative evaluation phase and provided a consistently superior performance to all other combinations in terms of linguistic fidelity, clinical reasoning quality, and template robustness. The human validation study evaluated only the single best-performing configuration to ensure that expert time was spent analyzing student work that produced the most pedagogically meaningful output and accurately represented the best-case scenario for quality of LLM-generated MCQ selection. Each MCQ was scored on a 0–10 scale across five widely used educational quality dimensions:
- (1)
- Appropriateness of the question;
- (2)
- Clarity and specificity;
- (3)
- Relevance to clinical context;
- (4)
- Quality of alternatives (distractors);
- (5)
- Cognitive level (Bloom’s taxonomy).
- Expert-Level Scoring Trends
A consistent rating difference was observed between the two evaluators, as shown in Table 9.
Table 9.
Human expert evaluation of MCQs generated using OpenBioLLM-70B with Chain-of-Thought prompting (n = 88).
- Expert 1 (Higher Rater—a senior academic with 9 years’ experience in high-stakes medical examination construction and vetting) provided generally more favorable assessments, with average scores across metrics ranging from 7.47 to 8.69, and a total composite score of approximately 39.57/50.
- Expert 2 (Stricter Rater—a senior consultant and chair of an exam committee in a national residency program, with 7 years’ experience in exam blueprinting, item writing, and quality assurance) displayed more conservative scoring behavior, with mean scores ranging from 6.11 to 6.48, and a total composite score of 31.52/50.
Though there was a difference in the overall rating scores, both raters demonstrated similar rank order across the five criteria, indicating strong agreement with respect to the generated items’ relative strengths and weaknesses.
- Metric-Specific Findings
For both raters, relevance to clinical context accounted for the highest average combined rating (7.57), indicating that OpenBioLLM-70B + CoT was at least sufficiently able to preserve clinical meaning as it applies to actual diagnosis, investigation, and management functions that a clinician would utilize in practice. In terms of Appropriateness, Clarity, and Cognitive level ratings (7.25–7.35), the generated items appeared structurally coherent within the MCQ stem and demonstrated moderately strong reasoning alignment with Bloom’s taxonomy. Both raters assigned the lowest rating to “Quality of alternatives (distractors)” (6.79), fading to the lowest score indicating the plausibility and discriminatory value of distractors is the primary area for improvement, even in the best condition. This concern aligns with a recognized issue in the literature on automated MCQ generation, where distractors should reflect precise medical nuance and require discipline-specific reasoning.
- Observations and Implications
Comparing their scores, both raters were consistently biased towards strictness and verification by order, as Expert 2 scored lower than Expert 1 for each item. This finding is seemingly a publication style preference rather than a difference in understanding, as both raters provided relatively similar ranked orders across criteria.
Future studies could consider a score normalization approach (e.g., z-scores) to control for scale differences specific to each rater prior to calculating aggregate scores. From a content development perspective, if continuous iteration is to be learned from the overall observation that both raters identified distractor quality as a common low ratings dimension, then scenarios, prior to the presentation of the items, should prioritize distractor plausibility and clinical validity balanced by cognitive challenge.
Overall, human validation of the highest rating scheme (OpenBioLLM-70B + CoT) was re-established as a system that produces reasonably high MCQs with clinical relevance, structural clarity, and strong cognitive alignment, while once again suggesting overall distractor construction or quality where primary improvement is required. Overall, this focused set of evaluations supports an augmented and reliable upper bound on the achievable mean of items that can be developed and demonstrate success.
6. Limitations
The findings of this study should be interpreted with caution due to a few limitations. Although nine LLMs and four prompting techniques were assessed in this study, the expert validation step evaluated only one of the highest-performing model/prompt combinations (OpenBioLLM-70B + CoT). This restriction allowed for a more efficient use of expert time, focusing only on clinically relevant results, but it limited the ability to perform direct comparisons of human ratings across all combinations of models and prompts. Second, the expert review was performed by two raters who had different baseline scoring patterns. This finding suggests that there were systematic differences in their levels of stringency when evaluating the responses. Summary averages were provided, but future analyses may also include normalization of the ratings by rater and inter-rater reliability calculations. Third, the automated metrics assessed in this study (BLEU, ROUGE, and METEOR) provide a limited proxy for clinical reasoning quality, as they primarily measure surface-level text similarity. Furthermore, since these metrics did not consider how often distractor choices could be plausible responses during evaluation, there was a reduced comparison of qualitatively different valid answer choices. These metrics were used to validate objective and reproducible comparisons across models; however, they were inadequate for measuring educational validity. Therefore, educational validity was assessed through human expert evaluations. Finally, the evaluations were based on the use of case vignettes collected from the same institutional dataset and may not be accurately representative of other large clinical training programs or specialty-specific test items.
7. Future Work
Future studies could improve human validation across multiple model-prompt combinations by using expert assessment to determine the validity of various combinations throughout the entire range of performance. Studies should rely on a larger sample of clinical educators and the use of standardized psychometric measures (e.g., discrimination index, distractor efficiency, cognitive level mapping) to increase generalizability and allow for more precise benchmarking. In addition, future studies will incorporate cross-evaluation, whereby multiple independent clinical educators assess MCQs generated across different model–prompt combinations, to reduce evaluator bias and strengthen robustness and generalizability. Studies should also integrate retrieval-augmented generation (RAG) via the use of domain-based sources (e.g., UpToDate, PubMed) to improve evidence-based distractor generation. A promising direction is to investigate multi-stage pipelines where an LLM generates the MCQ, another model critiques or refines it, and a final stage applies quality filters before expert review. Longitudinal studies could also examine how LLM-generated items perform when deployed in real student assessments. Also, we will focus to extend this current proposed evaluation framework to be able to evaluate and compare new released LLMs when they can be documented and reproduced sufficiently, in order to be able to perform systematic evaluations between the newly released LLMs with currently available models and use the same methodology.
8. Conclusions
This study, a multi-axis evaluation of nine LLMs and four prompting strategies, allows us to establish that prompting strategy and model selection affect the quality of generated assessment items through the application of textual fidelity metrics, structure parsability metrics, cost-latency profilers, and expert validation. We see that using CoT prompting produced the best results with respect to coherence, alignment with clinical reasoning, and parsability distractor quality. The Medical model OpenBioLLM-70B with CoT had the highest performance out of all evaluated models, with optimal levels of language quality, clinical relevance, and operational efficiency. An expert reviewer’s assessment of this CoT configuration established a strong evidence base for its educational validity. However, distractor quality is the primary area requiring further improvement. The variability of expert reviewers also shows a need for an established rubric for assessing human reviewers. In conclusion, we demonstrate that the capabilities of state-of-the-art LLMs in generating medical questions can be enhanced by applying clean, structured prompting to reliably help generate questions and provide a cost-effective and scalable way to generate medical questions from credentialed sources. These findings lay the groundwork for an expandable, reproducible evaluation framework that provides all institutions with actionable recommendations to integrate LLM-Assisted MCQ generation into medical educational and assessment workflows.
Author Contributions
Writing—original draft preparation, S.A.S.; review and editing, A.A.A. and A.A. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
The MedMCQA dataset is publicly available at https://github.com/MedMCQA/MedMCQA (accessed on 22 June 2025). Our generated MCQs, prompting templates, and evaluation scripts are available from the corresponding author upon reasonable request.
Acknowledgments
This study would not have been possible without the invaluable contribution of Maha Al-Jabri. Her expert evaluation of the generated MCQs provided critical clinical insight, which enhanced the educational validity and rigor of the human validation component of the study.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| MCQ | Multiple Choice Question |
| CB | Case-Based |
| LLM | Large Language Model |
References
- Lee, H.Y.; Yune, S.J.; Lee, S.Y.; Im, S.; Kam, B.S. The impact of repeated item development training on the prediction of medical faculty members’ item difficulty index. BMC Med. Educ. 2024, 24, 599. [Google Scholar] [CrossRef]
- Law, A.K.; So, J.; Lui, C.T.; Choi, Y.F.; Cheung, K.H.; Kei-ching Hung, K.; Graham, C.A. AI versus human-generated multiple-choice questions for medical education: A cohort study in a high-stakes examination. BMC Med. Educ. 2025, 25, 208. [Google Scholar] [CrossRef] [PubMed]
- Ch’en, P.Y.; Day, W.; Pekson, R.C.; Barrientos, J.; Burton, W.B.; Ludwig, A.B.; Jariwala, S.P.; Cassese, T. GPT-4 generated answer rationales to multiple choice assessment questions in undergraduate medical education. BMC Med. Educ. 2025, 25, 333. [Google Scholar] [CrossRef] [PubMed]
- Laupichler, M.C.; Rother, J.F.; Grunwald Kadow, I.C.; Ahmadi, S.; Raupach, T. Large Language Models in Medical Education: Comparing ChatGPT- to Human-Generated Exam Questions. Acad. Med. 2024, 99, 508–512. [Google Scholar] [CrossRef] [PubMed]
- Cheung, B.H.H.; Lau, G.K.K.; Wong, G.T.C.; Lee, E.Y.P.; Kulkarni, D.; Seow, C.S.; Wong, R.; Co, M.T.-H. ChatGPT versus human in generating medical graduate exam multiple choice questions—A multinational prospective study (Hong Kong S.A.R., Singapore, Ireland, and the United Kingdom). PLoS ONE 2023, 18, e0290691. [Google Scholar] [CrossRef]
- Schulhoff, S.; Ilie, M.; Balepur, N.; Kahadze, K.; Liu, A.; Si, C.; Li, Y.; Gupta, A.; Han, H.; Schulhoff, S.; et al. The Prompt Report: A Systematic Survey of Prompt Engineering Techniques. arXiv 2025. [Google Scholar] [CrossRef]
- Yao, Z.; Parashar, A.; Zhou, H.; Jang, W.S.; Ouyang, F.; Yang, Z.; Yu, H. MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback. arXiv 2024. [Google Scholar] [CrossRef]
- Liu, F.; AlDahoul, N.; Eady, G.; Zaki, Y.; Rahwan, T. Self-Reflection Makes Large Language Models Safer, Less Biased, and Ideologically Neutral. arXiv 2024. [Google Scholar] [CrossRef]
- Palmer, E.J.; Devitt, P.G. Assessment of higher order cognitive skills in undergraduate education: Modified essay or multiple choice questions? Research paper. BMC Med. Educ. 2007, 7, 49. [Google Scholar] [CrossRef]
- Thistlethwaite, J.E.; Davies, D.; Ekeocha, S.; Kidd, J.M.; MacDougall, C.; Matthews, P.; Purkis, J.; Clay, D. The effectiveness of case-based learning in health professional education. A BEME systematic review: BEME Guide No. 23. Med. Teach. 2012, 34, e421–e444. [Google Scholar] [CrossRef]
- Kurdi, G.R. Generation and Mining of Medical, Case-Based Multiple Choice Questions. Ph.D. Thesis, The University of Manchester, Manchester, UK, 2020. [Google Scholar]
- Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Hou, L.; Clark, K.; Pfohl, S.; Cole-Lewis, H.; Neal, D.; et al. Towards Expert-Level Medical Question Answering with Large Language Models. arXiv 2023. [Google Scholar] [CrossRef]
- Fleming, S.L.; Morse, K.; Kumar, A.; Chiang, C.-C.; Patel, B.; Brunskill, E.; Shah, N. Assessing the Potential of USMLE-Like Exam Questions Generated by GPT-4. medRxiv 2023. [Google Scholar] [CrossRef]
- Nori, H.; King, N.; McKinney, S.M.; Carignan, D.; Horvitz, E. Capabilities of GPT-4 on Medical Challenge Problems. arXiv 2023. [Google Scholar] [CrossRef]
- Garber, M.; Feng, H.; Ronzano, F.; LaFleur, J.; De Oliveira, R.; Rough, K.; Roth, K.; Nanavati, J.; Zine El Abidine, K.; Mack, C. Evaluation of Large Language Model Performance on the Biomedical Language Understanding and Reasoning Benchmark: Comparative Study. JMIR 2024. preprints. [Google Scholar] [CrossRef]
- Van Uhm, J.; Van Haelst, M.M.; Jansen, P.R. AI-Powered Test Question Generation in Medical Education: The DailyMed Approach. medRxiv 2024. [Google Scholar] [CrossRef]
- Jeong, M.; Sohn, J.; Sung, M.; Kang, J. Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models. Bioinformatics 2024, 40, i119–i129. [Google Scholar] [CrossRef] [PubMed]
- Dorfner, F.J.; Dada, A.; Busch, F.; Makowski, M.R.; Han, T.; Truhn, D.; Kleesiek, J.; Sushil, M.; Lammert, J.; Adams, L.C.; et al. Biomedical Large Languages Models Seem not to be Superior to Generalist Models on Unseen Medical Data. arXiv 2024. [Google Scholar] [CrossRef]
- Maharjan, J.; Garikipati, A.; Singh, N.P.; Cyrus, L.; Sharma, M.; Ciobanu, M.; Barnes, G.; Thapa, R.; Mao, Q.; Das, R. OpenMedLM: Prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models. Sci. Rep. 2024, 14, 14156. [Google Scholar] [CrossRef]
- Grévisse, C.; Pavlou, M.A.S.; Schneider, J.G. Docimological Quality Analysis of LLM-Generated Multiple Choice Questions in Computer Science and Medicine. SN Comput. Sci. 2024, 5, 636. [Google Scholar] [CrossRef]
- Artsi, Y.; Sorin, V.; Konen, E.; Glicksberg, B.S.; Nadkarni, G.; Klang, E. Large language models for generating medical examinations: Systematic review. BMC Med. Educ. 2024, 24, 354. [Google Scholar] [CrossRef]
- Zhu, Y.; Tang, W.; Yang, H.; Niu, J.; Dou, L.; Gu, Y.; Wu, Y.; Zhang, W.; Sun, Y.; Yang, X. The Potential of LLMs in Medical Education: Generating Questions and Answers for Qualification Exams. arXiv 2025. [Google Scholar] [CrossRef]
- Bedi, S.; Fleming, S.L.; Chiang, C.-C.; Morse, K.; Kumar, A.; Patel, B.; Jindal, J.A.; Davenport, C.; Yamaguchi, C.; Shah, N.H. QUEST-AI: A System for Question Generation, Verification, and Refinement using AI for USMLE-Style Exams. In Biocomputing 2025; World Scientific: Kohala Coast, HI, USA, 2024; pp. 54–69. [Google Scholar] [CrossRef]
- Shahriar, S.; Dara, R.; Akalu, R. A comprehensive review of current trends, challenges, and opportunities in text data privacy. Comput. Secur. 2025, 151, 104358. [Google Scholar] [CrossRef]
- Shahriar, S.; Dara, R. Priv-IQ: A Benchmark and Comparative Evaluation of Large Multimodal Models on Privacy Competencies. AI 2025, 6, 29. [Google Scholar] [CrossRef]
- Elzayyat, M.; Mohammad, J.N.; Zaqout, S. Assessing LLM-generated vs. expert-created clinical anatomy MCQs: A student perception-based comparative study in medical education. Med. Educ. Online 2025, 30, 2554678. [Google Scholar] [CrossRef] [PubMed]
- Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. arXiv 2023, arXiv:2205.11916. [Google Scholar] [CrossRef]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020. [Google Scholar] [CrossRef]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2023. [Google Scholar] [CrossRef]
- Miao, J.; Thongprayoon, C.; Suppadungsuk, S.; Krisanapan, P.; Radhakrishnan, Y.; Cheungpasitporn, W. Chain of Thought Utilization in Large Language Models and Application in Nephrology. Medicina 2024, 60, 148. [Google Scholar] [CrossRef]
- Madaan, A.; Tandon, N.; Gupta, P.; Hallinan, S.; Gao, L.; Wiegreffe, S.; Alon, U.; Dziri, N.; Prabhumoye, S.; Yang, Y.; et al. Self-Refine: Iterative Refinement with Self-Feedback. arXiv 2023. [Google Scholar] [CrossRef]
- Yue, M.; Yao, W.; Mi, H.; Yu, D.; Yao, Z.; Yu, D. DOTS: Learning to Reason Dynamically in LLMs via Optimal Reasoning Trajectories Search. arXiv 2025. [Google Scholar] [CrossRef]
- Kıyak, Y.S.; Emekli, E. ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: A literature review. Postgrad. Med. J. 2024, 100, 858–865. [Google Scholar] [CrossRef]
- Saad, M.; Almasri, W.; Hye, T.; Roni, M.; Mohiyeddini, C. Analysis of ChatGPT-3.5’s Potential in Generating NBME-Standard Pharmacology Questions: What Can Be Improved? Algorithms 2024, 17, 469. [Google Scholar] [CrossRef]
- Kıyak, Y.S.; Kononowicz, A.A. Using a Hybrid of AI and Template-Based Method in Automatic Item Generation to Create Multiple-Choice Questions in Medical Education: Hybrid AIG. JMIR Form. Res. 2025, 9, e65726. [Google Scholar] [CrossRef] [PubMed]
- Pal, A.; Umapathi, L.K.; Sankarasubbu, M. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering. arXiv 2022. [Google Scholar] [CrossRef]
- Kipp, M. From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance. Information 2024, 15, 543. [Google Scholar] [CrossRef]
- Sonoda, Y.; Kurokawa, R.; Nakamura, Y.; Kanzawa, J.; Kurokawa, M.; Ohizumi, Y.; Gonoi, W.; Abe, O. Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in “Diagnosis Please” cases. Jpn. J. Radiol. 2024, 42, 1231–1235. [Google Scholar] [CrossRef]
- Franzen, C. Cohere Targets Global Enterprises with New Highly Multilingual Command a Model Requiring Only 2 GPUs. Venturebeat. Available online: https://venturebeat.com/ai/cohere-targets-global-enterprises-with-new-highly-multilingual-command-a-model-requiring-only-2-gpus?utm_source=chatgpt.com (accessed on 27 October 2025).
- Cohere. Command a: An Enterprise-Ready Large Language Model. Technical. Available online: https://cohere.com/research/papers/command-a-technical-report.pdf (accessed on 26 April 2025).
- Singh, P. Llama 3.3 70B Is Here! 25x Cheaper than GPT-4o. Analytics Vidhya. Available online: https://www.analyticsvidhya.com/blog/2024/12/meta-llama-3-3-70b/?utm_source=chatgpt.com (accessed on 27 October 2025).
- Oketch, K.; Lalor, J.P.; Yang, Y.; Abbasi, A. Bridging the LLM Accessibility Divide? Performance, Fairness, and Cost of Closed versus Open LLMs for Automated Essay Scoring. arXiv 2025. [Google Scholar] [CrossRef]
- Jahan, I.; Laskar, M.T.R.; Peng, C.; Huang, J. Evaluating the Effectiveness of Cost-Efficient Large Language Models in Benchmark Biomedical Tasks. arXiv 2025. [Google Scholar] [CrossRef]
- Zhang, G.; Jin, Q.; Zhou, Y.; Wang, S.; Idnay, B.; Luo, Y.; Park, E.; Nestor, J.G.; Spotnitz, M.E.; Soroush, A.; et al. Closing the gap between open source and commercial large language models for medical evidence summarization. Npj Digit. Med. 2024, 7, 239. [Google Scholar] [CrossRef]
- Pan, G.; Wang, H. A Cost-Benefit Analysis of On-Premise Large Language Model Deployment: Breaking Even with Commercial LLM Services. arXiv 2025. [Google Scholar] [CrossRef]
- Luo, R.; Sun, L.; Xia, Y.; Qin, T.; Zhang, S.; Poon, H.; Liu, T.-Y. BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining. Brief. Bioinform. 2022, 23, bbac409. [Google Scholar] [CrossRef]
- Dorfner, F.J.; Dada, A.; Busch, F.; Makowski, M.R.; Han, T.; Truhn, D.; Kleesiek, J.; Sushil, M.; Adams, L.C.; Bressem, K.K. Evaluating the effectiveness of biomedical fine-tuning for large language models on clinical tasks. J. Am. Med. Inform. Assoc. 2025, 32, 1015–1024. [Google Scholar] [CrossRef] [PubMed]
- OpenAI. Optimizing LLM Accuracy: Context, Prompts, Sampling. Available online: https://platform.openai.com/docs/guides/optimizing-llm-accuracy (accessed on 28 April 2025).
- Mansuy, R. Evaluating NLP Models: A Comprehensive Guide to ROUGE, BLEU, METEOR, and BERTScore Metrics. Plainenglish. Available online: https://plainenglish.io/blog/evaluating-nlp-models-a-comprehensive-guide-to-rouge-bleu-meteor-and-bertscore-metrics-d0f1b1 (accessed on 11 November 2025).
- Li, Z.; Guo, W.; Gao, Y.; Yang, D.; Kang, L. A Large Language Model-Based Approach for Data Lineage Parsing. Electronics 2025, 14, 1762. [Google Scholar] [CrossRef]
- Baran, K. Understanding the Cost Economics of GenAI Systems: A Comprehensive Guide. Medium. Available online: https://medium.com/%40AI-on-Databricks/understanding-the-cost-economics-of-genai-systems-a-comprehensive-guide-24e3d4f22e4f (accessed on 11 November 2025).
- Al Shuraiqi, S.; Aal Abdulsalam, A.; Masters, K.; Zidoum, H.; AlZaabi, A. Automatic Generation of Medical Case-Based Multiple-Choice Questions (MCQs): A Review of Methodologies, Applications, Evaluation, and Future Directions. Big Data Cogn. Comput. 2024, 8, 139. [Google Scholar] [CrossRef]
- Lin, C.-Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), Barcelona, Spain, 25–26 July 2004. [Google Scholar]
- Tănase, A.-V.; Pelican, E. SupraTok: Cross-Boundary Tokenization for Enhanced Language Model Performance. arXiv 2025. [Google Scholar] [CrossRef]
- Sun, W.; Wang, J.; Guo, Q.; Li, Z.; Wang, W.; Hai, R. CEBench: A Benchmarking Toolkit for the Cost-Effectiveness of LLM Pipelines. arXiv 2025. [Google Scholar] [CrossRef]
- Zhao, T.; Wei, M.; Preston, J.S.; Poon, H. Pareto Optimal Learning for Estimating Large Language Model Errors. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; Long Papers; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; Volume 1, pp. 10513–10529. [Google Scholar]
- Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; Casas, D.d.L.; Hendricks, L.A.; Welbl, J.; Clark, A.; et al. Training Compute-Optimal Large Language Models. arXiv 2022. [Google Scholar] [CrossRef]
- Liang, P.; Bommasani, R.; Lee, T.; Tsipras, D.; Soylu, D.; Yasunaga, M.; Zhang, Y.; Narayanan, D.; Wu, Y.; Kumar, A.; et al. Holistic Evaluation of Language Models. arXiv 2023. [Google Scholar] [CrossRef]
- Jin, D.; Pan, E.; Oufattole, N.; Weng, W.-H.; Fang, H.; Szolovits, P. What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams. arXiv 2020. [Google Scholar] [CrossRef]
- Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef]
- Zheng, Z.; Fiore, A.M.; Westervelt, D.M.; Milly, G.P.; Goldsmith, J.; Karambelas, A.; Curci, G.; Randles, C.A.; Paiva, A.R.; Wang, C.; et al. Automated Machine Learning to Evaluate the Information Content of Tropospheric Trace Gas Columns for Fine Particle Estimates Over India: A Modeling Testbed. J. Adv. Model. Earth Syst. 2023, 15, e2022MS003099. [Google Scholar] [CrossRef]
- Adams, N.E. Bloom’s taxonomy of cognitive learning objectives. J. Med. Libr. Assoc. 2015, 103, 152–153. [Google Scholar] [CrossRef]
- Al Shuriaqi, S.; Aal Abdulsalam, A.; Masters, K. Generation of Medical Case-Based Multiple-Choice Questions. Int. Med. Educ. 2023, 3, 12–22. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.





