Next Article in Journal
Fuel Substitution in Cement Production: A Comparative Life Cycle Assessment of Refuse-Derived Fuel and Coal
Previous Article in Journal
Quantum Abduction: A New Paradigm for Reasoning Under Uncertainty
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Identifying Features of LLM-Resistant Exam Questions: Insights from Artificial Intelligence (AI)–Student Performance Comparisons

Department of Organic Chemistry and Pharmacognosy, Faculty of Chemistry and Pharmacy, Sofia University “St. Kliment Ohridski”, 1164 Sofia, Bulgaria
*
Author to whom correspondence should be addressed.
Sci 2025, 7(4), 183; https://doi.org/10.3390/sci7040183
Submission received: 23 October 2025 / Revised: 3 December 2025 / Accepted: 10 December 2025 / Published: 12 December 2025

Abstract

Large language models (LLMs) are rapidly being explored as tools to support learning and assessment in health science education, yet their performance across discipline-specific evaluations remains underexamined. This study evaluated the accuracy of two prominent LLMs on university-level pharmacognosy examinations and compared their performance to that of pharmacy students. Authentic exam papers comprising a range of question formats and content categories were administered to ChatGPT and DeepSeek using a structured prompting approach. Student data were anonymized and LLM responses were graded using the same marking criteria applied to student cohorts, and a Monte Carlo simulation was conducted to determine whether observed performance differences were statistically meaningful. Facility Index (FI) values were calculated to contextualize item difficulty and identify where LLM performance aligned or diverged from student outcomes. The models demonstrated variable accuracy across question types, with a stronger performance in recall-based and definition-style items and comparatively weaker outputs for applied or interpretive questions. Simulated comparisons showed that LLM performance did not uniformly exceed or fall below that of students, indicating dimension-specific strengths and constraints. These findings suggest that while LLM-resistant examination design is contingent on question structure and content, further research should refine their integration into pharmacy education.

1. Introduction

The rapid integration of generative AI and computer vision continues to transform higher education, posing new opportunities and challenges for educators and students in various fields. Pharmacy education has not remained unaffected [1]. While such tools have proved beneficial in exam generation [2,3] and grading [4], their accessibility and potential for creative misuse [5] and the generally increasing acceptance [6] and higher student confidence [7] in AI place emphasis on large language model-resistant designs in the context of the academic integrity of student assessments [8]. Several studies have already investigated the performance of LLMs in an examination setting [9], including in pharmacy exams [10] and assignments [11]. Research into constructing LLM-resistant exams suggests approaches such as leveraging real-world scenarios outside of the available training data, non-textual elements and deliberate distractors to counteract AI model weaknesses [12]; however, this can be difficult to achieve in empirical and applied sciences, especially when utilizing an evidence-based teaching approach, which is reliant on open access to published data [13]. ChatGPT models are reported as being preferred and frequently used by students [14]; however, the landscape of leveraging particular LLMs in academic pursuits is likely to continuously shift as new generative AI emerge and already-existing models improve. DeepSeek is one such model which has already demonstrated adequate performance in various fields, including Gastroenterology [15]. Computer vision like Google Lens appears to be utilized primarily for the identification of botanical subjects [16,17]. Pl@ntNet and other more specialized computer vision tools show variance in performance and, in some cases, high accuracy for macroscopic images [18] and are a form of participatory botanical observation. Materials used in pharmacognostic study, however, are morphologically distinct from typically utilized training data.
Another important consideration is that pharmacy education requires the development of various competencies in diverse, interconnected, specialized domains [19]. These competencies are not perceived as being equally important by students [20], educators and practitioners [21]. Thus, it seems prudent to investigate how LLMs perform in answering questions associated with the assessment of the development of highly ranked competencies. Competency in pharmacognosy is ranked >60% by community pharmacists [21], as the initial scope of the science has expanded past botanically describing subjects and evolved to include a much more in-depth exploration of the quality, safety and efficacy of herbal substances, preparations and medicinal and borderline products [22]. These include but are not limited to knowledge of traditional medicine systems, evidence-based traditional and well-established use, clinically relevant product interactions, contemporary bioprospecting and drug discovery and substance quality control methods [23]. This posits pharmacognosy-related competencies as crucial for pharmacy practitioners and inherently interdisciplinary, but raises questions about the potential long-term risks of student academic misconduct during assessments. Furthermore, some natural resources, products and traditional practices are region specific [24]. This may influence the availability of the published open access data LLMs and computer vision algorithms are trained on. No less important is the propensity of generative AI to hallucinate or put forth biased or false responses [25], which can be further exacerbated by existing knowledge gaps [26]. So far, no studies have been carried out with the express objective of evaluating LLMs and CV performance in pharmacognosy examinations, nor have they compared their performance to that of students in terms of cognitive dimensions.
The utilization of online learning platforms like Moodle [27] further compounds the issue, by providing the opportunity for digital assessment of students’ skills without the direct oversight of an examiner. Moodle provides functionality for examination design with integrated tools. These include, but are not limited to, different question types (such as multiple choice, true or false, cloze, essay, etc.), the ability to group questions into categories, randomize questions selected for each student per category in a single exam and shuffle question order in an exam [28]. While some publications suggest that LLMs perform well with certain multiple-choice questions [29], others disprove their effectiveness, particularly for medical questions [30]. Similarly, the effects of question type on LLMs’ performance in pharmacognosy examinations have not been explored. In addition to exam design functionality, Moodle has built-in quiz report statistics. These are intended to help examiners determine how well the quiz is helping students and identify faulty or overly difficult or easy questions [31]. The predictive power of metrics derived from student attempts like the facility index (the mean score of students on the question or category) or other psychometrics such as the discrimination index (the correlation between the weighted scores on the question and those on the rest of the examination) [32] on LLMs’ performance has not yet been explored. Therefore, this benchmark study seeks to assess the performance of ChatGPT-4o, DeepSeek, Google Lens and Pl@ntNet on pharmacognosy examinations conducted on the Moodle e-learning platform of Sofia University “St. Kliment Ohrdiski” and compare it to that of students, in the context of competencies, cognitive dimensions and question types, which are variable features of each examination, thus highlighting what leads to simultaneous good student and poor LLM performance, i.e., what makes an examination LLM-resistant.

2. Materials and Methods

Pharmacognosy in the Master’s Program in Pharmacy at Sofia University “St. Kliement Ohrdiski”, Faculty of Chemistry and Pharmacy is taught in two consecutive courses (Pharmacognosy 1 and 2). Each course includes three online examinations in Moodle, performed during the semester prior to the final exam. These examinations cover portions of the course material and theoretical concepts required for practicums, including reading, interpreting and application of experimental results. Study materials prepared by faculty members are not publicly available; however, the course content is based on the publicly available published scientific literature [33,34]. The Pharmacognosy 1 course introduces students to the foundations of this science and continues with the exploration of natural products from the perspective of their biologically active constituents and therapeutic applications. Students are taught how to harvest, identify and analyze plant materials, how to adhere to relevant regulatory standards and utilize these sources in an efficacious and safe matter. Further examples of the topics covered in the course are provided in Appendix A, Table A1.
Each examination in Pharmacognosy 1 quizzes students on a different number of topics and comprises a different total number of categories, structured in thematic sections. These categories contain variations of what is, in essence, the same question in terms of the cognitive domain and competency, but may differ in question type and general layout of text or images, or specific example data required to solve the question. For student attempts, only one question is drawn at random from each category during an examination. Thematic section order is fixed, but category order inside thematic sections is randomized for each student attempt. Individual students can attempt the examination only once. Navigation is consecutive, and sections appear on separate pages. Each question is scored using a point system. Points are awarded for each correct answer. The total achievable score in an examination from all questions ranges from 0.00 to 5.00 points. The points from the three examinations constitute 9.09% of the course grade. The complete evaluation scheme for Pharmacognosy 1 is available in Appendix A, Table A2.
Our first step was structuring the dataset by exporting recorded students’ results for each question, associated question type and question statistics per examination and then deleting all student personal data. This dataset included a record of all scores and psychometrics per question and per category in a table format, as generated by Moodle. Following this step, each question was labeled with the corresponding competency per the questionnaire developed by Atkinson, J. et al., 2016 [21] and cognitive dimension by Bloom’s classification [35] it assesses. The examination structure is elaborated in Appendix A, Table A3, Table A4 and Table A5. E1 makes most frequent use of the ability to combine different question types into one question, by utilizing the coding capabilities of the cloze question type. A total of 80% of all examination question types are cloze questions. The reverse is observable for E2, in which only 4.2% of questions are cloze questions. Of the remaining 95.8%, the majority (42.1%) are drag and drop questions. E3 utilizes a more even spread of question types—66% cloze and 34% non-cloze questions. High-cognitive-dimension questions are most abundant in E3 (38.5%), followed by E1 (25.7%) and E2 (11.3%); however, on average, E1 utilizes the most cognitive dimensions per question (2.2), as opposed to E3 (1.9) and E2 (1.4). E3 covered the most competencies ranked above 60% by community pharmacists (87.0%), followed by E1 (63.2%) and E2 (54.0%). Due to all questions assessing competency in pharmacognosy, it has been excluded from our datasets, in which the questions were tagged. A cumulative comparison of the three examinations, which includes the total number of students attempting each examination and total number of questions is provided in Appendix A, Table A6.
The web interfaces for ChatGPT and DeepSeek were used for all tests. In addition, the web interface for Google Lens was used for questions relying on the identification of plant materials from images. For Pl@ntNet, the mobile app, as available on the IOS app store, was used to identify the photographic materials included in the examinations. The models used were ChatGPT-4o, DeepSeek-V3 and Google Lens 1.17.240515009. The initial prompt used was “Solve the following questions”; afterwards, all questions were consecutively, manually prompted into the web interface in the form of screenshots. This approach was chosen instead of copy-and-pasting or typing out question text, due to the time restrictions in place for student attempts, as well as the variable and complicated layout of the question graphical user interface (Appendix B, Figure A1). Question text was manually prompted only when models could not extract text from provided images or if relevant context was missing (i.e., for questions using drop-down menus, which are not expanded by default). Google Lens and Pl@ntNet searches were performed directly with the images embedded in the exam questions. All examination images in identification questions were queried, and successful botanical identifications were recorded (Appendix B, Figure A2). LLM responses were graded using the same criteria as for student responses (i.e., the same point system). This examination schema rewards points for each correct response within a given question. If all responses selected are correct (e.g., all drop-down menu options are accurate), the maximum number of points is awarded.
To model the full range of possible outcomes for each examination based on the recorded responses, a Monte Carlo simulation approach was employed. The Monte Carlo method was selected due to the high combinatorial complexity of the examination format. This method allows for a probabilistic estimation of complex outcome distributions when full enumeration is computationally infeasible [36]. One random question per category was selected by using a uniform distribution, assuming equal likelihood for all variants within a question. The point values for each selected question were summed up to compute a total examination score for that run. This process was repeated 100,000 times for each examination to approximate the probability distribution of total scores. This sampling-based approach avoids the need to explicitly enumerate all possible test permutations, which would exceed 10 million combinations for some of our test structures, making it a useful tool to simulate random outcomes for high question variance examinations. No smoothing, transformation or trimming of the data was applied. All simulations were performed using Python 3.11, with the “pandas” and “random” libraries [37]. Python was chosen for its speed, reproducibility and ability to handle large-scale simulations within memory constraints. All data visualizations were performed with SCImago Graphica [38].
To explore the relationship between the facility index and question outcomes, we used linear regression [39]. The discrimination index values per question category are also provided for further contextualization. The two variables were plotted separately for the LLMs, and the coefficient of determination was calculated using Excel (Microsoft Corporation). Linear regression was carried out once for the entire examination, using all available datapoints, and then individually per question category, separating datapoints into small multiples within the same examination. For the entire examination, the analysis was carried out, including all question categories in order to discern any general positive or negative relationship between the variables. In plotting the variables as small multiples, those categories for which it was immediately obvious that no correlation was observed (i.e., the LLM achieved a score of 0 or 100 per cent on all questions inside the category) were retained; however, question categories which contained less than three questions (i.e., no more than two datapoints could be plotted) were disregarded, as no meaningful result interpretation could be carried out.

3. Results

3.1. Score Distribution of LLM Recorded Responses and Student Examination Score Comparison

The total mean scores per examination attempt for the tested LLMs are provided in Table 1 and their frequency distribution plots are visualized in Figure 1. ChatGPT 4o outperformed DeepSeek R1 in all three examinations by an average of 12.27 per cent. The highest average score achieved was by ChatGPT on Examination 3 and the lowest was obtained by DeepSeek on Examination 1. With the exception of Examination 3, the total score frequency was more spread out for ChatGPT and differed by the widest margin for Examination 2. ChatGPT’s mean score was over 50 per cent in all examinations, as opposed to DeepSeek, which managed to cross this threshold only in Examination 3. Notable differences in the density of result frequencies are observed between LLMs and with the most apparent being in DeepSeek’s performance in Examination 3, where two separate strings of result frequencies form. Weaker but similar effects are noted in DeepSeek’s performance on Examination 2 and ChatGPT’s Examination 1. This effect is absent from the LLM’s performance in Examination 1. Another notable difference between examinations is related to the density of the average scores: there is a much more even total distribution, notably for both LLMs’ performance observed in Examination 1, in contrast to the other examinations. This is especially apparent in the skew of the distribution plot for both models for Examination 2, relative to the mean score.
Compared to student performance, the models achieved lower average total scores, with the exception of ChatGPT’s performance on Examination 3, where its average total score was 3.12 per cent higher. On Examinations 1 and 2, student performance exceeded that of ChatGPT by an average of 14.9 per cent. The average student scores achieved were 21.1 per cent higher compared to those of DeepSeek; however, the difference in performance was significantly smaller for Examination 3—a higher student average score of only 4.28 per cent compared to 29.09 per cent for Examination 1 and 30.01 per cent for Examination two. Another notable difference was the much greater observed standard deviation for student attempts, though that could be explained by the magnitudes lower total number of student attempts compared to the number of simulations. This comparison is visualized in Figure 2.

3.2. Scores Attained per Examination Category, Relative to Student Performance

Exploring the differences between student and LLM performance in the context of examination categories within examinations confirm the already-established results while providing additional detailed context. In Examinations 1 and 2, where students’ average results were higher, they not only outperformed LLMs in a greater number of categories but also by greater margins, compared to in Examination 3. Examination 1 had the greatest number of categories where student performance was better than that of both LLMs—10 out of 15 (66.7 per cent). This comparison between examinations is visualized in Figure 3. This comparison was both for point values and percentage scores achieved.
In this context, the most impactful categories for Examination 1 were C10 and C9, as they contributed to attempts not only in a significant percentile but also in greater point values. Questions in both categories are constructed using the same question type, but C10 relies on visual identification of images and low cognitive dimensions and examines less-valued professional competencies, in contrast to C9, which is a purely text-based question but relies on a high cognitive dimension and examines higher valued professional competencies. For Examination 2, the most impactful categories were C2 and C11. The question type is once again the same for the categories. All questions rely on selecting the correct response from a predetermined set of options in drop-down menus, which are also interdependent. C2 is characterized by requiring a high cognitive dimension and examining highly valued competencies, conversely to C11, where a low cognitive dimension is required and less valued professional competency is quizzed, but attempts are again reliant on analyzing graphical data—specifically, microscopic illustrations of plant powders. For Examination 3, C16 contributes the most to student attempts. It utilizes a combination of fill in the blank fields and drop-down menus, which are again interdependent, and relies on visual analysis of graphical data and high cognitive dimensions but explores a mixture of high- and low-valued competencies. DeepSeek performed poorly in all categories where visual analysis of graphical data was concerned.
For the LLMs, both homo- and heterogenicity of the highest contributing categories is observed, particularly in Examination 2 and 3 for DeepSeek. For Examination 1, both models performed better than students in category C3: questions which require several multiple-choice answers and rely on textual analysis and low cognitive dimensions though examining highly valued competencies. In Examination 2, C9 and C13 contributed to better ChatGPT attempts and C16 to DeepSeek. The common denominators are the utilization of non-interdependent required answers in simple question types, the requirement for low cognitive dimensions in textual analysis and the examination of low-valued professional competencies. C16 stands out only in that it requires simple arithmetic calculations to achieve a high score. For Examination 3, C7, C10 and C14 seemed to contribute to both LLMs’ performances, with better scores from DeepSeek observed in C7 and C14, compared to ChatGPT. The lower average scores here can be explained with the low performance in C2–C4 and C15 by DeepSeek. For C7, C10 and C14, again, low cognitive dimensions, textual analysis and examination of low-valued or a mix of low- and high-valued competencies are again the common denominators, whereas in the categories where DeepSeek is observed to have poorer comparative performance, either high cognitive dimensions, interdependent answers or attempt success is dependent on the analysis of graphical data.

3.3. Exploring the Connection Between the Facility Index and LLM Scores

Only a very weak positive correlation was observed for Examination 1 between the facility index and ChatGPT scores as a result of the linear regression analysis, with a coefficient of determination (r2) of 0.2504, and for Examination 3, 0.2437. As such, no general observation of the potential predictive capabilities of the metric can be posited. Weaker values were obtained for all other examinations and LLM combinations. The results are visualized in Figure 4. Further context was extracted by exploring this relationship on a category-by-category basis. Much stronger correlations were observed. For Examination 1, the highest r2 values for ChatGPT and DeepSeek were calculated for C7—0.7229 and 0.9521, respectively. All calculated coefficients of determination are provided in Table 2 and category-specific correlations are depicted in Figure 5.

3.4. Google Lens and Pl@ntNet Capabilities in Identification of Herbal Substances Featured on Pharmacognosy Examinations

The majority of Examination 1 and 2’s embedded photographic materials were not successfully identified. Google Lens suggested the correct botanical identity for only one macroscopic photographic material (2.15 per cent) of the fruit of Tribulus terrestris L. from Examination 1 and three macroscopic photographic materials (9.09 per cent) from Examination 2. One of these was of the inflorescence of Achillea millefolium L. and two were of the seed of Aesculus hippocastanum L. Pl@ntNet identified four (8.60 per cent) macroscopic materials from Examination 1 correctly and the same three macroscopic photographic materials from Examination 2, performing just slightly better by also successfully identifying leaves from Tribulus terrestris L and Achillea millefolium L. No micromorphological photographic materials were correctly identified with either computer vision approach. When searching embedded pollen grain photographic materials, Google Lens suggested other pollen grains, but none of the results identified the specific pollen accurately. The pollen grains included in the examinations were of Carthamus tinctorius L. and Crocus sativus. L. Equisetum L. species spores were particularly difficult to identify, as search results came back only with images of various algae. This was not the case for the photographic materials included in Examination 3. Macroscopic photographic materials from this examination featured various seeds from fatty-oil-producing plants, some of which are commonly used both as traditional medicine and foods or spices or are known poisoning hazards. Of the 20 embedded photographic materials, 16 (80 per cent) were correctly identified. These seeds belonged to the species Brassica nigra (L.) W.D.J.Koch, Cannabis sativa L., Chenopodium quinoa Willd, Datura stramonium L., Glycine max (L.) Merr., Papaver somniferum L., Ricinus communis L., Salvia hispanica L. and Sesamum indicum L. A single outlier, which was also correctly identified, was a macroscopic photographic material of the thallus of Cetraria islandica (L.) Ach. Pl@ntNet’s performance in Examination 3 was in stark contrast to that in Examination 1 and 2 and Google Lens, as it failed to identify any of the seed or lichen images.

4. Discussion and Conclusions

The purpose of this study was to examine how effectively LLMs perform in established pharmacognosy examinations in the context of distant digital evaluation and how their performance compares with that of undergraduate pharmacy students. Recent work has highlighted the expanding role of LLMs in medical and pharmaceutical education, particularly in contexts involving knowledge retrieval and formative assessment [3,4], but not in pharmacognosy online examinations. Additionally, including mixed item types and original score settings offers validity that standard synthetic question sets used in such published works usually lack. By using past examinations and corresponding grading criteria, the study offers a realistic insight into the capabilities and limitations of AI-generated responses in a specialized bioscience domain. The models performed relatively well on lower-order cognitive tasks, such as recalling definitions and identifying basic concepts: areas where prompt clarity and factual recall are most influential [29]. Conversely, questions requiring synthesis, contextualization, or the application of knowledge to novel scenarios exposed clearer limitations: a pattern noted in broader evaluations of AI in clinical and biological science assessments [30]. These patterns reflect the broader discourse on LLM use in medical and pharmaceutical education, where accuracy often hinges on the depth and specificity of domain knowledge.
The use of the facility index and Monte Carlo simulations, which are practical solutions to category-randomized examinations, allowed for the unique findings to be interpreted within a meaningful psychometric framework. Some items with higher facility index values corresponded more closely to successful LLM responses; however, this was not a sustained trend across question categories. Simulated comparisons indicated that the LLMs did not consistently outperform students, reinforcing the idea that AI competence is uneven across question items, cognitive dimensions and examined professional competencies. The study also underscores a practical challenge in educational deployment: while LLMs can reproduce fact-based answers with speed and clarity, they may not reliably generate the nuanced reasoning expected in applied pharmaceutical contexts. Complex, interconnected question types, such as combinations of short answers, drop-down menus, etc., posed significantly higher challenges for LLMs.
Another key consideration is the correct attribution of point values to questions. Inappropriate allocation of a high number of points to simple, answer non-interdependent questions, relying on textual analysis, low cognitive dimensions and assessing low-valued professional competencies appear to be crucial for higher attainable scores by LLMs. By extension, this could provide clarity as to why LLMs have achieved higher scores compared to students on Examination 3, despite the fact that it has the highest percentage of highly valued professional competencies and the second highest number of questions requiring high cognitive dimensions.
Computer vision tools such as Google Lens and Pl@ntNet appear to still be lacking in accurate botanical identification of herbal substance origins, which is in stark contrast to some of the published literature [16,17,18]. This is especially relevant for micromorphological graphical materials, regardless of whether they are photomicrographic or illustrative work. In contrast, macroscopic images of common medicinal substances, foodstuffs and well-known poisonous materials are readily identifiable. This suggests that avoiding the inclusion of such herbal substances in examination questions could fortify LLM-resistant designs.
Several limitations warrant consideration. No additional higher-reasoning, frontier LLMs (e.g., GPT-o1/o3, Claude 3.5, Gemini 3) have been included in this study. Some studies show that these achieve significant exam results in medical domains [40]. While such comparisons are important for establishing a baseline, this study was intentionally limited to only include the most popular and available LLMs. The findings derive from a single subject area and scope, within one institution and one curriculum, potentially limiting generalizability and external validity. We posit that performing the same investigation but on a published standardized pedagogical work in pharmacognosy [41] could prove prudent in providing further insights into LLM-resistant examination design. The prompting strategy, though standardized, may also have influenced the quality of responses, particularly for complex tasks. Retrieval-augmented prompting was disregarded as no one suitable corpus of pharmacognostic course literature in the form of files accessible to LLMs is available. While this is considered a standard state-of-the-art approach in such studies, the goal was to test raw chat models, responding only based on available open access data. Additionally, the absence of multimodal questioning and the lack of follow-up prompts mirror traditional assessment conditions but may underestimate the full potential of interactive AI use. Specifically, it is prudent to point out that no guarantee can be provided as to the preference of students towards screenshot input as opposed to typed-out text. This prompting design is inferred from the time-limited nature of online examinations, which leaves students with an average of 174 s per question to prompt, receive a generated response and potentially evaluate it and fill in the perceived correct responses with an overview of the unique user interface question layout. Potential future studies should also incorporate, at the very least, students’ self-reported prompting preferences. Despite these constraints, the results contribute valuable evidence about how LLM performance aligns with student achievement across different thematic areas of pharmacognostic knowledge.
Educators and curriculum designers may benefit from incorporating LLMs to support revision, question development or feedback generation, provided their use is aligned with the cognitive demands of specific tasks. Future research should examine broader curricula, integrate qualitative analysis of response reasoning and explore interactive or scaffolded prompting to assess the upper limits of LLM capability. As AI integration in pharmacy education accelerates, balanced evaluation will remain essential to ensure that innovation enhances, rather than compromises, learning outcomes.

Author Contributions

Conceptualization, A.S. and A.N.; methodology, A.S. and A.N.; software, A.S.; investigation, A.S.; resources, A.N.; data curation, A.S.; writing—original draft preparation, A.S.; writing—review and editing, A.N.; visualization, A.S.; supervision, A.N.; project administration, A.N.; funding acquisition, A.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors express gratitude to the Sofia University Marking Momentum for Innovation and Technological Transfer (SUMMIT) BG-RRP-2.004-0008 SUMMIT-3.3 for providing support for popularizing the results of this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AIArtificial intelligence
CVComputer vision
E1Examination 1
E2Examination 2
E3Examination 3
FIBFill in the blanks
DDDrag and drop
LLMLarge language model
MCMultiple choice
TFTrue or false

Appendix A

Table A1. Topics covered in the Pharmacognosy 1 course.
Table A1. Topics covered in the Pharmacognosy 1 course.
Thematic AreaTopics CoveredKey Concepts/Examples
Foundations of PharmacognosyNature, objectives, tasksBasic concepts and approaches
Historical developmentAncient sources and contemporary practices
Medicinal plants and herbal substances: concepts, classification, nomenclatureTaxonomy
Herbal substances nomenclature
Interdisciplinary role of pharmacognosy across sciencesConnections to pharmacology, pharmacotherapy, technology of pharmaceutical forms and legislation
Discovery, Sources and Products of Natural OriginModern approaches to medicinal plant discoveryEthnopharmacology, ethnobotany, phylogenetics, chemotaxonomy
Products of natural originHerbal substances
Preparations (teas, oils, fats and extracts)
Plant Material HandlingCollection, processing, storage, cultivationGood Agricultura and Collection Practices (GACP)
Wild plants and biodiversity as sourcesDistribution of natural resources
Bioaccumulation
Cultivation
Pharmacognostic study designPlanning and key considerations
Types of preparations and extraction techniquesSpecific requirements in overview of constituents
Standards and RegulationsEuropean PharmacopeiaStructure
Quality control methods Classification of preparations
Reference standards; marker compounds
Medicine agenciesBulgarian Drug Agency
European Medicines Agency—Herbal Medicinal Products Committee
Definitions
Declaration of extracts
Analytical and Diagnostic MethodsMacroscopic, microscopic, pharmacognostic analysesMorphological and anatomical features
Staining and observation techniques
Physiochemical pharmacognostic analysesLoss on drying
Ash content
Swelling index
Herbal MedicinesDefinitions, classification, safety and efficacyTraditional and well-established use
Combination herbal medicinal products (species)
Use in sensitive populations
Therapeutic indications, posology, counterindications, period of use.
Biologically Active CompoundsIsolation and analysis of bioactive compoundsMethods and screening for activity
Primary and secondary metabolites, biosynthetic pathwaysShikimate, mevalonate, malonate, etc.
Chemical groups of natural compoundsCarbohydrates
Lipids and lipoids
Phenols
Flavonoids
Table A2. Pharmacognosy 1, course evaluation scheme. The total number of points available for the course is 165. A minimum of 83 points is required for a passing grade.
Table A2. Pharmacognosy 1, course evaluation scheme. The total number of points available for the course is 165. A minimum of 83 points is required for a passing grade.
Quiz, Examination or ExamPointsPoints [%]
Taxonomy and nomenclature quiz1810.91
Examination 153.03
Examination 253.03
Examination 353.03
Practical exam3320.00
Final exam9960.00
Table A3. Thematic section and category structure of Pharmacognosy 1, Examination 1.
Table A3. Thematic section and category structure of Pharmacognosy 1, Examination 1.
Thematic SectionCategoriesPoint ValueQuestionsQuestion TypeFacility Index [%]Discrimination Index [%]Cognitive DimensionCompetencies
European pharmacopeiaC1. General notices0.21.1Cloze (MC; FIB)8133Remember7.7 Ability to maintain current knowledge of relevant legislation and codes of pharmacy practice; 10.37 legislation and professional ethics;
1.276
1.374
C2. Ph. Eur. structure0.42.1Cloze (MC)7210Apply, analyze7.7 Ability to maintain current knowledge of relevant legislation and codes of pharmacy practice; 9.21 ability to communicate in English and/or locally relevant languages; 10.37 legislation and professional ethics; 11.39 current knowledge of good manufacturing practice (GMP) and of good laboratory practice (GLP)
2.274
2.356
2.474
C3. Monograph structure0.43.1Cloze (MC)6411Remember, analyze7.7 Ability to maintain current knowledge of relevant legislation and codes of pharmacy practice; 10.37 legislation and professional ethics; 11.39 current knowledge of good manufacturing practice (GMP) and of good laboratory practice (GLP)
3.270
3.363
Pharmacognosy—basic conceptsC4. Basic concepts0.14.1Cloze (MC; FIB)7959Remember, apply9.21 Ability to communicate in English and/or locally relevant languages
4.272
4.360
C5. Chemotaxonomy0.45.1DD8628Remember, apply9.21 Ability to communicate in English and/or locally relevant languages; 10.24 plant and animal biology; 10.27 organic and medicinal/pharmaceutical chemistry
5.280
5.373
C6. Morphological plant parts0.56.1Cloze (MC)7743Remember, analyze10.24 Plant and animal biology; 11.39 current knowledge of good manufacturing practice (GMP) and of good laboratory practice (GLP)
6.282
6.366
6.470
C7. Primary processing and GACP0.17.1MC9146Remember, understand11.39 Current knowledge of good manufacturing practice (GMP) and of good laboratory practice (GLP)
7.291
7.392
7.471
HMPC|EMA—recommendationsC8. Case studies0.58.1Cloze (MC)5825Remember, understand, apply, analyze,9.21 Ability to communicate in English and/or locally relevant languages; 10.30 anatomy and physiology; medical terminology; 10.33 pharmacotherapy and pharmaco-epidemiology; 13.46 retrieval and interpretation of relevant information on the patient’s clinical background; 17.63 provision of informed support for patients in selection and use of non-prescription medicines for minor ailments (e.g., cough remedies …)
8.291remember, understand, apply
C9. Therapeutic area, indication, patient population, posology0.59.1Cloze (MC)7522Remember, apply10.33 Pharmacotherapy and pharmaco-epidemiology; 17.63 provision of informed support for patients in selection and use of non-prescription medicines for minor ailments (e.g., cough remedies…)
9.250
9.369
Macroscopic analysisC10. Identification0.710.1Cloze (MC)6225Remember, analyze9.21 Ability to communicate in English and/or locally relevant languages; 10.24 plant and animal biology
10.233
10.355
C11. Adulteration0.211.1(MC)8148Remember9.21 Ability to communicate in English and/or locally relevant languages; 10.24 plant and animal biology
11.284
11.356
Microscopic analysisC12. Methodology0.212.1Cloze (FIB)5944Remember, understand7.6 Ability to design and conduct research using appropriate methodology; 7.7 ability to maintain current knowledge of relevant legislation and codes of pharmacy practice; 9.21 ability to communicate in English and/or locally relevant languages; 10.37 legislation and professional ethics
12.245
12.359
C13. Diagnostic characteristics0.113.1Cloze (MC)8314Remember, apply, analyze10.24 Plant and animal biology
13.260
13.376
13.465
C14. Differential staining0.314.1Cloze (MC)7229Remember, understand, apply, analyze10.24 Plant and animal biology; 10.27 organic and medicinal/pharmaceutical chemistry; 10.29 general and applied biochemistry (medicinal and clinical)
14.271
14.370
14.452
C15. Quality standards and adulteration0.415.1Cloze (FIB)4625Remember, understand, apply, analyze, evaluate10.24 Plant and animal biology; 10.37 legislation and professional ethics
15.239
15.351
Table A4. Thematic section and category structure of Pharmacognosy 1, Examination 2.
Table A4. Thematic section and category structure of Pharmacognosy 1, Examination 2.
Thematic SectionCategoriesPoint ValueQuestionsQuestion TypeFacility Index [%]Discrimination Index [%]Cognitive DimensionCompetencies
Traditional herbal medicinal productsC1. Justification of traditional use requirements0.21.1FIB8830Remember7.7 Ability to maintain current knowledge of relevant legislation and codes of pharmacy practice; 10.37 legislation and professional ethics.
1.293
C2. Therapeutic indications and use0.42.1DD7642Remember, apply17.63 Provision of informed support for patients in selection and use of non-prescription medicines for minor ailments (e.g., cough remedies…)
2.265
2.373
2.468
Extraction processesC3. Preliminary processing of plant materials0.13.1TF10015Remember, understand7.6 Ability to design and conduct research using appropriate methodology
3.238
3.383
3.485
C4. Extraction types0.24.1DD5910Remember, understand10.38 Current knowledge of design, synthesis, isolation, characterization and biological evaluation of active substances
4.2MC86
4.380
4.4FIB79
4.591
4.6DD78
4.778
C5. Solvents0.25.1DD9827Remember, understand10.27 Organic and medicinal/pharmaceutical chemistry
5.268
5.348
5.454
5.575
Metabolic pathwaysC6. Primary and secondary metabolites0.26.1FIB828Remember10.29 General and applied biochemistry (medicinal and clinical)
6.287
6.397
C7. Biosynthetic pathways0.17.1Cloze (MC)6520Remember10.29 General and applied biochemistry (medicinal and clinical)
7.281
7.356
7.461
Macroscopic analysisC8. Identification0.38.1DD6413Remember, analyze9.21 Ability to communicate in English and/or locally relevant languages; 10.24 plant and animal biology
8.292
8.377
8.467
8.581
C9. Diagnostic characteristics0.39.1MC7722Remember9.21 Ability to communicate in English and/or locally relevant languages; 10.24 plant and animal biology
9.275
9.363
9.497
9.572
9.682
Microscopic analysisC10. Diagnostic characteristics0.310.1DD8341Remember9.21 Ability to communicate in English and/or locally relevant languages; 10.24 plant and animal biology
10.283
10.363
10.497
10.5FIB75
10.663
10.784
10.881
10.988
C11. Micromorphology0.311.1DD10047Remember10.24 Plant and animal biology
11.285
11.345
11.477
11.560
11.678
C12. Identification0.312.1MC786Remember, analyze10.24 Plant and animal biology
12.260
12.3FIB80
12.4MC55
12.567
C13. Microchemical reactions0.313.1MC7515Remember10.28 Analytical chemistry
13.2100
13.3TF71
13.443
13.550
13.617
13.7DD92
13.864
C14. Starches0.314.1MC5017Remember, analyze10.24 Plant and animal biology; 10.37 legislation and professional ethics
14.285
14.377
14.425
14.593
Pharmacognostic analysesC15. Loss on Drying and Ash—general concepts0.315.1DD939Remember, apply10.37 Legislation and professional ethics
15.293
15.3FIB80
15.493
15.5DD79
15.655
15.796
C16. Loss on Drying and calculation0.316.1MC746Remember, apply10.37 Legislation and professional ethics
16.267
16.384
16.479
C17. Ash—calculation0.317.1DD9141Remember, apply10.37 Legislation and professional ethics
17.278
17.3MC80
17.489
C18. Swelling index0.318.1DD8726Remember7.6 Ability to design and conduct research using appropriate methodology; 10.37 legislation and professional ethics
18.285
18.380
C19. Cotton wool—absorbency0.319.1DD8414Remember7.6 Ability to design and conduct research using appropriate methodology: 10.37 legislation and professional ethics
19.2TF100
19.350
19.444
Table A5. Thematic section and category structure of Pharmacognosy 1, Examination 3.
Table A5. Thematic section and category structure of Pharmacognosy 1, Examination 3.
Thematic SectionCategoriesPoint ValueQuestionsQuestion TypeFacility Index [%]Discrimination Index [%]Cognitive DimensionCompetencies
General questionsC1. Natural identity of constituents0.11.1MC1001Remember9.21 Ability to communicate in English and/or locally relevant languages
1.267
1.3100
1.4100
1.5100
C2. General statements on constituents0.22.1MC670Remember, understand10.27 Organic and medicinal/pharmaceutical chemistry; 10.37 legislation and professional ethics
2.279
2.375
C3. Recognizing constituents, medicinal and borderline products0.53.1DD6047Remember, analyze7.7 Ability to maintain current knowledge of relevant legislation and codes of pharmacy practice; 10.37 legislation and professional ethics
3.232
3.367
Carbohydrates and waxesC4. Honey0.34.1Cloze (FIB; DD)8412Remember9.21 Ability to communicate in English and/or locally relevant languages; 10.24 plant and animal biology; 10.27 organic and medicinal/pharmaceutical chemistry; 10.31 microbiology; 10.37 legislation and professional ethics
4.296
C5. Mannitol0.25.1MC790Remember, apply10.31 Microbiology; 10.32 pharmacology including pharmacokinetics; 10.33 pharmacotherapy and pharmaco-epidemiology; 10.37 legislation and professional ethics
5.279
C6. Waxes0.26.1Cloze (FIB; DD)10021Remember, apply9.21 Ability to communicate in English and/or locally relevant languages; 10.34 pharmaceutical technology, including analyses of medicinal products
6.265
Fats and vegetable fatty oilsC7. Quality control of fatty oils0.37.1Cloze (FIB; DD)8717Remember9.21 Ability to communicate in English and/or locally relevant languages; 10.34 pharmaceutical technology, including analyses of medicinal products; 10.37 legislation and professional ethics
7.267
C8. Extraction and analysis of fatty oils0.38.1Cloze (FIB; DD)4526Remember, analyze7.4 Capacity to evaluate scientific data in line with current scientific and technological knowledge; 9.21 ability to communicate in English and/or locally relevant languages; 10.37 legislation and professional ethics
8.247
C9. Structural characteristics of fatty oils0.29.1Cloze (DD)10026Remember, analyze10.27 Organic and medicinal/pharmaceutical chemistry
9.270
9.3100
9.456
9.585
9.675
C10. Castor oil0.310.1Cloze (FIB; DD)6115Remember9.21 Ability to communicate in English and/or locally relevant languages; 10.24 plant and animal biology; 10.27 organic and medicinal/pharmaceutical chemistry
10.272
Therapeutic indicationsC11. HMPC | EMA recommendations0.3511.1Cloze (MC)812Remember, apply9.21 Ability to communicate in English and/or locally relevant languages; 10.33 pharmacotherapy and pharmaco-epidemiology; 17.63 provision of informed support for patients in selection and use of non-prescription medicines for minor ailments (e.g., cough remedies…)
11.277
C12. Risk and adverse effects associated with use0.3512.1Essay8032Understand, analyze, evaluate, create7.2 Analysis: ability to apply logic to problem solving, evaluating pros and cons and following up on the solution found; 7.3 synthesis: capacity to gather and critically appraise relevant knowledge and to summarize the key points; 7.5 ability to interpret preclinical and clinical evidence-based medical science and apply the knowledge to pharmaceutical practice; 10.32 pharmacology, including pharmacokinetics; 10.33 pharmacotherapy and pharmaco-epidemiology; 13.46 retrieval and interpretation of relevant information on the patient’s clinical background; 17.63 provision of informed support for patients in the selection and use of non-prescription medicines for minor ailments (e.g., cough remedies…)
12.276
12.386
Phenols and flavonoidsC13. Micro-sublimation0.313.1Cloze (FIB; DD)7513Remember, analyze10.33 Pharmacotherapy and pharmaco-epidemiology; 17.63 provision of informed support for patients in selection and use of non-prescription medicines for minor ailments (e.g., cough remedies…)
13.292
C14. Anthocyanidins0.214.1Cloze (DD)5651Remember10.27 Organic and medicinal/pharmaceutical chemistry; 10.28 analytical chemistry
14.273
Macroscopic analysisC15. Identification0.315.1DD9238Remember, analyze9.21 Ability to communicate in English and/or locally relevant languages; 10.24 plant and animal biology
15.275
15.367
15.433
15.592
C16. Combination herbal product quality analysis and risk0.516.1Cloze (FIB; DD)4068Analyze, evaluate, remember9.21 Ability to communicate in English and/or locally relevant languages; 10.24 plant and animal biology; 10.35 toxicology
16.250
Microscopic analysisC17. Combination herbal product quality analysis0.417.1Cloze (FIB; DD)1446Remember, analyze, evaluate9.21 Ability to communicate in English and/or locally relevant languages; 10.24 plant and animal biology; 10.34 pharmaceutical technology, including analyses of medicinal products
17.223
Table A6. Cumulative comparison of Pharmacognosy 1 course examinations, in terms of question types, cognitive dimensions, competencies, student performance and Moodle statistics.
Table A6. Cumulative comparison of Pharmacognosy 1 course examinations, in terms of question types, cognitive dimensions, competencies, student performance and Moodle statistics.
Student Performance and Examination Statistics
ExaminationsTotal Student AttemptsExamination Duration [min]Total QuestionsQuestions per Student AttemptStudent Average Performance [%]Standard Deviation [%]
15345491566.2912.41
25250951976.2111.41
35355471765.8811.64
Question types
ExaminationsTFMCDDFIBEssayCloze
(MC)(DD)(MC; FIB)(FIB)(FIB; DD)
107330270630
211254015040000
30103032130016
Cognitive dimensions
ExaminationsQuestions assessing low cognitive dimensionsQuestions assessing high cognitive dimensions
RememberingUnderstandingApplyingAnalyzingEvaluatingCreating
14314242530
28716151500
344662573
Professional competencies
ExaminationsRanked <60% by community pharmacists […]Ranked >60% by community pharmacists […]
7.610.2410.2811.3811.397.27.37.47.57.79.2110.2710.2910.3010.3110.3210.3310.3410.3510.3713.4617.63
1324001500001021742005001625
211318700000220570000002904
30132003323328900229622037

Appendix B

Figure A1. Screen capture depicting the graphical user interface of examination questions: (a) E1, section “Pharmacognosy—basic concepts”, category “Morphological parts”, question 6.1; and (b) E1, section “HMPC|EMA—recommendations”, category “Therapeutic area, indication, patient population, posology”, question 9.1.
Figure A1. Screen capture depicting the graphical user interface of examination questions: (a) E1, section “Pharmacognosy—basic concepts”, category “Morphological parts”, question 6.1; and (b) E1, section “HMPC|EMA—recommendations”, category “Therapeutic area, indication, patient population, posology”, question 9.1.
Sci 07 00183 g0a1
Figure A2. Screen capture depicting the utilization of the graphical user interface of Google Lens. The provided example is of a successful botanical identification of photographic materials by Google Lens included in pharmacognosy examinations.
Figure A2. Screen capture depicting the utilization of the graphical user interface of Google Lens. The provided example is of a successful botanical identification of photographic materials by Google Lens included in pharmacognosy examinations.
Sci 07 00183 g0a2

References

  1. Mortlock, R.; Lucas, C. Generative artificial intelligence (Gen-AI) in pharmacy education: Utilization and implications for academic integrity: A scoping review. Explor. Res. Clin. Soc. Pharm. 2024, 15, 100481. [Google Scholar] [CrossRef]
  2. Burke, C.M. AI-Assisted Exam Variant Generation: A Human-in-the-Loop Framework for Automatic Item Creation. Educ. Sci. 2025, 15, 1029. [Google Scholar] [CrossRef]
  3. Nikolovski, V.; Trajanov, D.; Chorbev, I. Advancing AI in Higher Education: A Comparative Study of Large Language Model-Based Agents for Exam Question Generation, Improvement, and Evaluation. Algorithms 2025, 18, 144. [Google Scholar] [CrossRef]
  4. Wang, Q. DeepSeek Hits Hard: Helping to Revolutionize Higher Education in the Era of Artificial Intelligence. Int. J. High. Educ. 2025, 14, 26. [Google Scholar] [CrossRef]
  5. Stöhr, C.; Ou, A.W.; Malmström, H. Perceptions and usage of AI chatbots among students in higher education across genders, academic levels and fields of study. Comput. Educ. Artif. Intell. 2024, 7, 100259. [Google Scholar] [CrossRef]
  6. Huo, W.; Yuan, X.; Li, X.; Luo, W.; Xie, J.; Shi, B. Increasing acceptance of medical AI: The role of medical staff participation in AI development. Int. J. Med. Inform. 2023, 175, 105073. [Google Scholar] [CrossRef] [PubMed]
  7. Gustafson, K.A.; Berman, S.; Gavaza, P.; Mohamed, I.; Devraj, R.; Abdel Aziz, M.H.; Singh, D.; Southwood, R.; Ogunsanya, M.E.; Chu, A.; et al. Pharmacy faculty and students perceptions of artificial intelligence: A National Survey. Curr. Pharm. Teach. Learn. 2025, 17, 102344. [Google Scholar] [CrossRef] [PubMed]
  8. Yan, L.; Sha, L.; Zhao, L.; Li, Y.; Martinez-Maldonado, R.; Chen, G.; Li, X.; Jin, Y.; Gašević, D. Practical and Ethical Challenges of Large Language Models in Education: A Systematic Scoping Review. arXiv 2023, arXiv:2303.13379. [Google Scholar] [CrossRef]
  9. Franke, S.; Pott, C.; Rutinowski, J.; Pauly, M.; Reining, C.; Kirchheim, A. Can ChatGPT Solve Undergraduate Exams from Warehousing Studies? An Investigation. Computers 2025, 14, 52. [Google Scholar] [CrossRef]
  10. Ehlert, A.; Ehlert, B.; Cao, B.; Morbitzer, K. Large Language Models and the North American Pharmacist Licensure Examination (NAPLEX) Practice Questions. Am. J. Pharm. Educ. 2024, 88, 101294. [Google Scholar] [CrossRef] [PubMed]
  11. Do, V.; Donohoe, K.L.; Peddi, A.N.; Carr, E.; Kim, C.; Mele, V.; Patel, D.; Crawford, A.N. Artificial intelligence (AI) performance on pharmacy skills laboratory course assignments. Curr. Pharm. Teach. Learn. 2025, 17, 102367. [Google Scholar] [CrossRef]
  12. Larsen, S.K. Creating Large Language Model Resistant Exams: Guidelines and Strategies (Version 1). arXiv 2023, arXiv:2304.12203. [Google Scholar] [CrossRef]
  13. Logullo, P.; De Beyer, J.A.; Kirtley, S.; Schlüssel, M.M.; Collins, G.S. Open access journal publication in health and medical research and open science: Benefits, challenges and limitations. BMJ Evid. Based Med. 2024, 29, 223–228. [Google Scholar] [CrossRef]
  14. Anderson, H.D.; Kwon, S.; Linnebur, L.A.; Valdez, C.A.; Linnebur, S.A. Pharmacy student use of ChatGPT: A survey of students at a U.S. School of Pharmacy. Curr. Pharm. Teach. Learn. 2024, 16, 102156. [Google Scholar] [CrossRef]
  15. Ibrahim, A.F.; Danpanichkul, P.; Hayek, A.; Paul, E.; Farag, A.; Mansoor, M.; Thongprayoon, C.; Cheungpasitporn, W.; Othman, M.O. Artificial Intelligence in Gastroenterology Education: DeepSeek Passes the Gastroenterology Board Examination and Outperforms Legacy ChatGPT Models. Am. J. Gastroenterol. 2025, ahead of print. [Google Scholar] [CrossRef]
  16. Shapovalov, V.B.; Shapovalov, Y.B.; Bilyk, Z.I.; Megalinska, A.P.; Muzyka, I.O. The Google Lens analyzing quality: An analysis of the possibility to use in the educational process. Educ. Dimens. 2019, 1, 219–234. [Google Scholar] [CrossRef]
  17. Shapovalov, Y.; Bilyk, Z.; Atamas, A.; Shapovalov, V.; Uchitel, A. The Potential of Using Google Expeditions and Google Lens Tools under STEM-education in Ukraine. Educ. Dimens. 2019, 51, 90–101. [Google Scholar] [CrossRef]
  18. Bilyk, Z.; Shapovalov, Y.; Shapovalov, V.; Antonenko, P.; Zhadan, S.; Lytovchenko, D.; Megalinska, A. Features of Using Mobile Applications to Identify Plants and Google Lens During the Learning Process. In Proceedings of the 2nd Myroslav I. Zhaldak Symposium on Advances in Educational Technology—AET, Kyiv, Ukraine, 11–12 November 2023; pp. 688–705. [Google Scholar] [CrossRef]
  19. ElKhalifa, D.; Hussein, O.; Hamid, A.; Al-Ziftawi, N.; Al-Hashimi, I.; Ibrahim, M.I.M. Curriculum, competency development, and assessment methods of MSc and PhD pharmacy programs: A scoping review. BMC Med. Educ. 2024, 24, 989. [Google Scholar] [CrossRef] [PubMed]
  20. Aly, A.; Mraiche, F.; Maklad, E.; Ali, R.; El-Awaisi, A.; El Hajj, M.S.; Mukhalalati, B. Examining the perception of undergraduate pharmacy students towards their leadership competencies: A mixed-methods study. BMC Med. Educ. 2025, 25, 833. [Google Scholar] [CrossRef] [PubMed]
  21. Atkinson, J.; Sánchez Pozo, A.; Rekkas, D.; Volmer, D.; Hirvonen, J.; Bozic, B.; Skowron, A.; Mircioiu, C.; Sandulovici, R.; Marcincal, A.; et al. Hospital and Community Pharmacists’ Perceptions of Which Competences Are Important for Their Practice. Pharmacy 2016, 4, 21. [Google Scholar] [CrossRef]
  22. Alamgir, A.N.M. Origin, Definition, Scope and Area, Subject Matter, Importance, and History of Development of Pharmacognosy. In Therapeutic Use of Medicinal Plants and Their Extracts: Volume 1; Springer International Publishing: Berlin/Heidelberg, Germany, 2017; Volume 73, pp. 19–60. [Google Scholar] [CrossRef]
  23. Dhami, N. Trends in Pharmacognosy: A modern science of natural medicines. J. Herb. Med. 2013, 3, 123–131. [Google Scholar] [CrossRef]
  24. Shinde, V.; Dhalwal, K.; Mahadik, K.R. Phcog Rev. Review Article Some issues related to pharmacognosy. Pharmacogn. Rev. 2008, 2, 1–5. [Google Scholar]
  25. Cadena-Bautista, Á.; López-Ponce, F.F.; Ojeda-Trueba, S.L.; Sierra, G.; Bel-Enguix, G. Exploring the Behavior and Performance of Large Language Models: Can LLMs Infer Answers to Questions Involving Restricted Information? Information 2025, 16, 77. [Google Scholar] [CrossRef]
  26. Agrawal, G.; Kumarage, T.; Alghamdi, Z.; Liu, H. Can Knowledge Graphs Reduce Hallucinations in LLMs?: A Survey (Version 2). arXiv 2023, arXiv:2311.07914. [Google Scholar] [CrossRef]
  27. Blanco Abellan, M.; Ginovart Gisbert, M. On How Moodle Quizzes Can Contribute to the Formative e-Assessment of First-Year Engineering Students in Mathematics Courses. RUSC Univ. Knowl. Soc. J. 2012, 9, 166. [Google Scholar] [CrossRef]
  28. Hamady, S.; Mershad, K.; Jabakhanji, B. Multi-version interactive assessment through the integration of GeoGebra with Moodle. Front. Educ. 2024, 9, 1466128. [Google Scholar] [CrossRef]
  29. Viegas, C.; Gheyi, R.; Ribeiro, M. Assessing the Capability of LLMs in Solving POSCOMP Questions (Version 1). arXiv 2025, arXiv:2505.20338. [Google Scholar] [CrossRef]
  30. Singh, S.; Alyakin, A.; Alber, D.A.; Stryker, J.; Tong, A.P.S.; Sangwon, K.; Goff, N.; de la Paz, M.; Hernandez-Rovira, M.; Park, K.Y.; et al. It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education (Version 1). arXiv 2025, arXiv:2503.13508. [Google Scholar] [CrossRef]
  31. Quiz statistics report—MoodleDocs. 2024. Available online: https://docs.moodle.org/500/en/Quiz_statistics_report (accessed on 30 September 2025).
  32. Quiz Report Statistics—MoodleDocs. 2022. Available online: https://docs.moodle.org/dev/Quiz_report_statistics (accessed on 30 September 2025).
  33. Odoh, U.E.; Gurav, S.S.; Chukwuma, M.O. (Eds.) Pharmacognosy and Phytochemistry: Principles, Techniques, and Clinical Applications, 1st ed.; Wiley: Hoboken, NJ, USA, 2025. [Google Scholar] [CrossRef]
  34. Evans, W.C. Trease and Evans Pharmacognosy, 16th ed.; Saunders/Elsevier: Amsterdam, The Netherlands, 2009. [Google Scholar]
  35. Tofade, T.; Elsner, J.; Haines, S.T. Best Practice Strategies for Effective Use of Questions as a Teaching Tool. Am. J. Pharm. Educ. 2013, 77, 155. [Google Scholar] [CrossRef] [PubMed]
  36. Rubinstein, R.Y.; Kroese, D.P. Simulation and the Monte Carlo Method, 1st ed.; Wiley: Hoboken, NJ, USA, 2016. [Google Scholar] [CrossRef]
  37. McKinney, W. Data Structures for Statistical Computing in Python. Python Sci. Conf. 2010, 56–61. [Google Scholar] [CrossRef]
  38. Hassan-Montero, Y.; De-Moya-Anegón, F.; Guerrero-Bote, V.P. SCImago Graphica: A new tool for exploring and visually communicating data. El Prof. De La Inf. 2022, 31, e310502. [Google Scholar] [CrossRef]
  39. Roustaei, N. Application and interpretation of linear-regression analysis. Med. Hypothesis Discov. Innov. Ophthalmol. 2024, 13, 151–159. [Google Scholar] [CrossRef] [PubMed]
  40. Wang, W.; Zhou, Y.; Fu, J.; Hu, K. Evaluating the Performance of DeepSeek-R1 and DeepSeek-V3 Versus OpenAI Models in the Chinese National Medical Licensing Examination: Cross-Sectional Comparative Study. JMIR Med. Educ. 2025, 11, e73469. [Google Scholar] [CrossRef] [PubMed]
  41. Bouzabata, A. Pharmacognosy: 150 Corrected and Annotated Multiple-Choice Questions and Course Summaries. 2018. Available online: https://books.google.bg/books?id=78NODwAAQBAJ (accessed on 20 November 2025).
Figure 1. Visualization of Monte Carlo simulations of average total score frequency, as achieved by the tested LLMs: (a) Examination 1—ChatGPT, (b) Examination 1—DeepSeek, (c) Examination 2—Chat GPT, (d) Examination 2—DeepSeek, (e) Examination 3—ChatGPT and (f) Examination 3—DeepSeek. The vertical black line represents the average score, as calculated by the built-in functionality of SCImago Graphica.
Figure 1. Visualization of Monte Carlo simulations of average total score frequency, as achieved by the tested LLMs: (a) Examination 1—ChatGPT, (b) Examination 1—DeepSeek, (c) Examination 2—Chat GPT, (d) Examination 2—DeepSeek, (e) Examination 3—ChatGPT and (f) Examination 3—DeepSeek. The vertical black line represents the average score, as calculated by the built-in functionality of SCImago Graphica.
Sci 07 00183 g001
Figure 2. Visualization of Monte Carlo simulations of average total score frequency, as achieved by the tested LLMs in examinations compared to that of students: (a) Examination 1, (b) Examination 2 and (c) Examination 3. The distribution of scores is plotted in increments of 0.25 points.
Figure 2. Visualization of Monte Carlo simulations of average total score frequency, as achieved by the tested LLMs in examinations compared to that of students: (a) Examination 1, (b) Examination 2 and (c) Examination 3. The distribution of scores is plotted in increments of 0.25 points.
Sci 07 00183 g002
Figure 3. Visualization of Monte Carlo simulations of average scores, as achieved by the tested LLMs in examination categories compared to that of students: (a) Examination 1, (b) Examination 2 and (c) Examination 3. Shaded areas represent better performance by students or LLMs, respectively.
Figure 3. Visualization of Monte Carlo simulations of average scores, as achieved by the tested LLMs in examination categories compared to that of students: (a) Examination 1, (b) Examination 2 and (c) Examination 3. Shaded areas represent better performance by students or LLMs, respectively.
Sci 07 00183 g003
Figure 4. Visualization of correlation of facility index with LLM scores per category, post linear regression analysis, using all available datapoints: (a) Examination 1—ChatGPT r2 = 0.2504, (b) Examination 1—DeepSeek r2 = 0.1717, (c) Examination 2—Chat GPT r2 = 0.0059, (d) Examination 2—DeepSeek r2 = 0.0721, (e) Examination 3—ChatGPT r2 = 0.2437 and (f) Examination 3—DeepSeek r2 = 0.1251. Convex hulls were shaded in for better visualization of data distribution, relevant to the regression line.
Figure 4. Visualization of correlation of facility index with LLM scores per category, post linear regression analysis, using all available datapoints: (a) Examination 1—ChatGPT r2 = 0.2504, (b) Examination 1—DeepSeek r2 = 0.1717, (c) Examination 2—Chat GPT r2 = 0.0059, (d) Examination 2—DeepSeek r2 = 0.0721, (e) Examination 3—ChatGPT r2 = 0.2437 and (f) Examination 3—DeepSeek r2 = 0.1251. Convex hulls were shaded in for better visualization of data distribution, relevant to the regression line.
Sci 07 00183 g004
Figure 5. Visualization of correlation of facility index with LLM scores per category, post-linear regression analysis, as small multiples: (a) Examination 1—ChatGPT, (b) Examination 1—DeepSeek, (c) Examination 2—Chat GPT, (d) Examination 2—DeepSeek, (e) Examination 3—ChatGPT and (f) Examination 3—DeepSeek. Convex hulls were shaded in for better visualization of data distribution, relevant to the regression lines. Disregarded categories are removed from the visualization. Horizontal regression lines are indicative of categories with no observed correlation.
Figure 5. Visualization of correlation of facility index with LLM scores per category, post-linear regression analysis, as small multiples: (a) Examination 1—ChatGPT, (b) Examination 1—DeepSeek, (c) Examination 2—Chat GPT, (d) Examination 2—DeepSeek, (e) Examination 3—ChatGPT and (f) Examination 3—DeepSeek. Convex hulls were shaded in for better visualization of data distribution, relevant to the regression lines. Disregarded categories are removed from the visualization. Horizontal regression lines are indicative of categories with no observed correlation.
Sci 07 00183 g005
Table 1. Calculated average scores obtained after the Monte Carlo analysis of the results obtained by LLMs in the three Pharmacognosy I examinations. Total average scores as well as their standard deviations are provided as point and percent values.
Table 1. Calculated average scores obtained after the Monte Carlo analysis of the results obtained by LLMs in the three Pharmacognosy I examinations. Total average scores as well as their standard deviations are provided as point and percent values.
LLMExamination 1Examination 2Examination 3
AverageStandard DeviationAverageStandard DeviationAverageStandard Deviation
Point[%]Point[%]Point[%]Point[%]Point[%]Point[%]
ChatGPT 4o2.5150.20.265.23.1362.60.336.63.4569.00.163.2
DeepSeek R11.8637.20.163.22.3146.20.285.63.0861.60.234.6
Table 2. Calculated coefficient of determination (r2) for linear regression analysis of facility indexes and LLM examination scores per category.
Table 2. Calculated coefficient of determination (r2) for linear regression analysis of facility indexes and LLM examination scores per category.
LLMExamination 1Examination 2Examination 3
Disregarded CategoriesCategories with no CorrelationCoefficient of Determination (r2) per CategoryDisregarded CategoriesCategories with no CorrelationCoefficient of Determination (r2) per CategoryDisregarded CategoriesCategories with no CorrelationCoefficient of Determination (r2) per Category
ChatGPT 4oC8C1
C4
C5
C11
C12
C14
C15
C2 (0.6938)
C3 (0.4391)
C6 (0.3273)
C7 (0.7229)
C9 (0.5423)
C10 (0.1089)
C13 (0.5111)
C1C2
C6
C7
C8
C11
C12
C13
C14
C16
C3 (0.9185)
C4 (0.8193)
C5 (0.2936)
C9 (0.4717)
C10 (0.9670)
C15 (0.3466)
C17 (0.4700)
C18 (0.9370)
C19 (0.1046)
C4
C5
C6
C7
C8
C10
C11
C13
C14
C16
C17
C1
C12
C2 (0.9494)
C3 (0.1120)
C9 (02161)
C15 (0.1684)
DeepSeek R1C8C1
C4
C5
C6
C10
C13
C14
C15
C2 (0.2521)
C3 (0.1007)
C7 (0.9521)
C9 (0.5423)
C11 (0.9912)
C12 (0.2500)
C1C2
C5
C6
C8
C9
C11
C12
C13
C17
C3 (0.9185)
C4 (0.3530)
C7 (0.8781)
C10 (0.2370)
C14 (0.2365)
C15 (0.4014)
C16 (0.6792)
C18 (0.4918)
C19 (09075)
C4
C5
C6
C7
C8
C10
C11
C13
C14
C16
C17
C1
C3
C9
C12
C15
C2 (0.4652)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Stoyanov, A.; Nedelcheva, A. Identifying Features of LLM-Resistant Exam Questions: Insights from Artificial Intelligence (AI)–Student Performance Comparisons. Sci 2025, 7, 183. https://doi.org/10.3390/sci7040183

AMA Style

Stoyanov A, Nedelcheva A. Identifying Features of LLM-Resistant Exam Questions: Insights from Artificial Intelligence (AI)–Student Performance Comparisons. Sci. 2025; 7(4):183. https://doi.org/10.3390/sci7040183

Chicago/Turabian Style

Stoyanov, Asen, and Anely Nedelcheva. 2025. "Identifying Features of LLM-Resistant Exam Questions: Insights from Artificial Intelligence (AI)–Student Performance Comparisons" Sci 7, no. 4: 183. https://doi.org/10.3390/sci7040183

APA Style

Stoyanov, A., & Nedelcheva, A. (2025). Identifying Features of LLM-Resistant Exam Questions: Insights from Artificial Intelligence (AI)–Student Performance Comparisons. Sci, 7(4), 183. https://doi.org/10.3390/sci7040183

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop