Artificial Intelligence Performance in Introductory Biology: Passing Grades but Poor Performance at High Cognitive Complexity

Rai, Megan E.; Ngaw, Michael; Nannas, Natalie J.

doi:10.3390/educsci15101400

Open AccessArticle

Artificial Intelligence Performance in Introductory Biology: Passing Grades but Poor Performance at High Cognitive Complexity

by

Megan E. Rai

,

Michael Ngaw

and

Natalie J. Nannas

^*

Department of Biology, Hamilton College, Clinton, New York, NY 13323, USA

^*

Author to whom correspondence should be addressed.

Educ. Sci. 2025, 15(10), 1400; https://doi.org/10.3390/educsci15101400

Submission received: 24 July 2025 / Revised: 7 October 2025 / Accepted: 13 October 2025 / Published: 18 October 2025

(This article belongs to the Topic Generative Artificial Intelligence in Higher Education)

Download

Browse Figures

Review Reports Versions Notes

Abstract

The emergence of Artificial Intelligence (AI) has impacted the world of higher education, and institutions are faced with challenges in integrating AI into curricula. Within the field of biology education, there has been little to no research on AI capabilities to explain collegiate-level biological concepts. In this study, we evaluated the ability of ChatGPT-4, ChatGPT-3.5, Google’s Bard, and Microsoft’s Bing to perform on introductory-level college assessments. All AIs were able to pass the biology course with varying degrees of success related to the usage of image-based assessments. With image-based questions, Bing and Bard received a D− and D, respectively; GPT-3.5 and 4 both received a C−, compared to the average student grade of a B. However, without image-based questions in the assessments, AI scores were a full letter grade higher. Additionally, AI performance was analyzed based on the cognitive complexity of the question, based on Bloom’s Taxonomy of learning. Performance by all four AIs dropped significantly with increasing complex questions, while student performance remained consistent. Overall, this study evaluated the ability of different AIs to perform on collegiate-level biology assessments. By understanding their capabilities at different levels of complexity, educators will be better able to adapt assessments based on AI ability, particularly through the utilization of image- and sequence-based questions, and integrate AI into higher education curricula.

Keywords:

biology education; generative artificial intelligence; higher education; Bloom’s Taxonomy of learning

1. Introduction

Artificial intelligence (AI) is a technology that has recently exploded into our collective awareness with the release of publicly available services, such as OpenAI’s ChatGPT, Google’s Bard, and Microsoft’s Bing. These generative AI tools differ from traditional AI in their ability to use large datasets and deep learning techniques to summarize, compile, and generate novel content (Bandi et al., 2023). These “chatbots” have an easy-to-use interface that can produce human-like answers when posed with prompt questions. Generative AI technologies will fundamentally change the way we interact with each other and with the broader world. A report from Goldman Sachs predicts that over two-thirds of current occupational tasks will be partially automated by AI and result in a 7% increase in global GDP (Hatzius, 2023). Despite the potential benefits of AI technology, there are many concerns about the associated negative impacts. Institutions of higher education are raising alarms about academic integrity and unethical usage of these technologies in the classroom (Khlaif et al., 2023; Salloum, 2024). To address these concerns, educators are using strategies to modify assessment formats to limit AI interference (Schön et al., 2023), using AI-detection software (Walters, 2023), and creating new assessments that embrace AI usage (Michel-Villarreal et al., 2023; Nikolopoulou, 2024). A combination of these different strategies will be required to both help students achieve critical competencies and embrace the realities of an AI-based future. However, before educators invest in different pedagogical approaches and curricular restructures, it is critical to understand the capabilities of generative AI tools and their performance across assessment types. While our study focused on biology, the results inform collegiate-level curriculum more broadly as the assessments are common formats, including problem sets, exams, and papers, used across many disciplines.

Biology is a discipline that sits at the nexus of many fields; its concepts and skills are critical for students interested in medicine, agriculture, climate science, public health, and many others. Artificial intelligence has already significantly impacted biology research, with AI algorithms like AlphaFold capable of predicting protein structure (Jumper et al., 2021), and other tools aiding vaccine development and drug design (Thomas et al., 2022), cancer detection and diagnosis (Perez-Lopez et al., 2024), and genomic analysis (Lin & Ngiam, 2023). Artificial intelligence has impacted biology education as well, with the development of AI-based tools to assist both students and educators (Selvam, 2024). Student-facing biology education tools include AI interfaces for blind and visually impaired students (Mukhiddinov & Kim, 2021) and AI-enabled biology E-books to support study skills (Koć-Januchta et al., 2020). Students’ learning outcomes with AI-enabled E-books were, however, comparable to students using traditional E-books (Koć-Januchta et al., 2020). AI machine learning models have been used to predictively model undergraduate success in an introductory biology course based on student responses within the first few weeks of the course (Bertolini et al., 2021). The goal of this study was to identify key predictive indicators to support early intervention and increased student success. Other AI tools for biology educators include a machine-learning-based tool, EvoGrader, to accelerate assessment of student comprehension of natural selection (Moharreri et al., 2014) and AI-generated multiple-choice biology questions (Nasution, 2023). Nasution et al. used OpenAI’s ChatGPT to write multiple-choice questions covering a range of biology concepts and found that 95% of the questions were accurate. Undergraduate biology students assessed these AI-generated questions; 79% of students found the questions to be relevant, and 72–73% reported that the questions were acceptable in terms of clarity and accuracy.

In order to embrace the benefits of AI in biology education and avoid the pitfalls of inappropriate usage, we must first understand generative AI’s current abilities to perform on biology assessments. Currently, no study has investigated generative AI performance in collegiate-level biology coursework. Here, we assessed the ability of four different AIs: Google’s Bard, Microsoft’s Bing, and OpenAI’s ChatGPT, both the free version (GPT-3.5) and the Premium Plus version (GPT-4), to perform in an introductory biology course. At Hamilton College, the introductory course, titled ‘BIO-100: Explorations in Biology’, uses a thematic approach to teach the five core concepts outlined in the National Science Foundation and American Association for the Advancement of Science’s call to action “Vision and Change in Undergraduate Biology” (NSF AAAS, 2009). The five core concepts include (1) evolution, (2) information flow, (3) transformation of energy and matter, (4) structure and function, and (5) biological interactions. The course enrolls approximately 200 students per year across multiple sections, each focused on a particular theme. Assessments in the course include three exams, a final exam, problem sets, and a final project, in addition to the laboratory component of the course. We assessed the four AIs’ ability to perform in a BIO-100 thematic section focused on genetic engineering (BIO-100F: Explorations in Biology: Genetic Engineering). The AIs were assessed based on their performance on exams, problem sets, and a final paper, but were not assessed on the lab work and oral presentation components included in this course. We found that both versions of ChatGPT (GPT-3.5 and 4) outperformed student averages. GPT-3.5 scored an overall average of 94.9%, GPT-4 scored an overall average of 92.3%, and students scored an overall average of 86.4%. Overall, AI performance was lower when answering higher-level questions that probe deeper understanding, but all four AIs were able to pass the introductory biology course. As reflected in the final scores, there were some areas where the AIs outperformed students statistically, and these capabilities will only increase over time as AI improves. It is critical that educators are aware of AI capabilities and consider curricular changes that will both embrace the benefits of AI and also continue to challenge students to learn deeply.

2. Materials and Methods

2.1. Data Collection

Four different generative artificial intelligences were used in this study, which was initiated in June 2023: Google’s Bard; Microsoft’s Bing; OpenAI’s free version, GPT-3.5; and the subscription-based premium version, GPT-4. Bard, Bing, and the free version of ChatGPT were accessed via the internet; for ChatGPT premium plus (GPT-4), an account was created with a paid subscription. Questions from BIO-100F Explorations in Biology: Genetic Engineering, from the fall 2022 semester, were input into all four generative AIs as prompts. Total prompts included five problem sets (total 38 questions), three exams (total 34 questions), the final exam (total 21 questions), and the directions for a final paper focused on a topic in genetic engineering. The exam and problem set questions were used as the sole prompts without additional information. The AIs were prompted with each question from the assessments a single time to best mimic the student experience of a single attempt to answer an assessment question. Only a single prompt was used per question, and no feedback loops were allowed. Prompts were submitted beginning in June 2023 and concluded by 20 July 2023. Outputs from the four generative AIs were compiled into documents for each assessment.

2.2. Grading Outputs

The compiled outputs from the generative AIs were uploaded to Gradescope, an online grading platform used in BIO-100 to facilitate team grading using standardized rubrics. Each AI was evaluated as an individual participant with exams and problem sets loaded into the BIO-100 Gradescope roster, and grading of these assessments was performed by the same instructional staff who evaluated the BIO-100 fall 2022 student work. This approach provided a comparative evaluation between the undergraduate students and the generative AIs. The same rubrics and point allocations were used with the AIs as the fall 2022 students; however, there were a few exceptions to this methodology. At the time of the study, only Bing was capable of accepting image inputs; thus, the other three AIs automatically received zero points for questions that required image comprehension, which required the creation of a new scoring option. Additionally, to facilitate more meaningful analysis, we also ran statistical analysis (described below) on assessments both with and without these image-based questions.

2.3. Bloom’s Taxonomy

Bloom’s Taxonomy is a hierarchy that classifies learning objectives based on cognitive complexity, with basic recall at the lowest level and creation at the highest levels. Each exam and problem set question was classified according to the revised levels of Bloom’s Taxonomy, as described by Anderson and Krathwohl (2001), which include Level 1 (Remember), Level 2 (Understand), Level 3 (Apply), Level 4 (Analyze), Level 5 (Evaluate), and Level 6 (Create). Some examples of questions at various levels include the following:

Level 2, Understand: “Why is complementary base pairing critical for replication?” This question was categorized as Level 2 because students must both remember (Level 1) and understand (Level 2) complementary bases and what role they play in the process of replication.

Level 5, Evaluate: “Keeping ethical standards in mind, which model organism should you choose to study the role of four chambers in heart function? Which model organism should you choose to study contracting heart tissue?” This question was categorized as level 5 because it requires evaluation using several different concepts. With questions classified according to Bloom’s Taxonomy level, both student and AI performance could be analyzed in relation to the depth of understanding.

2.4. Statistical Analysis

To determine whether scores were statistically significant, Student t-tests were run between different comparison groups; a two-tailed distribution and unequal variance were assumed in all calculations. To compare AI and student performance, Student t-tests were performed between each AI and class average student scores on exams (both including and excluding questions with images) and problem sets (both including and excluding questions with images). Variance in scores was compared using two-tailed f-tests. To analyze performance related to Bloom’s Taxonomy, scores for each question within a level were pooled together to calculate the averages and standard deviation. Comparisons across entities (AIs and students) and Bloom level were performed using two-way ANOVA tests with post hoc Tukey’s HSD test.

3. Results

Generative artificial intelligence with easy-to-use interfaces is changing the way people access and understand information. Their ability to synthesize information has raised concerns in higher education about their inappropriate usage during exams and assignments. Before redesigning the curriculum, it is first necessary to understand AI’s ability to perform on collegiate-level material. We investigated the ability of four AIs (GPT-3.5, GPT-4, Bard, and Bing) to perform in an introductory biology course at Hamilton College (BIO-100: Explorations in Biology) and compared the results with student performance in the same course. The AIs were prompted with a single attempt to answer the same questions found on problem sets and exams; their answers were compiled and graded using the same rubrics and methods used with students in the course. We compared performance by assessment type (exam vs. problem set) both between the AIs and between the AIs and the students. We also investigated performance related to cognitive complexity by using Bloom’s Taxonomy to categorize questions based on the depth of understanding. Lastly, we calculated the overall final grades for the four AIs and compared them to student performance. The threshold for passing was set at 60% (D−); scores at or above 60% were considered a passing grade, while scores below 60% were considered a failing performance.

The content of BIO-100 is focused on the five core concepts: evolution, information flow, transformation of energy and matter, structure and function, and biological interactions. In addition to learning core content, students also develop skills related to data interpretation, including analysis of graphs, charts, phylogenetic trees, and DNA sequences. At the time of this study, only Bing could accept images into its prompt window, while GPT-3.5, GPT-4, and Bard could not assess images. Given this limitation, questions that involved image-based analysis received a score of zero as the AIs could not accurately answer the question. This score significantly reduced overall scores on problem sets and exams, and thus, we performed a comparative analysis in which image-based questions were removed. In addition to images, prompts with DNA sequences faced a similar challenge; there was a lack of recognition of these letters as a DNA sequence. As a result, the AIs could not perform tasks, such as identifying gene open reading frames (sequences that code for protein), translating DNA sequences into amino acid sequences, or designing primers. Overall, by excluding questions involving images and DNA sequences, we removed ~37% of the questions in the theme assignments and ~36% of the questions in the exams. All comparative analyses described below were performed both on the full data set and a data set that excluded the image and DNA-based questions.

3.1. Performance by Assessment Type

The major forms of assessment in BIO-100F are problem sets and exams, including both midterm exams and a final exam. Problem sets in BIO-100F are intended to help students review and apply material to prepare for exams, which assess both understanding of core content and application of interpretative skills. Both assessment types consist of open-response questions; the course does not utilize multiple-choice, fill-in-the-blank, or true/false questions. Problem sets are open-book, open-note assessments that students have approximately one week to complete; as a result, they tend to contain more challenging questions, but students have more resources available to answer these questions. Exams, both midterms and finals, do not allow using resources and are completed within a defined time block, and the questions tend to be similar to problem sets with open-response questions, but the questions tend to focus on material the students have previously practiced. Given the different formats of these two assessments, we separated our analysis by assessment type in addition to looking at overall final grades, in which all scores are compiled into a final letter grade. To compare the performance of students and the generative AIs, t-tests were performed between each AI and the students, as well as between the AIs; significance was assumed at p-values < 0.05.

The generative AIs collectively scored poorly on both problem sets and exams when questions using images and DNA sequences were included (Figure 1). All AIs scored statistically worse on both problem sets and exams than undergraduate students, who scored an average of 86.6% on problem sets and 85.1% on exams (Figure 1A). Bing scored an average of 62.1% on problem sets and 56.2% on exams (lower than student averages, p-values < 0.01). Bard scored an average of 60.7% on problem sets and 66.8% on exams (lower than student averages, p-values < 0.01). GPT-3.5 scored an average of 63.7% on problem sets and 68.0% on exams (lower than student averages, p-values < 0.01). GPT-4 scored an average of 64.2% on problem sets and 65.4% on exams (lower than student averages, p-values < 0.01). There was no statistical difference between the assessment types (problem sets vs. exams). All entities (AIs and students) scored similarly when comparing problem sets to exams. This result is expected, given that the questions are the same open-response style. Average scores and standard deviations for each AI on problem sets and exams are plotted in Figure 1A. Between the AIs, there was no statistical difference in scores on either problem sets or exams. The low scores and poor performance could be attributed to the AI’s inability to assess image- and DNA sequence-based questions.

All AIs had significantly greater variance in their scores compared to students (p-values < 0.01, f-test). This greater variance was likely due in part to the zero scores received on image- and DNA-based questions; thus, to assess AI performance more accurately compared to student data, we assessed the scores only using questions that did not use images or DNA sequences. It is important to note that excluding these questions reduced our question sample size for problem set questions (from n = 38 to n = 24) and exam questions (from n = 55 to n = 35). With the exclusion of these questions, performance significantly increased for three of the four AIs (p-values < 0.05) (Table 1). Bard’s performance improved on both assessment types (60.7% to 71.6% on problem sets, 66.8% to 73.3% on exams), but these increases were not statistically significant. The other three AIs’ performance did increase statistically. GPT-3.5’s average problem set scores increased (63.8% to 90.8%, p-value < 0.01), as well as average exam scores (68.0% to 93.1%, p-value < 0.01). GPT-4’s scores significantly improved (64.2% to 84.6% on problem sets, p-value = 0.03; 65.4% to 91.6% on exams, p-value < 0.01). Surprisingly, Bing’s performance also increased on problem sets (62.1% to 84.4%, p-value < 0.01) and exams (56.3% to 77.0%, p-value < 0.01) despite being able to accept images in its prompts. This increase suggests that while it can accept images, Bing cannot adequately use and interpret the images. Student average scores did not significantly change on problem sets (86.6% to 87.9%) and exams (85.1% to 85.4%) when these questions were excluded from analysis, suggesting that students performed consistently across both question types. All average scores described above (scores on assessments with images included and excluded) and the associated p-values are displayed in Table 1, demonstrating the increased performance when images are excluded.

Comparison of AI and student scores excluding image- and sequence-based questions revealed different performance outcomes (Figure 1B, Table 2). Bing’s increased scores on problem sets (84.4%) and exams (77.0%) are not significantly different from student scores, demonstrating an improvement given that they were significantly lower than those of students in the previous analysis (Figure 1B). GPT-4 showed a similar trend with increased problem set scores (84.6%) and exam scores (91.6%) that are now statistically equivalent to student scores. Bard’s performance improved on problem sets (71.6%) and exams (73.3%), but the scores were still statistically lower than student scores on problem sets (p-value < 0.01, marked in red on Table 2) and exams (p-value = 0.01, marked in red on Table 2). Bard also performed statistically lower than GPT-3.5 on problem sets (p-value = 0.02) and lower than all other AIs on exams (p-values < 0.01, denoted with a single asterisk on Figure 1B plot). Interestingly, with the exclusion of these questions, GPT-3.5 scored statistically higher on exams than students (93.1% vs. 85.4%, p-value = 0.03, marked in green on Table 2), and higher than all other AIs except GPT-4 (p-values < 0.01, denoted with a double asterisk in Figure 1B). GPT-3.5’s problem set scores also increased (90.8%), but it was not statistically higher than student scores. Average scores and standard deviations on problem sets and exams with the image-excluded data set are plotted in Figure 1B. A comparison of AI scores to student scores with the image-excluded data set and the associated significance values described above is displayed in Table 2, demonstrating that Bard scored worse than students on both assessment types, while GPT-3.5 scored higher on exams.

3.2. Performance by Bloom’s Taxonomy Level

To evaluate the depth of AI’s understanding, we analyzed performance using Bloom’s Taxonomy of learning. Bloom’s Taxonomy classifies learning objectives into a hierarchy of cognitive complexity (Anderson & Krathwohl, 2001) (Figure 2A). Level 1 (remember) utilizes skills such as recall and definitions and is classified as the lowest level of cognitive complexity. Level 2 (understanding) utilizes the ability to explain. Levels 3 (apply), 4 (analyze), and 5 (evaluate) are classified as mid-tier cognitive complexity educational goals, and encompass skills such as implementation, testing/experimentation, and argumentation. Level 6 (create) is classified as the highest level of cognitive complexity and requires students to produce new work. Additional description of the six levels of Bloom’s Taxonomy can be found in Figure 2A. We assigned a Bloom Level to each question in our data set to facilitate analysis of AI performance by cognitive complexity. A sample of representative questions can be seen in Appendix A (Table A1), along with their assigned Bloom’s Taxonomy level and AI output. Although Bloom’s Taxonomy extends to level 6, the majority of the BIO-100 questions fall within levels 1–4, with two or fewer questions at levels 5 and 6. Analysis thus focused on levels 1–4 (Figure 2A,B). BIO-100 does include a final project involving a research paper, which qualifies as level 6, the creation of original work. We will discuss AI performance on this written assignment in the section below.

The performance of AIs was impacted by the cognitive complexity of the question, specifically with a significant decline in performance with increasing levels of complexity. Students remained relatively constant (no statistical change) across all Bloom Levels, while AI performance dropped with increasing complexity (Figure 2). As shown in Figure 2B,C, as well as Table 3, AI scores were lower with increasing Bloom levels. ANOVA tests confirmed (p-value > 0.01) that there is a statistically significant difference in performance between Bloom levels. ANOVA analysis also confirmed statistical differences between the AIs and students, and this was true for both datasets that included (Figure 2B) and excluded (Figure 2C) questions with images and DNA sequences. Interestingly, student performance did not decrease with increasing Bloom Level; student scores did not significantly change across any Bloom level comparison, both including (Figure 2B) and excluding (Figure 2C) image-based questions. AI performance, however, did alter across Bloom levels. AIs performed best at Level 1 (remember) and Level 2 (understand), but dropped significantly in Levels 3 (apply) and 4 (analyze) (p-value > 0.01). The reduction in performance by Bloom Level was striking when images and DNA sequence questions were included in the data set. Bing and Bard dropped to a failing level (below 60%) at Bloom level 3 (apply); GPT-3.5 and GPT-4 scored slightly above a failing score at this level. All four AIs dropped below 50% at level 4 (analyze) (Table 3). Bing’s reduction in performance at level 4 (21.6%) was statistically lower than its performance at both levels 1 (70.2%) and 2 (83.4%) (p-value < 0.01) (Figure 2B). Bard’s reduction in performance at level 4 (45.4%) was statistically lower than its performance at level 2 (82.6%) (p-value < 0.01) (Figure 2B). GPT-3.5’s reduction in performance from level 2 (96.1%) to level 3 (68.4%) was significant (p-value = 0.01), and further performance drop to level 4 (13.2%) was also significant (p-value < 0.01) Figure 2B). GPT-4’s reduction in performance from level 2 (94.4%) to level 3 (60.3%) was significant (p-value < 0.01), and further performance drop to level 4 (22.6%) was also significant (p-value < 0.01) (Figure 2B).

This reduction could be attributed to the inability to assess images and DNA sequences, given that the skills of application and analysis often require the use of image or graphical information. However, even when these image-based questions were removed, some reductions in scores were still statistically significant (Figure 2C). Bard dropped significantly from 82.0% at level 2 to 43.3% at level 4 (p-value = 0.03) (Figure 2C). GPT-3.5 scored 32.0% at level 4, which was a significant drop compared to its scores at levels 1, 2, and 3 (p-values < 0.01) (Figure 2C). Bing and GPT-4 had lower average scores at each level (Table 3, Figure 2C), but these reductions were not statistically significant. Similarly to the previous analysis of the full data set, student performance was unchanged across Bloom levels when image-based questions were excluded (Figure 2C).

3.3. Final Paper

In addition to problem sets and exams, students in BIO-100F complete a final project focused on a topic related to genetic engineering. Students select the topic, such as CRISPR-based gene therapy, and investigate how this topic intersects with the five core concepts of the class: (1) evolution, (2) information flow, (3) transformation of energy and matter, (4) structure and function, and (5) biological interactions. Students work in teams to investigate their topic and give a final oral presentation to the class. In addition to the presentation, they write a paper on their topic to individually demonstrate their understanding of the topic. While the AIs were unable to give an oral presentation, we prompted them to produce the final paper with the same instructions provided to students, to select a genetic engineering topic, and to discuss its relationship to the five core concepts, as well as its social implications. The paper must include citations from primary literature articles and secondary sources, and should be approximately 3–5 pages in length (approximately 1000 words). This assessment falls within Bloom level 6 as it requires the creation of novel work. When prompted with the assignment directions, all four AIs were able to select a relevant topic (two selected genetically engineered mosquitoes, two selected genetically modified crops) and produce accurate information in each of the five core concept areas. However, all of the AIs created a paper that was insufficient in length despite the original prompt and required prompting to produce more extended text. Three of the AIs (Bing, Bard, and GPT-3.5) had insufficient citations, i.e., they either lacked accurate references or did not include references, while GPT-4 included accurate and sufficient citations of its content. The largest distinguishing feature between the different AIs was the level of depth of the content. Bing scored a 50% on the paper and had the shallowest of the responses, including a lack of specifics on genetic modification, protein function, and organismal impact. Bard and GPT-3.5 were similarly shallow, scoring 65% and 70%, respectively; they also lacked specifics on gene, protein, and organismal information. GPT-4 produced the best paper of the four AIs with a score of 85%. The paper was on the same level as the students, who scored an average of 90% on the assignment. GPT-4 was able to describe the specifics of the gene modification, its impact from the molecular to the ecosystem level, and include information on trait evolution, energy transformation, and social impact.

The paper produced by GPT-4 was difficult to distinguish from similar work submitted by students and sufficiently met the expectations of the assignment. GPT-4’s ability to replicate expert-level explanations has been documented in other comparison studies, including higher performance than GPT-3.5 and other AIs on PhD-level medical questions (Khosravi et al., 2024), clinical decision-making (Lahat et al., 2024), emergency medicine examination (Liu et al., 2024), and collegiate-level coding (Yeadon et al., 2024). GPT-4’s outsized performance is likely due to its increased parameters. GPT-4 is based on 1.76 trillion parameters, a ten-fold increase from GPT-3.5’s 175 billion parameters, which provides the model with enhanced creativity, reasoning, accuracy, and contextual awareness (Annepaka & Partha, 2025). This increased capacity likely allows GPT-4 to produce writing close to student level, unlike GPT-3.5, Bard, and Bing, which were clearly distinguishable from students’ work.

3.4. Final Grades and Rankings

We calculated the final grades of the four AIs with and without image-based questions and compared the final grade with student performance. The original grading breakdown for BIO-100F included a lab component, which comprised 30% of the overall grade, as well as the final project, with an oral presentation and a paper comprising 10% of the overall grade. Because the AIs could not complete the lab component, we equally distributed the weight of the lab across both the problem set percentages and exam percentages, and adjusted the students’ scores accordingly. Additionally, the final project, worth 10% of the final grade, which included both an oral presentation and a final paper, was altered to only count the paper as the final project.

With image-based questions included, Bing received a letter grade of a D− (60.1%), Bard received a letter grade of a D (67.9%), GPT-3.5 received a letter grade of C− (72.6%), and GPT-4 received a letter grade of C− (72.3%). Students received a higher final grade with the class average grade of a B (86.4%). With the exclusion of image-based questions, Bing received a letter grade of C (76.0%), Bard received a letter grade of C (73.0%), GPT-3.5 received a letter grade of A− (92.4%), and GPT-4 received a letter grade of A− (92.1%) (Table 4). The exclusion of image-based questions increases the letter grade of the AIs anywhere from one letter grade to over two letter grades. It is interesting to note that the final paper, worth 10% of the overall grade, caused GPT-3.5’s and Bing’s scores to drop by half a letter grade. Without the paper weighted into the overall final grade, Bing would have received a C+ (79.1%), and GPT-3.5 would have received an A (94.9%).

While the final grade calculations reveal the overall grade each AI received in BIO-100F, the exam scores without image-based questions provided the most statistical power to effectively rank the entities (Table 5). We found that GPT-3.5 was the highest performer with 93.1% on exams, which was statistically higher than students, Bard, and Bing (p-values > 0.01), while not significantly different from GPT-4. GPT-4 was second with 91.6%, scoring statistically higher than Bard and Bing, which was not statistically different from GPT-3.5 and students. Students ranked third with 85.4%, which was statistically higher than Bard, lower than GPT-3.5, and not statistically different from GPT-4 or Bing. Bing ranked fourth with 77.0%, scoring statistically lower than GPT-3.5 and GPT-4, which was not statistically different from students or Bard. Bard was the lowest performer with an average exam score of 73.3%, which was statistically lower than GPT-3.5, GPT-4, and students, but not statistically different from Bing.

4. Discussion

To understand generative AI’s capacity in higher education, we assessed GPT-4, GPT-3.5, Bing, and Bard’s performance in a collegiate introductory biology course. After prompting the AIs with the exams, problem sets, and a final paper from Hamilton College’s course, titled BIO-100F: Exploration in Biology, the same instructional staff who evaluated the BIO-100 student work graded the outputs based on pre-established rubrics. Overall, the AIs performed poorly when required to analyze image- and DNA sequence-based questions, as well as performing poorly at higher levels of cognitive complexity as defined by Bloom’s Taxonomy. When images and DNA sequence-based questions were removed from assessments, the performance improved, with GPT-3.5 and 4 receiving scores higher than students. However, AI performance was still significantly lower than that of students at the highest tested Bloom Level (level 4), even with images and DNA sequences excluded. While this study is limited to a single course at a single institution, it provides evidence of AI’s ability to perform, albeit poorly, at a collegiate level. While this study is also limited to the discipline of biology, it provides context and a roadmap for educators in a range of disciplines, encouraging the usage of graphical, image-based analysis, as well as the usage of higher Bloom level assessments, to capture student learning.

The calculation of final grades resulted in Bing receiving a D−, Bard receiving a D, GPT-3.5 receiving a C−, and GPT-4 receiving a C− (Table 4). This performance was significantly impacted by their ability to interpret images and DNA sequence, and thus, a secondary analysis was performed without these image-based questions, which revealed improvement for all AI performance except Bard (Table 1). With the exclusion of image-based questions, final grade calculation resulted in Bing receiving a C+, Bard receiving a C, GPT-3.5 receiving an A−, and GPT-4 receiving an A− (Table 4). When compared to the average student in BIO-100F, we found that all AIs scored statistically worse on both problem sets and exams than the students when images and DNA sequence-based questions were included (Figure 1A). However, this result changed when assessing performance without image-based questions. Without these questions included in the analysis, the performance of many of the AIs was statistically indistinguishable from students (Figure 1). Exam without images provided the best data set for comparative ranking, and based on these scores, GPT-3.5 performed the highest of the group, followed by GPT-4, students, Bing, and lastly Bard (Table 5). Further analysis used Bloom’s taxonomy to better understand AI’s performance based on cognitive complexity. AI scores were lower with increasing Bloom levels (Figure 2B,C), revealing reduced AI capabilities at tasks involving application, analysis, and evaluation.

One of the most impactful results was the statistically significant outperformance of GPT-3.5 (93.1%) in comparison to students (85.4%) on exam averages with images and DNA sequences removed (p-value = 0.03) (Table 2). This significant difference in scores highlights the justifiable concern of inappropriate AI usage by students. However, this analysis informs possible forms of assessment that will highlight student learning. AI performance was only better than student averages when images and DNA sequence-based questions were excluded; assessments that include this content will be the most informative of student performance. The removal of these questions constituted over 30% of assessment content, and thus could over-inflate AI performance. Their removal allows for greater generalizability as DNA sequence is specific to biology, and image-based graphs and data analysis tend to be more represented in the sciences. By removing these questions, we are able to better assess purely text-based questions for a more accurate representation of AI ability. Furthermore, GPT-3.5 performs poorly at Bloom Level 4 (32%), while students perform far better (86%) (p-value < 0.01) (Table 4, Figure 2). These results illustrate the importance of evaluating higher cognitive levels, such as evaluation and analysis. GPT-3.5 was the only AI to statistically outperform students in any category. Bard, by comparison, scored less than student averages on both exams (71.6% vs. 87.9%, p-value < 0.01) and problem sets (73.3% vs. 85.4%, p-value = 0.01) (Table 2). Bard’s low performance could be attributed to its language model, Google’s Large Language Model or LaMDA, which, at the time of investigation, cautioned its experimental nature, reminding users of its ability to fabricate information (Rudolph, 2023). Further, Bard has shown extremely high rates of fabricating references compared to GPT-4 and GPT-3.5, however, all LLMs have been documented to fabricate at alarming rates (Chelli et al., 2024).

When assessing the AI performance in relation to cognitive complexity, both GPTs performed exceedingly well at the lower Bloom levels. Using the data set that excluded images and DNA sequences, GPT-3.5 and GPT-4 scored 100% and 98.4%, respectively, at Bloom level 1 (remember) (Table 3). Similarly, at Bloom level 2 (understand), GPT-3.5 scored 99.3% and GPT-4 scored 97.5% (Table 3). GPT-3.5 maintains high performance through Bloom level 3 (apply) (92.9%, Table 3), which statistically drops at Bloom level 4 (analyze) (32%, Table 3, Figure 2C). Interestingly, while GPT-4’s average scores decrease across Bloom levels, there is no statistically significant reduction in performance, similar to students’ more consistent performance across the levels (Figure 2C). GPT-4 differentiates from GPT-3.5 at Bloom level 4, with a statistically better performance, 70.0% vs. 32.0% (Table 3). This trend of strong GPT-3.5 performance at lower levels and GPT-4 outperformance at higher levels is consistent with other studies. On medical radiology exams, GPT-4 scores higher on “higher order” questions, while GPT-3.5 scores higher on “lower order questions” (Bhayana et al., 2023). It is important to note that the exclusion of image- and DNA sequence-based questions limited the question sample size at higher cognitive levels. When images and DNA sequence-based questions were not removed from analysis, GPT-3.5 and GPT-4 performed statistically worse at each Bloom level comparison from level 2 to level 4 (Figure 2B). On questions involving DNA sequences, AI outputs often included “hallucinations”, a term used to describe AI’s willingness to fabricate concepts confidently (Sun et al., 2024). When asked to perform specific sequence-based tasks such as finding a gene, identifying the template or coding strand, or designing primers, AIs often output correct instructions but fail to reach a correct answer. Other studies have documented hallucinations in scientific writing; in a study using GPTs to write literature reviews, 55% of GPT-3.5 citations and 18% of GPT-4 were fabricated (Walters & Wilder, 2023). Overall, AIs seem to perform poorly in the middle levels of Bloom’s Taxonomy, while excelling at both lower levels and the top level of creation (level 6) (Figure 2A).

It is speculated that AI performance will continue to grow and improve, both in its ability to provide accurate information and perform higher-level functions like data analysis and interpretation. Some studies have shown that GPT-4 can pass the uniform bar examination (Katz et al., 2024); however, other studies have disputed these claims (Martínez, 2025). Another study has shown that GPT-4 has increased MCAT performance significantly from GPT-3.5 performance (Kipp, 2024). In a study similar to ours, GPT-4 was shown to meet or exceed student performance on PhD-level biomedical science exams (Stribling et al., 2024). With access to AI tools that can pass these post-graduate exams, as we demonstrate here, and receive excellent scores in collegiate-level biology assessments, higher education instructors must be prepared to adapt their curriculum and methods of evaluation. Based on our results (Figure 2), assessments should focus on the higher levels of cognitive complexity, such as Bloom’s levels 3–5, which involve application of knowledge, analysis, and evaluation. Additionally, asking students to analyze and interpret images, graphs, and manipulate DNA sequences (in the specific case of biological sciences) will distinguish their knowledge from AI-supported work. While we found that AIs performed poorly in Bloom levels 3–5, it should be noted that they are capable of Bloom level 6, i.e., the creation of original work. GPT-4 produced a sufficient paper on a topic in genetic engineering that was difficult to distinguish from a first-year college student’s work. Possible solutions for educators include asking for greater specificity, as many of the AIs struggled with scaling information, connecting broad biological concepts to a specific gene, its protein, and associated function in the context of the cell.

Looking to the future, it will be more productive for educators to actively embrace AI in the classroom rather than pursue anti-usage policies. Recent studies have shown that university students are more likely to use ChatGPT when faced with high workloads, stress, and time pressure (Abbas et al., 2024). By integrating AI into the classroom, educators utilize its potential benefits, such as personalized learning support and tutoring (Baillifard et al., 2025), real-time feedback (Xu et al., 2023), and as a tool for creating new learning materials (Sajja et al., 2024). One potential application at the collegiate level is the usage of AI as a reading companion for primary literature. Initial studies have investigated the effective usage of AI-based reading assistants such as ExplainPaper and SciSpace for first-year college students (Watkins, 2025). The investigation revealed that AI reading assistants are beneficial for students with reading comprehension difficulties and non-native speakers (Watkins, 2025). Other AI tools are currently being explored for higher education English language learning (Zhai & Wibowo, 2023; Pan et al., 2024) and writing assistance tools that provide feedback (Nazari et al., 2021). However, there is currently little development of AI-supported reading tools for primary literature. It is often challenging for undergraduate STEM students to begin reading primary literature as it is written for an expert-level audience with extensive jargon and previous knowledge (Kozeracki et al., 2006; Hoskins et al., 2011). Other barriers to AI tool usage have been investigated, finding that students avoid using AI in educational settings due to a lack of familiarity or preference for traditional methods. Few students, however, expressed negative opinions about the value and utility of AI (Hanshaw & Sullivan, 2025). Over-reliance on AI can hinder student motivation for skill acquisition, critical evaluation, and self-reflection (Melisa et al., 2025).

In this study, we investigated the ability of four AIs to perform on collegiate-level curriculum, specifically measuring their ability to perform on introductory biology assessments. The four AIs, GPT-4, GPT-3.5, Bing, and Bard, were all capable of receiving a passing grade (60% or above) (Table 4) in BIO100, albeit with very poor performance. The low scores were linked to an inability to interpret images, including graphical representations of data and DNA sequences. When assessment questions containing this content were removed from analysis, AI performance improved significantly, with some AIs (GPT-4 and GPT-3.5) receiving higher final grades than students. These results are concerning for reasons of academic integrity and the inappropriate usage of these tools to aid student performance. However, college educators should consider the limitations of these AIs to perform at higher levels of cognitive complexity. When challenged with higher Bloom Levels (apply and analyze), performance was significantly weaker. These results together suggest that college educators should continue to reinforce assessments challenging their students to analyze and apply knowledge, particularly through assessments that involve images, graphs, and, in the case of biology, DNA sequences. Through these findings, our biology curriculum is shifting towards final presentations, poster sessions, and other oral formats for students to showcase their ability to analyze and interpret biological concepts and data. Educators in other fields beyond biology can similarly leverage these approaches by configuring assessments to focus on higher-order cognitive tasks, such as analysis, application, evaluation, and creation. This approach would be more effective for educators in all fields if these assessments included image-based content. Another approach includes basing assessments on novel scenarios and applications that fall outside of AI training data. In our biology curriculum, we are utilizing questions that probe student understanding with data that does not fit any known system or example. The future of higher education will be undoubtedly impacted by continuously improving AIs, and embracing their potential while mitigating negative impacts will be critical for success in student learning.

Author Contributions

M.E.R., M.N. and N.J.N. contributed to the acquisition, analysis, and interpretation of data for the study. M.E.R., M.N. and N.J.N. drafted and revised the manuscript. N.J.N. obtained funding for the study and designed the experimentation. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Hamilton College through the Summer Science Research Fellowship funded by the Stephen Harper Kirner Chair, the Library, and Information Technology Services Microgrant for the Exploration of Artificial Intelligence. The APC was funded by the Dean of the Faculty’s Office at Hamilton College.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Sample question inputs categorized by Bloom’s Taxonomy Level and sample responses from generative artificial intelligences with the score received.

Bloom’s Taxonomy Level	Sample Input/Prompt	Sample Output	Entity and Score Received
1	“Your body produces a liter of mucus per day secreted from a variety of cell types. One of the major components of mucus is mucin, a protein with many chemical modifications. Explain the processes of going from MUC1 (the gene that codes for mucin) to the secretion of this protein outside of the cell. Answer this question based on what we have learned in class, NOT looking up mucin production from external sources!”	“Sure, I can help you with that. The MUC1 gene is transcribed into mRNA, which is then translated into a protein called mucin. Mucin is a large complex protein that is made up of many different amino acids. It is also heavily glycosylated, which means that it has many sugar molecules attached to it. The mucin protein is then processed in the Golgi apparatus. The Golgi apparatus is a series of membrane-bound compartments that are responsible for processing and exporting proteins. In the Golgi apparatus the mucin protein is cleaved into smaller pieces and the sugars are attached to the protein. The processed mucin protein is then processed into vesicles. These vesicles are then transported into the plasma membrane, where they fuse with the membrane and release the mucin protein outside of the cell. The secretion of mucin is a complex process that involves many different steps. The MUC1 gene, mRNA, protein synthesis, Golgi apparatus, vesicles, and plasma membrane all play a role in the secretion of mucin. Here is a diagram of the process of mucin secretion: DNA → mRNA → Protein synthesis → Golgi apparatus → Vesicles → Plasma membrane → Mucin secretion The following are some of the key steps in the process of mucin secretion: Transcription: The MUC1 gene is transcribed into mRNA by RNA polymerase II. Translation: The mRNA is translated into a protein called mucin by ribosomes. Processing: The mucin protein is processed in the Golgi apparatus. Packaging: The processed mucin is packaged into vesicles. Transport: The vesicles are transported to the plasma membrane. Fission: The vesicles fuse with the plasma membrane and release the mucin protein outside of the cell. The secretion of mucin is a tightly regulated process. The amount of mucin that is secreted is controlled by a number of factors, including the type of cell, the environment, and the presence of hormones. Mucin is a very important protein. It plays a role in a variety of functions, including: Protecting the lining of the airways, digestive tract, and other organs. Lubricating the surfaces of the body. Trapping bacteria and other foreign particles. Helping to clear mucus from the body. Mucin is a fascinating protein that plays a vital role in many different bodily functions. The process of mucin secretion is a complex and tightly regulated process.”	Bard, 95%
2	“Why is complementary base pairing critical for replication?”	“Complementary base pairing is critical for DNA replication because it allows the two strands of DNA to be separated and used as templates to build two new strands with complementary bases. The hydrogen bonds between the bases hold the two strands of DNA together and serve as a way for DNA to replicate. The pairing of complementary bases in DNA means that the information contained within each strand is redundant. I hope this helps. Let me know if you have any other questions.”	Bing, 100%
3	“Given that codons consist of three nucleotides, how many amino acids could the genetic code support?”	“The genetic code is the set of rules that determines how the nucleotide sequence of a gene is translated into the corresponding sequence of amino acids in a protein. Since codons consist of three nucleotides, there are 64 possible combinations (4^3) of nucleotides to form codons.”	GPT-3.5, 80%
4	“Assuming the promoter and terminator lay just outside of the sequence above and also assuming there are no introns (no splicing occurs), what is the sequence of the RNA?”	“To determine the sequence of RNA transcribed from the given DNA sequence “CTACGCTACAGTTTAAGCCAGGATAGCAGACGGCATATAAATTACGG,” we need to identify the complementary RNA bases. RNA is synthesized using complementary base pairing with the DNA template strand, where thymine (T) is replaced by uracil (U) in RNA. The template strand is: CTACGCTACAGTTTAAGCCAGGATAGCAGACGGCATATAAATTACGG Using the rules of complementary base pairing and replacing T with U, the corresponding RNA sequence is: GAUGCGAUGUCAAAACGGUCCUAUCGUCUGCCGUAUAUAAUUACGCC Therefore, the sequence of the RNA transcribed from the given DNA sequence is “GAUGCGAUGUCAAAACGGUCCUAUCGUCUGCCGUAUAUAAUUACGCC.””	GPT-4.0, 70%
5	“Keeping ethical standards in mind, which model organism should you choose to study the role of four chambers in heart function? Which model organism should you choose to study contracting heart tissue?”	“Ethical standards for animal research are important to ensure the humane care and welfare of the animals used, as well as the scientific validity and social responsibility of the research. Some of the general principles that guide ethical animal research are replacement, reduction, refinement, and respect. To study the role of four chambers in heart function, you should choose a model organism that has a similar heart structure to humans, but also has a lower ethical cost. One possible option is Chicken, which has a four-chambered heart with two atria and two ventricles, but is considered less sentient and more replaceable than mammals. To study contracting heart tissue, you should choose a model organism that has a simple heart structure that can be easily manipulated and observed, but also has a lower ethical cost. One possible option is Tunicate, which has a single layer of contracting cells that forms a primitive heart, but is considered less sentient and more replaceable than vertebrates.”	Bing, 0%
6	“GTACGTCATGCCGTACTGTCCATCTGGCTGCCTGAT” “Design primers (6 base pairs in length) to amplify this gene. Report your primers in 5′ to 3′ orientation. How large would your PCR product be?”	“To amplify the gene from the start codon to the stop codon, you will need to design 6 bp forward and reverse primers to amplify the gene. Here are some guidelines for designing your PCR primers: Aim for the GC content to be between 40 and 60% with the 3′ of a primer ending in G or C to promote binding. This is known as GC Clamp. A good length for PCR primers is generally around 18–30 bases. Try to make the melting temperature^TM of the primers between 65 °C and 75 °C, and within 5 °C of each other. Based on these guidelines, you can design your forward and reverse primers as follows: Forward primer: GTCATG Reverse primer: AGCTCA The product size will be ~27 bp. I hope this helps!”	Bing, 25%

References

Abbas, M., Jam, F. A., & Khan, T. I. (2024). Is it harmful or helpful? Examining the causes and consequences of generative AI usage among university students. International Journal of Educational Technology in Higher Education, 21(1), 10–22. [Google Scholar] [CrossRef]
Anderson, L. W., & Krathwohl, D. R. (2001). A taxonomy for learning, teaching and assessing: A revision of Bloom’s Taxonomy of educational objectives: Complete edition. Longman. ISBN 978-0801319037. [Google Scholar]
Annepaka, Y., & Partha, P. (2025). Large language models: A survey of their development, capabilities, and applications. Knowledge and Information Systems, 67, 2967–3022. [Google Scholar] [CrossRef]
Baillifard, A., Gabella, M., Lavenex, P. B., & Martarelli, C. S. (2025). Effective learning with a personal AI tutor: A case study. Education and Information Technologies, 30(1), 297–312. [Google Scholar] [CrossRef]
Bandi, A., Adapa, P. V. S. R., & Kuchi, Y. E. V. P. K. (2023). The power of generative AI: A review of requirements, models, input–output formats, evaluation metrics, and challenges. Future Internet, 15(8), 260. [Google Scholar] [CrossRef]
Bertolini, R., Finch, S. J., & Nehm, R. H. (2021). Testing the impact of novel assessment sources and machine learning methods on predictive outcome modeling in undergraduate biology. Journal of Science Education and Technology, 30(2), 193–209. [Google Scholar] [CrossRef]
Bhayana, R., Bleakney, R. R., & Krishna, S. (2023). GPT-4 in radiology: Improvements in advanced reasoning. Radiology, 307(5), 230987. [Google Scholar] [CrossRef]
Chelli, M., Descamps, J., Lavoué, V., Trojani, C., Azar, M., Deckert, M., Raynier, J., Clowez, G., Boileau, P., & Ruetsch-Chelli, C. (2024). Hallucination rates and reference accuracy of ChatGPT and bard for systematic reviews: Comparative Analysis. Journal of Medical Internet Research, 26, e53164. [Google Scholar] [CrossRef]
Hanshaw, G., & Sullivan, C. (2025). Exploring barriers to AI course assistant adoption: A mixed-methods study on student non-utilization. Discover Artificial Intelligence, 5, 178. [Google Scholar] [CrossRef]
Hatzius, J. (2023). The potentially large effects of artificial intelligence on economic growth (Briggs/Kodnani). Goldman Sachs, 1, 268–296. [Google Scholar]
Hoskins, S. G., Lopatto, D., & Stevens, L. M. (2011). The C.R.E.A.T.E. Approach to primary literature shifts undergraduates’ self-assessed ability to read and analyze journal articles, attitudes about science, and epistemological beliefs. CBE—Life Sciences Education, 10(4), 368. [Google Scholar] [CrossRef]
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., … Back, T., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583–589. [Google Scholar] [CrossRef]
Katz, D. M., Bommarito, M. J., Gao, S., & Arredondo, P. (2024). GPT-4 passes the bar exam. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 382(2270), 20230254. [Google Scholar] [CrossRef] [PubMed]
Khlaif, Z. N., Mousa, A., Hattab, M. K., Itmazi, J., Hassan, A. A., Sanmugam, M., & Ayyoub, A. (2023). The potential and concerns of using AI in scientific research: ChatGPT performance evaluation. JMIR Medical Education, 9, e47049. [Google Scholar] [CrossRef]
Khosravi, T., Rahimzadeh, A., Motallebi, F., Vaghefi, F., Mohammad Al Sudani, Z., & Oladnabi, M. (2024). The performance of GPT-3.5 and GPT-4 on genetics tests at PhD-level: GPT-4 as a promising tool for genomie medicine and education. Journal of Clinical and Basic Research, 8(4), 22–26. Available online: https://jcbr.goums.ac.ir/article-1-476-en.html (accessed on 7 October 2025).
Kipp, M. (2024). From GPT-3.5 to GPT-4.o: A leap in AI’s medical exam performance. Information, 15(9), 543. [Google Scholar] [CrossRef]
Koć-Januchta, M. M., Schönborn, K. J., Tibell, L. A. E., Chaudhri, V. K., & Heller, H. C. (2020). Engaging with biology by asking questions: Investigating students’ interaction and learning with an artificial intelligence-enriched textbook. Journal of Educational Computing Research, 58(6), 1190–1224. [Google Scholar] [CrossRef]
Kozeracki, C. A., Carey, M. F., Colicelli, J., & Levis-Fitzgerald, M. (2006). An intensive primary-literature–based teaching program directly benefits undergraduate science majors and facilitates their transition to doctoral programs. CBE—Life Sciences Education, 5(4), 340–347. [Google Scholar] [CrossRef]
Lahat, A., Sharif, K., Zoabi, N., Patt, Y. S., Sharif, Y., Fisher, L., Shani, U., Arow, M., Levin, R., & Klang, E. (2024). Assessing generative pretrained transformers (GPT) in clinical decision-making: Comparative analysis of GPT-3.5 and GPT-4. Journal of Medical Internet Research, 26, e54571. [Google Scholar] [CrossRef] [PubMed]
Lin, J., & Ngiam, K. Y. (2023). How data science and AI-based technologies impact genomics. Singapore Medical Journal, 64(1), 59–66. [Google Scholar] [CrossRef] [PubMed]
Liu, M., Okuhara, Y., Chang, X., Shirabe, R., Nishiie, Y., Okada, H., & Kiuchi, T. (2024). Performance of ChatGPT across different versions in meical licensing examinations worldwide: Systematic review and meta-analysis. Journal of Medical Internet Research, 26, e60807. [Google Scholar] [CrossRef] [PubMed]
Martínez, E. (2025). Re-evaluating GPT-4’s bar exam performance. Artificial Intelligence and Law, 33, 581–604. [Google Scholar] [CrossRef]
Melisa, R., Ashadi, A., Triastuti, A., Hidayati, S., Salido, A., Ero, P. E. L., Marlini, C., Zefrin, Z., & Al Fuad, Z. (2025). Critical thinking in the age of AI: A systematic review of AI’s effects on higher education. Educational Process International Journal, 14, e2025031. [Google Scholar] [CrossRef]
Michel-Villarreal, R., Vilalta-Perdomo, E., Salinas-Navarro, D. E., Thierry-Aguilera, R., & Gerardou, F. S. (2023). Challenges and opportunities of generative AI for higher education as explained by ChatGPT. Education Sciences, 13(9), 856. [Google Scholar] [CrossRef]
Moharreri, K., Ha, M., & Nehm, R. H. (2014). EvoGrader: An online formative assessment tool for automatically evaluating written evolutionary explanations. Evolution: Education and Outreach, 7(1), 15. [Google Scholar] [CrossRef]
Mukhiddinov, M., & Kim, S.-Y. (2021). A systematic literature review on the automatic creation of tactile graphics for the blind and visually impaired. Processes, 9(10), 1726. [Google Scholar] [CrossRef]
Nasution, N. E. A. (2023). Using artificial intelligence to create biology multiple choice questions for higher education. Agricultural and Environmental Education, 2(1), em002. [Google Scholar] [CrossRef]
Nazari, N., Shabbir, M. S., & Setiawan, R. (2021). Application of artificial intelligence powered digital writing assistant in higher education: Randomized controlled trial. Heliyon, 7(5), e07014. [Google Scholar] [CrossRef]
Nikolopoulou, K. (2024). Generative artificial intelligence in higher education: Exploring ways of harnessing pedagogical practices with the assistance of ChatGPT. International Journal of Changes in Education, 1(2), 103–111. [Google Scholar] [CrossRef]
National Science Foudnation and the American Association for the Advancement of Science. (2009). Vision and change in undergraduate biology education: A call to action. Available online: https://www.aaas.org/sites/default/files/content_files/VC_report.pdf (accessed on 7 October 2025).
Pan, M., Guo, K., & Lai, C. (2024). Using artificial intelligence chatbots to support English-as-a-foreign language students’ self-regulated reading. RELC Journal, 1–13. [Google Scholar] [CrossRef]
Perez-Lopez, R., Laleh, N. G., Mahmood, F., & Kather, J. N. (2024). A guide to artificial intelligence for cancer researchers. Nature Reviews Cancer, 24(6), 427–441. [Google Scholar] [CrossRef] [PubMed]
Rudolph, J. (2023). War of the chatbots: Bard, bing chat, ChatGPT, ernie and beyond. The new AI gold rush and its impact on higher education. Journal of Applied Learning and Teaching, 6(1), 364–389. [Google Scholar] [CrossRef]
Sajja, R., Sermet, Y., Cikmaz, M., Cwiertny, D., & Demir, I. (2024). Artificial intelligence-enabled intelligent assistant for personalized and adaptive learning in higher education. Information, 15(10), 596. [Google Scholar] [CrossRef]
Salloum, S. A. (2024). AI perils in education: Exploring ethical concerns. In Artificial intelligence in education: The power and dangers of ChatGPT in the classroom (Vol. 144, pp. 669–675). Springer Nature. [Google Scholar] [CrossRef]
Schön, E.-M., Neumann, M., Hofmann-Stölting, C., Baeza-Yates, R., & Rauschenberger, M. (2023). How are AI assistants changing higher education? Frontiers in Computer Science, 5, 1208550. [Google Scholar] [CrossRef]
Selvam, A. A. A. (2024). Exploring the impact of artificial intelligence on transforming physics, chemistry, and biology education. Journal of Science with Impact, 2. [Google Scholar] [CrossRef]
Stribling, D., Xia, Y., Amer, M. K., Graim, K. S., Mulligan, C. J., & Renne, R. (2024). The model student: GPT-4 performance on graduate biomedical science exams. Scientific Reports, 14(1), 1278. [Google Scholar] [CrossRef] [PubMed]
Sun, Y., Sheng, D., Zhou, Z., & Wu, Y. (2024). AI hallucination: Towards a comprehensive classification of distorted information in artificial intelligence-generated content. Humanities and Social Sciences Communications, 11(1), 1–14. [Google Scholar] [CrossRef]
Thomas, S., Abraham, A., Baldwin, J., Piplani, S., & Petrovsky, N. (2022). Artificial intelligence in vaccine and drug design. In N. J. Clifton (Ed.), Vaccine design: Methods and protocols, volume 1. Vaccines for human diseases (pp. 131–146). Springer US. [Google Scholar] [CrossRef]
Walters, W. H. (2023). The effectiveness of software designed to detect AI-generated writing: A comparison of 16 AI text detectors. Open Information Science, 7(1), e001568-268. [Google Scholar] [CrossRef]
Walters, W. H., & Wilder, E. I. (2023). Fabrication and errors in the bibliographic citations generated by ChatGPT. Scientific Reports, 13(1), 14045. [Google Scholar] [CrossRef]
Watkins, M. (2025). AI-powered reading assistants: A tool for equity in first-year writing. In Rethinking writing eduction in the age of generative AI (1st ed.). Routledge. [Google Scholar] [CrossRef]
Xu, W., Meng, J., Raja, S. K. S., Priya, M. P., & Devi, M. K. (2023). Artificial intelligence in constructing personalized and accurate feedback systems for students. International Journal of Modeling, Simulation, and Scientific Computing, 14(01), 2341001. [Google Scholar] [CrossRef]
Yeadon, W., Peach, A., & Testrow, C. (2024). A comparison of human, GPT-3.5, and GPT-4 performance in a university-level coding course. Scientific Reports, 14, 23285. [Google Scholar] [CrossRef]
Zhai, C., & Wibowo, S. (2023). A systematic review on artificial intelligence dialogue systems for enhancing English as foreign language students’ interactional competence in the university. Computers and Education: Artificial Intelligence, 4, 100134. [Google Scholar] [CrossRef]

Figure 1. Average scores plotted by assessment type. Bing is plotted in blue, Bard in red, ChatGPT-3.5 in yellow, ChatGPT-4 in green, and students in orange. Error bars represent standard deviations, and asterisks denote statistical significance (p < 0.05). (A) Average scores by each entity on problem sets and exams. The data set contains all questions, including those with images and DNA sequences. An asterisk denotes that student scores were significantly higher than all AI scores on both assessment types. (B) Average scores by each entity on problem sets and exams. The data plotted here excludes questions with images and DNA sequences. A single asterisk (*) in problem set scores denotes that Bard scored lower than ChatGPT-3.5 and students. A single asterisk (*) in exams denotes that Bard scored lower than all other entities. Double asterisk (**) in exams denotes that ChatGPT-3.5 scored higher than all other entities except ChatGPT-4.

Figure 2. AI and student performance by Bloom’s Taxonomy. (A) Bloom’s Taxonomy is a hierarchical categorization of learning goals, with each level increasing in cognitive complexity, with level 1 (recall) as the lowest and level 6 (create) as the highest. The figure is modified from Vanderbilt University’s Center for Teaching, reproduced with permission under Creative Commons Attribution license CC BY-NC 4.0. (B) Average scores by each entity, categorized by Bloom level, using the full data set that includes image-based questions. (C) Average scores by each entity, categorized by Bloom level, using the data set that excludes image-based questions. Levels 1–4 are marked in gray, with level 1 in black and level 4 in white. Color-coding matches Figure 1: Bing is in blue, Bard in red, ChatGPT-3.5 in yellow, GPT-4 in green, and students in orange. Asterisk denotes statistical significance (p-value > 0.05).

Table 1. Comparison of performance with and without image-based questions. All average scores are plotted by entity (AI or students) and assessment type (problem set or exam). Scores are compared between the full data set, including image-based questions, and the partial data set, with image-based questions excluded for statistical significance. p-values are listed for each comparison. Comparisons in which the scores were significantly different are marked in green.

		Images Included	Images Excluded
Assessment	Entity	Average Score	Average Score	p-Value
P-set	Students	86.6%	87.9%	not significant
P-set	GPT-4	64.2%	84.6%	0.03
P-set	GPT-3.5	63.7%	90.8%	>0.01
P-set	Bing	62.1%	84.4%	>0.01
P-set	Bard	60.7%	71.6%	not significant
Exam	Students	85.1%	85.4%	not significant
Exam	GPT-4	65.4%	91.6%	>0.01
Exam	GPT-3.5	68.0%	93.1%	>0.01
Exam	Bing	56.3%	77.0%	>0.01
Exam	Bard	66.8%	73.3%	not significant

Table 2. Comparison of AI and student performance using an image-excluded data set. AI scores on both assessment types (problem sets and exams) were compared to student scores, marked in blue. p-values are listed for each comparison. AI scores that are statistically lower than students are marked in red; scores that are statistically higher than students are marked in green.

Assessment	Entity	Average Score	p-Value
P-set	Students	87.9%
P-set	GPT-3.5	90.8%	not significant
P-set	GPT-4	84.6%	not significant
P-set	Bing	84.4%	not significant
P-set	Bard	71.6%	>0.01
Exam	Students	85.4%
Exam	GPT-3.5	93.1%	0.03
Exam	GPT-4	91.6%	not significant
Exam	Bing	77.0%	not significant
Exam	Bard	73.3%	0.01

Table 3. AI and student performance by Bloom’s Taxonomy. Average scores for all entities listed by Bloom Level and whether images were included or excluded from the scoring system.

Bloom Level	Images	Bing	Bard	GPT-3.5	GPT-4	Students
Level 1	Included	70.2%	69.0%	92.0%	90.8%	85.0%
Level 2	Included	83.8%	82.6%	96.1%	94.4%	86.7%
Level 3	Included	58.4%	59.4%	68.4%	60.3%	84.3%
Level 4	Included	21.6%	45.4%	13.2%	22.6%	88.5%
Level 1	Excluded	77.7%	75.0%	100.0%	98.4%	87.0%
Level 2	Excluded	86.6%	82.0%	99.3%	97.5%	86.5%
Level 3	Excluded	72.9%	64.6%	92.9%	73.2%	84.2%
Level 4	Excluded	58.0%	43.3%	32.0%	70.0%	86.2%

Table 4. Final grades for AI’s and students in BIO-100F. Final grades are listed by percentage and letter for each entity, both with and without image-based questions in the data set.

Entity	Images Included	Grade	Images Excluded	Grade
Bing	60.1%	D−	76.0%	C
Bard	67.9%	D	73.0%	C
GPT-3.5	72.6%	C−	92.4%	A−
GPT-4	72.3%	C−	92.1%	A−
Students	86.4%	B	86.7%	B

Table 5. Ranking of AIs and students based on exam scores from a data set excluding image-based questions. Average scores on exams are listed along with a statistical comparison of each entity with all other entities. Green denotes a final grade within the A range. Dark green denotes the highest score (A), light green shows the second place ranking (A–). Yellow denotes a final grade of a B, and third place ranking. Red denotes a final grade within the C range. Light red denotes the fourth place ranking (C) and dark red denotes the lowest ranking (C–). Threshold for statistical significance is p-values < 0.05.

Rank	Entity	Average Score	Statistical Significance Between Entities
1	GPT-3.5	93.1%	Higher than 3 entities (students, Bard, Bing); Same as 1 entity (GPT-4), same 1
2	GPT-4	91.6%	Higher than 2 entities (Bard, Bing); Same as 2 entities (GPT-3.5, students)
3	Students	85.4%	Higher than 1 entity (Bard); Same as 2 entities (GPT-4, Bing); Lower than 1 entity (GPT-3.5)
4	Bing	77.0%	Same as 2 entities (Students, Bard), Lower than 2 entities (GPT-3.5, GPT-4)
5	Bard	73.3%	Same as 1 entity (Bing); Lower than 3 entities (GPT-3.5, GPT-4, Students)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rai, M.E.; Ngaw, M.; Nannas, N.J. Artificial Intelligence Performance in Introductory Biology: Passing Grades but Poor Performance at High Cognitive Complexity. Educ. Sci. 2025, 15, 1400. https://doi.org/10.3390/educsci15101400

AMA Style

Rai ME, Ngaw M, Nannas NJ. Artificial Intelligence Performance in Introductory Biology: Passing Grades but Poor Performance at High Cognitive Complexity. Education Sciences. 2025; 15(10):1400. https://doi.org/10.3390/educsci15101400

Chicago/Turabian Style

Rai, Megan E., Michael Ngaw, and Natalie J. Nannas. 2025. "Artificial Intelligence Performance in Introductory Biology: Passing Grades but Poor Performance at High Cognitive Complexity" Education Sciences 15, no. 10: 1400. https://doi.org/10.3390/educsci15101400

APA Style

Rai, M. E., Ngaw, M., & Nannas, N. J. (2025). Artificial Intelligence Performance in Introductory Biology: Passing Grades but Poor Performance at High Cognitive Complexity. Education Sciences, 15(10), 1400. https://doi.org/10.3390/educsci15101400

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Artificial Intelligence Performance in Introductory Biology: Passing Grades but Poor Performance at High Cognitive Complexity

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection

2.2. Grading Outputs

2.3. Bloom’s Taxonomy

2.4. Statistical Analysis

3. Results

3.1. Performance by Assessment Type

3.2. Performance by Bloom’s Taxonomy Level

3.3. Final Paper

3.4. Final Grades and Rankings

4. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI