Artificial Intelligence Performance in Introductory Biology: Passing Grades but Poor Performance at High Cognitive Complexity
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe title is appropriate and descriptive; however, I recommend simplifying it by removing the phrase “collegiate-level” for clarity, for example, “AI Performance in Introductory Biology.”
The abstract effectively outlines the study’s aim, methods, and key findings. To enhance its practical relevance, I suggest adding one sentence that highlights why the findings matter for educators or curriculum developers, particularly in shaping assessment strategies in AI-integrated learning environments.
The introduction could benefit from one brief sentence connecting this study to broader educational technology literature, not just biology. This would strengthen its interdisciplinary value. Line 30: “and create new content”, I suggest considering revising to “and generate novel content” for academic precision. Line 33: “over two-thirds of current occupations will be partially automated by AI.” I recommend specifying “occupational tasks” rather than “occupations,” which is broader. Line 39: “altering forms of assessments to avoid AI usage” , revise to “modifying assessment formats to limit AI interference.” Line 45: “how they perform on various assessments in different disciplines”, I believe it could be more direct as “their performance across assessment types in biology education.”
The Materials and Methods section is clearly structured, with appropriate use of standardized grading rubrics, alignment to Bloom’s Taxonomy, and valid statistical analyses. However, several areas would benefit from clarification and improved terminology. Line 107–108: The statement that “no iterative prompting was performed” is important but could be expanded. Clarify that only a single prompt was used per question and no feedback loops were allowed, this ensures methodological rigor. Line 115: “Each AI was treated as a unique individual” I suggest “Each AI was evaluated as an individual participant” to avoid anthropomorphism. Line 132: I recommend replace “Level 1, Level 2…” phrasing with “Bloom Level 1 (Remember), Bloom Level 2 (Understand)…,” for precision.
The Results section is clearly structured and effectively presents findings across multiple dimensions, including assessment type, cognitive complexity, and final grades. The use of statistical analysis is appropriate, and the tables and figures are well-aligned with the narrative. To improve clarity, the authors should briefly contextualize AI scores by indicating whether they represent passing or failing performance. Some descriptions of AI score improvements are repetitive and could be summarized more succinctly. The section on Bloom’s Taxonomy would benefit from a concise statement summarizing the overall trend of declining AI performance with increasing cognitive complexity. Additionally, all table references should be accompanied by brief interpretations rather than being presented as standalone data.
The Discussion appropriately interprets the findings in relation to prior literature and clearly addresses the implications for educational practice, particularly assessment design in AI-integrated classrooms. The authors effectively link their results to Bloom’s Taxonomy and provide practical recommendations for educators. To enhance clarity and scholarly rigor, a few areas could be strengthened. First, the section would benefit from a more focused synthesis of the main findings at the outset, rather than dispersing them throughout. Additionally, while the limitations of image and sequence-based tasks are acknowledged, the discussion could better address the study’s broader limitations, such as generalizability beyond biology or single-institution sampling. The forward-looking suggestions (e.g., AI as a reading companion) are thoughtful but could be better supported with emerging empirical evidence. A brief cautionary note on overreliance on AI-generated content would also add balance.
The conclusion effectively reiterates the central findings, that while generative AI tools can perform well on lower-order cognitive tasks, they struggle with higher-order assessments and tasks requiring image or sequence analysis. The call for educators to redesign assessments to focus on cognitive complexity is both appropriate and timely. However, the conclusion would benefit from a clearer, more concise summary of the study’s practical implications. Currently, key takeaways are somewhat embedded within broader reflections. The authors may consider ending with a sharper statement emphasizing how the findings can inform AI-integrated curriculum and assessment reform.
The references are generally current, relevant, and diverse. Thank you.
Comments on the Quality of English LanguageThe manuscript is readable but would benefit from minor language editing to improve clarity, consistency, and academic tone.
Author Response
We would like to thank Reviewer 1 for their careful reading of our manuscript and their insightful feedback to improve the work. We agree with the Reviewer on all of their points, and have made changes based on all comments and suggestions. Please see below for the specific edits. Thank you!
Reviewer 1 comment: The title is appropriate and descriptive; however, I recommend simplifying it by removing the phrase “collegiate-level” for clarity, for example, “AI Performance in Introductory Biology.”
Response: We thank the reviewer for their feedback on our manuscript. We have altered the title to remove “collegiate-level”.
Reviewer 1 comment: The abstract effectively outlines the study’s aim, methods, and key findings. To enhance its practical relevance, I suggest adding one sentence that highlights why the findings matter for educators or curriculum developers, particularly in shaping assessment strategies in AI-integrated learning environments.
Response: We have altered the last sentence of the abstract to speak more directly to educators to highlight how our findings can be used in curriculum development, stating “By understanding their capabilities at different levels of complexity, educators will be better able to adapt assessments based on AI ability, particularly through the utilization of image and sequence-based questions, and integrate AIinto higher education curriculum.”
Reviewer 1 comment: The introduction could benefit from one brief sentence connecting this study to broader educational technology literature, not just biology. This would strengthen its interdisciplinary value.
Response: We have edited the manuscript, specifically altering the last sentences of the first paragraph in the introduction (line 52-57) to state “However, before educators invest in different pedagogical approaches and curricular restructures, it is critical to understand the capabilities of generative AI tools and their performance across assessment types. While our study focused on biology, the results inform collegiate-level curriculum more broadly as the assessments are common formats including problem-sets, exams, and papers, used across many disciplines.”
Reviewer 1 comment: Line 33: “over two-thirds of current occupations will be partially automated by AI.” I recommend specifying “occupational tasks” rather than “occupations,” which is broader.
Response: We have edited the manuscript and used the reviewer’s suggested language.
Reviewer 1 comment: Line 39: “altering forms of assessments to avoid AI usage” , revise to “modifying assessment formats to limit AI interference.”
Response: We have edited the manuscript and used the reviewer’s suggested language.
Reviewer 1 comment: Line 45: “how they perform on various assessments in different disciplines”, I believe it could be more direct as “their performance across assessment types in biology education.”
Response: We have edited the manuscript and used the reviewer’s suggested language.
Reviewer 1 comment: The Materials and Methods section is clearly structured, with appropriate use of standardized grading rubrics, alignment to Bloom’s Taxonomy, and valid statistical analyses. However, several areas would benefit from clarification and improved terminology. Line 107–108: The statement that “no iterative prompting was performed” is important but could be expanded. Clarify that only a single prompt was used per question and no feedback loops were allowed, this ensures methodological rigor.
Response: We have clarified the Methods section, specifically at line 107-108 that the reviewer asked about. We have changed the language to state: “The AIs were prompted with each question from the assessments a single time to best mimic the student experience of a single attempt to answer an assessment question. Only a single prompt was used per question and no feedback loops were allowed.”
Reviewer 1 comment: Line 115: “Each AI was treated as a unique individual” I suggest “Each AI was evaluated as an individual participant” to avoid anthropomorphism.
Response: We have edited the manuscript and used the reviewer’s suggested language.
Reviewer 1 comment: Line 132: I recommend replace “Level 1, Level 2…” phrasing with “Bloom Level 1 (Remember), Bloom Level 2 (Understand)…,” for precision.
Response: We have edited the manuscript and used the reviewer’s suggested language.
Reviewer 1 comment: The Results section is clearly structured and effectively presents findings across multiple dimensions, including assessment type, cognitive complexity, and final grades. The use of statistical analysis is appropriate, and the tables and figures are well-aligned with the narrative. To improve clarity, the authors should briefly contextualize AI scores by indicating whether they represent passing or failing performance.
Response: This is an excellent point by the reviewer, as it is a major conclusion of the paper, and we now realize that we were not clear about which scores represent passing vs. failing performance. At the end of the first paragraph in the Results section (line 187), we now state “The threshold for passing was set at 60% (D-); scores at or above 60% are considered passing, while scores below 60% are considered a failing performance.”
Reviewer 1 comment: Some descriptions of AI score improvements are repetitive and could be summarized more succinctly. The section on Bloom’s Taxonomy would benefit from a concise statement summarizing the overall trend of declining AI performance with increasing cognitive complexity.
Response: We thank the reviewer for their recommendation. We have added a concise introductory sentence to the paragraph that summarizes performance by Bloom Level, stating:
“The performance of AIs was impacted by the cognitive complexity of the question, specifically with a significant decline in performance with increasing levels of complexity. Students remained relatively constant (no statistical change) across all Bloom Levels, while AI performance dropped with increasing complexity (Figure 2).”
We have also streamlined the section describing the AI score improvements, cutting much of the text and representing the scores and p-values in parentheses rather than repetitive sentence structures (Lines 262-276).
Reviewer 1 comment: Additionally, all table references should be accompanied by brief interpretations rather than being presented as standalone data.
Response: We have altered all of the sentences that reference the tables and included interpretations so they are not presented as standalone data. For Table 1, we now state:
“All average scores described above (scores on assessments with images included and excluded), and the associated p-values are displayed in Table 1, demonstrating the increased performance when images are excluded.”
For Table 2: Comparison of AI scores to student scores with the image-excluded data set and the associated significance values described above are displayed in Table 2, demonstrating that Bard scored worse than students on both assessment types while GPT-3.5 scored higher on exams.
Reviewer 1 comment: The Discussion appropriately interprets the findings in relation to prior literature and clearly addresses the implications for educational practice, particularly assessment design in AI-integrated classrooms. The authors effectively link their results to Bloom’s Taxonomy and provide practical recommendations for educators. To enhance clarity and scholarly rigor, a few areas could be strengthened. First, the section would benefit from a more focused synthesis of the main findings at the outset, rather than dispersing them throughout.
Response: We thank the reviewer for the recommendation. We have altered the beginning of the Discussion section such that the first paragraph now provides a synthesis of the major overall findings. The section now states:
"Overall, the AIs performed poorly when required to analyze image and DNA-sequence based questions, as well as performing poorly at higher levels of cognitive complexity as defined by Bloom’s Taxonomy. When image and DNA-sequence questions were removed from assessments, performance improved with GPT-3.5 and 4 receiving scores higher than students. However, AI performance was still significantly lower than students at the highest tested Bloom Level (levels 4) with images and DNA-sequence excluded."
Reviewer 1 comment: Additionally, while the limitations of image and sequence-based tasks are acknowledged, the discussion could better address the study’s broader limitations, such as generalizability beyond biology or single-institution sampling.
Response: This is an excellent suggestion from the reviewer, and we have added the following language to the first paragraph of the Discussion to acknowledge these limitations:
“While this study is limited to a single course at a single institution, it provides evidence that AI ability to perform, albeit poorly, at a collegiate level. While this study is also limited to the discipline of biology, it provides context and a roadmap to educators in a range of disciplines, encouraging the usage of graphical, image-based analysis as well as a usage of higher Bloom level assessments to capture student learning.”
Reviewer 1 comment: The forward-looking suggestions (e.g., AI as a reading companion) are thoughtful but could be better supported with emerging empirical evidence. A brief cautionary note on overreliance on AI-generated content would also add balance.
Response: We thank the reviewer for this suggestion as we did not include the studies to support these statements. We have changed the language in lines 586-602 to include empirical evidence, more specific examples, and a cautionary note on AI over-usage with an article by Melisa et al. 2025 cited with concerns about student motivation for evaluation and reflection.
“One potential application at the collegiate level is usage of AI as a reading companion for primary literature. Initial studies have investigated the effective usage of AI-based reading assistants such as ExplainPaper and SciSpace for first year college students (Watkins 2025). The investigation revealed that AI reading assistants are beneficial for students with reading comprehension difficulties and non-native speakers (Watkins 2025). Other AI tools are currently being explored for higher education English language learning (Zhai, Wibowo, 2023, Pan, Guo & Lai, 2024) and writing assistance tools that provides feedback (Nazari, Shabbir & Setiawan, 2021). However, there is currently little development of AI-supported reading tools for primary literature. It is often challenging for undergraduate STEM students to begin reading primary literature as it is written for an expert level audience with extensive jargon and assumed knowledge (Kozeracki et al. 2006, Hoskins, Lopatto & Stevens, 2011). Other barriers to AI-tool usage have been investigated, finding that students avoid AI usage in educational settings due to lack of familiarity or preference for traditional methods. Few students, however, expressed negative opinions about the value and utility of AI (Hanshaw and Sullivan 2025). Over-reliance on AI can hinder student motivation for skill acquisition, critical evaluation and self-reflection (Melisa et al. 2025).”
Reviewer 1 comment: The conclusion effectively reiterates the central findings, that while generative AI tools can perform well on lower-order cognitive tasks, they struggle with higher-order assessments and tasks requiring image or sequence analysis. The call for educators to redesign assessments to focus on cognitive complexity is both appropriate and timely. However, the conclusion would benefit from a clearer, more concise summary of the study’s practical implications. Currently, key takeaways are somewhat embedded within broader reflections. The authors may consider ending with a sharper statement emphasizing how the findings can inform AI-integrated curriculum and assessment reform.
Response:
We thank the reviewer for this important point; we agree that the manuscript ended on the topic of AI usage rather than recapping the major conclusions of the study and informing AI-integrated curriculum. We have rewritten the final paragraph of the Discussion to now state:
"In this study, we investigated the ability of four AIs to perform on collegiate-level curriculum, specifically measuring their ability with introductory biology assessments. The four AIs, GPT-4, GPT-3.5, Bing and Bard were all capable of receiving a passing grade (60% or above) (Table 4) in BIO100, albeit with very poor performance. The low scores were linked to an inability to interpret images, including graphical representation of data and DNA sequence. When assessment questions containing this content was removed from analysis, AI performance improved significantly with some AIs (GPT-4 and GPT-3.5) receiving higher final grades than students. These results are concerning for reasons of academic integrity and the inappropriate usage of these tools to aid student performance. However, college educators should consider the limitations of these AIs to perform at higher levels of cognitive complexity. When challenged with higher Bloom Levels (apply and analyze), performance was significantly weaker. These results together suggest that college educators should continue to reinforce assessments challenging their students to analyze and apply knowledge, particularly through assessment that involve images, graphs, and in the case of biology, DNA sequence. Through these findings, our biology curriculum is shifting towards final presentations, poster sessions, and other oral formats for students to showcase their ability to analyze and interpret biological concepts and data. Educators in other fields beyond biology can similarly leverage these approaches by configuring assessments to focus on higher-order cognitive tasks such as analysis, application, evaluation and creation. This approach would be more effective for educators if these assessments included image-based content. Another approach includes basing assessments on novel scenarios and applications that fall outside of AI training data. In our biology curriculum, we are utilizing questions that probe student understanding with data that does not fit any known system or example. The future of higher education will be undoubtedly impacted by continuously improving AIs, and embracing their potential while mitigating negative impacts will be critical for success in student learning."
Reviewer 2 Report
Comments and Suggestions for AuthorsSuggestions
- Critical Reflection on Limitations – Consider discussing more explicitly how excluding ~36% of questions (pp. 4–5) may artificially elevate AI scores. This is especially relevant when recommending that educators focus on higher-order tasks and image-based questions.
- Implications for Assessment Design – The discussion notes that GPT-3.5 and GPT-4 excel at lower Bloom levels but struggle with analysis (Level 4) and image-based interpretation (p. 10, Table 3). Strengthen recommendations for future educators by outlining specific strategies (e.g., integrating visual data interpretation, iterative assessments).
- Depth of Analysis for Final Paper Results – Expand on why GPT-4’s paper (scoring 85%) was near-student level, while Bing and Bard performed poorly (pp. 11–12). This could provide insight into differences in model architecture or output depth.
- Generalizability Beyond Biology – Briefly comment on whether these results may transfer to other STEM disciplines or remain biology-specific, as this will broaden the article’s relevance.
Comments
This is a well-designed and clearly written paper that makes a novel contribution to understanding generative AI performance in biology education. The introduction is well-contextualized, and the methods and results are robust, with appropriate statistical analysis and clear visualizations (Figures 1–2, Tables 1–5). To strengthen the paper, please (1) discuss more explicitly how excluding ~36–37% of image/DNA-based questions may elevate AI scores and affect generalizability; (2) strengthen recommendations for assessment design by suggesting concrete strategies for integrating visual data interpretation and higher-order cognitive tasks (Table 3 shows GPT-3.5/GPT-4 weakness at Bloom Level 4); (3) expand analysis of final paper results to explain why GPT-4 produced near-student-level work while Bing and Bard performed poorly (pp. 11–12); and (4) comment briefly on whether results may extend to other STEM disciplines to broaden relevance.
I look forward to reviewing your paper again.
Author Response
We would like to thank Reviewer 2 for their careful reading of our manuscript and their insightful feedback to improve the work. The Reviewer asked for expansion and explanation on several points and we have made these changes and additions. Please see below for the specific edits. Thank you!
Reviewer 2 comment:
This is a well-designed and clearly written paper that makes a novel contribution to understanding generative AI performance in biology education. The introduction is well-contextualized, and the methods and results are robust, with appropriate statistical analysis and clear visualizations (Figures 1–2, Tables 1–5). To strengthen the paper, please
(1) discuss more explicitly how excluding ~36–37% of image/DNA-based questions may elevate AI scores and affect generalizability;
Response:
We thank the reviewer for this important suggestion. We agree that it is important to note that removal of a large percentage of assessment questions could over-inflate AI scores. The removal of these questions may help generalizability to other fields. For example, the usage of DNA sequence is highly specialized to biology, and image-based graphical representations of data often tends to be more represented in the sciences. Removing these questions can provide more generalizable information on AI ability to answer text-based questions. We have added the following language to the manuscript in the Discussion section in lines 521-526:
"The removal of these questions constituted over 30% of assessment content, and thus could over-inflate AI performance. Their removal allows for greater generalizability as DNA sequence is specific to biology, and image-based graphs and data analysis tend to be more represented in the sciences. By removing these questions, we are able to better assess purely text-based questions for a more accurate representation of AI ability. "
Reviewer 2 comment: (2) strengthen recommendations for assessment design by suggesting concrete strategies for integrating visual data interpretation and higher-order cognitive tasks (Table 3 shows GPT-3.5/GPT-4 weakness at Bloom Level 4);
Response: We thank the reviewer for their suggestions. We have written a new final paragraph in the Discussion section to better recap the findings of the investigation and to provide concrete strategies for college educators:
"In this study, we investigated the ability of four AIs to perform on collegiate-level curriculum, specifically measuring their ability with introductory biology assessments. The four AIs, GPT-4, GPT-3.5, Bing and Bard were all capable of receiving a passing grade (60% or above) (Table 4) in BIO100, albeit with very poor performance. The low scores were linked to an inability to interpret images, including graphical representation of data and DNA sequence. When assessment questions containing this content was removed from analysis, AI performance improved significantly with some AIs (GPT-4 and GPT-3.5) receiving higher final grades than students. These results are concerning for reasons of academic integrity and the inappropriate usage of these tools to aid student performance. However, college educators should consider the limitations of these AIs to perform at higher levels of cognitive complexity. When challenged with higher Bloom Levels (apply and analyze), performance was significantly weaker. These results together suggest that college educators should continue to reinforce assessments challenging their students to analyze and apply knowledge, particularly through assessment that involve images, graphs, and in the case of biology, DNA sequence. Through these findings, our biology curriculum is shifting towards final presentations, poster sessions, and other oral formats for students to showcase their ability to analyze and interpret biological concepts and data. Educators in other fields beyond biology can similarly leverage these approaches by configuring assessments to focus on higher-order cognitive tasks such as analysis, application, evaluation and creation. This approach would be more effective for educators if these assessments included image-based content. Another approach includes basing assessments on novel scenarios and applications that fall outside of AI training data. In our biology curriculum, we are utilizing questions that probe student understanding with data that does not fit any known system or example. The future of higher education will be undoubtedly impacted by continuously improving AIs, and embracing their potential while mitigating negative impacts will be critical for success in student learning."
Reviewer 2 comment: (3) expand analysis of final paper results to explain why GPT-4 produced near-student-level work while Bing and Bard performed poorly (pp. 11–12);
Response: We thank the reviewer for this suggestion. We needed to better explore why this is the case. We were able to find other empirical evidence for the increased performance of GPT-4 over other models and have cited these studies and their contexts, as well as included a description of GPT-4 increased parameters over GPT-3.5 (1.76 trillion parameters vs. 175 parameters), which provides the model with enhanced creativity, reasoning, accuracy, and contextual awareness (Annepaka and Pakray 2025). The new sections is located at lines 425-437 and states:
"The paper produced by GPT-4 was difficult to distinguish from similar work submitted by students and sufficiently met the expectations of the assignment. GPT-4’s ability to replicate expert-level explanations has been documented in other comparison studies, including higher performance than GPT-3.5 and other AIs on PhD-level medical questions (Khosravi et al. 2024), clinical decision-making (Lahat et al. 2024), emergency medicine examination (Liu et al. 2024) and collegiate-level coding (Yeadon et al. 2024). GPT-4’s outsized performance is likely due to its increased parameters. GPT-4 is based on 1.76 trillion parameters, a ten-fold increase from GPT-3.5’s 175 billion parameters, which provides the model with enhanced creativity, reasoning, accuracy, and contextual awareness (Annepaka and Pakray 2025). This increased capacity likely allows GPT-4 to produce writing near student level, unlike GPT-3.5, Bard and Bing which were clearly distinguishable from student work."
Reviewer 2 comment: (4) comment briefly on whether results may extend to other STEM disciplines to broaden relevance.
Response: We thank the reviewer for their suggestion, and agree that we needed to broaden our scope to other fields as well as include concrete examples for educators all fields to inform AI-integrated curriculum and assessment reform. We have added the following paragraph to the end of the Discussion to sharpen the emphasis on how the findings can inform AI-integrated curriculum and assessment reform.
In this study, we investigated the ability of four AIs to perform on collegiate-level curriculum, specifically measuring their ability with introductory biology assessments. The four AIs, GPT-4, GPT-3.5, Bing and Bard were all capable of receiving a passing grade (60% or above) (Table 4) in BIO100, albeit with very poor performance. The low scores were linked to an inability to interpret images, including graphical representation of data and DNA sequence. When assessment questions containing this content was removed from analysis, AI performance improved significantly with some AIs (GPT-4 and GPT-3.5) receiving higher final grades than students. These results are concerning for reasons of academic integrity and the inappropriate usage of these tools to aid student performance. However, college educators should consider the limitations of these AIs to perform at higher levels of cognitive complexity. When challenged with higher Bloom Levels (apply and analyze), performance was significantly weaker. These results together suggest that college educators should continue to reinforce assessments challenging their students to analyze and apply knowledge, particularly through assessment that involve images, graphs, and in the case of biology, DNA sequence. Through these findings, our biology curriculum is shifting towards final presentations, poster sessions, and other oral formats for students to showcase their ability to analyze and interpret biological concepts and data. Educators in other fields beyond biology can similarly leverage these approaches by configuring assessments to focus on higher-order cognitive tasks such as analysis, application, evaluation and creation. This approach would be more effective for educators in all fields if these assessments included image-based content. Another approach includes basing assessments on novel scenarios and applications that fall outside of AI training data. In our biology curriculum, we are utilizing questions that probe student understanding with data that does not fit any known system or example. The future of higher education will be undoubtedly impacted by continuously improving AIs, and embracing their potential while mitigating negative impacts will be critical for success in student learning.
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsI find that the manuscript has been substantially improved in clarity, structure, and depth, effectively addressing previous comments, and I consider it acceptable for publication in its present form.

