Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks

Pan, Yunlong; Nehm, Ross H.

doi:10.3390/educsci15060676

Open AccessArticle

Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks

by

Yunlong Pan

^1,* and

Ross H. Nehm

^1,2

¹

Department of Applied Mathematics and Statistics, College of Engineering, Stony Brook University, Stony Brook, NY 11794, USA

²

Department of Ecology and Evolution, College of Arts and Sciences, Stony Brook University, Stony Brook, NY 11794, USA

^*

Author to whom correspondence should be addressed.

Educ. Sci. 2025, 15(6), 676; https://doi.org/10.3390/educsci15060676

Submission received: 30 January 2025 / Revised: 9 May 2025 / Accepted: 23 May 2025 / Published: 29 May 2025

(This article belongs to the Special Issue Exploring the Future of Science Education with AI Technologies: Opportunities and Challenges)

Download

Browse Figures

Versions Notes

Abstract

Few studies have compared Large Language Models (LLMs) to traditional Machine Learning (ML)-based automated scoring methods in terms of accuracy, ethics, and economics. Using a corpus of 1000 expert-scored and interview-validated scientific explanations derived from the ACORNS instrument, this study employed three LLMs and the ML-based scoring engine, EvoGrader. We measured scoring reliability (percentage agreement, kappa, precision, recall, F1), processing time, and explored contextual factors like ethics and cost. Results showed that with very basic prompt engineering, ChatGPT-4o achieved the highest performance across LLMs. Proprietary LLMs outperformed open-weight LLMs for most concepts. GPT-4o achieved robust but less accurate scoring than EvoGrader (~500 additional scoring errors). Ethical concerns over data ownership, reliability, and replicability over time were LLM limitations. EvoGrader offered superior accuracy, reliability, and replicability, but required, in its development a large, high-quality, human-scored corpus, domain expertise, and restricted assessment items. These findings highlight the diversity of considerations that should be used when considering LLM and ML scoring in science education. Despite impressive LLM advances, ML approaches may remain valuable in some contexts, particularly those prioritizing precision, reliability, replicability, privacy, and controlled implementation.

Keywords:

large language models (LLMs); machine learning (ML); automated scoring; science assessment; accuracy metrics; ethical implications

1. Introduction

In the past few years, proprietary and open-weight Generative Artificial Intelligence (GenAI) and Large Language Models (LLMs; e.g., ChatGPT, Gemini, Claude, Gemma, and OLMoE) have become embedded within the fabric and functioning of the modern world. LLMs display many stunning capabilities, including human language emulation and complex problem solving. Their continuously improving capabilities (e.g., performance exceeding most humans on science tasks; Stribling et al., 2024) have spurred substantial discussion, debate, and research in science education (e.g., Wu et al., 2024; Zhai, 2023; Zhai & Nehm, 2023). Our study focuses on a less futuristic but practical application of LLMs in science education: their use in automated scoring of students’ written scientific explanations.

Over the past 15 years, research on automated scoring of student writing products in science (e.g., essays, short answers) has focused primarily on the use of traditional machine learning (ML) tools (e.g., Ha et al., 2011; Haudek et al., 2011; Gerard & Linn, 2022; Magliano & Graesser, 2012; Nehm et al., 2012a; Riordan et al., 2020). Using this approach, “tagged” or human-scored text corpora (e.g., scored student essays) are employed as inputs or “training data” to help computers “learn” to identify text features and build computational models that mimic human scoring. If the models meet certain quality control benchmarks (e.g., high inter-rater agreements with human scores), the models are subsequently deployed on fresh text corpora and tested for comparable outcomes. Numerous assessment tools applicable to science education contexts have been developed using ML approaches (for reviews, see Zhai et al., 2020, 2021).

LLMs stand out among AI tools due to their massive scale, versatile capabilities, and advanced pre-training methods. Leveraging transformer architectures, these models incorporate multilevel neural networks trained on billions of tokens from vast datasets, including both public and proprietary sources. This extensive pre-training enables LLMs to discern intricate patterns, structures, and relationships within language, surpassing the capabilities of traditional, task-specific machine learning models. Unlike traditional models, LLMs can generate text that closely mimics human writing. The broad applicability of LLMs marks a significant advancement, potentially rendering numerous task-specific models obsolete.

Although science educators have recently focused their attention on the transformative potential of LLMs, it is important to investigate and identify possible contexts in which traditional machine learning (ML) models (or simpler approaches) may retain their utility in science education. Little work has explored conceptually or empirically the potential benefits and drawbacks of these two general approaches for science assessment scoring. All digital tools have strengths and limitations that must be considered in their application, and the strengths of LLMs should not eclipse their limitations. As noted by Nyaaba et al. (2024, p. 8), “GPT models lack domain-specific focus and are not fine-tuned for specialized tasks or contexts, making them more prone to biases or hallucinations”. In essay grading, for example, LLMs have generated hallucinations in the interpretation of student answers and augmented student response text during grading (Kundu & Barbosa, 2024). Recent work has also highlighted how the distribution of data in LLM training corpora pose fundamental performance constraints (McCoy et al., 2024), which can create significant limitations in analyses of youth and young adult discourse (cf. Liu et al., 2023).

Prior work has shown that advanced ML approaches (e.g., Hybrid Neural Networks, encoder models) can achieve robust scoring accuracy (e.g., Latif & Zhai, 2024), and Moore et al. (2023) found that traditional rule-based approaches to automated scoring can surpass LLM performance. Work like this suggests that ML applications may continue to be useful in assessment contexts, particularly those characterized by: precisely defined tasks with narrow, domain-specific objectives; relatively small data sets; standard hardware and software availability; limited access to expensive Application Programming Interface (API) calls and high-end computational resources; situations where privacy and ethical considerations require tightly controlled data usage; and cases in which accuracy is paramount.

Although a growing number of empirical studies have tested the accuracy of LLM scoring relative to human scoring (e.g., Zhai et al., 2023; Cohen et al., 2024; Pack et al., 2024), much less work has quantified whether LLMs outperform existing interview-validated automated scoring methods developed using ML in the domain of evolutionary biology (e.g., Beggrow et al., 2014) or has explored the methodological, economic, and ethical benefits and drawbacks of using LLMs in comparison to ML approaches (Oli et al., 2023). Our study fills this gap by asking: (1) How well do proprietary and open-weight LLMs perform scoring students’ scientific explanations of evolutionary change derived from the ACORNS instrument? (2) How do the LLM results compare to older ML-based automated scoring tools? (3) Do LLMs and ML approaches offer comparable methodological advantages for science educators? and (4) What ethical considerations should be made when choosing between LLMs and ML tools?

2. Materials and Methods

2.1. Data Corpus and Human Scoring

The data corpus consisted of 1000 text-based student explanations of evolutionary change generated in response to the previously published and widely used “Assessment of COntextual Reasoning about Natural Selection” measurement instrument (ACORNS; Nehm et al., 2012b). The ACORNS was developed to uncover high school and university students’ thinking about evolutionary change across a variety of biological phenomena through the scientific practice of explanation. The responses in our corpus were drawn from prior studies of undergraduate students’ formative assessments (mean age 19, range 18–20); studies have shown that high school student responses do not differ appreciably from undergraduate responses (Nehm et al., 2012b). Student responses to ACORNS prompts may be gathered in many different educational settings and contexts (e.g., homework, in-class activities, quizzes, exams) but must be in the form of text-based answers. Response patterns are used as indicators of preparation for future learning in evolutionary biology.

Multiple studies have gathered several forms of validity evidence for the ACORNS (e.g., content validity, substantive validity, discriminant and convergent validity, and generalization validity) in support of inferences drawn from instrument scores derived from both human and ML-based computer scoring (Opfer et al., 2012; Beggrow et al., 2014; Moharreri et al., 2014; Nehm, 2024). The instrument has also been used in student samples internationally (e.g., Chile, China, Germany, Indonesia, Korea, USA).

A published analytic (cf. Sripathi et al., 2024) scoring rubric has been used for more than a decade by expert human raters to score responses to the ACORNS instrument (see Beggrow et al. (2014) and Nehm et al. (2010) for details). This rubric was also used to develop the ML-based EvoGrader scoring system. In both contexts, the rubric is designed to produce binary scores (present, absent) for nine key concepts central to evaluating student explanations of evolutionary change. These nine concepts included six scientific ideas central to causal and mechanistic explanations of evolutionary change (variation, heritability, differential survival/reproduction, limited resources, competition, and non-adaptive factors). The rubric also contains guidelines for identifying three non-normative reasoning elements colloquially referred to as “misconceptions” (adaptation as acclimation, use/disuse inheritance, and inappropriate need-based causation or teleology). The rubric includes key linguistic terms, phrases, and general ways of reasoning about a concept that are not encompassed by a single word (e.g., gradual acclimation of all individuals in a population as opposed to changes in the distribution of individuals with different heritable traits). This rubric formed the basis for LLM prompt engineering (discussed below).

Human-derived scores for all nine concepts demonstrated strong inter-rater reliability evidence (Cohen’s Kappa > 0.81 for all concepts) and all disagreements were resolved via deliberation to achieve human consensus scores. These human-generated scores were compared to scores produced using ML and LLMs as discussed below.

2.2. ML-Based Scoring of Scientific Explanations

Students’ explanations were scored for the nine concepts mentioned above using the online, freely available machine-learning-based EvoGrader system (www.evograder.org). Although all EvoGrader scoring models use simple “bag of words” text parsing (Harris, 1954) and binary classifiers from Sequential Minimal Optimization (SMO; Platt, 1999), each concept model was optimized by unique feature extraction combinations (e.g., for variation: unigrams, stemming, removing stopwords, removing misclassified data; see Moharreri et al. (2014, p. 4)). Overall, training and testing utilized 10,270 text-based, human-scored evolutionary explanations and produced classifiers capable of accurately predicting the presence or absence of the nine evolutionary concepts at levels matching or exceeding human expert inter-rater reliabilities. Additional details on the ML infrastructure of EvoGrader may be found in Moharreri et al. (2014), and examples of ACORNS questions, scoring rubrics, and a scored data corpus can be found at www.evograder.org.

2.3. Large Language Model Selection and Prompt Engineering

Like the approaches used in human scoring and ML scoring, each student explanation was also evaluated by multiple LLMs. The selection of the three LLMs was based on the performance benchmark MMLU (Hendrycks et al., 2020) and led to the use of (i) proprietary ChatGPT-4o (OpenAI, 2024), (ii) open-weight Llama-3.1 (Dubey et al., 2024), and (iii) open-weight Gemma-2 (Gemma Team, 2024). Zero-shot prompt engineering encompassed three components: persona specifications, ACORNS rubric text, and format instructions (detailed in Figure 1).

Persona specifications in LLM prompts have been shown to improve scoring success (e.g., interpretation of explanatory text in student answers; OpenAI, 2025). In our case, the prompt included “You are a helpful teacher assistant to help grade a test.” The published ACORNS scoring rubrics have been successfully used by humans for decades (i.e., Nehm et al., 2010), and text relating to the detection of each of the nine concepts (e.g., variation) was extracted from the documents and added into each LLM prompt. Note that a concept encompasses a wide variety of ideas (see Nehm et al. (2010) for examples). To ensure consistency in comparing LLMs, we maintained this prompt design across all experiments.

2.4. Programming and Data Extraction

Four different software tools were used to execute LLM scoring. The first tool is LangChain (https://www.langchain.com/ accessed on 22 May 2025), which facilitates setting abstractions for repeatedly prompting a model, parsing output, and specifying operation types. LangChain is a framework for working with LLMs by chaining interoperable components. The second tool is the OpenAI API ‘Chat Completions’ that facilitates extracting responses from ChatGPT-4o (https://platform.openai.com/docs/guides/chat-completions, accessed on 22 May 2025). The third tool is Azure AI Studio (https://azure.microsoft.com/en-us/products/ai-studio, accessed on 22 May 2025) which is a unified platform for developing and deploying generative AI apps. This platform was used for the Llama 3.1 model (8b, 70b, and 405b) for access and deployment using the API. The fourth tool is Ollama. It was used to locally deploy the relatively small open-weight models (Llama3 8b and Gemma2 9b) (https://ollama.com/, accessed on 22 May 2025).

2.5. Statistical Comparisons of LLM and ML Scoring Accuracy Relative to Human Raters

Multiple evaluation methods were used to quantify scoring performance for the nine concepts: Cohen’s kappa, accuracy, precision, recall, and F1. Kappa accounts for chance agreements between raters and is a widely used measure quantifying binary inter-rater reliability in education research (Moharreri et al., 2014). Accuracy is the percentage of raw instances in which computer scores and human scores matched. Macro-average precision, recall, and F1 scores were also calculated, offering a balanced evaluation of scoring success. Macro average is a method for calculating a performance metric by taking the arithmetic mean of each class’s performance metric across all classes. Precision indicates how often positive predictions are correct. Recall shows how well the model captures actual positive cases. The F1 score balances both, combining them into a single performance measure. This approach is useful in datasets with class imbalances and allows for a fair comparison of model performance across all categories. The Sklearn Python package (v1.0.2 or later, https://scikit-learn.org/stable/, accessed on 22 May 2025) was used to calculate summary statistics (Cohen’s kappa, accuracy, precision, recall and F1 scores1) for model comparisons.

2.6. Model Run Time

Given that run time is a consideration for many researchers and practitioners, we used the ‘tqdm’ Python package (v4.66.0 or later, https://github.com/tqdm/tqdm, accessed on 22 May 2025) to monitor in-time progress and predict how much run time remained in each scoring run. This enabled us to track how much time each test took to complete (see data in Table 1).

3. Results

Our study revealed several findings regarding the accuracy of LLMs and ML methods in auto-scoring scientific explanations of evolutionary change. Three examples of student explanations and scoring results are provided in Table 1. First, most LLMs were able to follow grading criteria, output instructions, and complete the zero-shot tasks. ChatGPT-4o produced the most accurate results among the LLMs and generated the most consistent scores across the nine variables; accuracy scores ranged from 83% to 99.8%, and F1 scores ranged from 67.6% to 97.6% (Figure 1 and Figure 2, Table 2). ChatGPT-4o’s performance was notably superior to the other LLMs, demonstrating “near perfect” (cf. Landis & Koch, 1977) kappa results in several cases (Figure 3). These findings suggest that ChatGPT-4o has the strongest potential for serving as an alternative to the ML-based EvoGrader; however, further improvements in prompt engineering are clearly possible (discussed below).

The other LLMs performed much worse than GPT-4o (Figure 2). The open-weight models Llama 3-8, 3.1-8b, and Gemma2-9b generally struggled with assigning the correct target label when presented with complex prompts combining grading criteria, output instructions, and numerous examples. However, Llama 3-8b demonstrated the highest average accuracy at 84.7% for zero-shot learning tasks among the open-weight models, outperforming even larger and newer models in the Llama 3.1 family: 8b (72.5%), 70b (78.3%), and 405b (80.5%). Meanwhile, Gemma2-9b excelled in specific concepts, such as Variation (92.4% accuracy, the best among open-weight models) and Heritability (94.5% accuracy, the best among open-weight models; Figure 1). However, Gemma2-9b showed significant weaknesses for detecting concepts like Adapt = acclimate (30.7% accuracy) and Teleology/Need (48.5% accuracy), illustrating the variability in its performance across normative and misconception-based knowledge elements. The LLMs performed best when detecting normative scientific concepts (Figure 2, Table 1).

In addition to poorer performance, several drawbacks of LLMs became evident during the analyses. The models required significant computational resources, such as the cloud services Azure and OpenAI API. Generating score outputs demanded considerable computational time, highlighting inefficiencies compared to the ML-based EvoGrader. Measurements of total runtime for processing 1000 samples across nine concepts were variable across the LLMs. ChatGPT-4o required between 9 to 14 min per concept. Open-weight models, hosted on Microsoft Azure, exhibited significantly longer completion times (Table 2). For instance, Llama 3-8b averaged 50 min, while the largest model, Llama3.1-405b, required an average of 1 h and 20 min, with some tasks exceeding 2 h. Note that debugging attempts and/or prompt modification that sought to improve scoring accuracy required this processing time, making some prompt engineering options impractical.

Very simple zero-shot prompt engineering in ChatGPT-4o generated scoring accuracies comparable to some of the best LLM-related scoring results documented in the literature (e.g., Cohen et al., 2024). However, GPT-4o still produced 543 more grading mistakes than EvoGrader (703 vs. 160), or an average of 0.0781 vs. 0.0178 errors per student response (nine concepts × 1000 students produce 9000 total scores). This suggests that further prompt engineering will be needed to improve LLM performance (discussed below).

4. Discussion

How well do proprietary and open-weight LLMs perform scoring students’ scientific explanations of evolutionary change derived from the ACORNS instrument, and how do the LLM results compare to older ML-based scoring tools? Given the impressive performance of LLMs on graduate-level tasks (e.g., Stribling et al., 2024), we anticipated that very basic zero-shot prompt engineering using LLMs would quickly and easily match EvoGrader’s performance on detecting disciplinary core ideas and misconceptions in students’ text-based evolutionary explanations (NRC, 2012). Surprisingly, after months of prompting, we found that overall (and for all nine concepts) this was never the case (Figure 2; Table 2). Only GPT-4o showed promise as an alternative to EvoGrader.

Despite employing very basic prompt engineering techniques, LLM accuracy in our study was very similar to prior LLM studies using much more sophisticated prompt engineering techniques. Cohen et al. (2024), for example, found that LLMs had variable success scoring middle school student explanations, with a few items producing values similar to GPT-4o in our study and others similar to Gemma and Llama (Figure 1). Although it is likely that more sophisticated prompt engineering will improve LLM scoring of evolutionary explanations, this should be tested empirically in future studies given that prompt engineering is not always able to improve performance meaningfully.

Prior studies have found that traditional auto-scoring approaches can outperform LLM scoring with more advanced prompt engineering. In a study of automated detection of item writing problems, for example, Moore et al. (2023) found that a traditional rule-based approach to automated scoring was far superior to GPT-4. In our study, several factors may account for why traditional ML consistently outperformed LLMs: the training corpora, linguistic features of the knowledge domain, and prompt engineering methodologies.

4.1. Training Corpora

Although many factors are likely to contribute to the challenges of automated analysis of evolutionary language, training data well matched to testing data improves scoring success (Ha et al., 2011; Huebner et al., 2021). Recent studies of LLM performance have shown that the distribution of training data features significantly predicts model performance; what a system learns from will determine what it learns (Huebner et al., 2021; McCoy et al., 2024; Wei et al., 2021). The LLMs used in our study are unlikely to have been trained on comparable corpora as the ML-based system. The >10,000 student-generated evolutionary explanations used to develop EvoGrader comprise a unique corpus that may lack equivalencies in the trillions of documents used in LLM training and tuning.

Science educators are accustomed to working with unique student response data in terms of population (e.g., children), discourse (e.g., scientific practices), and domain (e.g., focused concepts within disciplines) that may be comparatively sparse in LLM training data. Educators’ interests in detecting non-normative reasoning patterns (“misconceptions”) are also unlikely to have been considered in LLM training corpus design. Recent work in science education showing that fine-tuning LLMs with student response corpora has improved outcomes supports this line of reasoning. Liu et al. (2023), for example, reported that augmenting transformer models with additional student work (what they referred to as SciEdBERT) improved model performance.

4.2. Linguistic Features of the Knowledge Domain

Another factor contributing to the differences in scoring success between traditional ML and LLM scoring may relate to the domain of evolutionary biology and associated linguistic complexities. Unlike many science domains, news articles, internet postings, and blogs are well known for containing scientifically inaccurate, confusing, ambiguous, or misleading language about evolutionary concepts (Rector et al., 2013). Scientific discourse about evolution is also fraught with interpretive complications (Shiroda et al., 2023); it contains language imbued with lexical ambiguity (e.g., ‘adapt’; Rector et al., 2013). Some widely used disciplinary terms remain undefined in professional journals (e.g., “evolutionary pressures”; see also Nehm et al., 2010) leading to widespread variation in meaning across professional and everyday contexts. Scientists themselves have debated for half a century whether it is appropriate to include teleological or goal-directed language in biology textbooks and scientific articles given its misleading nature (e.g., Simpson, 1959). These issues create unique challenges for determining when teleological terms are scientifically appropriate in student language (Rector et al., 2013). Like need-based language, the conflation of genetic concepts central to the evolutionary idea of variation (e.g., gene and allele) remains common in journal articles, professional discourse, and student language, which likewise makes detection of normative concepts in student explanations challenging for LLMs. Compounding these problems, scientific discourse and jargon patterns change as students gain disciplinary competence (Wang et al., 2024). All these factors are likely contributors to the difficulty LLMs have in detecting student misconceptions (Figure 2; Table 2). Further work is needed to explore linkages between these scoring challenges and LLM training data distributions (cf. McCoy et al., 2024).

Do LLMs and ML approaches offer comparable methodological advantages for science educators? ML and LLM approaches to science assessment have different context-dependent advantages and disadvantages (Yan et al., 2024 see Table 3). Perhaps the most salient issues in automated assessment evaluation relate to the production of accurate scores and the rationale for setting benchmarks for what “accurate” means. In this study, ML scoring matched extremely high expert scoring agreements in all cases (cf. Moharreri et al., 2014), whereas the LLM results (specifically GPT-4o) did so in only three out of nine cases. Other LLM concept models performed below ML but were fairly robust (F1 > 0.80). Although we can clearly demonstrate that ML produced the best results (Figure 1), it is more difficult to determine if LLM results are “good enough”—even though they match published research in the field (e.g., Cohen et al., 2024).

In a synthesis of empirical work on automated scoring, Zhai et al. (2020) documented a median Cohen’s Kappa of 0.72 (computer-human agreement) across studies. They suggested that such error rates demonstrate, in a technical sense, the success and utility of computer scoring. In our study, GPT-4o met or exceeded these median values and also met agreement benchmarks often considered near perfect (cf. Landis & Koch, 1977) by many science education researchers (Zhai et al., 2020). In practical terms, however, GPT-4o produced 540 more grading errors than ML. To a classroom teacher or parent, the benefit of saving time by using automated scoring may not be worth the cost of hundreds of additional scoring mistakes on student work. These technical issues intersect with ethical concerns that have been largely missing from automated scoring “success” studies and are discussed in more detail below.

Despite limitations, the broad benefits of using LLMs and ML in science education are clear (Zhu et al., 2020; Li, 2025). Numerous reform documents (AAAS, 2011; NRC, 2012) emphasize the importance of providing K-16 students numerous opportunities for using their knowledge to engage in authentic scientific practices (e.g., explaining natural phenomena like biological evolution). Having students practice generating evolutionary explanations—and other scientific products—requires writing (Nehm et al., 2012a; NRC, 2012). University biology classes are often large (500 or more students), making it unlikely that sufficient resources will be available to analyze evolutionary explanations to detect disciplinary core ideas at an actionable scale for guiding student thinking and revision (Myers & Wilson, 2023; 4500 analytic scores would be needed per learning opportunity). Similar challenges exist for K-12 teachers, albeit on a smaller scale. Opportunities for revision based on feedback have been shown to improve performance on science practices (Zhu et al., 2020; Li, 2025). Ethical challenges arise because of the tension between providing quality learning experiences and scoring errors. These issues are expanded upon below.

What ethical considerations should be made when choosing between LLMs and ML tools? Many ethical issues are entangled with the use of AI-generated evaluations of student work (Yan et al., 2024). The scoring of evolutionary explanations most closely aligns with the ethical principles of beneficence and privacy. In general, beneficence involves a priori minimization of possible negative outcomes or harms that could result from an intended action. Much of the science education literature on automated scoring has focused on technical arguments relating to scoring accuracy and, in so doing, implicitly engaged with issues of beneficence. For example, arguments justifying GPT-level error magnitudes (Figure 2, Table 2) as acceptable have included: quantitative comparisons with expert and/or teacher agreement magnitudes; teacher scoring error frequencies; score use in “low stakes” or “formative” contexts; student disengagement with feedback results; and the benefits of some feedback rather than none. Arguments about the benefits of automated scoring have also emphasized the value of freeing up time for often overworked teachers so that they may focus on more high-impact tasks (Zhai et al., 2020).

One major limitation of these technical arguments is that they have mostly occurred in academic journals and have been made by researchers developing and evaluating scoring tools. The ethical dimensions of these arguments (e.g., score benchmarks) have not involved stakeholders (e.g., parents, school boards) in hypothetical or real-world contexts. Beneficence considerations require stakeholder perspectives, yet science education researchers developing and evaluating automated scoring have drawn conclusions about scoring “success” largely on their own. The differences between ML and LLM scoring accuracy documented in our study may seem trivial to some scholars focused on technical performance, but this sidesteps the necessary ethical considerations of beneficence (e.g., whether parents view the differences as meaningful). Much more inclusive beneficence considerations regarding scoring error benchmarks are needed given the expanding role of LLMs in assessment, widely varying operational definitions of scoring “success”, and a focus on technical rather than ethical considerations.

Although explicit and principled integration of stakeholder considerations into evaluations of AI scoring success benchmarks are lacking in science education research (including for EvoGrader), prior work has explored teacher and parent attitudes and perceptions of AI in general (Han et al., 2024). One broad conclusion from this work is that different stakeholder groups have varying degrees of context-dependent AI acceptance and trust that impact AI deployment intentions and perspectives (e.g., Karran et al., 2024, p. 14; Kizilcec, 2024). Parents, for example, were found to be more accepting of AI in education if they had confidence in AI acting responsibly and fairly (Karran et al., 2024). Findings like these highlight the complex interplay among variables (e.g., AI privacy, transparency, explainability, agency) in shaping stakeholder views and demonstrate that careful measurement of multiple constructs beyond scoring accuracy is essential. These findings have relevance to educators’ LLM and LM scoring choices because privacy, explainability, and context dependency (e.g., AI autonomy vs. ‘human in the loop’) vary by automated scoring technology. For example, lower scoring thresholds may be more acceptable to stakeholders in cases of higher AI privacy, consistency, and explainability; higher scoring thresholds may be less acceptable in cases of lower AI privacy, consistency, and explainability. To our knowledge, a limitation of all prior work in automated scoring in science education is that it has not measured a complete suite of contributing variables identified in the literature or quantified their interactions across different educational contexts in relevant stakeholders (e.g., Karran et al., 2024).

The LLMs in our study introduced privacy drawbacks not applicable to the ML tool (Table 4). Unlike many uses of AI, our corpus minimized privacy risk by consisting of anonymous explanations stripped of any identifiers. But privacy concerns remained; student writing products became the property of the proprietary LLMs. Such work may be used in future profit-generating activities by corporations. Providing student text to LLMs may also complicate ongoing efforts to differentiate human vs. computer generated text in online educational settings. Finally, the corporations that developed LLMs are being sued for copyright infringement by many content creators because their proprietary materials were used in model training without their permission (Brittain, 2025). Paying for LLM access could be viewed as supporting unethical or illegal activity. It is concerning that these ethical issues have not received attention in papers testing the “success” of LLM scoring.

In addition to ethical issues, science educators need to consider how their time is spent when developing LLM and ML scoring. The rubrics used to produce the training data for our study were not developed exclusively for ML applications, and much of the scored corpus was available prior to model-building efforts. However, if these materials were absent, then the economic costs of ML work would have been substantially greater. Given that many universities and school systems have large science classes that include graded materials, there are numerous opportunities for efficient ML model building by leveraging existing data. ML and LLM work differ in where effort is expended: months of prompt engineering versus months of analyzing and coding student response data. Developing high-quality scoring rubrics is challenging (Sripathi et al., 2024; Wang et al., 2024) but often leads to important insights into student thinking (Nehm, 2024). This issue has also received little attention in the automated scoring literature.

Short-term scoring reliability and long-term replicability are also issues to consider when evaluating the advantages and disadvantages of ML and LLMs. Although assessments may be used for different purposes (Dann, 2014), continuous improvement of student learning outcomes is often a component of university accreditation programs and teacher evaluation systems (e.g., NCATE). Educators often seek to employ measures that can be compared over time or in response to alternative educational materials. Assessment results must be replicable and stable over time so that robust comparisons can be made for evaluation and research purposes. LLM results may not be replicable in the short or long term (e.g., years) because of ongoing LLM evolution and the probabilistic nature of model outputs. Studies of LLM intra-rater reliability (scoring using the same prompt over brief timespans) have documented significant differences in performance (Pack et al., 2024, p. 4), although this challenge can be addressed by using approaches such as random seeds. Direct comparisons of scores using LLMs across samples and over time (e.g., multiple years) may be limited. The deterministic nature of ML models can be advantageous in this regard, as EvoGrader has been used to study long-term classroom learning patterns for over a decade (see, e.g., Sbeglia & Nehm, 2024).

5. Study Limitations

Ongoing LLM model, corpus, and training improvements, along with new prompt engineering strategies, will likely increase LLM scoring success to levels sought by teachers and parents. It is important to emphasize, however, that the LLM performance we report (Figure 2) is in line with the vast majority of published work on text-based student explanations (e.g., Cohen et al., 2024). Nevertheless, there remain several limitations and many opportunities for further LLM improvement.

First, our study was designed to compare different LLM models by keeping the prompt and other settings constant, varying only the LLM. However, our results did not align with published benchmarks like MMLU, suggesting that for our specific task we may need to evaluate models independently. With the continuous emergence of newer and more powerful models, a key limitation is that we have not yet explored the full range of available LLMs or attempted to tailor prompts to each model.

Second, although the GPT-4o results matched top performance in published work, only very basic prompt engineering was used. Few-shot and COT approaches, along with fine-tuning techniques, may enhance LLM performance (Lee et al., 2024; Stahl et al., 2024). We can only claim that our very basic LLM prompts did not match the EvoGrader ML results, not that LLMs in general will not be able to reach ML levels.

Third, rubric formats play a role in scoring success, and recent work has suggested that holistic rubrics enhance LLM scoring success relative to analytic rubrics with multiple criteria (as was the case in our study) (Pack et al., 2024). Moving away from the analytical published rubrics of Nehm et al. (2010) and developing holistic criteria may improve LLM scoring success to match or exceed the ML results.

Fourth, breaking down the experiment into multiple steps is another strategy for improvement. For example, one representative prompt may instruct the model to identify individual terms and text features (such as mutation, variation, recombination, and specific genotypes) based on predefined criteria, assign binary scores to each, sum them, and return a Boolean output based on the total. This process could be decomposed into sequential subtasks—each checking for one category of keywords and returning intermediate Boolean results—before applying a final conditional check. This workflow would improve run time and likely improve scoring success.

Fifth, although many instructors have adopted EvoGrader, we do not know which combination of factors contributed to this choice (or, for that matter, those who did not choose to do so). Issues of privacy, accuracy, and replicability that are characteristic of the LLMs do not apply to EvoGrader, but, nevertheless, more careful work with stakeholders is essential for establishing more rigorous and holistic benchmarks for scoring success.

6. Conclusions

This study emphasizes that no single technology may be universally superior for all automated scoring contexts in science education, and the novelty of LLMs should not eliminate empirical tests of alternative approaches (particularly traditional ML). For now, ML-based tools remain valuable for tasks prioritizing precision, reliability, replicability, and controlled implementation, while LLMs hold promise for broader, more dynamic applications that permit potential inaccuracies. The unique nature of student discourse, the ambiguities inherent in language, in particular science domains (e.g., evolution), the distribution of student language in LLM training corpora, and the complexities of prompt engineering expertise pose ongoing challenges for LLMs. Moreover, studies of LLM and ML scoring success must move beyond a focus on technical accuracy alone and incorporate ethical considerations (e.g., beneficence considerations require teachers and parents to be involved in the use of scores). As both ML and LLM technologies evolve, ongoing evaluations of context, cost-benefit, and ethics will be essential to maximize their potential in automated assessment in science education.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/educsci15060676/s1, prompt codes.

Author Contributions

Conceptualization, Y.P. and R.H.N.; methodology, Y.P. and R.H.N.; formal analysis, Y.P.; investigation, Y.P.; resources, R.H.N.; data curation, Y.P.; writing—original draft preparation, Y.P. and R.H.N.; writing—review and editing, Y.P. and R.H.N.; visualization, Y.P. and R.H.N.; supervision, R.H.N.; funding acquisition, R.H.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Howard Hughes Medical Institute Inclusive Excellence grant and The APC was funded by the National Science Foundation DUE 2318346. The views expressed in this article are those of the authors and not HHMI or NSF.

Institutional Review Board Statement

Not applicable; the analyses used de-identified data collected prior to this study.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available from the authors.

Acknowledgments

The authors acknowledge HHMI funding and the very significant improvements to our manuscript made by two anonymous reviewers. These comments provided useful literature, ideas for future work, and also helped improve the quality of our contribution.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study, in the collection, analyses, or interpretation of the data, in the writing of the manuscript, or in the decision to publish the results. AI was not used in the writing of this manuscript.

Abbreviations

The following abbreviations are used in this manuscript:

API	Application Programming Interface
LLM	Large Language Models
ML	Machine Learning
NCATE	National Council for Accreditation of Teacher Education
COT	Chain of Thought

Note

1	See Supplementary Materials for equations.

References

American Association for the Advancement of Science. (2011). Vision and change in undergraduate biology education: A call to action. American Association for the Advancement of Science. [Google Scholar]
Beggrow, E. P., Ha, M., Nehm, R. H., Pearl, D., & Boone, W. J. (2014). Assessing scientific practices using machine-learning methods: How closely do they match clinical interview performance? Journal of Science Education and Technology, 23(1), 160–182. [Google Scholar] [CrossRef]
Brittain, B. (2025, February 20). OpenAI must face part of Intercept lawsuit over AI training. Reuters. Available online: www.reuters.com (accessed on 22 May 2025).
Cohen, C., Hutchins, N., Le, T., & Biswas, G. (2024). A Chain-of-thought prompting approach with LLMs for evaluating students’ formative assessment responses in science. arXiv, arXiv:2403.14565v1. [Google Scholar]
Dann, R. (2014). Assessment as learning: Blurring the boundaries of assessment and learning for theory, policy and practice. Assessment in Education: Principles, Policy & Practice, 21(2), 149–166. [Google Scholar] [CrossRef]
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., … Ganapathy, R. (2024). The llama 3 herd of models. arXiv, arXiv:2407.21783. [Google Scholar]
Gemma Team. (2024). Gemma 2: Improving open language models at a practical size. arXiv, arXiv:2408.00118. [Google Scholar]
Gerard, L., & Linn, M. C. (2022). Computer-based guidance to support students’ revision of their science explanations. Computers & Education, 176, 104351. [Google Scholar] [CrossRef]
Ha, M., Nehm, R. H., Urban-Lurain, M., & Merrill, J. E. (2011). Applying computerized scoring models of written biological explanations across courses and colleges: Prospects and limitations. CBE-Life Sciences Education, 10(4), 379–393. [Google Scholar] [CrossRef]
Han, A., Zhou, X., Cai, Z., Han, S., Ko, R., Corrigan, S., & Peppler, K. A. (2024, May 11–16). Teachers, parents, and students’ perspectives on integrating generative AI into elementary literacy education [Paper presentation]. CHI Conference on Human Factors in Computing Systems (CHI’24) (, 17p), Honolulu, HI, USA. [Google Scholar] [CrossRef]
Harris, Z. (1954). Distributional structure. Word, 10, 146–162. [Google Scholar] [CrossRef]
Haudek, K., Kaplan, J., Knight, J., Long, T., Merrill, J., Munn, A., Smith, M., & Urban-Lurain, M. (2011). Harnessing technology to improve formative assessment of student conceptions in STEM: Forging a national network. CBE-Life Sciences Education, 10, 149–155. [Google Scholar] [CrossRef]
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2020). Measuring massive multitask language understanding. arXiv, arXiv:2009.03300. [Google Scholar]
Huebner, P. A., Sulem, E., Cynthia, F., & Roth, D. (2021, November 10–11). BabyBERTa: Learning more grammar with small-scale child-directed language [Paper presentation]. 25th Conference on Computational Natural Language Learning (pp. 624–646), Online. [Google Scholar]
Karran, A. J., Charland, P., Martineau, J. T., de Arana, A., Lesage, A. M., Senecal, S., & Leger, P. M. (2024). Multi-stakeholder perspective on responsible artificial intelligence and acceptability in education. arXiv, arXiv:2402.15027. [Google Scholar]
Kizilcec, R. F. (2024). To advance AI use in education, focus on understanding educators. International Journal of Artificial Intelligence in Education, 34, 12–19. [Google Scholar] [CrossRef] [PubMed]
Kundu, A., & Barbosa, D. (2024). Are large language models good essay graders? arXiv. [Google Scholar] [CrossRef]
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 1159–1174. [Google Scholar] [CrossRef]
Latif, E., & Zhai, X. (2024). Automatic scoring of students’ science writing using hybrid neural network. Proceedings of Machine Learning Research, 257, 97–106. [Google Scholar]
Lee, G.-G., Latif, E., Wu, X., Liu, N., & Zhai, X. (2024). Applying large language models and chain-of-thought for automatic scoring. Computers and Education: Artificial Intelligence, 6, 100213. [Google Scholar] [CrossRef]
Li, W. (2025). Applying natural language processing adaptive dialogs to promote knowledge integration during instruction. Education Sciences, 15(2), 207. [Google Scholar] [CrossRef]
Liu, Z., He, X., Liu, L., Liu, T., & Zhai, X. (2023). Context matters: A strategy to pre-train language model for science education. arXiv, arXiv:2301.12031. [Google Scholar]
Magliano, J. P., & Graesser, A. C. (2012). Computer-based assessment of student-constructed responses. Behavior Research Methods, 44(3), 608–621. [Google Scholar] [CrossRef]
McCoy, R. T., Yao, S., Friedman, D., Hardy, M. D., & Griffiths, T. L. (2024). Embers of autoregression show how large language models are shaped by the problem they are trained to solve. Proceedings of the National Academy of Sciences of the United States of America, 121(41), e2322420121. [Google Scholar] [CrossRef]
Moharreri, K., Ha, M., & Nehm, R. H. (2014). EvoGrader: An online formative assessment tool for automatically evaluating written evolutionary explanations. Evolution: Education and Outreach, 7, 15. [Google Scholar] [CrossRef]
Moore, S., Nguyen, H. A., Chen, T., & Stamper, J. (2023). Assessing the quality of multiple-choice questions using GPT-4 and rule-based methods. In European conference on technology enhanced learning (pp. 229–245). Springer. [Google Scholar]
Myers, M. C., & Wilson, J. (2023). Evaluating the construct validity of an automated writing evaluation system with a randomization algorithm. International Journal of Artificial Intelligence in Education, 33(3), 609–634. [Google Scholar] [CrossRef]
National Research Council (NRC). (2012). A framework for K-12 science education: Practices, crosscutting concepts, and core ideas. National Academies Press. [Google Scholar]
Nehm, R. H. (2024, November 21). AI in biology education assessment: How automation can drive educational transformation. In X. Zhai, & J. Krajcik (Eds.), Uses of artificial intelligence in STEM education (online ed.). Oxford Academic. [Google Scholar] [CrossRef]
Nehm, R. H., Beggrow, E. P., Opfer, J. E., & Ha, M. (2012b). Reasoning about natural selection: Diagnosing contextual competency using the ACORNS instrument. The American Biology Teacher, 74(2), 92–98. [Google Scholar] [CrossRef]
Nehm, R. H., Ha, M., & Mayfield, E. (2012a). Transforming biology assessment with machine learning: Automated scoring of written evolutionary explanations. Journal of Science Education and Technology, 21(1), 183–196. [Google Scholar] [CrossRef]
Nehm, R. H., Ha, M., Rector, M., Opfer, J. E., Perrin, L., Ridgway, J., & Mollohan, K. (2010). Scoring guide for the open response instrument (ORI) and evolutionary gain and loss test (ACORNS) (Unpublished Technical Report of National Science Foundation REESE Project 0909999). National Science Foundation. [Google Scholar]
Nyaaba, M., Zhai, X., & Faison, M. Z. (2024). Generative AI for culturally responsive science assessment: A conceptual framework. Education Sciences, 14(12), 1325. [Google Scholar] [CrossRef]
Oli, P., Banjade, R., Chapagain, J., & Rus, V. (2023). Automated Assessment of Students’ Code Comprehension using LLMs. arXiv, arXiv:2401.05399. [Google Scholar]
OpenAI. (2024). Hello gpt-4o. Available online: https://openai.com/index/hello-gpt-4o/ (accessed on 22 May 2025).
OpenAI. (2025). Tactic: Ask the model to adopt a persona. OpenAI Platform. Available online: https://platform.openai.com/docs/guides/prompt-engineering#tactic-ask-the-model-to-adopt-a-persona (accessed on 31 March 2025).
Opfer, J. E., Nehm, R. H., & Ha, M. (2012). Cognitive foundations for science assessment design: Knowing what students know about evolution. Journal of Research in Science Teaching, 49(6), 744–777. [Google Scholar] [CrossRef]
Pack, A., Barrett, A., & Escalante, J. (2024). Large language models and automated essay scoring of English language learner writing: Insights into reliability and validity. Computers and Education: Artificial Intelligence, 6, 100234. [Google Scholar] [CrossRef]
Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—Support vector learning (pp. 185–208). MIT Press. [Google Scholar]
Rector, M. A., Nehm, R. H., & Pearl, D. (2013). Learning the language of evolution: Lexical ambiguity and word meaning in student explanations. Research in Science Education, 43, 1107–1133. [Google Scholar] [CrossRef]
Riordan, B., Bichler, S., Bradford, A., Chen, J. K., Wiley, K., Gerard, L., & Linn, M. C. (2020, July 10). An empirical investigation of neural methods for content scoring of science explanations. Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 135–144), Seattle, WA, USA. [Google Scholar]
Sbeglia, G. C., & Nehm, R. H. (2024). Building conceptual and methodological bridges between SSE’s diversity, equity, and inclusion statement and educational actions in evolutionary biology. Evolution, 78(5), 809–820. [Google Scholar] [CrossRef]
Shiroda, M., Doherty, J., Scott, E., & Haudek, K. (2023). Covariational reasoning and item context affect language in undergraduate mass balance written explanations. Advances in Physiology Education, 47, 762–775. [Google Scholar] [CrossRef] [PubMed]
Simpson, G. G. (1959). On Eschewing teleology. Science, 129(3349), 672–675. [Google Scholar] [CrossRef] [PubMed]
Sripathi, K. N., Moscarella, R. A., Steele, M., Yoho, R., You, H., Prevost, L. B., Urban-Lurain, M., Merrill, J., & Haudek, K. C. (2024). Machine learning mixed methods text analysis: An illustration from automated scoring models of student writing in biology education. Journal of Mixed Methods Research, 18(1), 48–70. [Google Scholar] [CrossRef]
Stahl, M., Biermann, L., Nehring, A., & Wachsmuth, H. (2024, June 20). Exploring LLM prompting strategies for joint essay scoring and feedback generation [Paper presentation]. 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024) (pp. 283–298), Mexico City, Mexico. [Google Scholar]
Stribling, D., Xia, Y., Amer, M. K., Graim, K. S., & Mulligan, C. J. (2024). The model student: GPT-4 performance on graduate biomedical science exams. Scientific Reports, 14, 5670. [Google Scholar] [CrossRef]
Wang, H., Haudek, K. C., Manzanares, A. D., Romulo, C. L., & Royse, E. A. (2024). Extending a pretrained language model (BERT) using an ontological perspective to classify students’ scientific expertise level from written responses. Research Square. [Google Scholar] [CrossRef]
Wei, J., Garrette, D., Linzen, T., & Pavlick, E. (2021, November 7–11). Frequency effects on syntactic rule learning in transformers. 2021 Conference on Empirical Methods in Natural Language Processing (pp. 932–948), Online and Punta Cana, Dominican Republic. [Google Scholar]
Wu, X., Saraf, P. P., Lee, G. G., Latif, E., Liu, N., & Zhai, X. (2024). Unveiling scoring processes: Dissecting the differences between LLMs and Human graders in automatic scoring. arXiv, arXiv:2407.18328. [Google Scholar] [CrossRef]
Yan, L., Sha, L., Zhao, L., Li, Y., Martinez-Maldonado, R., Chen, G., Li, X., Jin, Y., & Gašević, D. (2024). Practical and ethical challenges of large language models in education: A systematic scoping review. British Journal of Educational Technology, 55, 90–112. [Google Scholar] [CrossRef]
Zhai, X. (2023). Chatgpt and AI: The game changer for education. In X. Zhai (Ed.), ChatGPT: Reforming education on five aspects (pp. 16–17). Shanghai Education. [Google Scholar]
Zhai, X., & Nehm, R. H. (2023). AI and formative assessment: The train has left the station. Journal of Research in Science Teaching, 60(6), 1390–1398. [Google Scholar] [CrossRef]
Zhai, X., Neumann, K., & Krajcik, J. (2023). AI for tackling STEM education challenges. Frontiers in Education, 8, 1183030. [Google Scholar] [CrossRef]
Zhai, X., Shi, L., & Nehm, R. H. (2021). A meta-analysis of machine learning-based science assessments: Factors impacting machine-human score agreements. Journal of Science Education and Technology, 30(3), 361–379. [Google Scholar] [CrossRef]
Zhai, X., Yin, Y., Pellegrino, J. W., Haudek, K. C., & Shi, L. (2020). Applying machine learning in science assessment: A systematic review. Studies in Science Education, 56(1), 111–151. [Google Scholar] [CrossRef]
Zhu, M., Liu, O. L., & Lee, H.-S. (2020). The effect of automated feedback on revision behavior and learning gains in formative assessment of scientific argument writing. Computers & Education, 143, 103668. [Google Scholar]

Figure 1. Overview of the research design. The blocks with a grey background represent elements from previous research, the blocks with a green background represent open-source models and methods, the block with a red background represents a proprietary model, and the blocks with a blue background are specifically tailored to this study.

Figure 2. Box plots illustrate nine key concepts’ scoring accuracy (F1 values) for the LLMs (GPT, Llama, Gemma) and traditional ML-based EvoGrader. The boxes represent the interquartile range (IQR), covering the middle 90% of the values. The whiskers extend to the minimum and maximum values within the range, while individual dots indicate outliers beyond the whiskers. The line inside the box denotes the median value. See Table 1 and Figure 2 for additional agreement statistics across models for particular concepts. The circle refers to outliers.

Figure 3. Scoring success (F1 values) for the nine concepts evaluated in students’ evolutionary explanations (n = 1000) across three different LLMs and one ML-based tool (EvoGrader). White boxes indicate the best-performing model. ML= Machine Learning. See text for additional details on the concepts and agreement statistics and Table 2 for additional agreement statistics across models.

Table 1. Example of a student explanation (italics) and associated scoring across ML and LLM scoring. True = accurate; false = inaccurate scoring. Additional examples are provided in the Supplementary Materials.

“To explain the presence of pulegone in later generations of a plant that had not previously contained pulegone in ancestral plants, biologists could consider the changes in the environment that may have taken place over the time period that led to the development of pulegone in plants. Possible environmental influences may have prompted a need for adaptation by the labiatae that would require pulegone in order to thrive. Biologists could use human development and disruption, climate changes, predation, and other such influences to explain the current presence of pulegone in plants. Biologists could explain a slow process of evolution that took place over an extended period of time leading up to the addition of pulegone. Biologists could also explain that genetic variation may have resulted in pulegone. Over time, some of the plants may have had mutated genetic material that caused for the [sic] presence of pulegone. If these plants were then able to survive better in the conditions of their environment rather than those of plants without the mutation, this would yield favorable genetic variation that would become more prevalent in surviving plants. With time, the plants with the mutated genes that cause pulegone to occur would out-live the ancestral plants lacking the gene and through generations of reproduction, the new labiatae all contain this material.”
Student Explanation Scoring	KC1 (Presence/Causes of Variation)	KC2 (Heritability)	KC3 (Competition)	KC5 (Limited Resources)	KC6 (Differential Survival/Reproduction)	NI1 (Inappropriate Teleology)	NI2 (Use-Disuse)	NI3 (Adaptation as Acclimation)	NAR (Non-Adaptive Causation)
True Label	T	F	F	T	T	T	F	F	F
EvoGrader	TRUE	FALSE	FALSE	TRUE	TRUE	TRUE	FALSE	FALSE	FALSE
ChatGPT-4o	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE	FALSE	TRUE	FALSE
Llama 3.1-405b	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE	FALSE	TRUE	FALSE
Llama 3.1-70b	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE	FALSE	TRUE	FALSE
Llama 3.1-8b	TRUE	TRUE	FALSE	TRUE	TRUE	TRUE	FALSE	TRUE	TRUE
Llama 3-8b	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE	FALSE	TRUE	TRUE
Gemma 2-9b	TRUE	FALSE	TRUE	TRUE	TRUE	TRUE	FALSE	TRUE	TRUE

Table 2. Scoring accuracy across nine key concepts in student explanations for the LLMs and ML (EvoGrader) models. Kappa, accuracy, precision, recall, and F1 were used to measure scoring accuracy. Bolded and underlined values indicate the highest scoring accuracy using F1.

Model	Time (h/m)	Concept	Kappa	Accuracy	Precision	Recall	F1
Gemma2-9b	0:44′	Adapt = acclimate	0.048	0.307	0.552	0.597	0.298
Llama3-8b	0:51′	Adapt = acclimate	0.129	0.526	0.571	0.692	0.462
Llama3.1-8b	0:45′	Adapt = acclimate	0.106	0.553	0.553	0.646	0.468
Llama3.1-405b	2:17′	Adapt = acclimate	0.155	0.571	0.579	0.718	0.495
Llama3.1-70b	0:46′	Adapt = acclimate	0.171	0.586	0.585	0.735	0.507
ChatGPT-4o	0:12′	Adapt = acclimate	0.367	0.830	0.647	0.780	0.676
EvoGrader	0:04′	Adapt = acclimate	0.846	0.972	0.923	0.923	0.923
Llama3.1-8b	0:50′	KC1 Variation	0.481	0.730	0.755	0.771	0.729
Llama3.1-70b	0:20′	KC1 Variation	0.569	0.777	0.800	0.820	0.776
Llama3.1-405b	1:11′	KC1 Variation	0.615	0.804	0.815	0.840	0.802
Llama3-8b	0:55′	KC1 Variation	0.810	0.916	0.932	0.889	0.905
Gemma2-9b	0:40′	KC1 Variation	0.836	0.924	0.917	0.919	0.918
ChatGPT-4o	0:12′	KC1 Variation	0.850	0.933	0.925	0.926	0.925
EvoGrader	0:04′	KC1 Variation	0.939	0.972	0.975	0.965	0.969
Llama3.1-8b	0:53′	KC2 Heritability	0.526	0.818	0.730	0.860	0.755
Llama3.1-405b	0:48′	KC2 Heritability	0.652	0.876	0.786	0.914	0.823
Llama3.1-70b	0:14′	KC2 Heritability	0.700	0.897	0.810	0.929	0.848
Llama3-8b	1:09′	KC2 Heritability	0.766	0.936	0.895	0.872	0.883
Gemma2-9b	0:40′	KC2 Heritability	0.818	0.945	0.885	0.939	0.909
ChatGPT-4o	0:11′	KC2 Heritability	0.792	0.950	0.915	0.880	0.896
EvoGrader	0:04′	KC2 Heritability	0.941	0.984	0.991	0.953	0.970
Llama3.1-8b	0:58′	KC3 Competition	0.013	0.277	0.513	0.632	0.233
Gemma2-9b	0:41′	KC3 Competition	0.140	0.814	0.546	0.905	0.533
Llama3.1-70b	0:15′	KC3 Competition	0.257	0.903	0.582	0.951	0.615
Llama3-8b	0:44′	KC3 Competition	0.409	0.949	0.636	0.974	0.700
Llama3.1-405b	0:18′	KC3 Competition	0.635	0.979	0.738	0.989	0.817
ChatGPT-4o	0:11′	KC3 Competition	0.951	0.998	0.955	0.999	0.976
EvoGrader	0:04′	KC3 Competition	1.000	1.000	1.000	1.000	1.000
Llama3.1-70b	0:16′	KC5 Limited resources	0.279	0.596	0.663	0.738	0.581
Gemma2-9b	0:40′	KC5 Limited resources	0.289	0.610	0.663	0.742	0.592
Llama3.1-405b	1:23′	KC5 Limited resources	0.374	0.680	0.691	0.791	0.654
Llama3-8b	0:45′	KC5 Limited resources	0.541	0.829	0.749	0.807	0.769
Llama3.1-8b	0:38′	KC5 Limited resources	0.646	0.893	0.859	0.797	0.822
ChatGPT-4o	0:11′	KC5 Limited resources	0.832	0.940	0.892	0.949	0.916
EvoGrader	0:04′	KC5 Limited resources	0.963	0.988	0.989	0.975	0.982
Llama3.1-405b	1:51′	KC6 Differential survival	0.484	0.738	0.807	0.746	0.726
Llama3.1-70b	0:20′	KC6 Differential survival	0.502	0.747	0.817	0.755	0.736
Llama3.1-8b	1:28′	KC6 Differential survival	0.570	0.783	0.811	0.788	0.780
Gemma2-9b	0:41′	KC6 Differential survival	0.600	0.798	0.828	0.803	0.795
Llama3-8b	0:40′	KC6 Differential survival	0.666	0.833	0.834	0.834	0.833
ChatGPT-4o	0:09′	KC6 Differential survival	0.701	0.850	0.862	0.852	0.849
EvoGrader	0:04′	KC6 Differential survival	0.910	0.955	0.955	0.955	0.955
Llama3.1-8b	0:32′	Non-Adaptive reasoning	0.188	0.896	0.564	0.721	0.584
Llama3.1-70b	0:16′	Non-Adaptive reasoning	0.376	0.923	0.628	0.926	0.681
Gemma2-9b	0:50′	Non-Adaptive reasoning	0.377	0.932	0.631	0.878	0.684
Llama3-8b	0:52′	Non-Adaptive reasoning	0.373	0.957	0.656	0.735	0.686
Llama3.1-405b	2:36′	Non-Adaptive reasoning	0.527	0.960	0.698	0.910	0.762
ChatGPT-4o	0:13′	Non-Adaptive reasoning	0.707	0.981	0.806	0.921	0.853
EvoGrader	0:04′	Non-Adaptive reasoning	0.982	0.999	0.983	1.000	0.991
Gemma2-9b	0:42′	Need (teleology)	0.180	0.485	0.642	0.666	0.483
Llama3.1-8b	0:35′	Need (teleology)	0.370	0.699	0.678	0.760	0.664
Llama3-8b	0:38′	Need (teleology)	0.357	0.712	0.666	0.735	0.665
Llama3.1-70b	0:15′	Need (teleology)	0.471	0.754	0.723	0.821	0.721
Llama3.1-405b	1:35′	Need (teleology)	0.580	0.824	0.765	0.856	0.785
ChatGPT-4o	0:14′	Need (teleology)	0.678	0.875	0.820	0.867	0.839
EvoGrader	0:04′	Need (teleology)	0.904	0.968	0.962	0.943	0.952
Gemma2-9b	0:42′	Use/disuse inheritance	0.177	0.763	0.565	0.840	0.546
Llama3.1-405b	0:31′	Use/disuse inheritance	0.226	0.817	0.579	0.856	0.586
Llama3.1-70b	0:14′	Use/disuse inheritance	0.283	0.861	0.598	0.866	0.626
Llama3.1-8b	0:31′	Use/disuse inheritance	0.274	0.876	0.596	0.813	0.625
ChatGPT-4o	0:12′	Use/disuse inheritance	0.687	0.963	0.796	0.913	0.843
Llama3-8b	0:40′	Use/disuse inheritance	0.427	0.965	0.779	0.674	0.713
EvoGrader	0:04′	Use/disuse inheritance	1.000	1.000	1.000	1.000	1.000

Table 3. Implications of this study: contextual factors to consider when choosing between machine learning and large language model scoring approaches.

Topic	Machine Learning	Large Language Models
Scoring accuracy	Excellent; comparable to or exceeding expert human inter-rater agreement. Useful for higher-stakes assessment contexts or cases where high accuracy is required (e.g., high quality feedback).	Very good in the case of GPT-4. It does not match expert human raters. May be appropriate for lower-stakes assessment contexts or intermediate quality feedback.
Scoring replication over time and across samples	Results are replicable (i.e., deterministic) and stable over time, facilitating robust comparisons across samples and over time (e.g., many years) for evaluation and research purposes.	Results may not be replicable because of LLM evolution and the probabilistic nature of model outputs (unless random seed is used). Direct comparisons across samples and over time may be limited.
Hardware, software, and technology expertise	Desktop computers can easily perform rapid model building and testing; no need for high-end computational resources (e.g., APIs) or personnel with specialized computer science (CS) knowledge.	Large datasets (e.g., thousands of student responses) require significant computational resources or substantial processing time. Multiple applications and complex workflows need to be used to perform scoring (see Section 2). Specialized CS knowledge is required.
Economics and time	Human scoring for ML training and model building is costly and time-consuming. Model deployment does not incur any costs for large-scale scoring and has no recurring costs, given that it can be executed on inexpensive hardware.	Prompt engineering can be costly in terms of technology and personnel; it may never yield scores at the needed levels of accuracy. Using LLMs requires monthly payments. LLM run time is substantially longer (see Table 2 Run Time data).
Ethical considerations	Clear consent procedures can be used to gather data with the option of declining to release information. Data need not be stored.	Consent is complex or implied when using “free” proprietary versions of LLMs. Information submitted to proprietary LLMs may be stored and used by corporations in profit-generating efforts. Ownership of digital property, including copyrighted material, may be disputed. Use of the LLMs involves participation in disputed copyright infringement by content creators.

Table 4. Overview of different cloud services along with their respective privacy policies.

Service	Models	Privacy Policy
Azure	Llama 3, Llama 3.1 and Gemma (LLM)	“With Azure, you are the owner of the data that you provide for storing and hosting…We do not share your data with advertiser-supported services, nor do we mine them for any purposes like marketing research or advertising…We process your data only with your agreement, and when we have your agreement, we use your data to provide only the services you have chosen.” (https://azure.microsoft.com/en-us/explore/trusted-cloud/privacy, accessed on 22 May 2025)
OpenAI API	ChatGPT-4o (LLM)	“We collect personal data that you provide in the input to our services…including your prompts and other content you upload, such as files, images, and audio.” “How we use personal data: to improve and develop our services and conduct research, for example, to develop new product features.” (https://openai.com/policies/row-privacy-policy/, accessed on 22 May 2025)
AWS	EvoGrader (ML)	Files uploaded to EvoGrader are not retained, and no personal information is collected by the EvoGrader system. (www.evograder.org, accessed on 22 May 2025)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pan, Y.; Nehm, R.H. Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks. Educ. Sci. 2025, 15, 676. https://doi.org/10.3390/educsci15060676

AMA Style

Pan Y, Nehm RH. Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks. Education Sciences. 2025; 15(6):676. https://doi.org/10.3390/educsci15060676

Chicago/Turabian Style

Pan, Yunlong, and Ross H. Nehm. 2025. "Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks" Education Sciences 15, no. 6: 676. https://doi.org/10.3390/educsci15060676

APA Style

Pan, Y., & Nehm, R. H. (2025). Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks. Education Sciences, 15(6), 676. https://doi.org/10.3390/educsci15060676

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Corpus and Human Scoring

2.2. ML-Based Scoring of Scientific Explanations

2.3. Large Language Model Selection and Prompt Engineering

2.4. Programming and Data Extraction

2.5. Statistical Comparisons of LLM and ML Scoring Accuracy Relative to Human Raters

2.6. Model Run Time

3. Results

4. Discussion

4.1. Training Corpora

4.2. Linguistic Features of the Knowledge Domain

5. Study Limitations

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Note

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI