Evaluating LLMs for Automated Scoring in Formative Assessments

Mendonça, Pedro C.; Quintal, Filipe; Mendonça, Fábio

doi:10.3390/app15052787

Open AccessArticle

Evaluating LLMs for Automated Scoring in Formative Assessments

by

Pedro C. Mendonça

^1,2,*

,

Filipe Quintal

^1,2

and

Fábio Mendonça

^1,2

¹

Faculty of Exact Sciences and Engineering, Penteada University Campus, University of Madeira, 9000-082 Funchal, Portugal

²

Interactive Technologies Institute (ITI/LARSyS) and ARDITI, 9020-105 Funchal, Portugal

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(5), 2787; https://doi.org/10.3390/app15052787

Submission received: 6 February 2025 / Revised: 25 February 2025 / Accepted: 3 March 2025 / Published: 5 March 2025

(This article belongs to the Special Issue Applied Artificial Intelligence and Data Science)

Download

Browse Figures

Versions Notes

Abstract

The increasing complexity and scale of modern education have revealed the shortcomings of traditional grading methods in providing consistent and scalable assessments. Advancements in artificial intelligence have positioned Large Language Models (LLMs) as robust solutions for automating grading tasks. This study systematically compared the grading performance of an open-source LLM (LLaMA 3.2) and a premium LLM (OpenAI GPT-4o) against human evaluators across diverse question types in the context of a computer programming subject. Using detailed rubrics, the study assessed the alignment between LLM-generated and human-assigned grades. Results revealed that while both LLMs align closely with human grading, equivalence testing demonstrated that the premium LLM achieves statistically and practically similar grading patterns, particularly for code-based questions, suggesting its potential as a reliable tool for educational assessments. These findings underscore the ability of LLMs to enhance grading consistency, reduce educator workload, and address scalability challenges in programming-focused assessments.

Keywords:

formative assessment; automated grading; education; large language models

1. Introduction

Grading systems in education, particularly those relying on manual processes and standardized frameworks, face persistent challenges that limit their effectiveness. These systems often struggle to scale efficiently, ensure consistent evaluations, and provide feedback tailored to the diverse needs of individual students [1,2]. The process becomes increasingly resource-intensive and time-consuming for instructors tasked with scoring open-ended, text-based assignments as class sizes grow. Additionally, traditional grading methods are prone to inconsistencies and biases, with evaluations frequently influenced by factors unrelated to the quality of students’ work, such as grader fatigue or subjective interpretations of assessment criteria [3].

The emergence of Large Language Models (LLM) presents a transformative opportunity to address these issues. These advanced natural language processing tools can evaluate text-based responses at scale, substantially reducing the manual workload for educators while improving scoring consistency and objectivity [1]. By automating and refining key aspects of the scoring process, LLMs offer the potential to overcome the inefficiencies and inequities of traditional approaches, paving the way for a more efficient, fair, and scalable assessment system [2,3].

Recent research has increasingly explored these possibilities, with studies examining how Artificial Intelligence (AI) powered scoring systems are reshaping educational practices. AI-driven scoring technologies have demonstrated substantial growth, offering scalable solutions to address the complexities of evaluating student work while maintaining fairness and accuracy [1,4]. Fagbohun et al. [1] highlight the ability of LLMs to manage complex scoring tasks, such as essay evaluations, by analyzing content and structure. These systems streamline assessments, reduce instructor workloads, and ensure consistent and unbiased evaluations [1,4].

Compared to manual scoring methods, the scalability and efficiency of LLM-based systems are especially evident in larger classrooms and online environments [1,4]. Alqahtani et al. [4] emphasize that AI-driven systems provide relevant time savings while maintaining quality and reliability in scoring processes. Similarly, Bond et al. [2] underline how these systems deliver precise assessments tailored to individual student needs, addressing disparities often present in traditional scoring. Such innovations allow for greater scalability, making them particularly effective in larger classrooms or online environments [2,4]. These complementary findings collectively highlight LLMs’ role in reducing variability across assessments, thus increasing equity in student evaluations [2,4].

A recent study by Henkel et al. [5] focused on Automatic Short Answer Grading (ASAG) tasks, finding that GPT-4, with minimal prompt engineering, performed at a level comparable to expert human raters. This result suggests that generative LLMs like GPT-4 could be integrated into real-world educational systems to grade formative assessment tasks reliably and efficiently [5]. Additionally, GPT-4’s flexibility in adapting to various scoring contexts with few-shot prompting and minimal adjustments underscores its advantage over fine-tuned systems [6]. This highlights a key benefit of generative models in educational applications, where their ability to align with the gold standard of human ratings demonstrates potential for scalable, automated assessment systems [5,7].

Despite these advancements, several studies stress the importance of human oversight [6,7,8]. LLMs can effectively automate grading, but educators play a crucial role in addressing non-standard questions, ensuring fairness, and upholding ethical practices. Kooli and Yusuf [7] emphasize the need for AI systems to complement rather than replace instructors, maintaining a balanced approach that leverages the strengths of both human judgment and machine efficiency [6,7,8].

Despite their potential, LLMs face critical challenges regarding reliability, fairness, and ethical considerations [1,4]. An important concern is their tendency for hallucinations, where models generate inaccurate or non-existent information [1,4,9,10]. Additionally, data privacy and security remain pressing issues, especially in educational settings, where sensitive information requires robust protection to maintain trust [1,4,11].

Algorithmic bias in LLMs is another concern, as it can result in inequitable grading for minority groups or non-standard language users, undermining fairness [4]. Misuse by students, such as employing LLMs to generate answers, further raises ethical questions and challenges the integrity of the learning process [4]. Moreover, the rapid development of new models complicates research comparability and generalizability, highlighting the need for continuous evaluation across diverse contexts [6].

To address the limitations of LLMs, particularly hallucinations and outdated information, Retrieval-Augmented Generation (RAG) offers a practical solution by integrating external knowledge sources. This approach allows LLMs to retrieve up-to-date and contextually relevant information, substantially improving the accuracy and reliability of their responses [12,13]. RAG systems use semantic similarity to retrieve relevant data from external databases, such as vector databases like ChromaDB, ensuring the generated outputs align closely with the query context [12,14]. The implementation of RAG in this study was supported by LangChain, which provided the tools to efficiently manage the retrieval process and integrate relevant information into the LLM’s outputs [15]. However, the success of this system depends on the quality and timeliness of the external data, requiring careful management to maximize its effectiveness [12].

While RAG mitigates data limitations, effective prompt design is necessary for maximizing LLM performance. Prompts directly influence response quality, with few-shot learning emerging as a suitable technique to enable models to adapt to new tasks using minimal data [8,16]. By incorporating examples like questions, rubrics, and reference answers, few-shot learning can improve accuracy in automated scoring while reducing the need for frequent retraining [6,17]. However, overfitting—in this context, the model becomes overly reliant on specific examples, limiting generalizability—remains a risk. This challenge, and strategies for its mitigation, including maintaining balanced datasets and using sufficiently large validation sets, have been addressed in recent works [18].

Achieving optimal performance depends on selecting the right LLM, with premium and open-source models offering distinct trade-offs [19]. Premium LLMs typically deliver higher performance but have considerable licensing costs and potential data privacy risks due to remote data processing [11,19]. On the other hand, open-source models provide greater flexibility and control, often running locally to address privacy concerns. However, they require substantial computational resources, making deployment at scale technically demanding. The choice ultimately involves balancing performance, cost, privacy, and infrastructure requirements [19].

While advancements have been made in employing LLMs for educational purposes, substantial gaps in research remain, particularly in the context of automated grading. Limited studies have explored the feasibility of using LLMs to accurately grade diverse types of student assessments, presenting a critical opportunity for further investigation into their reliability and scalability [7]. Furthermore, while some research highlights the need for comparisons between open-source and premium LLMs, studies directly addressing their grading performance remain scarce, leaving significant gaps in understanding the trade-offs between these models.

Addressing these challenges and building on the opportunities highlighted by prior research, this study aims to evaluate the ability of LLMs, both open-source and premium, to perform scoring tasks with reliability and accuracy comparable to human evaluators in the context of a computer programming subject, defining two Research Questions (RQ):

RQ1: Are LLM-generated grades as reliable as human evaluations? H₀: There is no significant difference between the two.
RQ2: Can open-source LLMs achieve scoring accuracy comparable to premium models? H₀: There is no significant scoring difference between the models.

The findings demonstrate that while both LLMs align closely with human evaluators, the premium LLM achieves statistically significant and practically equivalent scoring performance, particularly for code-based assessments. The premium LLM also demonstrated the highest consistency across all question types, outperforming human evaluators and the open-source model. By addressing these hypotheses, this study evaluates the potential of LLMs to effectively assist human evaluators in grading tasks.

2. Materials and Methods

2.1. Development of the Platform

The introEduAI platform automates grading processes while evaluating the reliability of LLMs compared to human evaluators. By using LangChain and RAG, the platform delivers precise, personalized, and actionable assessments for diverse question types. Hosted on a Virtual Private Server (VPS), its architecture ensures scalability, cost-efficiency, and seamless interactions among students, educators, and backend systems. For visual demonstrations of the platform’s interface and functionalities, see Appendix A (Figure A1, Figure A2, Figure A3 and Figure A4).

The platform employs Flask with Jinja2 templates and a Representational State Transfer Application Programming Interface (REST API) backend, adhering to the Model-View-Controller paradigm for a maintainable and flexible design. Secure access is managed through JavaScript Object Notation (JSON) Web Token (JWT), providing adequate functionalities for students, teachers, and administrators while ensuring robust data protection. The platform integrates both premium and open-source LLMs through APIs, eliminating the need for local infrastructure. These models were selected as state-of-the-art representatives of their respective categories: OpenAI GPT-4o, widely regarded as a leading premium LLM due to its advanced capabilities and strong performance across a range of applications [18], and LLaMA 3.2, which has been shown to outperform other open-source models on several public benchmarks [20]. This focused selection allows for a meaningful comparison between premium and open-source approaches while maintaining the study’s scope.

The workflow dynamically adapts based on the question type. More specifically, for text-based questions, LangChain retrieves relevant information through RAG and combines it with the query to provide contextually enriched and accurate evaluations. For code-based questions, the LLM’s internal expertise is used to assess submissions. Such was implemented through LangChain by embedding information retrieved from external data sources (in this case, ChromaDB) into LLM prompts, mitigating issues related to outdated knowledge [12,16]. The employed RAG solution dynamically combines retrieval and generative processes, grounding responses in factual knowledge while enhancing the reliability and scalability of grading systems [12,14,15]. Furthermore, the platform employs ChromaDB to manage high-dimensional embeddings created using OpenAI’s “text-embedding-ada-002” model. This vector database was selected as it was shown to be suitable in semantic similarity searches, aligning user queries with relevant data in the RAG workflow.

The platform utilized few-shot prompting (in this case, it enabled a small set of input-output examples within the prompt to guide the behavior of LLMs). This approach enables the models to generalize effectively to new tasks, reducing the need for extensive retraining [6,21]. In this implementation, prompts included key elements such as question text, scoring rubrics, example solutions, evaluation criteria, and sample student responses to ensure outputs aligned with scoring standards. An example of a prompt template used in the study is provided in Figure 1.

2.2. Experimental Design

The study was conducted within the ’Introduction to Programming’ curricular unit at a Portuguese university, involving 23 first-year and second-year students, mostly young adults, predominantly male (22 male and 1 female). The assignment included nine questions written in Portuguese: two programming code-based and seven short-answer questions. Among the short-answer questions, five were in a True or False format, requiring justifications only for the false statements. Details of the short-answer questions, including their solutions and a partial overview of the evaluation criteria, are summarized in Appendix B, Table A1, while the programming code-based questions are detailed in Appendix B, Table A2. To clarify, a complete example of the evaluation criteria for the question with ID 12 is presented in Appendix B, Table A3, demonstrating how scores were assigned based on specific criteria. This setup ensured a balanced assessment of the LLMs’ performance on both objective code-based tasks and subjective short-answer responses.

In total, 207 responses (including blanks) were evaluated. Each response was graded by three teachers and two LLMs, with each model conducting three independent evaluations. Detailed rubrics ensured consistency, and responses were anonymized and graded in randomized order. Scores were averaged across evaluations to ensure fair comparisons of scoring methods.

The study utilized two distinct types of prompts to address specific evaluation tasks: scoring code submissions and evaluating short text-based responses. To ensure fairness and consistency in assessing scoring accuracy, these prompts were structured identically for both LLM.

In this study, the prompts incorporated key elements such as the question text, scoring rubrics, example solutions, evaluation criteria, and student responses to align with established scoring standards. This was performed as former research indicates that including detailed rubric information within prompts enhances evaluation precision and F1 scores by providing clear guidance [16].

The study received ethical approval from the relevant Ethics Commission. All participants provided informed consent, ensuring transparency and voluntary involvement. To protect privacy, responses were anonymized before evaluation, and data were securely stored in encrypted databases with access controlled via JWT.

2.3. Evaluation and Analysis

The evaluation involved a statistical analysis of scoring performance. This was assessed by comparing the final average grades for each student’s response across human evaluators, the open-source LLM, and the premium LLM. Normalized scores ensured consistency across responses with differing maximum grades. Scoring criteria were defined by detailed rubrics, as indicated in Appendix B, Table A3, which illustrates the assessment framework for Question ID 12. Comparisons included human versus LLM grades and direct comparisons between the two LLMs, focusing on alignment and reliability.

Statistical methods included Spearman’s correlation to measure relationships between scoring patterns and Mann–Whitney U tests, chosen due to the non-normal distribution of the data, to assess differences in grade distributions [22]. Equivalence testing was employed to evaluate practical similarity when the null hypothesis could not be rejected. Following Lakens et al. [23], the Two One-Sided Tests (TOST) approach was used to detect meaningful effects within predefined equivalence bounds and provide statistical confirmation of the absence of meaningful differences [24]. We acknowledge that fully non-parametric methods for equivalence do exist (e.g., rank-based approaches) [22]. However, the parametric TOST enabled a more straightforward interpretation of ‘practically negligible’ differences around the mean, allowing for defining and evaluating our equivalence bounds.

Therefore, in this study, ’practical equivalence’ is defined as differences between scoring distributions falling within a predefined threshold of −5 to +5 on a 0–100 scale. This range was chosen as a pragmatic Smallest Effect Size of Interest (SESOI), reflecting differences small enough to be practically irrelevant while accounting for normal variability in educational contexts. Specifically, an absolute deviation of five percentage points is typically negligible in standard grading practice, meaning it negligibly alters student outcomes or perceptions of fairness. Setting these equivalence bounds ensures reliable consistency between LLM-generated grades and human evaluations [24]. Effect sizes were calculated to quantify the magnitude of observed differences or similarities [25].

3. Results

This section presents the findings of the study, focusing on the performance and reliability of LLMs in comparison to human evaluators in formative assessments. As shown in Table 1, the mean evaluation scores by question identifier (ID) highlight the differences and similarities in scoring patterns across all evaluators. These results are further illustrated in Figure 2, which visually depicts the distribution and variability of scores for each question, providing a clearer comparison of the scoring behaviors among the three evaluators.

3.1. Scoring Performance

Starting with the analysis of the grades assigned by human evaluators and the open-source LLM. The Shapiro–Wilk test indicated significant deviations from normality for both distributions (p < 0.001). Consequently, non-parametric methods were used in the analysis. Descriptive statistics revealed that the median grade for human evaluations was 54.17, with an Interquartile Range (IQR) of 86.11. The open-source LLM exhibited a median grade of 62.50, with an IQR of 100.00. Both distributions displayed bimodal tendencies, with peaks at 0 and 100.

A Mann–Whitney U test assessed differences in scoring patterns. The results showed no statistically significant difference between human and open-source LLM grades (U = 20,911.50, p = 0.666). The effect size, calculated as r ≈ 0.021, was very small. The mean rank values were 209.98 for human evaluations and 205.02 for the open-source LLM. Table 1 presents the normalized grades for the questions, evaluated by each grader, with the following measures: minimum score (Min), maximum score (Max), the mean score from three evaluations (Mean), and standard deviation (SD), reflecting variability in scores. As shown in the table, the scores assigned by the open-source LLM and human evaluators are generally aligned across most short-text question IDs. However, differences are observed in the scoring of code questions.

Now, considering the comparison between grades assigned by human evaluators and the premium LLM. The Shapiro–Wilk test confirmed significant deviations from normality for both distributions (p < 0.001), necessitating the use of non-parametric methods for analysis.

Descriptive statistics showed that the median grade for human evaluations was 54.17, whereas the premium LLM assigned a median of 68.18. The IQR was 86.11 for human grades and 82.16 for the premium LLM. Both distributions exhibited bimodal tendencies, with noticeable peaks at scores of 0 and 100.

A Mann–Whitney U test was conducted to evaluate differences in scoring patterns between the two evaluators. The results indicated no statistically significant difference (U = 22,444.50, p = 0.392). The effect size, calculated as r ≈ 0.042, suggested minimal practical differences. The mean rank values were 202.57 for human evaluations and 212.43 for the premium LLM.

As shown in Table 1, the normalized grades for the questions reflect the variability in scores across evaluators. Figure 2 complements this by visually illustrating the distribution of grades, offering further insight into the variability and statistical properties observed in the table. For True/False questions ID 4, 9, and 11, which did not require justification for ’True’ responses, the normalized scores were analyzed separately, and these questions were excluded from the boxplot analysis. For question ID 4, all evaluators assigned 100 to 19 responses and 0 to four responses. Similarly, for question ID 9, all evaluators assigned 100 to 21 responses and 0 to two responses. However, question ID 11 exhibited greater variability among evaluators, as shown in Figure 3.

Notably, the alignment between the premium LLM and human evaluators was stronger for code-based questions, as shown in Table 1. However, the premium LLM generally assigned higher scores for text-based questions.

A Spearman correlation analysis examined the relationships between scoring patterns across all evaluators. The analysis revealed the following correlation coefficients: 0.962 between human evaluators and the open-source LLM, 0.963 between human evaluators and the premium LLM, and 0.953 between the open-source LLM and the premium LLM.

3.2. Question Type Analysis

To explore differences in scoring patterns across question types, a Kruskal–Wallis test was performed. This non-parametric method is suitable for comparing more than two independent groups, particularly given the non-normal distribution of grades.

For text-based questions, the results showed no statistically significant differences between evaluators (H = 1.044, degrees of freedom = 2, p = 0.593). The mean ranks for text-based questions were Human (236.34), Open-source LLM (239.01), and Premium LLM (250.65).

For code-based questions, the Kruskal–Wallis test also revealed no statistically significant differences (H = 3.536, degrees of freedom = 2, p = 0.171). The mean ranks for code-based questions were Human (74.93), Open-source LLM (60.60), and Premium LLM (72.97).

As shown in Table 1 and Figure 2, evaluation scores for short-answer questions exhibit variability among evaluators, with the premium LLM generally assigning higher scores compared to human and open-source evaluations. For code-based questions, evaluation scores show a general alignment among evaluators, as illustrated in Figure 4. The premium LLM and human evaluations are closely aligned, while the open-source LLM consistently assigned noticeably lower scores for these questions.

3.3. Similarity Between LLM and Human Evaluator

To assess the practical similarity between human and LLM-generated grades, equivalence testing was conducted using thresholds of −5 and 5 to establish acceptable ranges of difference. Since we could not reject the null hypothesis (H₀: There is no difference between LLM-generated grades and human evaluations), equivalence testing provides an additional perspective on whether the observed differences fall within an acceptable range of practical similarity.

For the open-source LLM, the −5 equivalence threshold resulted in a mean difference between grades of 3.64 (standard deviation = 10.85), with a t-statistic of 4.83 (degrees of freedom = 206) and a two-sided p-value of <0.001. The 95% confidence interval ranged from 2.15 to 5.13, and the effect size, calculated as Cohen’s d, was 0.336. For the 5 equivalence threshold, the mean difference was -6.36 (standard deviation = 10.85), with a t-statistic of −8.43 (degrees of freedom = 206) and a two-sided p-value < 0.001. The 95% confidence interval ranged from −7.85 to −4.87, and Cohen’s d was −0.586.

Regarding the premium LLM, equivalence testing showed that for the −5 equivalence threshold, the mean difference between grades was 8.67 (standard deviation = 11.57), with a t-statistic of 10.78 (degrees of freedom = 206) and a two-sided p-value of <0.001. The 95% confidence interval ranged from 7.08 to 10.25. Cohen’s d was 0.749. For the 5 equivalence threshold, the mean difference was −1.33 (standard deviation = 11.57), with a t-statistic of −1.66 (degrees of freedom = 206) and a two-sided p-value of 0.099. The 95% confidence interval ranged from −2.92 to 0.25, and Cohen’s d was −0.115.

Following the equivalence testing, scoring trends across short-answer and code-based questions were analyzed, with the variability in scores for each evaluator summarized in Table 2, which presents the standard deviation and variance (Var). Across both question types, the premium LLM consistently exhibited the lowest variability in scores, followed by the open-source LLM with moderate variability. Human evaluators demonstrated the highest variability for both short-answer and code-based questions.

3.4. Comparative Analysis of LLM Performance

The grades assigned by both the open-source and premium LLMs were analyzed. Tests for normality, including the Kolmogorov–Smirnov and Shapiro–Wilk tests, yielded p-values less than 0.001 for both models, indicating significant deviations from normality. Consequently, non-parametric methods were employed for the subsequent analysis.

Descriptive statistics showed that the median grade for the open-source LLM was 62.50, with an IQR of 100.00. The premium LLM exhibited a slightly higher median of 68.18, with an IQR of 82.16. Both models displayed wide interquartile ranges, reflecting a broad spread of grades. Histograms revealed bimodal distributions for both models, with notable peaks at the extremes of 0 and 100.

A Mann–Whitney U test was conducted to evaluate the differences in scoring patterns between the two models. The results showed no statistically significant difference between the grades assigned by the open-source and premium LLMs (U = 22,764.5, p = 0.257). The effect size, calculated as r ≈ 0.0557, was very small. The mean rank values were 201.03 for the open-source LLM and 213.97 for the premium LLM.

A Spearman correlation was conducted to assess the relationship between the grades assigned by the two models. The results showed a very strong positive correlation (ρ = 0.953, N = 207, p < 0.001).

Equivalence testing evaluated the practical significance of differences between the models. For the −5 equivalence threshold, the mean difference was −5.03 (standard deviation = 14.40), with a t-statistic of −0.027 and a two-sided p-value of 0.978. The 95% confidence interval ranged from −2.00 to 1.95. Cohen’s d was −0.002. For the 5 equivalence threshold, the mean difference was −10.03 (standard deviation = 14.40), with a t-statistic of −10.02 and a two-sided p-value of <0.001. The 95% confidence interval ranged from −12.00 to −8.05, and Cohen’s d was −0.696.

4. Discussion

The designed study contributed to a bimodal distribution of scores, with peaks at 0 and 100, in contrast to studies using public datasets with normal distributions. Despite these differences, the results demonstrate the adaptability of LLMs in effectively handling heterogeneous question types.

Addressing RQ1, the examined premium LLMs closely aligned with human evaluators in a programming subject, achieving practical equivalence within defined thresholds (−5 to 5). A strong Spearman correlation coefficient of 0.963 further demonstrated the premium LLM’s reliability in replicating human scoring patterns. In contrast, the examined open-source LLM deviated beyond equivalence thresholds, highlighting the need for refinement to achieve similar reliability. These findings establish premium LLMs as reliable tools for educational assessments, particularly in contexts demanding high scoring consistency and objectivity. While these findings align with prior research by Henkel et al. [5,26], Cohn et al. [18], and Latif and Zhai [25], which demonstrated the reliability of premium LLMs across diverse educational tasks, this study provides additional insights into their application. Specifically, the comparative analysis between open-source and premium LLMs in evaluating programming-related, short-text responses represents a novel contribution not extensively explored in prior studies. Unlike research focused on essay scoring or free-form answers, this work highlights the challenges and opportunities in grading structured, code-based assessments. Furthermore, the findings of Mansour et al. [27], who identified difficulties in applying LLMs to broader assessment tasks like essay scoring, suggest that the reliability observed here may not extend to disciplines requiring subjective or interpretive evaluation, such as humanities or arts. Additionally, the observed differences in scoring patterns, such as the premium LLM’s higher median scores and greater consistency, underscore its suitability for objective, technical evaluations while revealing gaps in open-source models’ performance.

Distinct scoring trends and consistency were observed across question types. For code-based questions, the premium model closely matched human evaluators, demonstrating its potential for scoring programming assignments with precision. In contrast, the open-source model scored substantially lower in this study, reflecting a need for further optimization in this area. These results highlight the examined premium model suitability for structured, objective assessments and align with prior research on the reliability of generative pre-trained transformer models for educational tasks [5,16]. For short-answer questions, the premium LLM showed leniency in mid-range scoring, while human and open-source evaluations were closely aligned. Variability at score extremes underscored challenges in interpreting subjective responses, contrasting with the consistency observed in code-based evaluations.

Nevertheless, these conclusions might not be generalized to disciplines requiring subjective or interpretative grading, such as humanities or arts, where nuanced evaluation poses unique challenges. Further studies are necessary to confirm the findings in such contexts.

Another notable finding pertains to scoring consistency. The premium LLM demonstrated the highest consistency across both short-answer and code-based questions, with minimal variation between question types. Human graders showed greater variability, likely reflecting their ability to capture nuanced responses that automated systems may overlook.

The premium LLM also assigned higher median grades (68.18) compared to human evaluators (54.17). This aligns with Chang and Ginter’s research [28] highlighting that premium LLMs tend to assign higher grades in scoring contexts. However, Grévisse [29] observed lower grades from GPT-4 in specific domains, emphasizing the importance of contextual factors when interpreting LLM performance.

Addressing RQ2, both LLMs demonstrated overall consistency in scoring patterns, with a strong positive correlation between their outputs. However, the open-source LLM exhibited greater deviations and failed to meet equivalence thresholds compared to the premium model, which aligned more reliably with human grades. While Song et al. [30] highlighted the cost-effectiveness and flexibility of open-source LLMs, their current limitations reduce their applicability in formative assessments. Future advancements in fine-tuning and optimization may help narrow the gap between these models.

5. Conclusions

This study explored the potential of LLMs to automate scoring in formative assessments, emphasizing their reliability and consistency compared to human evaluators. By developing the introEduAI platform, we demonstrated the feasibility of using LLMs for scalable and precise scoring workflows.

The evaluation of the open-source LLaMA 3.2 and premium OpenAI GPT-4o revealed distinct strengths and limitations. The premium LLM closely aligned with human scoring patterns, showing high correlations and minimal variability, particularly in coding responses. In contrast, open-source models, while cost-effective and privacy-friendly, exhibited greater variability, underscoring the need for further optimization to match premium model performance. However, we must acknowledge the rapid evolution of new language models, particularly open-source alternatives like DeepSeek [31], especially when using reasoning methodologies, which show promising potential to match or exceed the performance of premium commercial models in the near future.

However, this study focuses on computer programming assessments characterized by objective and structured responses. While these tasks are suitable for testing LLM precision and consistency, the findings may not extend to disciplines where scoring involves interpretative judgment and handling ambiguity.

Furthermore, the scope of question types used in this study (primarily two code-based questions and seven short-answer text responses) limits our understanding of LLM adaptability. While these formats provided structured insights into LLM performance, they do not fully capture the diversity of tasks commonly encountered in educational assessments.

A further limitation of this investigation is the restricted sample size (23 university students). Although the study generated 207 data points, it is important to acknowledge that these do not represent independent participants but rather repeated measures from the same 23 individuals across multiple experimental items.

Therefore, future research should expand the application of generative LLMs to include a broader range of subjects and educational levels, particularly those involving interpretative or subjective elements, and explore a broader range of complex question types to gain deeper insights into how LLMs handle intricate reasoning and extended text-based tasks.

Author Contributions

Conceptualization, P.C.M., F.Q. and F.M.; methodology, P.C.M., F.Q. and F.M.; software, P.C.M.; validation, P.C.M., F.Q. and F.M.; formal analysis, P.C.M., F.Q. and F.M.; investigation, P.C.M., F.Q. and F.M.; resources, P.C.M.; data curation, P.C.M.; writing—original draft preparation, P.C.M.; writing—review and editing, F.Q. and F.M.; visualization, P.C.M.; supervision, F.Q. and F.M.; project administration, F.Q.; funding acquisition, F.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This work was also supported by Interactive Technologies Institute (ITI/Larsys)—Funded by Fundação para a Ciência e a Tecnologia (FCT) projects: 10.54499/LA/P/0083/2020; 10.54499/UIDP/50009/2020 & 10.54499/UIDB/50009/2020.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Ethics Committee of University of Madeira (protocol code 150/CEUMN2O24, approved on 17 October 2024).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The datasets presented in this article are not readily available because the information collected during the case study was authorized by an informed consent signed by each subject. Data sharing was not contemplated in the consent.

Acknowledgments

During the preparation of this work, the authors used ChatGPT-4o in order to refine the structure of the text and enhance its academic clarity while strictly preserving the original content without AI-generated writing. After using this tool, the authors reviewed and edited the content as needed and took full responsibility for the content of the published article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Figure A1. Interface for creating a formative assessment task. Uploaded attachments are integrated into the RAG system, enabling evaluations to reference materials specifically used in class.

Figure A2. Interface for adding and managing questions in a formative assessment.

Figure A3. Examples of responses for a specific question. These examples are provided to the LLM using the few-shot prompting technique to guide its evaluation process.

Figure A4. Teacher evaluation interface for scoring specific questions.

Appendix B

Table A1. Summary of Short-Answer Questions with Scoring Rubrics. Translated from Portuguese.

ID	Question	Solution Example	Criteria
2	Python has two main types of loops. Briefly explain how each of them works, highlighting their main differences.	Python has two main types of loops: for and while. The for loop is used to iterate over elements of a sequence (such as lists, strings, or numeric ranges) and automatically traverses all elements. The while loop is used when we want to repeat a block of code while a condition is true, which means it depends on a specific logical condition. The main difference is that the for loop is more suitable for a known number of iterations, while the while loop is useful when we do not know in advance how many times the loop will execute.	Identification of Loop Types (Total: 6 points) Explanation of the Functioning of the for Loop (Total: 7 points) Explanation of the Functioning of the while Loop (Total: 7 points) Clarity and Coherence in the Comparison between for and while (Total: 2 points)
3	Determine the truth of the statement (justify if false): Strings are mutable sequences of characters.	False. Strings in Python are immutable, meaning that we cannot alter their individual characters after creation.	Indication of True or False (3 points) Justification of the Answer (Total: 5 points)
4	Determine the truth of the statement (justify if false): To access the last element of a list, we use the index −1.	True	Indication of True or False (8 points)
9	Determine the truth of the statement (justify if false): This operation is permitted in Python: true = True	True	Indication of True or False (8 points)
10	Determine the truth of the statement (justify if false): Tuples are an immutable data structure where new elements can be added.	False, because tuples are immutable, which means we cannot add new elements after their creation.	Indication of True or False (3 points) Justification for the False Answer: ✓ Explanation of Immutability (2 points) ✓ Indication of the Impossibility to Add New Elements (3 points)
11	Determine the truth of the statement (justify if false): Comments are extremely important in programming, as they help document the code.	True	Indication of True or False (8 points)
12	You intend to store the contacts (telephone, email, address) of the students in the TPSI course. Indicate and justify which type(s) of data structures you would use to perform this operation.	I would use a list of composite data structures, such as dictionaries, where each dictionary would represent a contact. Within each dictionary, I would use the keys telephone, email, and address to store each student’s information, as this allows me to easily associate each piece of information with its respective key.	Identification of the Main Structure (5 points) Explanation of Using Dictionaries as a Complementary Structure (12 points) Justification of the Combination of List + Dictionary (8 points)

Table A2. Summary of Code Questions with Scoring Rubrics. Translated from Portuguese.

ID Question Solution Example Criteria

13

Implement a function called insertValidNIF that repeatedly reads a user’s NIF and validates it according to the criteria below:

(a): The function should not have arguments;
(b): The function should prompt the user to enter the NIF until the input is valid. It can be read as a string or as an integer, but it must correspond to an integer value with exactly 9 digits;
(c): If the NIF is read as an integer, a try…except should be added;
(d): If the NIF is invalid, the function should display the message: “Invalid NIF. Please enter a valid NIF.” and prompt for a new value;
(e): The function should return the NIF when it is correctly entered.

def insertValidNIF():
while True:
try:
nif = int(input(“Insert your NIF: “))
if len(str(nif)) != 9:
raise ValueError
break
except ValueError:
print(“Invalid NIF. Please insert a valid NIF.”)
return nif

Definition of the insertValidNIF Function (3 points)
Use of a while Loop for Repeated Validation (7 points)
Reading the NIF Value (3 points)
Verification and Handling of Errors for Different Types (6 points)
Length Verification (6 points)
Error Message for Invalid NIF (5 points)
Return of the Valid NIF (3 points)
Clarity and Coherence in the Comparison between for and while (Total: 2 points)

14

Create a Python program that performs the following operations:

(a)

Initialize an empty list to store user contacts;

(b)

Each contact should be a structure containing the following information (key-value pairs):

✓: id: a unique integer number that identifies the contact;
✓: nome: a string representing the full name of the contact;
✓: telefone: an integer number corresponding to the telephone contact;
✓: email: the email associated with the contact (e.g., joao@example.com);
✓: localidade: the location where the contact resides.

(c)

Create a function called adicionarContacto() that allows filling in the data for each contact. This function should request the mentioned data from the user and return them as an individual contact element;

(d)

Create a function called listarContactos() that prints the data of all contacts stored in the list, in the following format (example):

1—João Alberto Silva Pacheco/Funchal
Telefone: 291223322 Email: joao_aspacheco@sapo.pt

(e): Finally, in the main program, use the adicionarContacto() function to add at least one contact to the list and then call the listarContactos() function to display the added contacts.

def adicionarContacto():
id = int(input(“ID: “))
nome = input(“Nome: “)
telefone = int(input(“Telefone: “))
email = input(“Email: “)
localidade = input(“Localidade: “)
return {“id”: id, “nome”: nome, “telefone”: telefone, “email”: email, “localidade”: localidade}

def listarContactos():
print(“Lista de Contactos:”)
for contacto in listaContactos:
print(f”ID: {contacto[’id’]} - Nome: {contacto[’nome’]} / {contacto[’localidade’]}\nTelefone: {contacto[’telefone’]}, Email: {contacto[’email’]}”)

listaContactos = []listaContactos.append(adicionarContacto())
listarContactos()

Initialization of the Empty List (5 points)Creation of the adicionarContacto() Function:

✓: Definition and Structure of the Function (5 points)
✓: Reading and Assignment of the 5 Required Values (15 points in total)
✓: Storage in Dictionary Format with Specified Keys (15 points in total)

Creation of the listarContactos() Function

✓: Definition of the Function (6 points)
✓: Traversing the List and Printing Contacts (24 points in total)

Calling the Functions in the Main Program

✓: Calling and Filling the List with adicionarContacto() (5 points)
✓: Calling listarContactos() to Print (5 points)

Table A3. Scoring rubric for Question ID 12, showing criteria and score ranges.

Criterion	Sub-Criterion	Description	Points
Identification of Main Structure		Mentioning lists as main structure for contact information	5
Identification of Main Structure		Mentioning inappropriate structure (dictionaries, tuples, sets)	0
Explanation of Dictionaries as Complementary Structure	Explanation of Key-Value Pairs	Explaining dictionaries store contact details as key-value pairs	6
		Partial/ambiguous explanation	3 to 5
		Incorrect/missing explanation	0
	Clarity in Attribute Association	Justifying dictionaries enable clear data association with specific keys	6
		Partial/ambiguous explanation	3 to 5
		Incorrect/missing explanation	0
Justification of Combining List + Dictionary	Flexibility and Organization of Contacts	Explaining list of dictionaries allows dynamic contact addition	4
		Partial explanation	2 to 3
		Incorrect/missing explanation	0
	Ease of Search and Manipulation	Indicating that this combination allows efficient search, access, and modification of information for each student, using keys to retrieve specific fields and indices to access each student in the list	4
		Partial explanation	2 to 3
		Incorrect/missing explanation	0

References

Fagbohun, O.; Iduwe, N.P.; Abdullahi, M.; Ifaturoti, A.; Nwanna, O.M. Beyond Traditional Assessment: Exploring the Impact of Large Language Models on Grading Practices. J. Artifical Intell. Mach. Learn. Data Sci. 2024, 2, 1–8. [Google Scholar] [CrossRef] [PubMed]
Bond, M.; Khosravi, H.; De Laat, M.; Bergdahl, N.; Negrea, V.; Oxley, E.; Pham, P.; Chong, S.W.; Siemens, G. A Meta Systematic Review of Artificial Intelligence in Higher Education: A Call for Increased Ethics, Collaboration, and Rigour. Int. J. Educ. Technol. High. Educ. 2024, 21, 4. [Google Scholar] [CrossRef]
Automatic Assessment of Text-Based Responses in Post-Secondary Education: A Systematic Review. Comput. Educ. Artif. Intell. 2024, 6, 100206. [CrossRef]
The Emergent Role of Artificial Intelligence, Natural Learning Processing, and Large Language Models in Higher Education and Research. Res. Social. Adm. Pharm. 2023, 19, 1236–1242. [CrossRef]
Henkel, O.; Hills, L.; Roberts, B.; McGrane, J. Can LLMs Grade Open Response Reading Comprehension Questions? An Empirical Study Using the ROARs Dataset. Int. J. Artif. Intell. Educ. 2024. [Google Scholar] [CrossRef]
Liu, M.; M’Hiri, F. Beyond Traditional Teaching: Large Language Models as Simulated Teaching Assistants in Computer Science. In Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1, Portland, OR, USA, 20–23 March 2024; pp. 743–749. [Google Scholar]
Kooli, C.; Yusuf, N. Transforming Educational Assessment: Insights Into the Use of ChatGPT and Large Language Models in Grading. Int. J. Hum.–Comput. Interact. 2025, 41, 1–12. [Google Scholar] [CrossRef]
Kosar, T.; Ostojić, D.; Liu, Y.D.; Mernik, M. Computer Science Education in ChatGPT Era: Experiences from an Experiment in a Programming Course for Novice Programmers. Mathematics 2024, 12, 629. [Google Scholar] [CrossRef]
Cooper, G. Examining Science Education in ChatGPT: An Exploratory Study of Generative Artificial Intelligence. J. Sci. Educ. Technol. 2023, 32, 444–452. [Google Scholar] [CrossRef]
Zhai, X. ChatGPT User Experience: Implications for Education 2022. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4312418 (accessed on 2 March 2025).
Dorfner, F.J.; Jürgensen, L.; Donle, L.; Mohamad, F.A.; Bodenmann, T.R.; Cleveland, M.C.; Busch, F.; Adams, L.C.; Sato, J.; Schultz, T.; et al. Is Open-Source There Yet? A Comparative Study on Commercial and Open-Source LLMs in Their Ability to Label Chest X-Ray Reports. arXiv 2024, arXiv:2402.12298. [Google Scholar]
Miao, J.; Thongprayoon, C.; Suppadungsuk, S.; Garcia Valencia, O.A.; Cheungpasitporn, W. Integrating Retrieval-Augmented Generation with Large Language Models in Nephrology: Advancing Practical Applications. Medicina 2024, 60, 445. [Google Scholar] [CrossRef]
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; Wang, H. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2023. [Google Scholar] [CrossRef]
Fung, S.C.E.; Wong, M.F.; Tan, C.W. Automatic Feedback Generation on K-12 Students’ Data Science Education by Prompting Cloud-Based Large Language Models. In Proceedings of the Eleventh ACM Conference on Learning @ Scale, Atlanta, GA, USA, 18–20 July 2024; pp. 255–258. [Google Scholar]
Posedaru, B.-S.; Pantelimon, F.-V.; Dulgheru, M.-N.; Georgescu, T.-M. Artificial Intelligence Text Processing Using Retrieval-Augmented Generation: Applications in Business and Education Fields. Proc. Int. Conf. Bus. Excell. 2024, 18, 209–222. [Google Scholar] [CrossRef]
Carpenter, D.; Min, W.; Lee, S.; Ozogul, G.; Zheng, X.; Lester, J. Assessing Student Explanations with Large Language Models Using Fine-Tuning and Few-Shot Learning. In Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024); Kochmar, E., Bexte, M., Burstein, J., Horbach, A., Laarmann-Quante, R., Tack, A., Yaneva, V., Yuan, Z., Eds.; Association for Computational Linguistics: Mexico City, Mexico, 2024; pp. 403–413. [Google Scholar]
(PDF) Generative Language Models with Retrieval Augmented Generation for Automated Short Answer Scoring. Available online: https://www.researchgate.net/publication/382944163_Generative_Language_Models_with_Retrieval_Augmented_Generation_for_Automated_Short_Answer_Scoring (accessed on 19 January 2025).
Cohn, C.; Hutchins, N.; Le, T.; Biswas, G. A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students’ Formative Assessment Responses in Science. Proc. AAAI Conf. Artif. Intell. 2024, 38, 23182–23190. [Google Scholar] [CrossRef]
Wong, E. Comparative Analysis of Open Source and Proprietary Large Language Models: Performance and Accessibility. Adv. Comput. Sci. 2024, 7, 1–7. [Google Scholar]
Wang, Y.; Wang, M.; Manzoor, M.A.; Liu, F.; Georgiev, G.N.; Das, R.J.; Nakov, P. Factuality of Large Language Models: A Survey. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 19519–19529. [Google Scholar]
Jury, B.; Lorusso, A.; Leinonen, J.; Denny, P.; Luxton-Reilly, A. Evaluating LLM-Generated Worked Examples in an Introductory Programming Course. In Proceedings of the 26th Australasian Computing Education Conference, Sydney, NSW, Australia, 29 January–2 February 2024; pp. 77–86. [Google Scholar]
Fay, M.P.; Proschan, M.A. Wilcoxon-Mann-Whitney or t-Test? On Assumptions for Hypothesis Tests and Multiple Interpretations of Decision Rules. Stat. Surv. 2010, 4, 1–39. [Google Scholar] [CrossRef]
Lakens, D. Equivalence Tests: A Practical Primer for t Tests, Correlations, and Meta-Analyses. Social. Psychol. Personal. Sci. 2017, 8, 355–362. [Google Scholar] [CrossRef] [PubMed]
Lakens, D.; McLatchie, N.; Isager, P.M.; Scheel, A.M.; Dienes, Z. Improving Inferences About Null Effects with Bayes Factors and Equivalence Tests. J. Gerontol. Ser. B 2020, 75, 45–57. [Google Scholar] [CrossRef]
Fine-Tuning ChatGPT for Automatic Scoring. Comput. Educ. Artif. Intell. 2024, 6, 100210. [CrossRef]
Henkel, O.; Hills, L.; Boxer, A.; Roberts, B.; Levonian, Z. Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability To Mark Short Answer Questions in K-12 Education. In Proceedings of the Eleventh ACM Conference on Learning @ Scale, Atlanta, GA, USA, 18–20 July 2024; pp. 300–304. [Google Scholar]
Mansour, W.A.; Albatarni, S.; Eltanbouly, S.; Elsayed, T. Can Large Language Models Automatically Score Proficiency of Written Essays? In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Orino, Italia, 20–25 May 2024; pp. 2777–2786. [Google Scholar]
Chang, L.-H.; Ginter, F. Automatic Short Answer Grading for Finnish with ChatGPT. Proc. AAAI Conf. Artif. Intell. 2024, 38, 23173–23181. [Google Scholar] [CrossRef]
Grévisse, C. LLM-Based Automatic Short Answer Grading in Undergraduate Medical Education. BMC Med. Educ. 2024, 24, 1060. [Google Scholar] [CrossRef]
Song, Y.; Zhu, Q.; Wang, H.; Zheng, Q. Automated Essay Scoring and Revising Based on Open-Source Large Language Models. IEEE Trans. Learn. Technol. 2024, 17, 1920–1930. [Google Scholar] [CrossRef]
DeepSeek-AI; Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025. [Google Scholar] [CrossRef]

Figure 1. Few-Shot Prompt Template for RAG-Enhanced Short-Answer Assessment.

Figure 2. Boxplots showing distribution of normalized evaluation scores for each question by evaluator type.

Figure 3. Scores for Question ID 11 by human and LLM evaluators, showing the distribution of responses graded as 0 and 100.

Figure 4. Analysis of scoring agreement between human and AI evaluators (premium and open source) across code assessment questions, shown in three panels: (a) Bland–Altman plot showing score differences between human vs. premium and human vs. open-source evaluators across mean scores; (b) Raincloud plot displaying score distributions for question ID 13 across human, premium, and open-source evaluators; (c) Raincloud plot showing score distributions for question ID 14 across the same evaluator groups.

Table 1. Descriptive Statistics of Normalized Evaluation Scores by Question ID for Human and LLM Evaluators.

Question ID	Type	Human				Open Source LLM				Premium LLM
Question ID	Type	Min	Max	Mean	SD	Min	Max	Mean	SD	Min	Max	Mean	SD
2	Short-text	0.00	86.36	40.97	23.41	0.00	90.91	40.91	24.97	0.00	100.00	55.01	26.55
3	Short-text	0.00	100.00	37.14	42.95	0.00	100.00	36.41	42.34	0.00	100.00	34.60	40.31
4	Short-text	0.00	100.00	82.25	37.78	0.00	100.00	82.61	37.90	0.00	100.00	82.61	37.90
9	Short-text	0.00	100.00	91.30	28.18	0.00	100.00	91.30	28.18	0.00	100.00	91.30	28.18
10	Short-text	0.00	100.00	60.69	31.59	0.00	100.00	64.13	36.36	0.00	100.00	68.48	35.59
11	Short-text	0.00	100.00	92.75	23.99	0.00	100.00	95.65	20.39	0.00	100.00	91.30	28.18
12	Short-text	0.00	62.67	19.59	15.49	0.00	73.33	14.15	17.39	0.00	84.00	34.72	26.09
13	Code	0.00	90.91	34.83	30.19	0.00	100.00	29.38	35.06	0.00	100.00	34.91	33.49
14	Code	0.00	97.50	30.51	30.70	0.00	100.00	23.26	33.49	0.00	100.00	30.11	33.06

Table 2. Variability in scoring consistency by question type and evaluator.

Question Type	Evaluator	SD	Var
Short-Answer	Teacher	4.438	74.843
	Open Source LLM	1.197	12.670
	Premium LLM	0.873	6.540
Code	Teacher	6.350	83.566
	Open Source LLM	1.020	4.207
	Premium LLM	0.710	3.874

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mendonça, P.C.; Quintal, F.; Mendonça, F. Evaluating LLMs for Automated Scoring in Formative Assessments. Appl. Sci. 2025, 15, 2787. https://doi.org/10.3390/app15052787

AMA Style

Mendonça PC, Quintal F, Mendonça F. Evaluating LLMs for Automated Scoring in Formative Assessments. Applied Sciences. 2025; 15(5):2787. https://doi.org/10.3390/app15052787

Chicago/Turabian Style

Mendonça, Pedro C., Filipe Quintal, and Fábio Mendonça. 2025. "Evaluating LLMs for Automated Scoring in Formative Assessments" Applied Sciences 15, no. 5: 2787. https://doi.org/10.3390/app15052787

APA Style

Mendonça, P. C., Quintal, F., & Mendonça, F. (2025). Evaluating LLMs for Automated Scoring in Formative Assessments. Applied Sciences, 15(5), 2787. https://doi.org/10.3390/app15052787

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Evaluating LLMs for Automated Scoring in Formative Assessments

Abstract

1. Introduction

2. Materials and Methods

2.1. Development of the Platform

2.2. Experimental Design

2.3. Evaluation and Analysis

3. Results

3.1. Scoring Performance

3.2. Question Type Analysis

3.3. Similarity Between LLM and Human Evaluator

3.4. Comparative Analysis of LLM Performance

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI