Leveraging Transformers and LLMs for Automated Grading and Feedback Generation Using a Novel Dataset

G. Khalf, Asmaa; Nabil, Emad; H. Gomaa, Wael; Benrhouma, Oussama; M. El-Mandouh, Amira

doi:10.3390/data11030057

Open AccessArticle

Leveraging Transformers and LLMs for Automated Grading and Feedback Generation Using a Novel Dataset

by

Asmaa G. Khalf

¹,

Emad Nabil

^2,*

,

Wael H. Gomaa

¹

,

Oussama Benrhouma

²

and

Amira M. El-Mandouh

¹

Faculty of Computers and Artificial Intelligence, Beni-Suef University, Beni-Suef 62511, Egypt

²

Faculty of Computer and Information Systems, Islamic University of Madinah, Madinah 42351, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Data 2026, 11(3), 57; https://doi.org/10.3390/data11030057

Submission received: 22 January 2026 / Revised: 10 March 2026 / Accepted: 11 March 2026 / Published: 16 March 2026

(This article belongs to the Special Issue Mining and Computational Intelligence for E-Learning and Education—4th Edition)

Download

Browse Figures

Versions Notes

Abstract

Automated Short Answer Grading (ASAG) has garnered significant attention in the field of educational technology due to its potential to improve the efficiency, scalability, and consistency of student assessments. This study introduces a novel dataset of 651 student responses from a Database Transaction course exam at Beni-Suef University, referred to as the Beni-Suef Transaction Processing (BeSTraP) dataset. The BeSTraP is specifically designed to support ASAG evaluation. To assess ASAG performance, five approaches were employed: string-based similarity, semantic similarity, a hybrid of both, fine-tuning transformer-based models, and the application of Large Language Models (LLMs). The experimental results indicated that fine-tuned transformers, particularly GPT-2, achieved the highest Pearson correlation with human scores (0.8813) on the new dataset and maintained robust performance on the Mohler benchmark (0.7834). In addition to grading, the framework integrates automated feedback generation through LLMs, further enriching the assessment process. This research contributes (i) a novel, domain-specific dataset derived from an actual university examination, (ii) a comprehensive comparison of traditional and transformer-based approaches, and (iii) evidence of the efficacy of fine-tuned models in providing accurate and scalable grading solutions. The created dataset will be publicly available for the community.

Keywords:

automated short answer grading; transformer; large language models; fine-tuning; text similarity; feedback generation

Graphical Abstract

1. Introduction

Natural Language Processing (NLP) has significantly transformed e-learning, particularly in the domain of automated short-answer assessment. In higher education, institutions increasingly adopt e-learning by leveraging Information and Communication Technologies (ICTs) to monitor and support student progress. Nevertheless, key challenges persist, such as high dropout rates and the difficulty of efficiently evaluating large cohorts of students [1].

Manual grading of short-answer responses is both time-consuming and costly, highlighting the need for automation [2]. Automated Short Answer Grading (ASAG) systems utilize NLP to evaluate responses based on their semantic meaning, rather than relying on simple keyword matching. These systems improve grading efficiency, ensure consistency, and provide timely feedback, thereby improving the overall e-learning experience [3].

Student learning in higher education is typically evaluated through a combination of objective and subjective assessment approaches. Objective formats, such as multiple-choice, matching, and true/false questions, rely on predefined answers, while subjective formats require students to generate responses in their own words, including short-answer and essay questions [4,5].

As digital technologies become more embedded in educational practice and institutions increasingly adopt learner-centered models, open-ended assessments have become more widely used [6]. The cognitive complexity and learning potential of these formats depend on their design, which can enable students to articulate understanding more freely, engage deeply, and develop higher-order thinking skills. In addition, online learning environments create new opportunities for instructor–student interaction, further promoting the use of descriptive responses [7].

Among subjective assessment types, short-answer questions are particularly valuable when carefully designed, because they require students to demonstrate understanding concisely and in their own words. They often draw on prior knowledge and demand the integration of multiple concepts, making them effective tools for evaluating learning outcomes and deeper comprehension [8]. However, manually assessing large numbers of such responses is both labor-intensive and time-consuming. Moreover, human grading may introduce subjectivity and inconsistencies, particularly at scale. These limitations highlight the need for robust and reliable ASAG systems.

ASAG is defined as the computational evaluation of brief, freely constructed responses in natural language. Developing efficient evaluation systems for concise answers in digital learning platforms remains challenging due to the subjective nature of questions, variations in language use, and the diversity of subject matter in responses. These challenges have motivated the development of different methodological approaches to automated scoring. ASAG systems have historically struggled to achieve human-level accuracy in evaluating answers [9]. To address these challenges, two primary approaches have been proposed to tackle the short-answer scoring problem: the response-based and reference-based methods. The response-based approach trains a scoring model on human-graded answers, leveraging features from student responses to generate scores [10]. In contrast, the reference-based method evaluates student answers by comparing them to an instructor-provided model response, employing various text similarity techniques [11].

Existing research on ASAG has largely relied on benchmark datasets such as SemEval-2013 Beetle, SciEntsBank, and the Mohler dataset. While these resources have provided valuable foundations for developing and evaluating automated grading models, they remain limited in scope, often constrained to specific disciplines or question formats. Such limitations restrict their applicability in broader educational contexts, where assessment practices vary widely across courses and institutions. Despite the availability of existing benchmark datasets, they are often limited in scope, constrained to specific disciplines, or rely on simulated or crowdsourced responses. These limitations create a gap in authentic, domain-specific, and reliable data for evaluating ASAG systems. The BeSTraP dataset directly addresses this gap by providing real exam responses in the domain of Database Transactions, scored independently by two instructors to enhance reliability, and including integrated feedback evaluation. This dataset thus enables a more accurate and comprehensive assessment of automated grading methodologies and supports the development of scalable, consistent ASAG systems.

This study introduces a novel dataset comprising 651 student responses obtained from a formal Database Transaction examination conducted at Beni-Suef University. Each response is accompanied by scores assigned by course instructors in accordance with a standardized marking guide, thereby ensuring alignment with authentic assessment practices. Given that the dataset is sourced from an actual university examination, it offers both authenticity and pedagogical significance, rendering it highly pertinent for advancing research in ASAG. In addition to its direct application, the dataset serves as a valuable benchmark for evaluating various methodologies, including similarity-based techniques, transformer models, and Large Language Models (LLMs), and facilitates the development of scalable, consistent, and objective evaluation methods in higher education. Despite recent advances in NLP-based ASAG systems, achieving human-level grading accuracy remains a challenge. Traditional string similarity methods often fail to capture semantic meaning, leading to misinterpretations. To address this, the study investigates a combined approach that integrates string-based and semantic similarity techniques, alongside transformer-based and LLM-driven models. In doing so, it seeks to bridge the gap between automated and human grading, while leveraging the new dataset to provide a stronger foundation for reliable and scalable ASAG solutions.

The primary objective of this study is to explore and systematically compare diverse methodologies for the automated assessment of short-answer responses. Specifically, this research focuses on evaluating string-based approaches, semantic-based techniques, LLMs, and fine-tuned transformer models to enhance the accuracy and efficiency of automated grading systems. To achieve these objectives, the study aims to address the following research questions:

How effective are various traditional methods, such as those relying on string and semantic similarity, in computer-aided assessment of short answers?
What are the differences between string-based and semantic similarity-based methods in evaluating short answers, and which factors influence their accuracy?
How can LLMs be utilized to improve the accuracy and reliability of ASAG systems?
In ASAG, how does the performance of fine-tuned models compare to that of pre-trained LLMs?
How do fine-tuned transformer models compare to LLMs regarding accuracy and computational efficiency in the context of ASAG?

This research contributes to advancing ASAG by combining state-of-the-art NLP methods with authentic exam data. The key contributions are as follows:

Construction of a novel dataset comprising 651 student responses from a formal exam at Beni-Suef University, paired with human-assigned grades, provides a robust and authentic benchmark for evaluating ASAG methods. The dataset will be publicly available to the community.
Systematic evaluation of similarity-based methods, transformer-based models, and LLMs to highlight their relative strengths, weaknesses, and practical applicability.
Empirical evidence shows that fine-tuned transformer models achieve stronger correlations with human grading, thereby improving alignment with expert assessments.
A comparative evaluation of fine-tuned transformer models and pre-trained LLMs, examining their respective strengths and limitations in terms of accuracy, efficiency, and scalability.
This study illustrates how LLMs can extend the role of ASAG systems by providing context-aware feedback that supports student learning.

This paper is structured as follows: A thorough review of relevant literature is presented in Section 2. The methodological approaches employed in each stage of the suggested model are outlined in Section 3. Section 4 showcases the outcomes of experiments, along with an in-depth analysis. Followed by a critical discussion of their implications in Section 5. Finally, the concluding section reflects on the study’s findings and outlines potential directions for future research.

2. Literature Review

Given the increasing demand for automated assessment in education, short-answer grading has gained considerable attention as a challenging problem in Natural Language Processing (NLP). Over the past decade, researchers have proposed various approaches ranging from traditional machine learning techniques to more advanced deep learning models to develop systems capable of accurately evaluating student responses.

A systematic literature review (SLR) was conducted to provide a structured overview of these developments. The search spanned multiple academic databases, including Google Scholar, Scopus, Web of Science, and IEEE Xplore, covering studies published between 2011 and 2024. Studies were selected based on predefined keywords related to short-answer grading, natural language processing, and machine learning. The screening process considered relevance, methodological quality, and contribution to the field, ensuring a comprehensive and reliable synthesis of existing approaches. Table 1 summarizes the datasets used in these studies, including information about language, sample size, grade range, and subject domain. These datasets provide the foundation for evaluating different ASAG approaches. The major techniques proposed in previous research are summarized in Table 2, which highlights representative studies and the methods they employ to improve automated short-answer grading performance.

For instance, study [12] applies paraphrase generation to create multiple reference answers, improving the handling of diverse student responses in ASAG. Specifically, when tested on the Mohler dataset, the model achieved a Pearson correlation of 73.50 and a root mean squared error (RMSE) of 0.7790. Similarly, study [13] introduces a BERT-based hybrid ASAG model with customized multi-head attention and parallel CNN layers to enhance semantic understanding and scoring reliability. The model was evaluated on widely used ASAG benchmark datasets, including the Mohler dataset, where it achieved a Pearson correlation of 0.747 and an RMSE of 0.856. Additionally, study [14] evaluates GPT-3.5 and GPT-4 on 2000 Finnish student responses, showing that GPT-4 outperforms GPT-3.5, achieving a QWK score above 0.6 in 44% of one-shot cases. While promising, GPT-4 requires further validation before it can be considered a reliable autograder.

Ref. [15] introduces IDEAS (Intelligent Descriptive Answer Electronic Assessment System), an ASAG framework that evaluates student answers using eight similarity metrics derived from statistical and deep learning methods. Tested on benchmark datasets, including the Mohler dataset, it achieved an accuracy of 0.76 and a precision of 0.75.

The authors in [16] investigate factors influencing automated short-answer evaluation with LLMs, including few-shot prompting, temperature settings, and rubrics. When evaluated on the Mohler dataset, Mistral 7B Instruct v0.2 achieved a Pearson correlation of 0.73 for two-shot learning, highlighting advances in models like GPT 3.5 Turbo and Llama 3 8B Instruct. In [17], researchers propose a methodology for improving ASAG reference answers by integrating content extraction, information grouping, and expert-authored responses processed through a zero-shot classifier. Using a transformer ensemble on the Mohler dataset, the model achieved an RMSE of 0.978 and a correlation of 0.485. This study [18] contrasted the knowledge-based Arabic WordNet with BERT and Word2Vec embeddings for Arabic short-answer scoring, analyzing the impact of stemming and using Cosine similarity. BERT outperformed the other methods, achieving a Pearson correlation of 0.84 and an RMSE of 1.003. In [19], a Siamese Stacked Bidirectional LSTM neural network was developed. The model employs domain-specific embeddings, generated by training a Gensim Word2Vec model on Data Structures content. This approach overcomes the limitations of pre-trained embeddings, which often fail to disambiguate words effectively. The proposed system achieved a Pearson correlation of 0.668. GPT-4’s effectiveness in ASAG is analyzed through various prompt configurations and scoring examples using the ASAP-SAS dataset. In [20] the average QWK of 0.677 for GPT-4. Findings show that including scoring examples improved performance, especially in science and biology, while the impact of rationale generation differed by evaluation metric, indicating important trade-offs in ASAG prompt design. In [21], a semantic network was used to derive multiple features that were then applied to train a Support Vector Machine (SVM) model for predicting semantic similarity levels. Model performance was evaluated using Pearson correlation and RMSE, yielding scores of 0.63 and 0.83, respectively.

In [22], a researcher introduced Ans2Vec, a streamlined and effective model for evaluating short answers. This method makes use of Skip-Thought Vectors to encode the responses from both the model and the student into comparable vector formats, allowing for similarity analysis. When tested in the Texas data set, Ans2Vec achieved a correlation of 63%. Two deep learning models are proposed in [23], which explore deep learning techniques for automatic grading of brief answers. Using data from the University of North Texas, the study found that BERT-based systems perform best for both German and English. A correlation score of 0.73 (Pearson) was obtained, with a mean absolute error reaching 0.4 points.

Ref. [24] investigated the performance of four models utilizing transfer learning techniques, BERT, GPT, ELMo, and GPT-2 on the Mohler dataset for ASAG, using Cosine similarity as input for isotonic, linear, and ridge regression models. ELMo achieved the best performance, with an RMSE of 0.978 and a Pearson correlation of 0.485. In [25], researchers developed a combined method that merged a Siamese Bidirectional LSTM neural network structure with feature engineering strategies. The model achieved a Pearson correlation of 0.655 and an RMSE of 0.889. In [26] examined the performance of seven distinct models for encoding brief text responses. Four of these models utilized the aggregation of pre-trained word embeddings, while the other three employed a deep-learning approach trained to derive paragraph vectors. The models’ efficacy was evaluated using Pearson correlation and RMSE, resulting in scores of 0.569 and 0.79, respectively. In [27], an ASAG system is introduced that combines Siamese Bi-LSTMs with a Sinkhorn distance–based pooling layer and a support vector ordinal output, enhanced by a task-specific data augmentation method. The effectiveness is measured using Pearson correlation and RMSE, yielding scores of 0.55 and 0.83, respectively. The researchers referenced in [28] employed a four-phase approach. The initial phase focused on preprocessing, and lemmatization and lowercasing were found to be the most effective methods. During the second phase, various techniques, including string-based, semantic, and embedding methods, were used to compare student responses with model answers. The third and fourth phases involved grading student answers based on their similarity to the model response, and comparing the predicted scores with manually assigned grades. This methodology resulted in a correlation score of 65.12%. In [29] introduced a systematically organized dataset comprising handwritten and transcribed responses from students enrolled in MIS courses at Covenant University during the 2022/2023 academic year. This dataset, along with human-assigned scores, was employed to evaluate both traditional machine learning methods, such as TF-IDF with regression, and transformer-based models. The results revealed that transformer models, particularly Longformer, exhibited the strongest correlation with human grading, achieving a score of 0.77, highlighting their capability to grasp semantic and contextual subtleties. Researchers in [30] integrated a simplified SentenceTransformers model (all-distilroberta-v1) with GPT-based data augmentation to balance datasets and improve grading efficiency, and further optimized it through hyperparameter tuning. Evaluations on the Mohler and ScienceBank datasets showed that this resource-efficient approach outperformed larger models, while other related studies were conducted on different datasets. The study in [31] developed a system for an Arabic dataset from schools in Egypt’s Qalyubia Governorate. It applied a hybrid forecasting technique combining the Grey Wolf Optimizer (GWO) and Long Short-Term Memory (LSTM). GWO optimized the LSTM’s dropout rates, enhancing generalization, reducing overfitting, and improving student test score predictions. Ref. [32] introduces SPRAG, a publicly available annotated corpus designed for evaluating programming answers, along with a comprehensive framework for data collection and annotation. Benchmark experiments indicate that fine-tuned sentence transformers, such as paraphrase-albert-small-v2 and nli-bert-large-cls-pooling, achieve the most promising results. Ref. [33] explored two key aspects of ASAG: (i) establishing a unified framework to evaluate research progress, and (ii) introducing GradeAid, an ASAG system that combines BERT-derived semantic features with TF-IDF lexical features. This approach integrates the bag-of-words method with advanced NLP techniques. While BERT-based semantics improved performance across all datasets, the gains were not always statistically significant.

Table 2. Summary of Related Work on ASAG, grouped by method type.

Study	Dataset	Model/Architecture	Pearson	QWK
Similarity-based/Traditional ML
[18]	AR-ASAG	BERT, Word2Vec	0.775	–
[19]	Mohler	Siamese Stacked BiLSTM	0.668	–
[21]	Mohler	SVM, Semantic Networks	0.631	–
[22]	Mohler, SciEntsBank	Ans2Vec, Skip-Thought Vector	0.63, 0.58	–
[28]	Mohler	Ensemble Model (LSA, BERT, GloVe)	0.652	–
Transformer-based
[13]	Mohler, SciEntsBank, SemEval-2013	Hybrid (BERT, CNN, LSTM)	0.747, -, -	–
[12]	AR-ASAG, Mohler	Bi-LSTM (Seq2Seq)	0.889, 0.735	–
[14]	Finnis Dataset	GPT-3.5, GPT-4	–	0.6
[15]	Mohler, ASAP-SAS, SciEntsBank	Ensemble (RF, XGBoost, KNN)	0.76, -, -	–
[17]	Mohler	Ensemble (BERT, ALBERT, RoBERTa)	0.485	–
[23]	Mohler, German Dataset	BERT Fine-Tuning	0.73, 0.78	–
LLM-based/Prompting
[16]	Mohler, WuS24	Llama 3 8B Instruct, GPT-4o, GPT-3.5 Turbo	0.73, -	–
[20]	ASAP-SAS	GPT-4	–	0.677
[24]	Mohler	GPT-2, GPT, BERT, ELMo	0.485	–
[25]	Mohler	Siamese BiLSTM	0.655	–

3. Methodology

The proposed methodology encompasses five crucial phases following the introduction of the dataset, as shown in Figure 1. Initially, the textual similarity between student responses and model answers is evaluated using string-based similarity analysis, employing models such as BeniniI, BISIM, JaroWinkler, and CohenKappa. The subsequent phase concentrates on assessing semantic similarity, utilizing Hugging Face models like gtr-t5-base, sentence-t5-base, paraphrase-MiniLM-L6-v2, multi-qa-MiniLM-L6-cos-v1, gtr-t5-large, and paraphrase-multilingual-mpnet-base-v2 to gauge the semantic correspondence between student and model answers. The third phase implements a hybrid approach, merging string-based and semantic similarity analyses through various methods, including combining results exclusively from string similarity models, semantic similarity models. In the fourth phase, LLMs are employed to determine the similarity between model and student answers, using zero-shot learning, prompt engineering, few-shot learning, and feedback generation. The final phase involves fine-tuning transformer models, including BERT, T5, and GPT-2, to enhance the precision of similarity assessments. This well-rounded approach seeks to enhance the precision and reliability of similarity assessment through the integration of various sophisticated methods.

3.1. Dataset

To assess our method for grading short answers, we developed a new dataset in 2025, named the Beni-Suef Transaction Processing (BeSTraP) dataset, which can be downloaded from the Supplementary Materials, derived from a paper-based exam given to undergraduate students in the Faculty of Computers and Artificial Intelligence at Beni-Suef University. This exam was part of the Transaction Management course for fourth-year students in the Department of Information Systems. Ninety-three students participated in the exam, which consisted of seven short-answer questions. The grading scheme adopts a score range from 0 to 5, with fractional increments of 0.25 to capture varying degrees of partial correctness in student responses. Integer scores (e.g., 0, 1, 2, 3, 4, and 5) represent the primary performance levels, reflecting major qualitative differences in answer quality. Fractional scores (e.g., 3.25, 3.5, and 3.75) denote intermediate levels between two adjacent integer scores and are used to indicate gradual improvements in the completeness and accuracy of the response.

For example, a score of 3 represents a partially correct answer that includes several relevant concepts but lacks completeness or sufficient explanation. A score of 3.5 indicates that most key elements of the expected answer are present, although some minor details may still be missing. A score of 3.75 reflects an answer that is very close to the model response, with only minor omissions or slight inaccuracies. Similarly, values such as 3.25 represent responses that show modest improvement over a basic level-3 answer but remain less complete than higher-scoring responses.

More generally, increments of 0.25 provide a finer-grained representation of answer quality between two main score levels. Lower fractional values (e.g., +0.25) indicate a slight improvement over the lower integer score, midpoints (e.g., +0.5) reflect moderate improvement with most essential elements present, and higher fractional values (e.g., +0.75) indicate responses that are very close to the next integer level. This design enables a more nuanced and consistent evaluation of student answers while preserving the interpretability of the grading scale. Two faculty members with expertise in the subject independently graded the answers, and the final score for each response was calculated by averaging the two grades to minimize subjectivity and bias.To further ensure consistency, inter-rater reliability (IRR) was assessed using the Intraclass Correlation Coefficient (ICC), resulting in an ICC3k value of 0.979, indicating excellent agreement between the graders. Since the exam was conducted on paper, all student responses were subsequently collected and digitized into a single structured file to enable further processing and analysis. If a student did not provide an answer, it was recorded as No Answer. The BeSTraP dataset was partitioned at the response level using an 80–20% split, where 80% of the responses (n = 520) were used for training, and 20% (n = 131) were reserved for testing. Each data point comprised the student’s answer, the corresponding model answer, and the assigned score. All test responses were strictly excluded from the training phase to prevent data leakage and ensure an unbiased evaluation. In contrast, string-matching algorithms, semantic similarity models, and LLM-based row-wise evaluation were applied to the complete dataset, as these approaches do not involve model fine-tuning.

Novelty of the BeSTraP Dataset

The BeSTraP dataset introduces several features that distinguish it from existing benchmark datasets. Collectively, these characteristics establish BeSTraP as a novel resource that enhances existing benchmarks in terms of authenticity, grading reliability, domain focus, and integrated feedback evaluation.

Real university exam setting: Responses are collected from an actual Database Transactions exam, rather than crowdsourced or simulated environments.
Dual human grading with averaged scores: Each response is scored independently by two instructors, enhancing grading reliability and consistency.
Domain specificity: Focused on Database Transactions, enabling domain-specific evaluation and research.
Integrated grading and feedback evaluation: In addition to numerical scoring, qualitative feedback was automatically generated using Gemini and LLaMA models, which was subsequently evaluated using DeepSeek. The validated feedback will be incorporated into the released dataset, enabling future researchers to benchmark their own feedback-generation models against standardized reference feedback. This integration supports both automated grading and comparative feedback evaluation, enhancing the dataset’s research value.

Consequently, the resulting BeSTraP dataset captures a broad range of student performance, from missing responses to fully correct answers. Representative samples of the collected data are provided in Table 3, which highlights the dataset’s structure. This dataset provides a valuable foundation for developing and testing automated grading systems.

3.2. Model Selection

For automated grading experiments, we employed both fine-tuned transformer models and a language model (LLM) for evaluation. The fine-tuned models, including BERT-base and T5, were selected for their proven performance in short-text semantic understanding and their suitability for training on small to medium-sized datasets. These models allow efficient adaptation to our BeSTraP dataset without requiring extensive computational resources.

For the language model evaluation, GPT-2 was chosen as a baseline LLM due to its accessibility, manageable computational requirements, and availability for local fine-tuning. While larger OpenAI models, such as GPT-3.5 and GPT-4, may offer improved performance, access and cost constraints limited their use in the current study. Future work may extend the evaluation to these larger models as resources permit.

This selection strategy ensures a balance between scientific rigor, reproducibility, and practical feasibility in automated short-answer grading experiments.

3.3. Data Preprocessing

Data pre-processing is crucial for enhancing data quality and accuracy in string-based similarity methods. The process includes stopword removal, tokenization, punctuation removal, and lemmatization. First, all text, including the reference answer and student response, is converted to lowercase. Next, stopwords are removed, tokenization splits the text into individual words, and punctuation is eliminated for consistency. Finally, lemmatization reduces words to their base forms while preserving meaning. Table 4 summarizes these pre-processing methods, their descriptions, and an example of a processed student response (SA1). SA1: “To find problem and errors in a program before it is finalized.”

3.4. String-Based Similarity

String-based similarity analysis is a key approach to evaluating textual content by directly comparing letter and word sequences. It is widely used in automated classification and text analysis [34]. The Abydos library for Python (version 0.6.0) provides efficient algorithms for this purpose [35]. Ample similarity assesses text resemblance by analyzing character alignment and structural patterns, making it effective for detecting subtle textual variations. Monge-Elkan similarity computes the similarity between word sets by comparing individual token pairs and applying an aggregation function, making it particularly useful for handling noisy or inconsistent text. The Jaccard index measures word-set overlap, helping to compare semantic units in short texts. These algorithms excel in rule-based comparisons, such as short-answer grading and plagiarism detection. However, their reliance on surface-level patterns limits their ability to capture semantic equivalence, making it challenging to recognize synonyms and paraphrased content. Figure 2 illustrates this process, while Algorithm 1 outlines the step-by-step methodology.

Algorithm 1: String-based similarity

1:: Input: student_answer(preprocessed text),model_answer(preprocessed text)
2:: Output: similarity_scores(list), correlation(coefficient)
3:: Begin
4:: Processed_student_answer←preprocess(student_answer)
5:: Processed_model_answer ← preprocess(model_answer)
6:: Install_library(“Abydos”)
7:: Load_library(“Abydos”)
8:: similarity_scores ←[ ]
9:: algorithms ← get_Abydos_algorithms()
10:: for each algorithm in algorithms do
11:: similarity_score ← calculate_similarity(Processed_student_answer,
processed_model_answer, algorithm)
12:: Add similarity_score to similarity_scores
13:: end for
14:: avg_scores←get_avg_scores()
15:: correlation←calculate_correlation(similarity_scores, avg_scores)
16:: Return (similarity_scores, correlation)
17:: End

3.5. Semantic Similarity

Semantic similarity is essential for grading student responses, as it enables a deeper evaluation of content beyond simple word matching. To measure this, pre-trained transformer models like GTR-T5-Base, GTR-T5-Large, and Paraphrase-MiniLM-L6-V2 are employed. These models create sentence embeddings, which capture the contextual meaning of the text, facilitating a more refined comparison [36]. The similarity score is calculated by assessing the alignment between the model answer’s embedding and each sentence in the student’s response. This method is particularly effective when dealing with single-sentence comparisons, but variations in sentence structure and order may affect accuracy. Initially, a one-to-one sentence mapping is performed, followed by a normalized aggregation of similarity scores. However, this approach may struggle when the sequence of sentences differs significantly between the model and student responses. Figure 3 illustrates this process, while Algorithm 2 outlines the step-by-step methodology.

Algorithm 2: Semantic similarity

1:: Input: student_answer, model_answer
2:: Output: similarity_score, correlation
3:: Begin
4:: Load Hugging Face embedding model
5:: student_embedding ← get_embedding(student_answer)
6:: model_embedding ← get_embedding(model_answer)
7:: similarity_score←cosine_similarity(student_embedding, model_embedding)
8:: actual_scores ← get_actual_scores()
9:: correlation ← calculate_correlation(similarity_score, actual_scores)
10:: Return (similarity_score, correlation)
11:: End

3.6. Hybrid Approach (String-Based, Semantic Similarity

A hybrid merging approach was applied to improve automated short answer evaluation accuracy. First, string-based similarity algorithms were merged, focusing on textual overlap to detect exact matches. Then, semantic similarity algorithms were combined, capturing meaning and recognizing paraphrased responses. This approach minimizes errors, reduces biases, and improves the reliability of automated grading systems, making them more effective in assessing short answers accurately.while Algorithm 3 outlines the complete methodology.

Algorithm 3: Hybrid approach (String-based, Semantic similarity)

1:: BEGIN
2:: Step 1: String-based Similarity Calculation
3:: Read the original dataset
4:: Create a new CSV file to store the similarity results
5:: for each row in the dataset do
6:: Initialize an empty list to store similarity scores
7:: for each string-based similarity algorithm (28 algorithms) do
8:: Compute the similarity score using the algorithm
9:: Store the result in the list
10:: end for
11:: Add the actual score (ground truth) to the list
12:: Append the list as a new row in the CSV file
13:: end for
14:: Save the final CSV file with 28 similarity score columns and 1 actual score column
15:: Step 2: Hugging Face-based Similarity Calculation
16:: Read the original dataset again
17:: Create a new CSV file to store the Hugging Face similarity results
18:: for each row in the dataset do
19:: Initialize an empty list to store similarity scores
20:: for each Hugging Face similarity model (10 models) do
21:: Compute the similarity score using the model
22:: Store the result in the list
23:: end for
24:: Add the actual score (ground truth) to the list
25:: Append the list as a new row in the CSV file
26:: end for
27:: Save the final CSV file with 10 similarity score columns and 1 actual score column
28:: Step 3: Classification using Weka
29:: Open Weka and import each CSV file into Weka
30:: Apply different classification algorithms
31:: Calculate the correlation score for each classification algorithm
32:: Print the results
33:: END

3.7. Fine-Tuning (Transformer Model)

Fine-tuning is used for specific tasks by adapting pre-trained models, improving accuracy in domain-specific applications. This research applied fine-tuning to Transformer models to automatically score short answers [37]. Experiments using BERT, T5, and GPT-2 evaluated responses based on their distinctive capabilities: BERT is good at identifying key concepts, T5 assesses linguistic quality and coherence, and GPT-2 is good at analyzing fluency and response structure. The results show that fine-tuning improves automated scoring, ensuring more accurate and reliable classification. Figure 4 visually represents the fine-tuning process, while Algorithm 4 provides a step-by-step breakdown of the procedure.

Algorithm 4: Fine-Tuning (Transformer Model)

1:: Step 1: Load the Data
2:: Read the dataset
3:: Split the dataset into two subsets:
4:: - 80% for training
5:: - 20% for testing
6:: Step 2: Initialize and Fine-Tune Transformer Model
7:: Select a transformer model (e.g., GPT-2)
8:: Perform fine-tuning using the training data
9:: Step 3: Evaluate the Model
10:: Test the fine-tuned model on the test data
11:: Compute the correlation between the actual values and the model’s predictions
12:: Step 4: Output the Results
13:: Print the correlation score
14:: END

3.8. Large Language Model (LLM)

LLMs are sophisticated artificial intelligence systems trained on extensive text corpora, enabling them to comprehend and generate human-like language with high precision. Consequently, they function as valuable tools in education [38], enhancing automated grading processes and facilitating a more individualized and efficacious learning experience [39]. LLMs such as LLaMA 3 and Gemini were employed to evaluate short answers and generate feedback. This methodology incorporated zero-shot learning to assess responses without prior training, prompt engineering to optimize model outputs, and few-shot learning to enhance accuracy using limited examples. Subsequently, automated feedback was generated based on the model’s analysis, facilitating students in refining their responses. Figure 5 provides a visual representation of this process, while Algorithm 5 outlines the step-by-step procedure.

Algorithm 5: Large Language Model (LLM)

1:: Step 1: Load the Data
2:: Read the model answer and the student’s answer
3:: Read the actual grade (ground truth)
4:: Step 2: Apply Different Prompting Techniques
5:: for each prompting method in {Zero-shot, Prompt Engineering, Few-shot} do
6:: Input the model answer and the student’s answer into the LLM
7:: Use the selected prompting method
8:: Predict the score using the LLM
9:: end for
10:: Step 3: Compute Correlation Scores
11:: Compare the predicted scores with the actual score
12:: Calculate the Pearson correlation between the actual scores and the predicted scores
13:: Step 4: Provide Feedback using Few-shot Learning
14:: Use Few-shot Learning to generate feedback for the student based on their answer
15:: Store or display the feedback
16:: Step 5: Output the Results
17:: Print the predicted scores for each method
18:: Print the correlation scores
19:: Print the generated feedback
20:: END

4. Results

The evaluation outcomes derived from the proposed framework are organized according to the different applied approaches for short answer assessment. First, the “Results of String-Based Similarity” section outlines the findings based on conventional string comparison techniques. Next, the “Results of Semantic Similarity” section assesses the effectiveness of semantic-based methods. The “Results of the Hybrid Approach (String-Based and Semantic Similarity)” section provides an analysis of the combined use of both approaches. This is followed by the “Results of Large Language Models” section, which highlights the performance of LLMs in this domain. Lastly, the “Results of Fine-Tuning Transformer Models” section examines how fine-tuning contributes to improved accuracy in automated.

4.1. Environmental Setup

In this study, five types of experiments were conducted to evaluate short-answer grading approaches on a PC with 8 GB RAM, Windows 10 64-bit, and an Intel Core i5 processor. String-matching algorithms from the Abydos library and semantic similarity models from HuggingFace were applied with default settings, with scores computed on Colab using CPU and GPU respectively. In the hybrid experiment, features from both methods were combined and evaluated using Weka. Fine-tuning experiments were performed on GPT-2, BERT, and T5 models with learning rates of

5 \times 10^{- 5}

and

3 \times 10^{- 5}

and epochs of 10 and 15 on Colab with GPU acceleration. Finally, LLaMA (via Hugging Face) and Gemini API were used for row-by-row evaluation with temperature = 0.6, top-p = 0.9, and max_new_tokens = 1024 on Kaggle notebooks.

4.2. Results of String-Based Similarity

The proposed framework was validated on the newly constructed dataset using only student and model answers. Seventeen string-based algorithms from the Abydos library were tested, and different preprocessing techniques were explored. The best performance was achieved when only stopword removal was applied. Among the tested algorithms, BISIM produced the highest result, achieving a correlation of 0.5434. The Quadratic Weighted Kappa (QWK) reached 0.5003, consistent with the obtained Pearson correlation, as shown in Table 5. Additionally, the RMSE and Mean difference (MD) were computed to assess absolute error and systematic bias. BISIM showed an RMSE of 1.998 and a moderate under-scoring bias (MD = −1.519), indicating that while it captures relative performance, the predicted scores tend to underestimate human scores. The other algorithms displayed varying levels of bias and error, highlighting the limitations of string-based similarity metrics in fully replicating rubric-based grading.

4.3. Results of Semantic Similarity

The outcomes of embedding student and model responses using algorithms from the Hugging Face library are presented in Table 6. To evaluate the proximity between answers, Pearson correlation and Quadratic Weighted Kappa (QWK) were calculated based on cosine similarity. The bert-base-nli-mean-tokens model demonstrated superior performance among the tested algorithms, achieving a Pearson correlation of 0.5631 and a QWK of 0.5097 when applied to the original, unprocessed dataset. To further assess prediction quality, the RMSE and MD were also computed. The bert-base-nli-mean-tokens model showed an RMSE of 1.345 and a moderate over-scoring bias (MD = 0.403), indicating that while it effectively captures relative semantic similarity, it slightly overestimates student scores. Other models exhibited varying levels of error and bias, highlighting the relative advantage of bert-base-nli-mean-tokens in modeling semantic similarity for short answer grading.

4.4. Results of the Hybrid Approach (String-Based and Semantic Similarity)

Table 7 presents the results obtained after applying multiple string-based similarity algorithms with different preprocessing techniques. The computed similarity scores were collected in an Excel file alongside the actual scores and then input into Weka for classification. Only classifiers outperforming the best individual similarity algorithm were considered. The highest Pearson correlation of 0.7133 and a QWK of 0.6396 were achieved using the unprocessed original dataset preprocessing with the Random Forest classifier. Additionally, the Random Forest model exhibited a low RMSE of 1.083 and an almost negligible (MD = −0.0021), indicating that it not only captures relative ranking effectively but also provides predictions with minimal bias. Other classifiers showed slightly higher RMSE and minor biases, reinforcing the advantage of Random Forest in combining multiple string-based similarity metrics to improve both accuracy and consistency in ASAG.

The outcomes of various semantic similarity algorithms used for embedding the student’s responses and the model answers are shown in Table 8. Following the embedding process, cosine similarity was employed to evaluate the closeness between the responses. The resulting similarity scores, together with the actual scores, were consolidated into a single file and analyzed using Weka for classification purposes. Only classifiers that surpassed the performance of the best individual similarity algorithm were documented. The Random Forest classifier achieved the highest Pearson correlation of 0.6569, while the KStar classifier yielded the highest QWK of 0.5864.

To provide a more comprehensive assessment, the RMSE and MD were also calculated. The Random Forest classifier exhibited a low RMSE of 1.168 and an MD of −0.012, demonstrating that its predictions are not only accurate in magnitude but also practically unbiased. Other classifiers showed slightly higher RMSE values and minor biases, emphasizing the effectiveness of Random Forest in integrating multiple semantic similarity features to generate reliable and balanced short-answer predictions.

Note that the string-based and semantic-based methods were evaluated independently. Specifically, all string-based similarity algorithms were aggregated to produce a single consolidated result representing that category, while all semantic similarity algorithms were similarly combined within their respective category. Importantly, no weighting or integration was applied between the string-based and semantic-based methods; each category was analyzed entirely on its own.

It was observed that, in certain experiments, the string-based methods achieved slightly higher performance scores compared to the semantic-based methods. This difference can be attributed to the domain-specific characteristics of the BeSTraP dataset, where exact string matches often suffice to capture the essential concepts and key terms within student responses. In contrast, semantic-based approaches, which rely on embeddings and meaning representations, may be more sensitive to variations in wording, phrasing, or sentence structure, potentially resulting in slightly lower agreement with the human-assigned scores in this context. Overall, these results indicate that hybrid models effectively combine multiple metrics within each category to improve predictive accuracy and reduce bias, although string-based methods slightly outperform semantic-based methods in this dataset.

4.5. Results of Fine-Tuning Transformer Models

A systematic investigation of multiple architectures and hyperparameter configurations was conducted to assess the effectiveness of transformer-based models in automated short-answer scoring, with particular emphasis on fine-tuning to optimize performance. Among the evaluated models, GPT-2 demonstrated the best overall results and was therefore selected as the primary model. Fine-tuning GPT-2 with a learning rate of

5 \times 10^{- 5}

achieved a Pearson correlation coefficient of 0.8687 and a QWK of 0.8336. A subsequent experiment using a slightly lower learning rate of

3 \times 10^{- 5}

under identical conditions yielded improved results, with a correlation of 0.8813 and a QWK of 0.8490. These findings indicate that adjusting the learning rate enhances the model’s ability to capture subtle semantic relationships between student and reference responses.

In contrast, T5-small achieved a Pearson correlation coefficient of 0.8191 and a QWK of 0.7867. Similarly, BERT was fine-tuned using two learning rate configurations: with a rate of

5 \times 10^{- 5}

, it achieved a correlation of 0.8487 and a QWK of 0.8186, whereas decreasing the learning rate to

3 \times 10^{- 5}

led to a slight improvement, yielding a correlation of 0.8583 and a QWK of 0.8064. Table 9 summarizes the performance of the transformer-based models, including GPT-2, T5, and BERT. Additionally, RMSE and MD were computed to assess prediction accuracy and bias. GPT-2 (LR =

3 \times 10^{- 5}

) achieved the lowest RMSE of 0.8607 and MD of 0.0795, indicating highly accurate and unbiased predictions. T5-small and BERT showed slightly higher errors and biases, confirming that fine-tuning improves not only correlation and QWK but also overall prediction reliability.

4.6. Results of Large Language Models

An LLM was utilized to automatically evaluate students’ short-answer responses by applying advanced NLP methods. The methodology incorporated three core strategies: Zero-Shot Learning, Prompt Engineering, and Few-Shot Learning. Initially, a Zero-Shot Learning approach was adopted, enabling the model to evaluate student responses without prior task-specific training. Instead of fine-tuning labeled grading data, the model relied on a structured prompt-based evaluation framework to guide its scoring decisions. The specific prompt used for this task is presented in Figure 6. For this experiment, the unsloth/llama-3-8b-Instruct-bnb-4bit model was employed to assign numerical scores to student responses, producing both whole (integer) and fractional values. The primary objective of the evaluation was to determine the efficacy of Zero-Shot Learning in automated grading by analyzing the alignment between those provided by human evaluators and the model-generated scores.

On evaluation the effectiveness of the proposed framework, the dataset was used in its original form, without any preprocessing or normalization steps. A zero-shot learning strategy was applied using the unsloth/llama-3-8b-Instruct-bnb-4bit model, achieving a Pearson correlation of 0.6166, a QWK of 0.5072, a RMSE of 1.2749, and a MD of 0.3748. These results demonstrate the model’s ability to generalize grading tasks directly from raw student responses, underscoring its potential even in the absence of task-specific preprocessing or fine-tuning.

While Zero-Shot Learning provided a baseline for automated grading, it lacked structured grading criteria, which introduced ambiguity in the evaluation process. To address this limitation, Prompt Engineering was employed to refine the model’s performance. Prompt engineering involves designing structured and precise input instructions to optimize the accuracy and consistency of LLMs in generating relevant outputs.

Unlike the Zero-Shot Learning approach, which relied on a general prompt, this method incorporated a more structured and detailed prompt to enhance the interpretability of the model’s scoring process. In this stage, the same model was used, but the refined prompt explicitly defined grading criteria, offering a clear framework for assessing student responses. The structured grading criteria, as shown in Figure 7, aimed to minimize ambiguity and improve alignment with human evaluation.

The implementation of prompt engineering led to a substantial improvement in performance. The Pearson correlation coefficient increased from 0.6166 (achieved with a basic Zero-Shot Learning prompt) to 0.6247, the QWK rose from 0.5072 to 0.5292, the RMSE decreased from 1.2749 to 1.2186, and the MD increased from 0.1406 to 0.3748, demonstrating a stronger agreement between model-generated scores and human evaluations. This result underscores the effectiveness of structured prompts in improving LLM-based automated grading by establishing well-defined evaluation criteria and reducing interpretative inconsistencies.

Although refining the prompt improved grading accuracy, the model still lacked real-world grading examples to guide its evaluation process. To further enhance performance, Few-Shot Learning was introduced, incorporating a small set of labeled examples to provide additional context. Unlike Zero-Shot Learning, which relied solely on a well-structured prompt, and Prompt Engineering, which refined instructions for greater clarity, Few-Shot Learning presented explicit reference cases that helped the model recognize grading patterns and achieve better alignment with human evaluations.

A more structured and detailed prompt was introduced, featuring clearer grading instructions and carefully selected examples to align the model’s scoring behavior with human grading standards. As shown in Figure 8, this refined prompt achieved the highest observed Pearson correlation of 0.6750 and a QWK of 0.5614. The RMSE was 1.2626, and the Mean Difference (MD) was 0.1790. These results underscore the effectiveness of Few-Shot Learning in automated grading, demonstrating that combining explicit examples with structured prompts and optimized inference parameters substantially enhances the model’s grading accuracy and reliability. The full prompt is provided in Figure A1.

While Few-Shot Learning with LLaMA-3 yielded the best correlation and QWK so far, the evaluation was extended to a more advanced model, Gemini, to assess whether its multi-modal processing capabilities could further improve grading performance. Gemini integrates NLP with reinforcement learning, enabling it to extract deeper contextual understanding and enhance grading consistency.

This experiment employed the same Few-Shot Learning prompt (as illustrated in Figure 8) to ensure a fair comparison. A key finding from this study is the substantial performance improvement achieved by the Gemini model. While the LLaMA-3 model previously obtained a Pearson correlation of 0.6750 and a QWK of 0.5614, Gemini significantly outperformed it, achieving a higher Pearson correlation of 0.7955 and a QWK of 0.7464. Furthermore, Gemini achieved a lower RMSE of 1.1439 compared to LLaMA-3, indicating reduced prediction error. However, the Mean Difference (MD) was −0.4623, revealing a noticeable negative bias, meaning that Gemini tended to assign slightly lower scores than human evaluators on average.

This improvement demonstrates Gemini’s ability to align with human grading by effectively interpreting context and applying evaluation criteria, as illustrated in Table 10. These results highlight the impact of both model architecture and learning strategy on automated assessment performance. While prompt engineering and few-shot learning enhance interpretability, leveraging advanced models like Gemini further refines grading consistency, reinforcing the role of multi-modal learning in achieving human-aligned evaluation outcomes.

In this study, feedback was both generated and evaluated using large language models (LLMs). The feedback was produced using Gemini-1.5-Flash, following a structured prompt designed to generate clear, well-organized, and high-quality responses. Once the reference answers were created, the generated feedback was assessed using unsloth/llama-3-8b-Instruct-bnb-4bit, based on a predefined evaluation prompt. During this process, the model compared student responses with the reference answers and provided textual feedback explaining the reasoning behind each assigned score. The evaluation focused on alignment, accuracy, and completeness, ensuring that the feedback identified both strengths and areas for improvement. Scoring was performed on a 0–10 scale, yielding an average score of 7.8494 across all evaluations.

The prompts used in this process played a crucial role in maintaining consistency. As shown in Figure 9, the reference-answer generation prompt ensured that model-produced responses were clear, comprehensive, and well-structured. Meanwhile, Figure 10 illustrates the evaluation prompt, which guided the model in assessing the quality of student responses and providing constructive feedback. This structured approach ensured that the system effectively delivered meaningful insights, enabling students to understand the quality of their answers and identify areas for improvement. The full prompt is provided in Figure A2 and Figure A3.

To further examine the consistency of the evaluation process, an additional assessment was conducted using the same structured prompts for both feedback generation and evaluation. In this phase, feedback was generated using unsloth/llama-3-8b-Instruct-bnb-4bit and subsequently evaluated using Gemini-1.5-Flash, adhering to the predefined assessment criteria and resulting in an average score of 7.2457. Finally, to enhance the reliability of the evaluation results, both sets of generated feedback produced by Gemini-1.5-Flash and unsloth/llama-3-8b-Instruct-bnb-4bit—were further evaluated using DeepSeek, following the same predefined assessment criteria. The DeepSeek-based evaluation assigned a score

S_{i}

to each feedback instance, and the overall performance for each set was summarized using the arithmetic mean:

\bar{S} = S_{i} \sum_{i = 1}^{n} \frac{1}{n}

where n is the number of feedback instances. This evaluation yielded average scores of 7.8771 for the Gemini-generated feedback and 6.9170 for the LLaMA-generated feedback, reinforcing the robustness and consistency of the proposed evaluation framework across different LLMs.

This secondary evaluation confirmed the stability and reliability of the overall assessment framework, demonstrating that LLMs can consistently generate structured, insightful, and actionable feedback across diverse model architectures.

To further validate the quality and acceptability of the automatically generated feedback, a subset of feedback instances was presented to two experienced instructors for qualitative assessment. Both instructors indicated that the feedback was informative, relevant, and aligned with pedagogical expectations. Subsequently, a sample of feedback corresponding to different questions was evaluated by 20 undergraduate students, who reported a clear preference for the LLM-generated feedback over standard or generic comments. These observations provide additional evidence that the proposed framework produces meaningful, comprehensible, and pedagogically useful feedback, supporting the quantitative findings reported above.

In addition, this feedback will be incorporated into the dataset, providing a publicly available resource for other researchers to evaluate or benchmark automated short-answer grading systems. This ensures that the generated feedback not only serves as an evaluation tool but also contributes to reproducibility and further research in the field.

5. Validation of the Proposed Approach

To further demonstrate the reliability and adaptability of the proposed method, an additional evaluation was carried out on the Mohler dataset, a widely recognized benchmark for short-answer grading tasks. Note that QWK and MD metrics were not included in this comparison, as they are not available for this dataset. As shown in Table 11, the proposed model achieved a Pearson correlation of 0.78 and an RMSE of 0.728, surpassing previous approaches reported in related studies. These results reinforce the robustness of the proposed system and validate its effectiveness across different datasets.

The results of this study reveal that the proposed model maintains strong and consistent performance across various English datasets. This consistency suggests that the model effectively captures semantic relationships between student and reference answers rather than depending on specific dataset characteristics. Such stability is vital for practical short answer grading systems, which must handle diverse question types and linguistic variations. Overall, the findings confirm the model’s robustness and suitability for large-scale educational assessment applications. While the proposed model demonstrated strong performance on both the BeSTraP and Mohler datasets, it is important to note that both datasets originate from computer science–related domains. Although the consistent results suggest promising cross-dataset generalizability, grading characteristics may differ in disciplines with distinct linguistic styles or reasoning structures. Future work will therefore explore evaluation on multi-domain benchmarks to further assess cross-disciplinary transferability.

6. Discussion

This study evaluated five approaches for automated short-answer assessment: String-Based Similarity, Semantic Similarity, Hybrid Approach, Large Language Models (LLMs), and Fine-Tuning. All methods were tested on the BeSTraP dataset. The evaluation was based on two key metrics: Pearson correlation coefficient and Quadratic Weighted Kappa (QWK), which measure the alignment between model predictions and human scoring. Additionally, Root Mean Square Error (RMSE) and Mean Difference (MD) were also used to evaluate the accuracy and bias of model predictions. The results demonstrated that Fine-Tuning consistently outperformed the other approaches across both metrics. By adapting pre-trained transformer-based models such as GPT-2, T5, and BERT to this domain-specific dataset, the Fine-Tuned models provided more accurate and contextually aware assessments of student responses, achieving stronger agreement with human evaluation compared to the alternative methods.

As shown in Table 12, Fine-Tuning achieved the highest alignment with human evaluation on the BeSTraP dataset, attaining a Pearson correlation coefficient of 0.8813, a QWK of 0.8490, an MD of 0.0795, and an RMSE of 0.8607. These findings confirm that Fine-Tuned models, when trained on domain-specific data, provide the most reliable and context-sensitive evaluation of short answers.

Despite its strong performance, Fine-Tuning presents both strengths and limitations. Its advantages include leveraging domain-specific data to enhance contextual understanding, ensuring adaptability, and capturing both syntactic and semantic features for robust evaluation. However, it also requires substantial labeled data, is computationally intensive, and may overfit, limiting generalizability. Furthermore, biases in training data raise concerns regarding fairness in automated scoring. Therefore, while Fine-Tuning demonstrates superior effectiveness in the BeSTraP dataset, its practical application remains dependent on data availability, computational resources, and fairness considerations.

While LLMs did not achieve the highest correlation with human evaluation, their outcomes remain promising. This performance is influenced by several factors, notably the design of the prompt employed in this study. A well-structured prompt is crucial for directing LLMs to generate precise and consistent responses. In this study, the prompt was developed using key prompt engineering strategies as shown in Figure 8, ensuring clarity, specificity, and a task-focused approach. By explicitly outlining the evaluation criteria and instructing the model to provide a numerical score, the prompt enhanced consistency in assessment. Furthermore, the incorporation of few-shot prompting which included multiple examples assisted the model in identifying scoring patterns and aligning more effectively with human judgment. Additionally, output control mechanisms minimized irrelevant or excessively detailed responses, while the structured nature of the prompt enabled the LLM to leverage its capabilities in classification and evaluation. LLMs offer several advantages that render them valuable for automated assessment. Unlike traditional models, LLMs do not require extensive domain-specific training and can generalize across a wide range of topics, making them adaptable to various assessment tasks. They also handle open-ended responses more effectively, allowing for greater flexibility in evaluating student answers. Furthermore, continuous advancements in model architectures and training methodologies contribute to their ongoing improvement, increasing their potential for enhanced performance in future automated assessment applications.

Nevertheless, despite these advantages, LLMs present several challenges that warrant consideration. A primary limitation is their substantial computational cost, rendering large-scale deployment resource-intensive. Additionally, their sensitivity to prompt phrasing implies that minor variations in prompt wording can significantly affect performance, thereby posing a challenge to consistency. Another critical concern is the presence of bias in model outputs, as LLMs may inherit biases from their training data, potentially resulting in fairness issues in scoring. Furthermore, LLMs lack interpretability, complicating the explanation or justification of their scoring decisions. Finally, their strong dependence on prompt design indicates that poorly structured prompts can lead to inconsistent or inaccurate assessments. Despite these limitations, LLMs remain a promising tool for automated short-answer assessment. Their effectiveness can be further enhanced by optimizing prompt engineering strategies, addressing bias-related concerns, and improving computational efficiency. With continued advancements, LLMs have the potential to become a more reliable and scalable solution for automated evaluation in educational settings.

Although string-based similarity and semantic similarity methods did not achieve the highest performance in this study, they still offer notable advantages. Unlike more advanced approaches such as Fine-Tuning and LLMs, these methods do not require extensive computational resources (GPU processing) or large-scale training datasets, making them more accessible and cost-effective. String-based similarity techniques are straightforward, interpretable, and efficient for tasks that rely primarily on lexical overlap. Similarly, semantic similarity methods, while more advanced, leverage pre-trained embeddings to capture meaning without the need for domain-specific fine-tuning.

Notably, hybrid variations within each category demonstrated performance improvements. For string-based methods, integrating multiple lexical similarity measures enhanced robustness by capturing diverse surface-level patterns. Similarly, combining semantic similarity techniques improved the representation of nuanced contextual meaning. These within-category hybrid approaches offered more reliable and consistent scoring compared to relying on a single technique, providing a computationally efficient yet effective alternative for automated short-answer evaluation.

The notable improvement achieved by our approach can be attributed to the superior contextual understanding offered by transformer-based architectures. Unlike conventional methods that depend on manually engineered features or shallow neural networks, GPT-2 excels in capturing intricate semantic relationships between responses, resulting in greater grading accuracy. Its capability to comprehend complex linguistic structures enables a more refined assessment of student answers, minimizing dependence on strict lexical similarity metrics. The demonstrated effectiveness of our model suggests that transformer-based approaches provide a more robust and scalable solution for short answer grading. Future research can explore further refinements, such as fine-tuning domain-specific datasets or integrating additional linguistic features, to further optimize performance and applicability in real-world educational settings.

7. Conclusions

Automated Short Answer Grading (ASAG) remains a critical challenge in educational assessment, requiring methods that are both accurate and scalable. The central contribution of this study lies in the development of a new dataset, the BeSTraP dataset, specifically designed for ASAG research, addressing the scarcity of high-quality resources in the field, where the dataset will be open access for the scientific community.

Using the BeSTraP dataset, five approaches were systematically evaluated: string-based similarity, semantic similarity, hybrid techniques, large language models (LLMs) such as LLaMA and Gemini, and fine-tuned transformer models. The experimental findings demonstrated that fine-tuning, particularly with GPT-2, consistently achieved the strongest correlation with human scores, highlighting its effectiveness in adapting to domain-specific contexts and capturing complex linguistic patterns.

To further validate these results, additional experiments were conducted on the widely used Mohler dataset, where fine-tuned models again outperformed competing methods, confirming the robustness of the approach. The introduction of the BeSTraP dataset not only provides a valuable benchmark for future research but also extends the landscape of ASAG evaluation beyond existing resources. Looking forward, future work should explore advanced LLMs such as DeepSeek, Mistral, and Claude, expand testing across multilingual settings, especially underrepresented languages like Arabic, and investigate fairness considerations. These directions will help build scalable and equitable automated grading systems with broad real-world applicability.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/data11030057/s1, Supplementary Materials contain the BeSTraP dataset.

Author Contributions

Conceptualization, A.G.K., E.N., W.H.G., O.B. and A.M.E.-M.; methodology, A.G.K., E.N., W.H.G., O.B. and A.M.E.-M.; formal analysis, A.G.K., E.N., W.H.G., O.B. and A.M.E.-M.; data curation, A.G.K., E.N., W.H.G., O.B. and A.M.E.-M.; writing—original draft preparation, A.G.K.; writing—review and editing, A.G.K., E.N., W.H.G., O.B. and A.M.E.-M.; visualization, A.G.K.; supervision, E.N., W.H.G., O.B. and A.M.E.-M.; funding acquisition, E.N.; All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Deanship of Scientific Research, Islamic University of Madinah, Saudi Arabia.

Data Availability Statement

The BeSTraP dataset is included in the Supplementary Materials.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. LLM Prompts Used for Grading and Feedback Generation

Appendix A.1. Full Few-Shot Prompt Employed in the Grading Experiments

This appendix presents the exact and complete few-shot learning prompt used for automated scoring. The prompt includes all task instructions, scoring criteria, and the illustrative examples provided to guide the model. It is reported without modification to ensure transparency and reproducibility.

Figure A1. Prompt used for Few-Shot Learning-Based Grading.

Appendix A.2. Full Prompt Employed for Feedback Generation

This appendix presents the exact and complete prompt used for automated feedback generation. The prompt includes the full task instructions, guidance criteria, and any illustrative examples provided to guide the model in producing constructive feedback. It is reproduced without modification to ensure methodological transparency and reproducibility of the experiments.

Figure A2. Prompt used for Create Feedback.

Appendix A.3. Full Prompt Employed for Feedback Evaluation

This appendix presents the exact and complete prompt used for automated feedback evaluation. The prompt includes the full evaluation criteria, scoring guidelines, and any illustrative examples provided to guide the model in assessing the quality of generated feedback. It is reproduced without modification to ensure methodological transparency and reproducibility of the experiments.

Figure A3. Prompt used for Evaluate Feedback.

References

Qiu, R.G. A systemic approach to leveraging student engagement in collaborative learning to improve online engineering education. Int. J. Technol. Enhanc. Learn. 2019, 11, 1–19. [Google Scholar] [CrossRef]
Whitelock, D.; Bektik, D. Progress and challenges for automated scoring and feedback systems for large-scale assessments. Int. Handb. Prim. Second. Educ. 2018, 2, 617–634. [Google Scholar]
Khan, S.; Khan, R.A. Online assessments: Exploring perspectives of university students. Educ. Inf. Technol. 2019, 24, 661–677. [Google Scholar] [CrossRef]
Jordan, S. E-assessment: Past, present and future. New Dir. 2013, 9, 87–106. [Google Scholar] [CrossRef]
Ashton, H.S.; Beevers, C.E.; Milligan, C.D.; Schofield, D.K.; Thomas, R.C.; Youngson, M.A. Moving beyond objective testing in online assessment. In Online Assessment and Measurement: Case Studies from Higher Education, K-12 and Corporate; IGI Global: Hershey, PA, USA, 2006; pp. 116–129. [Google Scholar]
Beckman, K.; Apps, T.; Bennett, S.; Dalgarno, B.; Kennedy, G.; Lockyer, L. Self-regulation in open-ended online assignment tasks: The importance of initial task interpretation and goal setting. Stud. High. Educ. 2021, 46, 821–835. [Google Scholar] [CrossRef]
Sychev, O.; Anikin, A.; Prokudin, A. Automatic grading and hinting in open-ended text questions. Cogn. Syst. Res. 2020, 59, 264–272. [Google Scholar] [CrossRef]
Ahmed, B.; Kagita, M.; Wijenayake, C.A.; Ravishankar, J. Implementation guidelines for an automated grading tool to assess short answer questions on digital circuit design course. In Proceedings of the 2018 IEEE International Conference on Teaching, Assessment, and Learning for Engineering (TALE), Wollongong, NSW, Australia, 4–7 December 2018; pp. 1142–1145. [Google Scholar]
Schneider, J.; Richner, R.; Riser, M. Towards trustworthy autograding of short, multi-lingual, multi-type answers. Int. J. Artif. Intell. Educ. 2023, 33, 88–118. [Google Scholar] [CrossRef]
Riezler, S.; Simianer, P.; Haas, C. Response-based learning for grounded machine translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, MD, USA, 22–27 June 2014; pp. 881–891. [Google Scholar]
Sayeed, M.A.; Gupta, D. Automate Descriptive Answer Grading using Reference based Models. In Proceedings of the 2022 OITS International Conference on Information Technology (OCIT), Bhubaneswar, India, 14–16 December 2022; pp. 262–267. [Google Scholar]
Ouahrani, L.; Bennouar, D. Paraphrase Generation and Supervised Learning for Improved Automatic Short Answer Grading. Int. J. Artif. Intell. Educ. 2024, 34, 1627–1670. [Google Scholar] [CrossRef]
Kaya, M.; Cicekli, I. A Hybrid Approach for Automated Short Answer Grading. IEEE Access 2024, 12, 96332–96341. [Google Scholar] [CrossRef]
Chang, L.H.; Ginter, F. Automatic short answer grading for finnish with chatgpt. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI: Washington, DC, USA, 2024; Volume 38, pp. 23173–23181. [Google Scholar]
Sree Lakshmi, P.; Simha, J.; Ranjan, R. Empowering Educators: Automated Short Answer Grading with Inconsistency Check and Feedback Integration using Machine Learning. SN Comput. Sci. 2024, 5, 653. [Google Scholar] [CrossRef]
Metzler, T.; Plöger, P.G.; Hees, J. Computer-Assisted Short Answer Grading Using Large Language Models and Rubrics. In Proceedings of the Informatik 2024, Wiesbaden, Germany, 24–26 September 2024; Klein, M., Krupka, D., Winter, C., Gergeleit, M., Marti, L., Eds.; Gesellschaft für Informatik eV: Bonn, Germany, 2024; pp. 1383–1393. [Google Scholar]
Akila Devi, T.; Javubar Sathick, K.; Abdul Azeez Khan, A.; Arun Raj, L. Novel framework for improving the correctness of reference answers to enhance results of ASAG systems. SN Comput. Sci. 2023, 4, 415. [Google Scholar] [CrossRef] [PubMed]
Meccawy, M.; Bayazed, A.A.; Al-Abdullah, B.; Algamdi, H. Automatic essay scoring for Arabic short answer questions using text mining techniques. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 113–120. [Google Scholar] [CrossRef]
Patil, S.; Adhiya, K.P. Evaluation of Short Answers Using Domain Specific Embedding and Siamese Stacked BiLSTM with Contrastive Loss. Rev. D’Intell. Artif. 2023, 37, 719. [Google Scholar] [CrossRef]
Jiang, L.; Bosch, N. Short answer scoring with GPT-4. In Proceedings of the Eleventh ACM Conference on Learning@ Scale, Atlanta, GA, USA, 18–20 July 2024; pp. 438–442. [Google Scholar]
Hameed, N.H.; Sadiq, A.T. Automatic Short Answer Grading System Based on Semantic Networks and Support Vector Machine. Iraqi J. Sci. 2023, 64, 6025–6040. [Google Scholar] [CrossRef]
Gomaa, W.H.; Fahmy, A.A. Ans2vec: A scoring system for short answers. In Proceedings of the The International Conference on Advanced Machine Learning Technologies and Applications (AMLTA2019) 4; Springer: Berlin, Germany, 2020; pp. 586–595. [Google Scholar]
Sawatzki, J.; Schlippe, T.; Benner-Wickner, M. Deep learning techniques for automatic short answer grading: Predicting scores for English and German answers. In Proceedings of the International Conference on Artificial Intelligence in Education Technology; Springer: Berlin, Germany, 2021; pp. 65–75. [Google Scholar]
Gaddipati, S.K.; Nair, D.; Plöger, P.G. Comparative evaluation of pretrained transfer learning models on automatic short answer grading. arXiv 2020, arXiv:2009.01303. [Google Scholar] [CrossRef]
Prabhudesai, A.; Duong, T.N. Automatic short answer grading using Siamese bidirectional LSTM based regression. In Proceedings of the 2019 IEEE International Conference on Engineering, Technology and Education (TALE), Yogyakarta, Indonesia, 10–13 December 2019; pp. 1–6. [Google Scholar]
Hassan, S.; Fahmy, A.A.; El-Ramly, M. Automatic short answer scoring based on paragraph embeddings. Int. J. Adv. Comput. Sci. Appl. 2018, 9, 397–402. [Google Scholar] [CrossRef]
Kumar, S.; Chakrabarti, S.; Roy, S. Earth Mover’s Distance Pooling over Siamese LSTMs for Automatic Short Answer Grading. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), Melbourne, Australia, 19–25 August 2017; pp. 2046–2052. [Google Scholar]
Saeed, M.M.; Gomaa, W.H. An ensemble-based model to improve the accuracy of automatic short answer grading. In Proceedings of the 2022 2nd International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC), Cairo, Egypt, 8–9 May 2022; pp. 337–342. [Google Scholar]
Dada, I.D.; Akinwale, A.T.; Tunde-Adeleke, T.J. A Structured Dataset for Automated Grading: From Raw Data to Processed Dataset. Data 2025, 10, 87. [Google Scholar] [CrossRef]
Wijanto, M.C.; Yong, H.S. Combining balancing dataset and sentencetransformers to improve short answer grading performance. Appl. Sci. 2024, 14, 4532. [Google Scholar] [CrossRef]
Abdul Salam, M.; El-Fatah, M.A.; Hassan, N.F. Automatic grading for Arabic short answer questions using optimized deep learning model. PLoS ONE 2022, 17, e0272269. [Google Scholar] [CrossRef] [PubMed]
Bonthu, S.; Sree, S.R.; Prasad, M. SPRAG: Building and benchmarking a Short Programming-Related Answer Grading dataset. Int. J. Data Sci. Anal. 2024, 20, 1871–1883. [Google Scholar] [CrossRef]
Del Gobbo, E.; Guarino, A.; Cafarelli, B.; Grilli, L. GradeAid: A framework for automatic short answers grading in educational contexts—Design, implementation and evaluation. Knowl. Inf. Syst. 2023, 65, 4295–4334. [Google Scholar] [CrossRef] [PubMed]
Gomaa, W.H.; Fahmy, A.A. Short answer grading using string similarity and corpus-based similarity. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 2012, 3, 115–121. [Google Scholar]
Little, C.C. Abydos: A Python Library for Text Processing. 2024. Available online: https://abydos.readthedocs.io/en/latest/index.html (accessed on 29 March 2025).
Gomaa, W.H.; Fahmy, A.A. A survey of text similarity approaches. Int. J. Comput. Appl. 2013, 68, 13–18. [Google Scholar] [CrossRef]
Sung, C.; Dhamecha, T.I.; Mukhi, N. Improving short answer grading using transformer-based pre-training. In Proceedings of the Artificial Intelligence in Education: 20th International Conference, AIED 2019, Chicago, IL, USA, 25–29 June 2019, Proceedings, Part I 20; Springer: Berlin, Germany, 2019; pp. 469–481. [Google Scholar]
Kortemeyer, G. Performance of the pre-trained large language model GPT-4 on automated short answer grading. Discov. Artif. Intell. 2024, 4, 47. [Google Scholar] [CrossRef]
Gandolfi, A. GPT-4 in education: Evaluating aptness, reliability, and loss of coherence in solving calculus problems and grading submissions. Int. J. Artif. Intell. Educ. 2024, 35, 367–397. [Google Scholar] [CrossRef]

Figure 1. Architecture of the methodology.

Figure 2. Method for String-based similarity.

Figure 3. Method for Semantic similarity.

Figure 4. Method for Fine-Tuning (Transformer Model).

Figure 5. Method for Large Language Model (LLM).

Figure 6. Prompt used for zero-shot learning-based grading.

Figure 7. Prompt Used for Prompt Engineering.

Figure 8. Prompt used for Few-Shot Learning-Based Grading.

Figure 9. Prompt used for Create Feedback.

Figure 10. Prompt used for Evaluate Feedback.

Table 1. Overview of publicly available datasets used in ASAG studies.

Dataset	Language	#Records	Discipline	Type of Questions
AR-ASAG	Arabic	2133	Cybercrime	Open-ended, explanatory
Mohler	English	2273	Data Structures	Conceptual, non-factoid explanatory
SciEntsBank	English	139	Science	Conceptual and reasoning-based
German Dataset	German	–	General Education	Open-ended descriptive
Finnish Dataset	Finnish	2000	Economics and Demographics	Explanatory and analytical
SemEval-2013	English	12,000	Education	Mixed (fact-based and explanatory)
ASAP-SAS	English	17,043	Mixed (Science, English, Biology)	Mixed (fact-based and reasoning)

Table 3. BeSTraP dataset sample.

Question: Difference between ’repeatable read’ and ’serializable’? Model answer: Repeatable Read prevents non-repeatable reads but allows phantom reads, while Serializable prevents both non-repeatable and phantom reads by ensuring complete isolation
	Student Answer	Evaluator 1	Evaluator 2	Average
Student’s Answer 1	serializable make lock to resource untill make transaction and disable any other transaction untill finish.	1	1	1
Student’s Answer 2	repeatable read provides a high level of consistency but can still result in phantom read serializable provides a highest level of consistancy.	3.5	4	3.75
Student’s Answer 3	Repeatable Read provides a high level of consistency by preventing dirty reads and non-repeatable reads, but it can still result in phantom reads.Serializable provides the highest level of consistency by preventing all concurrency issues, including phantom reads.	5	5	5

Table 4. Preprocessing techniques with their description and a student’s answer SA1 after applying preprocessing.

No.	Technique	Description	SA1 After Preprocessing
1	Removing punctuation	Cleanses the data by removing all non-essential special characters and symbols.	To find problem and errors in a program before it is finalized.
2	Tokenization	Divides text into individual words or subwords, known as tokens.	To, find, problem, and, errors, in, a, program, before, it, is, finalized.
3	Removing stopwords	Eliminates common words (e.g., “the”, “is”) to focus on meaningful terms.	find problem errors program finalized
4	Lemmatization	Reduces words to their base or root form while preserving meaning.	To find problem and error in a program before it be finalize

Table 5. Performance of string-based similarity algorithms measured by Correlation and QWK.

Algorithm	Pearson	QWK	RMSE	MD
AndresMarzoDelta	0.1953	0.1240	1.690	0.746
AZZOO	0.4297	0.4065	1.444	0.297
BISIM	0.5434	0.5003	1.998	−1.519
BeniniI	0.1247	0.0944	1.629	0.494
Bennet	0.3443	0.3105	1.600	0.631
Chord	0.4347	0.4061	1.972	−1.396
CohenKappa	0.2814	0.2433	1.795	0.982
DamerauLevenshtein	0.5159	0.4887	2.094	−1.620
Hamming	0.2661	0.2293	3.299	−2.939
JaroWinkler	0.4510	0.4414	1.403	0.005
Levenshtein	0.5221	0.4888	2.104	−1.636
MetaLevenshtein	0.4188	0.3380	2.910	−2.546
NeedlemanWunsch	0.3558	0.2477	3.243	−2.884
SAPS	0.1900	0.0663	3.512	−3.002
SmithWaterman	0.4139	0.4117	2.821	−2.443
Strcmp95	0.4379	0.3916	1.411	0.232
Tichy	0.2682	0.1917	1.501	−0.104

Table 6. Correlation Coefficients and QWK for Semantic Similarity Models.

Models	Pearson	QWK	RMSE	MD
all-MiniLM-L6-v2	0.5242	0.4232	1.317	0.073
paraphrase-MiniLM-L6-v2	0.4172	0.3132	1.423	−0.059
all-MiniLM-L12-v2	0.4973	0.3977	1.356	0.199
paraphrase-MpNet-base-v2	0.5009	0.3886	1.355	0.223
bert-base-nli-mean-tokens	0.5631	0.5097	1.345	0.403
gtr-t5-base	0.4926	0.3873	1.551	0.681
multi-qa-MiniLM-L6-cos-v1	0.4637	0.3787	1.373	0.004
sentence-t5-base	0.5235	0.4164	1.791	1.072
paraphrase-multilingual-mpnet-base-v2	0.472	0.3784	1.389	0.262
gtr-t5-large	0.5345	0.4675	1.604	0.823

Table 7. Correlation Coefficients and QWK for Hybrid Approach Using String-Based Similarity Algorithms.

Algorithm	Pearson	QWK	RMSE	MD
RandomForest	0.7133	0.6396	1.083	−0.002
M5P	0.6577	0.5645	1.165	−0.010
M5Rules	0.6227	0.5402	1.218	−0.040
Bagging	0.6839	0.6047	1.127	−0.004
RandomSubSpace	0.6663	0.5705	1.154	−0.002
KStar	0.6442	0.6365	1.233	0.066
IBK	0.6137	0.5285	1.325	0.116
SMOreg	0.6077	0.512	1.236	0.096
Linear Regression	0.6022	0.5037	1.234	0.0004

Table 8. Correlation Coefficients and QWK for Hybrid Approach Using Semantic Similarity Algorithms.

Algorithm	Pearson	QWK	RMSE	MD
RandomForest	0.6569	0.563	1.168	−0.012
trees.M5P	0.6262	0.550	1.204	0.002
rules.M5Rules	0.6262	0.550	1.204	0.002
Linear Regression	0.6262	0.548	1.207	0.020
meta.Bagging	0.6242	0.5487	1.207	0.020
SMOreg	0.609	0.544	1.255	0.250
RandomSubSpace	0.5951	0.5025	1.243	0.008
KStar	0.5942	0.5864	1.303	0.143

Table 9. Evaluation of Fine-Tuned Transformer Models Using Pearson Correlation and QWK Metrics.

Algorithm	Pearson	QWK	RMSE	MD
T5-small	0.8191	0.7867	1.217	0.3950
Bert-base-uncased (LR = $5 \times 10^{- 5}$ )	0.8487	0.8186	0.9370	0.4196
Bert-base-uncased (LR = $3 \times 10^{- 5}$ )	0.8583	0.8064	0.9219	0.2642
GPT-2 (LR = $5 \times 10^{- 5}$ )	0.8687	0.8336	0.8644	0.2524
GPT-2 (LR = $3 \times 10^{- 5}$ )	0.8813	0.8490	0.8607	0.0795

Table 10. Correlation and Weighted Kappa (QWK) of Large Language Models under Different Prompting Strategies.

Model	Pearson	QWK	RMSE	MD
Llama (Zero-Shot)	0.6166	0.5072	1.2749	0.3748
Llama (Using Prompt Engineering)	0.6247	0.5292	1.2186	0.1406
Llama (Few-Shot)	0.6750	0.5614	1.2626	0.1790
Gemini (Zero-Shot)	0.7796	0.7339	1.0642	−0.4229
Gemini (Using Prompt Engineering)	0.7872	0.7173	1.0642	−0.4229
Gemini (Few-Shot)	0.7955	0.7464	1.1439	−0.4623

Table 11. Comparison of Pearson Correlation Results on the Mohler Dataset.

Model	Pearson	RMSE
Proposed Approach	0.7834	0.728
[17]	0.485	0.978
[19]	0.668	0.889
[21]	0.631	0.830
[12]	0.735	0.779
[13]	0.747	0.856

Table 12. Summary of evaluation results across all approaches on the BeSTraP dataset.

Approach	Pearson	QWK	RMSE	MD
String-Based Similarity	0.5434	0.5003	1.998	−1.5190
Semantic Similarity	0.5631	0.5097	1.345	0.4030
Hybrid Approach	0.7133	0.6396	1.083	−0.0020
Fine-Tuning	0.8813	0.8490	0.8607	0.0795
LLMs	0.7955	0.7464	1.1439	−0.4623

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

G. Khalf, A.; Nabil, E.; H. Gomaa, W.; Benrhouma, O.; M. El-Mandouh, A. Leveraging Transformers and LLMs for Automated Grading and Feedback Generation Using a Novel Dataset. Data 2026, 11, 57. https://doi.org/10.3390/data11030057

AMA Style

G. Khalf A, Nabil E, H. Gomaa W, Benrhouma O, M. El-Mandouh A. Leveraging Transformers and LLMs for Automated Grading and Feedback Generation Using a Novel Dataset. Data. 2026; 11(3):57. https://doi.org/10.3390/data11030057

Chicago/Turabian Style

G. Khalf, Asmaa, Emad Nabil, Wael H. Gomaa, Oussama Benrhouma, and Amira M. El-Mandouh. 2026. "Leveraging Transformers and LLMs for Automated Grading and Feedback Generation Using a Novel Dataset" Data 11, no. 3: 57. https://doi.org/10.3390/data11030057

APA Style

G. Khalf, A., Nabil, E., H. Gomaa, W., Benrhouma, O., & M. El-Mandouh, A. (2026). Leveraging Transformers and LLMs for Automated Grading and Feedback Generation Using a Novel Dataset. Data, 11(3), 57. https://doi.org/10.3390/data11030057

Article Menu

Leveraging Transformers and LLMs for Automated Grading and Feedback Generation Using a Novel Dataset

Abstract

1. Introduction

2. Literature Review

3. Methodology

3.1. Dataset

Novelty of the BeSTraP Dataset

3.2. Model Selection

3.3. Data Preprocessing

3.4. String-Based Similarity

3.5. Semantic Similarity

3.6. Hybrid Approach (String-Based, Semantic Similarity

3.7. Fine-Tuning (Transformer Model)

3.8. Large Language Model (LLM)

4. Results

4.1. Environmental Setup

4.2. Results of String-Based Similarity

4.3. Results of Semantic Similarity

4.4. Results of the Hybrid Approach (String-Based and Semantic Similarity)

4.5. Results of Fine-Tuning Transformer Models

4.6. Results of Large Language Models

5. Validation of the Proposed Approach

6. Discussion

7. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. LLM Prompts Used for Grading and Feedback Generation

Appendix A.1. Full Few-Shot Prompt Employed in the Grading Experiments

Appendix A.2. Full Prompt Employed for Feedback Generation

Appendix A.3. Full Prompt Employed for Feedback Evaluation

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI