The evaluation outcomes derived from the proposed framework are organized according to the different applied approaches for short answer assessment. First, the “Results of String-Based Similarity” section outlines the findings based on conventional string comparison techniques. Next, the “Results of Semantic Similarity” section assesses the effectiveness of semantic-based methods. The “Results of the Hybrid Approach (String-Based and Semantic Similarity)” section provides an analysis of the combined use of both approaches. This is followed by the “Results of Large Language Models” section, which highlights the performance of LLMs in this domain. Lastly, the “Results of Fine-Tuning Transformer Models” section examines how fine-tuning contributes to improved accuracy in automated.
4.4. Results of the Hybrid Approach (String-Based and Semantic Similarity)
Table 7 presents the results obtained after applying multiple string-based similarity algorithms with different preprocessing techniques. The computed similarity scores were collected in an Excel file alongside the actual scores and then input into Weka for classification. Only classifiers outperforming the best individual similarity algorithm were considered. The highest Pearson correlation of 0.7133 and a QWK of 0.6396 were achieved using the unprocessed original dataset preprocessing with the Random Forest classifier. Additionally, the Random Forest model exhibited a low RMSE of 1.083 and an almost negligible (MD = −0.0021), indicating that it not only captures relative ranking effectively but also provides predictions with minimal bias. Other classifiers showed slightly higher RMSE and minor biases, reinforcing the advantage of Random Forest in combining multiple string-based similarity metrics to improve both accuracy and consistency in ASAG.
The outcomes of various semantic similarity algorithms used for embedding the student’s responses and the model answers are shown in
Table 8. Following the embedding process, cosine similarity was employed to evaluate the closeness between the responses. The resulting similarity scores, together with the actual scores, were consolidated into a single file and analyzed using Weka for classification purposes. Only classifiers that surpassed the performance of the best individual similarity algorithm were documented. The Random Forest classifier achieved the highest Pearson correlation of 0.6569, while the KStar classifier yielded the highest QWK of 0.5864.
To provide a more comprehensive assessment, the RMSE and MD were also calculated. The Random Forest classifier exhibited a low RMSE of 1.168 and an MD of −0.012, demonstrating that its predictions are not only accurate in magnitude but also practically unbiased. Other classifiers showed slightly higher RMSE values and minor biases, emphasizing the effectiveness of Random Forest in integrating multiple semantic similarity features to generate reliable and balanced short-answer predictions.
Note that the string-based and semantic-based methods were evaluated independently. Specifically, all string-based similarity algorithms were aggregated to produce a single consolidated result representing that category, while all semantic similarity algorithms were similarly combined within their respective category. Importantly, no weighting or integration was applied between the string-based and semantic-based methods; each category was analyzed entirely on its own.
It was observed that, in certain experiments, the string-based methods achieved slightly higher performance scores compared to the semantic-based methods. This difference can be attributed to the domain-specific characteristics of the BeSTraP dataset, where exact string matches often suffice to capture the essential concepts and key terms within student responses. In contrast, semantic-based approaches, which rely on embeddings and meaning representations, may be more sensitive to variations in wording, phrasing, or sentence structure, potentially resulting in slightly lower agreement with the human-assigned scores in this context. Overall, these results indicate that hybrid models effectively combine multiple metrics within each category to improve predictive accuracy and reduce bias, although string-based methods slightly outperform semantic-based methods in this dataset.
4.6. Results of Large Language Models
An LLM was utilized to automatically evaluate students’ short-answer responses by applying advanced NLP methods. The methodology incorporated three core strategies: Zero-Shot Learning, Prompt Engineering, and Few-Shot Learning. Initially, a Zero-Shot Learning approach was adopted, enabling the model to evaluate student responses without prior task-specific training. Instead of fine-tuning labeled grading data, the model relied on a structured prompt-based evaluation framework to guide its scoring decisions. The specific prompt used for this task is presented in
Figure 6. For this experiment, the
unsloth/llama-3-8b-Instruct-bnb-4bit model was employed to assign numerical scores to student responses, producing both whole (integer) and fractional values. The primary objective of the evaluation was to determine the efficacy of Zero-Shot Learning in automated grading by analyzing the alignment between those provided by human evaluators and the model-generated scores.
On evaluation the effectiveness of the proposed framework, the dataset was used in its original form, without any preprocessing or normalization steps. A zero-shot learning strategy was applied using the unsloth/llama-3-8b-Instruct-bnb-4bit model, achieving a Pearson correlation of 0.6166, a QWK of 0.5072, a RMSE of 1.2749, and a MD of 0.3748. These results demonstrate the model’s ability to generalize grading tasks directly from raw student responses, underscoring its potential even in the absence of task-specific preprocessing or fine-tuning.
While Zero-Shot Learning provided a baseline for automated grading, it lacked structured grading criteria, which introduced ambiguity in the evaluation process. To address this limitation, Prompt Engineering was employed to refine the model’s performance. Prompt engineering involves designing structured and precise input instructions to optimize the accuracy and consistency of LLMs in generating relevant outputs.
Unlike the Zero-Shot Learning approach, which relied on a general prompt, this method incorporated a more structured and detailed prompt to enhance the interpretability of the model’s scoring process. In this stage, the same model was used, but the refined prompt explicitly defined grading criteria, offering a clear framework for assessing student responses. The structured grading criteria, as shown in
Figure 7, aimed to minimize ambiguity and improve alignment with human evaluation.
The implementation of prompt engineering led to a substantial improvement in performance. The Pearson correlation coefficient increased from 0.6166 (achieved with a basic Zero-Shot Learning prompt) to 0.6247, the QWK rose from 0.5072 to 0.5292, the RMSE decreased from 1.2749 to 1.2186, and the MD increased from 0.1406 to 0.3748, demonstrating a stronger agreement between model-generated scores and human evaluations. This result underscores the effectiveness of structured prompts in improving LLM-based automated grading by establishing well-defined evaluation criteria and reducing interpretative inconsistencies.
Although refining the prompt improved grading accuracy, the model still lacked real-world grading examples to guide its evaluation process. To further enhance performance, Few-Shot Learning was introduced, incorporating a small set of labeled examples to provide additional context. Unlike Zero-Shot Learning, which relied solely on a well-structured prompt, and Prompt Engineering, which refined instructions for greater clarity, Few-Shot Learning presented explicit reference cases that helped the model recognize grading patterns and achieve better alignment with human evaluations.
A more structured and detailed prompt was introduced, featuring clearer grading instructions and carefully selected examples to align the model’s scoring behavior with human grading standards. As shown in
Figure 8, this refined prompt achieved the highest observed Pearson correlation of 0.6750 and a QWK of 0.5614. The RMSE was 1.2626, and the Mean Difference (MD) was 0.1790. These results underscore the effectiveness of Few-Shot Learning in automated grading, demonstrating that combining explicit examples with structured prompts and optimized inference parameters substantially enhances the model’s grading accuracy and reliability. The full prompt is provided in
Figure A1.
While Few-Shot Learning with LLaMA-3 yielded the best correlation and QWK so far, the evaluation was extended to a more advanced model, Gemini, to assess whether its multi-modal processing capabilities could further improve grading performance. Gemini integrates NLP with reinforcement learning, enabling it to extract deeper contextual understanding and enhance grading consistency.
This experiment employed the same Few-Shot Learning prompt (as illustrated in
Figure 8) to ensure a fair comparison. A key finding from this study is the substantial performance improvement achieved by the Gemini model. While the LLaMA-3 model previously obtained a Pearson correlation of 0.6750 and a QWK of 0.5614, Gemini significantly outperformed it, achieving a higher Pearson correlation of 0.7955 and a QWK of 0.7464. Furthermore, Gemini achieved a lower RMSE of 1.1439 compared to LLaMA-3, indicating reduced prediction error. However, the Mean Difference (MD) was −0.4623, revealing a noticeable negative bias, meaning that Gemini tended to assign slightly lower scores than human evaluators on average.
This improvement demonstrates Gemini’s ability to align with human grading by effectively interpreting context and applying evaluation criteria, as illustrated in
Table 10. These results highlight the impact of both model architecture and learning strategy on automated assessment performance. While prompt engineering and few-shot learning enhance interpretability, leveraging advanced models like Gemini further refines grading consistency, reinforcing the role of multi-modal learning in achieving human-aligned evaluation outcomes.
In this study, feedback was both generated and evaluated using large language models (LLMs). The feedback was produced using Gemini-1.5-Flash, following a structured prompt designed to generate clear, well-organized, and high-quality responses. Once the reference answers were created, the generated feedback was assessed using unsloth/llama-3-8b-Instruct-bnb-4bit, based on a predefined evaluation prompt. During this process, the model compared student responses with the reference answers and provided textual feedback explaining the reasoning behind each assigned score. The evaluation focused on alignment, accuracy, and completeness, ensuring that the feedback identified both strengths and areas for improvement. Scoring was performed on a 0–10 scale, yielding an average score of 7.8494 across all evaluations.
The prompts used in this process played a crucial role in maintaining consistency. As shown in
Figure 9, the reference-answer generation prompt ensured that model-produced responses were clear, comprehensive, and well-structured. Meanwhile,
Figure 10 illustrates the evaluation prompt, which guided the model in assessing the quality of student responses and providing constructive feedback. This structured approach ensured that the system effectively delivered meaningful insights, enabling students to understand the quality of their answers and identify areas for improvement. The full prompt is provided in
Figure A2 and
Figure A3.
To further examine the consistency of the evaluation process, an additional assessment was conducted using the same structured prompts for both feedback generation and evaluation. In this phase, feedback was generated using unsloth/llama-3-8b-Instruct-bnb-4bit and subsequently evaluated using Gemini-1.5-Flash, adhering to the predefined assessment criteria and resulting in an average score of 7.2457. Finally, to enhance the reliability of the evaluation results, both sets of generated feedback produced by Gemini-1.5-Flash and unsloth/llama-3-8b-Instruct-bnb-4bit—were further evaluated using DeepSeek, following the same predefined assessment criteria. The DeepSeek-based evaluation assigned a score
to each feedback instance, and the overall performance for each set was summarized using the arithmetic mean:
where
n is the number of feedback instances. This evaluation yielded average scores of 7.8771 for the Gemini-generated feedback and 6.9170 for the LLaMA-generated feedback, reinforcing the robustness and consistency of the proposed evaluation framework across different LLMs.
This secondary evaluation confirmed the stability and reliability of the overall assessment framework, demonstrating that LLMs can consistently generate structured, insightful, and actionable feedback across diverse model architectures.
To further validate the quality and acceptability of the automatically generated feedback, a subset of feedback instances was presented to two experienced instructors for qualitative assessment. Both instructors indicated that the feedback was informative, relevant, and aligned with pedagogical expectations. Subsequently, a sample of feedback corresponding to different questions was evaluated by 20 undergraduate students, who reported a clear preference for the LLM-generated feedback over standard or generic comments. These observations provide additional evidence that the proposed framework produces meaningful, comprehensible, and pedagogically useful feedback, supporting the quantitative findings reported above.
In addition, this feedback will be incorporated into the dataset, providing a publicly available resource for other researchers to evaluate or benchmark automated short-answer grading systems. This ensures that the generated feedback not only serves as an evaluation tool but also contributes to reproducibility and further research in the field.