1. Introduction
Assessment plays a significant role in measuring the learning abilities of a student [
1]. Academic examinations can be performed using many question types, including multiple-choice and free-response questions [
2]. This study focuses on an automated free-response question grading system. A free-response question can be defined as a question that requires answers that allow the student to be more expressive. It can be a short or long answer, which can span from one phrase to one page. These answers are often given in natural language and also demonstrate knowledge gained from students’ understanding of the question and the subject [
3]. Human assessment evaluations are predominantly used for free-response question tasks. One considerable challenge arises as the teacher-to-student ratio increases [
4]; the manual assessment process becomes more complicated, leading to time-consuming issues since a single task must be repeated numerous times. This repetition may trigger the so-called “human factor”, particularly in assigning unequal notes/grades to the same answers from different students.
The educational system is shifting towards electronic learning (e-learning), which is web-supported, with computer-based exams and automatic evaluation playing a significant role. This e-learning is undeniably a rapidly developing area that goes beyond simple rule-based methods because a single question can receive more responses from students with different explanations. In automated free-response question grading, for every question given, student answers are compared to the reference answers, and a score is assigned using machine-learning techniques [
5]. As assessment in the educational system is critical, it requires a more refined model, especially in terms of accuracy, because even a slight scoring error can have a big impact on the student(s) taking the assessment.
Many state-of-the-art deep-learning methods for automatic evaluation have been proposed with good scoring accuracy. This automatic evaluation is a crucial application related to the education domain that uses Natural Language Processing (NLP) and machine-learning techniques. The transformer model is one of the leading models to achieve a state-of-the-art result for automated free-response question grading using the idea of semantic textual similarity, but these approaches have predominantly focused on intra-sentence attention, which examines the relationships within a single sentence or document.
However, these approaches often fall short of capturing the semantic relationships between different sentences. In this study, a novel transformer-based model with a focus on inter-sentence attention mechanisms to guide the model in focusing on critical inter-sentence information, such as synonyms, hyponyms, metonyms, and antonyms, was proposed. This enhanced focus aims to improve the model’s accuracy in identifying semantic equivalences and differences between sentence pairs.
2. Related Works
The application of machine learning in educational settings has expanded to include the analysis of student behaviors and interactions [
6]. Learning analytics play a crucial role in this domain by modeling student–staff engagement, offering insights to enhance educational practices [
7]. The field of Automated Short Answer Grading (ASAG) has gained significant attention as an alternative to traditional manual grading, which is often time-consuming and prone to inconsistencies [
8]. With advancements in NLP and deep learning, various automated approaches have been proposed, ranging from rule-based and statistical models to deep-learning architectures, such as Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks and Transformers [
4,
9,
10].
Traditional ASAG systems primarily relied on lexical matching techniques and semantic similarity measures. Early studies utilized cosine similarity, Jaccard similarity, and latent semantic analysis (LSA) to compare student responses with reference answers [
4]. However, these approaches struggled with synonymy, paraphrasing and deeper linguistic structures, leading to the development of more sophisticated techniques such as machine-learning and deep-learning models [
10].
Recent advancements in NLP and deep-learning architectures have shown promising results in various tasks like machine translation, text summarization and text similarity tasks, particularly in applications such as Automatic Essay Scoring (AES) and ASAG. Transfer learning models have transformed ASAG performance, with models such as BERT, SBERT, RoBERTa, and XLNet showing state-of-the-art results in semantic similarity and grading [
11,
12]. These models employ self-attention mechanisms, allowing them to capture contextual relationships between words and sentences, leading to a more accurate assessment of student responses [
2].
Transformer models, particularly Bidirectional Encoder Representations from Transformers (BERT) [
13] and its variants have proven highly effective in ASAG. BERT’s self-attention mechanism allows it to process entire sequences bidirectionally, making it well-suited for sentence-pair regression tasks like semantic textual similarity (STS) and ASAG [
14]. Several studies have explored fine-tuning pre-trained transformers for ASAG tasks.
Zhu et al. [
11] proposed a four-stage framework for ASAG utilizing a pre-trained BERT model. In the first stage, BERT was used to encode both student responses and reference answers. Next, a Bi-directional Long Short-Term Memory (Bi-LSTM) network was applied to enhance semantic understanding from BERT’s outputs. In the third stage, the Semantic Fusion Layer combined these outputs with fine-grained token representations to enrich contextual meaning. Finally, in the prediction stage, a max-pooling technique was employed to generate the final grading scores. The study, conducted on the Mohler and SemEval datasets, demonstrated an accuracy of 76.5% for grading unseen answers, 69.2% for unseen domains, and 66.0% for unseen questions. Additionally, on the Mohler dataset, the model achieved a Root Mean Square Error (RMSE) of 0.248 and a Pearson Correlation Coefficient (R) of 0.89, indicating strong grading performance and reliability.
Sung et al. [
15] focused on enhancing ASAG by developing contextual representations using BERT. The goal was to improve the efficiency of pre-trained BERT models by incorporating domain-specific resources. To achieve this, the work utilized textbooks from disciplines such as the physiology of behavior, American government, human development, and abnormal psychology to fine-tune BERT for the ASAG task. The empirical study demonstrated that task-specific fine-tuning significantly improved BERT’s performance, leading to more accurate and reliable grading results.
Lei and Meng [
16] introduced a Bi-GRU Siamese architecture built on a pre-trained ALBERT model for improved text-similarity assessment. In this approach, input expressions were first converted into word vectors using ALBERT, which were then processed by a Gated Recurrent Unit (GRU) network. To enhance semantic understanding, the researchers incorporated an attention layer after the Bi-GRU network. Finally, the model’s output was normalized using a softmax function, transforming predictions into a probability distribution for better accuracy. These experimental results showed that the proposed model outperformed traditional approaches, achieving higher accuracy in text-similarity tasks.
Condor et al. [
17] conducted a study comparing the effectiveness of Sentence-BERT (SBERT) with traditional techniques such as Word2Vec and Bag-of-Words. The findings revealed that SBERT-based models significantly outperformed those developed using older methods, demonstrating superior performance in capturing semantic meaning and improving automated grading accuracy.
Sayeed and Gupta [
18] proposed a Siamese architecture for ASAG that evaluates descriptive answers by comparing student responses with reference answers. This method leverages RoBERTa bi-encoder-based Transformer models, designed to balance computational efficiency and grading accuracy. The model was trained on the SemEval-2013 two-way dataset and demonstrated either superior or equivalent performance compared to benchmark models, highlighting its effectiveness in ASAG tasks while remaining computationally feasible.
Bonthu et al. [
19] proposed another method that involves using sentence transformers such as SBERT (Sentence-BERT), which modifies BERT to optimize semantic similarity tasks. This model fine-tunes sentence embedding and applies augmentation techniques such as random deletion, synonym replacement, and back translation to improve ASAG performance. This study demonstrated that combining text augmentation with fine-tuned SBERT led to a 4.91% accuracy improvement.
Wijanto et al. [
3] present a novel approach to enhancing ASAG systems by integrating balanced datasets with advanced language models, specifically Sentence Transformers. The authors address the critical challenges of grading open-ended responses, emphasizing the importance of dataset balance to improve evaluation accuracy. Through comprehensive experimentation, the researchers demonstrated that their method significantly improves grading-performance metrics such as Pearson Correlation and RMSE while also maintaining computational efficiency. The findings indicate that combining simpler models with strategic data augmentation can achieve results comparable to more complex approaches, making ASAG systems more practical for educational settings. The study highlights future research directions, including exploring additional data augmentation techniques and addressing ethical considerations in model deployment. Overall, this work contributes valuable insights to the ASAG field, supporting the potential for broader implementation in educational assessments.
Kaya et al. [
20] present a novel hybrid approach for ASAG by utilizing Bidirectional Encoder Representations from Transformers (BERT), combined with a customized multi-head attention mechanism and parallel Convolutional Neural Network (CNN) layers. The model addresses the challenges of grading short answers in distance education, demonstrating improved accuracy and meaningful understanding of student responses. The proposed system outperforms existing models evaluated on well-known datasets, showcasing its effectiveness in providing reliable and efficient assessments essential for modern educational environments.
Badry et al. [
21] introduce an automatic Arabic grading system for short-answer questions and use NLP techniques to assess student responses, including text preprocessing, feature extraction, and semantic similarity analysis between student and model responses. The system is trained on a collected dataset, which differentiates it from studies using Kaggle data. The authors use machine-learning algorithms for grading and validate the system through experimental evaluation, although they successfully achieved a greater accuracy. The study advances Arabic-language automated assessment and opens the way for future deep-learning improvements.
Existing ASAG methods have primarily relied on single-sentence or document representations, limiting their ability to compare longer, more complex responses. Most Transformer-based grading models focus on self-attention within individual sentences (intra-sentence attention) but fail to establish relationships between multiple sentences within a student’s response. To address this gap, Inter-Sentence Attention (iAttention), a novel mechanism that extends attention beyond single-sentence representations, was proposed. Unlike conventional models that process each sentence independently, iAttention captures dependencies between multiple sentences or documents, improving grading accuracy in free-response evaluations. This technique enhances contextual understanding and better aligns to student responses with reference answers by considering relationships across the entire response rather than isolated sentences. The iAttention model incorporates hierarchical inter-sentence attention layers, ensuring that sentence-level coherence and interdependencies are captured before producing a final grading prediction. By integrating this mechanism, iAttention allows the model to understand discourse structure, improving grading accuracy over traditional BERT-based approaches.
5. Results and Discussion
5.1. Benchmark Results
This section presents and discusses the experimental results obtained across five benchmark datasets: the STS benchmark, SciEntsBank, SemEval-2013 Beetle, Mohler, and U-datasets. The evaluation metrics vary based on task type and include Pearson Correlation (PC), Spearman Correlation (SC), Accuracy (Acc), Macro-F1 (M-F1), Weighted-F1 (W-F1), and Root Mean Square Error (RMSE). Five experiments were carried out on all datasets with the same hyperparameter settings, and their average was reported.
In the variant, the threshold parameter ε specifies the minimum semantic similarity score required to consider a pair of textual inputs semantically close. To determine an appropriate value, preliminary experiments were conducted, systematically varying ε within the range [0.1, 0.9]. The experimental results consistently indicated that ε = 0.4 achieved the most favorable performance across the development datasets, offering a balanced trade-off between strictness and tolerance. Based on these findings, ε was fixed at 0.4 for all experiments involving . This setting was used consistently throughout the subsequent evaluation and result reporting.
5.1.1. STS Benchmark Results
Table 2 reports the performance of seven STS datasets using both BERT-based and RoBERTa-based models. The proposed iAttention-enhanced variants substantially outperformed the existing baseline models. Within the BERT-based models,
-BERT and
-BERT (ε = 0.4) yielded the highest Pearson and Spearman Correlations across multiple datasets. For instance,
-BERT achieved a Pearson Correlation of 92.58 and Spearman Correlation of 87.23 on STS12, while
-BERT (ε = 0.4) attained 94.20 and 89.40 for STS-B, respectively, outperforming strong baselines like dictBERT, BERT-sim, and DisBERT.
The RoBERTa-based iAttention models continued this trend, achieving new state-of-the-art results. -RoBERTa (ε = 0.4) reached the highest performance across nearly all datasets, with top scores such as PC = 94.20 for STS-B and PC = 91.23 for SICK. These results validate the ability of iAttention models to capture semantic alignments more effectively than conventional transformer architectures, offering improved generalization and robustness for semantic similarity tasks.
5.1.2. SciEntsBank Dataset Results
Performance on the SciEntsBank dataset was presented in
Table 3 and evaluated under both two-way and three-way classification settings across the three types (UA, UQ, UD). The iAttention-based models demonstrated superior performance in all cases. With the two-way setting,
-BERT
achieved the highest Accuracy of 0.828 and a Macro-F1 of 0.823 for UA, outperforming all classical and transformer-based baselines. In the more challenging three-way classification setting,
-BERT
retained high performance with Accuracy of 0.690, a Macro-F1 of 0.662 and Weighted-F1 of 0.664 on the UD subset, surpassing existing methods including TF+SF, XLNet, and RoBERTa-lrg-vl.
These findings indicate the robustness of the proposed inter-sentence attention mechanisms, particularly in handling varied and ambiguous student responses. The marginal performance drop under the three-way classification compared to the two-way setting, as seen in iAttention models, further highlights their stability and semantic sensitivity.
5.1.3. SemEval-2013 Beetle and Mohler Dataset Results
The results presented here were compared with Tf-Idf [
9], Lesk [
9,
10,
12,
19], Mohler et al. [
29], TF+SF [without question] [
30], TF+SF [with question] [
30], BERT Regressor + Similarity Score [
31], XLNet [
32], CoMeT [
33], ETS [
34], Roberta-large-vl [
18], SoftCardinality [
35], and UKP-BIU [
20].
Table 4 present results on the SemEval-2013 Beetle dataset that reinforce the efficacy of the proposed models. Under the two-way classification setup,
-BERT (ε = 0.4) achieved the best performance with Macro-F1 scores of 0.872 (UA) and 0.783 (UQ), outperforming [
20] and other transformer-based models. Under the three-way classification, the iAttention models retained their advantage, with
-BERT (ε = 0.4) again outperforming all other methods, achieving a Macro-F1 of 0.666 and Weighted-F1 of 0.657 on UQ, an area where traditional models like ETS and UKP-BIU show marked declines.
Table 5 presents results on the Mohler dataset, evaluated using Pearson Correlation and RMSE. The proposed
-BERT
achieved the highest correlation score (PC = 0.840) and the lowest RMSE (0.650), significantly outperforming strong models when compared to models presented in other iAttention variants, including
-BERT and
-BERT. It also surpassed all baseline methods, demonstrating the model’s capability to provide accurate numeric predictions aligned with human grading. These results confirm that the iAttention Transformer effectively captures meaningful semantic relationships essential for short-answer scoring.
5.1.4. U-Datasets Results
This section presents the experimental results on the U-datasets, which include the Student Grade Dataset MIS221 (SGDM221), Student Grade Dataset MIS415 (SGDM415), and their combined version, Combined SGDM (CSGDM). The results represent the average performance over five independent experiments conducted under identical hyperparameter settings, as described in
Table 1. The only modification was the maximum sequence length, which was set to 512 tokens for BERT and RoBERTa.
Table 6 provides a comparative evaluation of the different models, demonstrating the effectiveness of iAttention-enhanced Longformer models for automated grading. The results indicate that Longformer-based models significantly outperform traditional transformer models, such as BERT, SBERT, and RoBERTa. For SGDM221, BERT achieved a Pearson Correlation (PC) of 51.06 and a Spearman Correlation (SC) of 52.78, while RoBERTa showed a slight improvement with 51.89 PC and 53.06 SC. SBERT, however, outperformed both, achieving 60.67 PC and 61.65 SC, indicating a stronger ability to capture sentence-level semantic relationships. Despite these improvements, Longformer exhibited a substantial performance boost, reaching 70.89 PC and 73.90 SC on SGDM221. Further enhancements were observed with iAttention-based Longformer models, which introduced attention mechanisms to improve contextual understanding. The
-Longformer model achieved 75.29 PC and 78.50 SC on SGDM221, while the
-Longformer further improved the results to 81.78 PC and 82.45 SC, demonstrating the effectiveness of word-level attention in refining predictions. The best overall performance was obtained using
-Longformer (ε = 0.4), which achieved 81.00 PC and 81.60 SC for SGDM221. This model also performed exceptionally well for SGDM415, reaching 84.08 PC and 85.69 SC, and it maintained its superior performance for CSGDM, where it achieved 83.56 PC and 84.79 SC. These results reinforce the effectiveness of iAttention mechanisms, which improve the model’s ability to focus on key information and understand semantic relationships in student responses. By capturing inter-sentence dependencies, iAttention-enhanced Longformer models deliver more accurate and reliable grade predictions that closely align with human assessments. The superiority of Longformer-based models, particularly those incorporating iAttention, demonstrates their potential for advancing automated grading systems. The results for the SGDM221, SGDM415, and CSGDM datasets confirm the robustness of the proposed approach.
5.2. Statistical Significance Analysis
To assess the reliability of the proposed models, statistical significance tests were conducted to compare their performance against other competitive baseline models. The paired
t-test was applied to determine whether the observed improvements obtained in the results in
Section 5.1 were statistically meaningful. This test evaluates whether the mean difference between two paired samples is statistically significant, providing insights into the effectiveness of the proposed approach. Among the three proposed iAtttention variants, the
-variant was selected as the primary model for comparison due to its consistent performance across datasets.
In
Table 7,
-RoBERTa is compared against a wide range of STS baselines across all seven tasks. The results indicate that improvements in the Pearson Correlation (PC) and Spearman Correlation (SC) are statistically significant in nearly all comparisons. For instance,
p-values against strong baselines such as BERT (PC = 0.008179, SC = 0.000339) and SemBERT (PC = 0.000348, SC = 0.000223) fall well below the conventional 0.05 threshold, confirming the robustness of the observed performance gains. Even against other competitive variants like
-RoBERTa, significance is retained in most cases, reinforcing the advantage of using iAttention with confidence weighting.
Table 8 presents the
t-test outcomes for the SciEntsBank dataset, comparing
–BERT (ε = 0.4) with prior baselines in terms of Accuracy, Macro-F1, and Weighted-F1. Statistically significant differences were observed for all major models, including CoMET, ETS, and XLNet, with
p-values consistently below 0.05. For example, the difference in Accuracy compared to CoMET is significant at
p = 0.005173, and Macro-F1 compared to ETS yields
p = 0.008321. Comparisons with other iAttention variants such as
–BERT and
–BERT also show significance in Macro- and Weighted-F1, indicating the impact of the iAttention model.
In
Table 9, the proposed
–BERT (ε = 0.4) model is compared against various baselines of the SemEval-2013 Beetle dataset. The model consistently achieves statistically significant improvements in Macro-F1 across most baselines, including CELI (
p = 0.005025), CNGL (
p = 0.013998), and LIMSILES (
p = 0.001275). Although the improvements in Weighted-F1 are not always statistically significant, for instance,
p = 0.362140 against CoMET, both remain competitive and highlight the strength of the proposed model, particularly in terms of class-balanced metrics. Comparisons with other iAttention variants such as
–BERT and
–BERT show that
–BERT (ε = 0.4) performs significantly better in terms of Macro-F1, confirming the value of iAttention.
Table 10 further substantiates these findings using the U-datasets (SGDM221, SGDM415, and CSGDM). The proposed
–Longformer (ε = 0.4) outperforms all baselines with statistically significant differences in Pearson and Spearman Correlations.
For example, it achieves p = 0.000882 (PC) and p = 0.000029 (SC) against BERT, and similarly significant improvements against SBERT, RoBERTa, and Longformer, with all p-values under the 0.05 threshold. Even when compared with other strong attention variants such as –Longformer, the proposed model maintains a significant edge (PC: p = 0.011591). These results affirm that the performance gains reported throughout the experiments are not only consistent but also statistically robust, validating the practical effectiveness of inter-sentence attention mechanisms in automated grading and semantic similarity tasks.
5.3. Performance Analysis of Models
The results presented in the box plots and bar charts provide a comprehensive evaluation of various models across all experimental tasks, including STS, SciEntsBank, SemEval-2013 Beetle, Mohler Dataset, and U-Datasets. The primary focus is on the effectiveness of iAttention-enhanced models compared to traditional BERT, RoBERTa, and Longformer-based approaches.
The performance across the STS tasks, as shown in
Figure 6 (PC Scores) and
Figure 7 (SC Scores), reveals that iAttention-based models consistently achieve higher Pearson Correlation (PC) and Spearman Correlation (SC) scores compared to baseline transformer models.
-RoBERTa (ε = 0.4) and
-RoBERTa demonstrate superior semantic understanding, outperforming models such as BERT, unsup-SimCSE, and RoBERTa, which exhibit higher variability. The presence of lower quartiles and outliers in models like DisBERT and unsup-SimCSE indicates inconsistency in capturing textual relationships across datasets.
On the SciEntsBank dataset, the Accuracy, Macro-F1, and Weighted-F1 score distributions further validate the superiority of iAttention models.
Figure 8 (Accuracy scores),
Figure 9 (Macro-F1 scores), and
Figure 10 (Weighted-F1 scores) illustrate that traditional models such as CoMeT, ETS, and SOFTCAR show considerable variance in grading consistency, while
-BERT (ε = 0.4) and
-BERT exhibit stability with consistently higher Accuracy and F1 scores. The ability to generalize effectively across various student responses demonstrates the robustness of iAttention-based approaches.
The results from the SemEval-2013 Beetle dataset reinforce the trend observed in the SciEntsBank evaluation.
Figure 11 (Weighted-F1 Scores) and
Figure 12 (Macro-F1 Scores) indicate that traditional feature-based grading models, including CELI and CoMeT, struggle with grading consistency. Meanwhile, UKP-BIU and SoftCardinality show slight improvements, but they do not match the performance of the iAttention models. The lower variance and higher median values of
-BERT and
-RoBERTa confirm their effectiveness in automated short-answer grading.
The performance of the Mohler dataset, as depicted in
Figure 13, shows that iAttention-based models achieve the highest Pearson Correlation scores while maintaining the lowest Root Mean Square Error (RMSE). Traditional methods such as TF-IDF, Lesk, and Mohler et al. exhibit lower correlation scores, indicating weaker semantic representation in grading. While BERT-based regressors improve upon these baselines,
-BERT (ε = 0.4) delivers the best performance, highlighting the impact of hierarchical attention mechanisms.
The evaluation on U-Datasets across SGDM221, SGDM415, and CSGDM confirms the advantages of iAttention models in handling complex textual data.
Figure 14 presents a stacked bar chart comparing PC and SC scores across different models. Longformer-based models outperform BERT, SBERT, and RoBERTa, particularly when analyzing longer textual responses. The inclusion of hierarchical and word-level attention mechanisms in iAttention models further enhances performance, with
-Longformer (ε = 0.4) achieving the highest correlation scores across all datasets. The results demonstrate that transformer-based grading approaches with enhanced attention mechanisms offer significant improvements compared to traditional models.
Overall, iAttention-enhanced models consistently outperform standard transformer models across multiple benchmarks. The integration of hierarchical attention significantly improves grading Accuracy, particularly in long-form textual responses. STS, SciEntsBank, and SemEval-2013 Beetle evaluations confirm the superiority of iAttention-based approaches in capturing semantic relationships and improving automated grading performance. The results further indicate that Longformer-based models, particularly -Longformer (ε = 0.4), provide the best performance in student grading tasks, demonstrating their capability to handle complex answer structures.
The findings confirm that iAttention-based models are highly effective in automated grading, achieving higher correlation scores, reduced RMSE, and improved F1 scores across multiple benchmark datasets. These advancements reinforce the potential of attention-based architectures in enhancing automated grading systems, ensuring fair, consistent, and scalable assessment processes.
5.4. Comparison of Results with Expert Scores
Having been trained on the U-Dataset, the results of these iAttention-sentence Transformers must be evaluated to determine their usefulness in the real-world grading process. The evaluation process involves selecting two models to determine to what degree the score generated from the models best satisfies human intuitions. This section compares the results of the iAttention-sentence Transformers and experts’ opinions using Pearson Correlation (PC), Spearman Correlation (SC), and absolute score differences (ASD). To achieve this, about six (6) questions were given out to about three (3) students and were fed into the selected iAttention Transformers and were given out to human experts for grading. An example of this can be found in
Appendix A.
Table 11 presents the score generated by iAttention-sentence Transformers and the score of human experts.
Table 12 presents a comparative result of where the model’s performance is compared to the human expert grading. The
model performance is moderately good for the Pearson Correlation (0.493–0.725) with the human expert, which closely aligns with the Expert 3 score by 0.73, and the Spearman Correlation shows a better performance ranging from 0.577 to 0.774. The absolute score differences suggest that
aligns well with Expert 3, which is the lowest score (0.894). Furthermore,
consistently has higher correlations with the human expert score when compared to the
model, especially with Expert 3’s score (0.817 for Pearson correlation and 0.793 for Spearman). This indicates that there is a strong agreement between
and Expert 3.
also has lower absolute score differences overall, with 0.828 for Expert 3, which shows that
and human Expert 3 closely match. To further evaluate the mean and median score of the human expert, the figure was taken, and the models show a moderately strong relationship.
also outperforms the
model, having a PC of 0.709 for the mean score of the human expert and 0.713 for the median score compared to the
model’s scores, which are 0.618 and 0.633.
The models were also compared to the original scores from the dataset, which was not part of the training and testing process. Both models indicate strong correlations with the origin score for which the slightly outperforms . has a Pearson Correlation (0.835) and Spearman Correlation (0.871) with absolute score differences of 0.823. The expert scores correlate quite well with the original scores, with Expert 1 having the highest scores for Pearson Correlation (0.881) and Spearman Correlation (0.841) as well as absolute score differences of 0.861. As compared to the other human experts, Expert 1 has the most aligned scores with the original scores. Generally, the model seems to perform better across all comparisons, including both expert scores and original scores. This shows the model captures more meaningful aspects of the scoring process which leads to better alignment with human experts.
5.5. Computational Complexity Analysis
The efficiency of automated grading models is a crucial factor in real-world applications, where scalability and computational feasibility play significant roles.
Table 13 presents a comparative analysis of the computational complexity of different iAttention-sentence variants, evaluating their training time per epoch, memory usage, and model size in terms of parameters. The baseline model for these evaluations was Longformer, ensuring a consistent benchmark for comparison. The results indicate that
is the most computationally intensive model, requiring the longest training time (579 sec per epoch) and the highest memory consumption (10,579.04 MB). This increase is attributed to the hierarchical attention mechanism, which introduces additional computations to enhance the model’s ability to capture inter-sentence relationships. On the other hand,
is the most efficient variant, with the shortest training time (331.7 sec per epoch) and lower memory consumption (10,461.05 MB). The reduction in computational cost can be explained by its reliance on TF-IDF embeddings, which provide a lightweight textual representation without requiring extensive deep-learning operations. The
-model, which employs word-level attention, falls between the two in terms of complexity, requiring 567 sec per epoch and utilizing 10,476.04 MB of memory. The additional computations needed to refine word-level dependencies contribute to the increased training time and memory consumption compared to iAttention-TF-IDF. Overall, while
offers the best grading accuracy, it comes at the cost of a higher computational overhead. The trade-off between efficiency and accuracy must be carefully considered, depending on the specific requirements of an automated grading system, particularly in resource-constrained environments.
7. Future Direction
While the iAttention-sentence Transformer models have shown strong performance in grading free-response answers, further research is required to enhance their robustness, scalability, and practical deployment. The most pressing future direction is the integration of multilingual support, as current experiments are limited to English. Planned work includes experimenting with multilingual transformers such as XLM-R and mBERT, along with cross-lingual sentence embeddings, to assess generalization across languages and educational contexts. A pilot setup is being considered using translated student responses in combination with fine-tuning for small domain-specific multilingual datasets. Fairness and bias mitigation represent another critical challenge. Future efforts will include bias auditing across demographic subgroups and the application of fairness-aware training techniques, such as sample reweighting or adversarial debiasing, particularly in datasets where scoring disparities are observed. Interpretability is also essential for user trust. Attention, weight visualization, and saliency mapping will be explored to trace how the model aligns response segments with reference answers. To support deployment at scale, computational efficiency must be improved. Proposed experiments include pruning iAttention layers, applying quantization methods, and testing lighter-weight encoders for inference on low-resource devices. Additionally, integrating OCR pipelines will extend model usability to handwritten student responses, a necessity in many classroom settings.
Finally, to ensure generalizability, future work will benchmark the model on broader datasets, including STEM-focused questions and code-based assessments. These directions aim to build more inclusive, explainable, and deployable grading systems, especially in under-resourced or linguistically diverse environments.