1. Introduction
Automated essay scoring (AES) systems are instrumental in contemporary education, providing efficient grading solutions and timely feedback to students. These systems help reduce the workload on educators while maintaining consistent and objective evaluation standards [
1]. However, The development of reliable AES systems is severely hampered by the scarcity of extensive, expert-scored essay datasets [
2], a challenge that is particularly pronounced for languages with limited computational resources.
Arabic, spoken by over 400 million people and an official language in 26 countries, starkly illustrates this issue. Despite its global prominence, Arabic lags behind in language technology advancements, largely due to the insufficient availability of high-quality, scored essay datasets [
3]. The evaluation of writing quality is a complex task that requires assessment across multiple dimensions, including Relevance, Organization, Vocabulary, Style, Development, Mechanics, and Structure. This need for multidimensional evaluation has been highlighted in foundational work on writing assessment [
4], a task that current datasets struggle to support adequately.
The absence of extensive, multidimensional Arabic essay datasets limits the diagnostic capabilities of AES systems, often confining them to offering broad, holistic scores rather than detailed, actionable feedback. This gap underscores the need for innovative methodologies that can expand and enhance the quality of Arabic AES resources without the substantial costs associated with traditional manual annotation processes. This is especially pertinent given the critiques of traditional automated systems that lack the ability to provide meaningful, nuanced feedback [
5].
In addressing these challenges, this research pioneers the use of LLMs to augment the scoring process for Arabic essays. We introduce a human–AI collaborative framework designed to overcome the shortage of expert-scored data by leveraging LLMs to generate multidimensional essay evaluations across seven key writing traits. This scalable solution not only expands the quantity of available training data but also improves its quality by providing a more comprehensive and insightful assessment framework. By leveraging LLMs, which have demonstrated a strong ability for few-shot learning and human-like text generation and evaluation [
6], we aim to develop a scalable solution that aligns with recent efforts to use LLMs for data annotation and feedback generation [
7].
The contributions of the proposed work are designed to advance the field of Arabic AES and serve as a model for other low-resource languages. We introduce a systematic methodology for optimizing LLM-based scoring, specifically tailored for long-form Arabic essays. This research provides empirical insights into the effectiveness of different LLMs in assessing various writing dimensions, enriching the landscape of Arabic NLP resources. Furthermore, we substantially expand the pool of available Arabic AES datasets with multidimensional annotations, focusing on essays of meaningful length suitable for academic evaluation. Finally, we present a replicable framework that can be adapted to address annotation challenges in other languages facing similar resource limitations.
This study seeks to answer the following research questions: (1) What is the most effective methodology for refining LLM-based scoring of Arabic essays through prompt engineering and revision processes? (2) How does the performance of individual LLMs and ensemble methods compare across different writing dimensions? (3) Can our framework be used to generate high-quality, multidimensional annotations at scale to expand Arabic AES datasets? (4) Is the proposed scoring framework generalizable when applied to new and varied datasets?
The remainder of this paper is organized as follows:
Section 2 covers a review of the literature on AES with an emphasis on Arabic language challenges.
Section 3 describes the datasets used in this study.
Section 4 details the proposed methodology for LLM-based scoring and dataset expansion.
Section 5 presents experimental results, followed by the Discussion in
Section 6 and Conclusions in
Section 7.
2. Literature Review
This section examines the landscape of AES with particular focus on Arabic language challenges. We first present an overview of AES and the unique complexities of Arabic language processing, followed by an analysis of benchmark datasets in both English and Arabic. We then review methodological approaches from traditional machine learning to recent advances using LLMs, discuss evaluation metrics for AES systems, and conclude by identifying critical research gaps that the proposed work aims to address.
2.1. Overview of Automated Essay Scoring
AES has undergone a significant evolution in recent decades. The field has matured considerably in English, with diverse system types benefiting from rich resources. However, Arabic AES development consistently lags behind its English counterparts due to three fundamental challenges unique to the Arabic language.
First, the morphological complexity of Arabic creates substantial processing challenges. Arabic utilizes a root-and-pattern system where a single root (e.g., k-t-b,
![Data 10 00148 i001 Data 10 00148 i001]()
) can generate numerous derived forms, complicating feature extraction [
3]. Its non-concatenative morphology alters internal vowel patterns (e.g.,
kataba “he wrote” vs.
yaktubu “he writes”). Additionally, affixation ambiguity occurs where a single prefix like “y” can simultaneously mark future tense and the third person.
Second, diglossic variation creates inconsistencies between Modern Standard Arabic (MSA) used in formal writing and regional dialects. There is significant lexical divergence where 58% of common vocabulary differs between MSA and dialectal variants such as Egyptian Arabic [
8]. Syntactic differences exist as well, with dialects often abandoning MSA’s verb–subject–object (VSO) word order. Orthographic variability is particularly evident in informal contexts where diacritics are omitted and non-standard spellings are common.
Third, data scarcity significantly impedes Arabic AES development. The combined size of all public Arabic essay datasets totals approximately 4500 essays, compared to over 130,000 available for English [
9]. There is limited prompt diversity, with most Arabic datasets covering fewer than five distinct writing topics. Arabic datasets also suffer from insufficient average essay length (typically 50–100 words) compared to the 350+ word average in English datasets. Finally, there is a lack of comprehensive trait-specific annotations necessary for multidimensional assessment.
These challenges create a significant barrier to Arabic AES advancement, necessitating innovative approaches that can effectively address both the linguistic complexities and data limitations simultaneously.
2.2. English AES Foundations and Benchmark Datasets
As shown in
Table 1, English AES benefits from extensive benchmark datasets with large sample sizes and comprehensive trait coverage. These datasets provide robust resources for training and evaluating English AES systems, creating a stark contrast with the limited Arabic resources described in subsequent sections.
The English AES landscape is dominated by three major benchmark datasets. The Hewlett Foundation dataset represents the original ASAP dataset. The essays were written by students of class 7 to 10. Each essay was evaluated by two evaluators. Six out of the eight prompts only have overall scores. Only two of them have scores for individual essay attributes, like Content, Organization, Style, etc. ASAP++ serves as the gold standard, containing 13,000 essays across eight diverse prompts with comprehensive scoring rubrics that facilitate consistent evaluation. Their contribution is the scoring of individual attributes of the essays, like Content, Organization, Style, etc., in the Hewlett dataset for the remainder of the essays. TOEFL11 offers a specialized focus on non-native English writing with strict holistic scoring protocols, making it valuable for second-language assessment research. Together, these resources provide English AES researchers with extensive training and evaluation capabilities that dramatically exceed what is currently available for Arabic.
Having established the extensive resources available for English AES development, we now examine the comparatively limited landscape of Arabic essay datasets.
Table 2 presents a comparison of key Arabic AES resources, highlighting significant disparities when contrasted with English datasets, particularly regarding size, essay length, and comprehensive trait coverage.
As summarized in
Table 3, the quantitative comparison between representative Arabic and English datasets further highlights the resource disparity.
Dataset-Specific Analysis
The landscape of Arabic AES datasets presents varied approaches to assessment, each with distinct characteristics and contributions to the field.
The QCAW/QAES dataset introduces a significant innovation as the first trait-specific Arabic dataset with multi-layer annotations covering Organization, Vocabulary, Style, Development, Mechanics, and Structure (scored 1–5) along with Relevance (scored 1–2). This granular approach achieves high inter-rater reliability () with detailed analytic scoring rubrics. However, QAES remains severely limited by its small size (195 essays) and minimal topic diversity (only 2 prompts).
The ZAEBUC dataset offers a bilingual Arabic–English approach with CEFR proficiency labels (A1–C2) and includes valuable writer metadata such as L1 background and education level. This structure makes ZAEBUC particularly valuable for cross-lingual transfer learning studies and as a benchmark for holistic scoring models. Despite these innovations, ZAEBUC remains constrained by its limited scale (214 essays) and narrow topical scope (covering only 3 topics).
The QALB (2015) Corpus focuses primarily on grammatical error correction, categorizing 14 detailed error types ranging from orthographic to morphological issues. Its strength lies in robust error detection capabilities, achieving 79.36% F1 score [
20] with CAMeLBERT [
21] across its 80,000 annotated sentences. However, QALB falls short for holistic essay evaluation due to the absence of topic annotations and its sentence-level focus, which limits discourse-level analysis. It is available with two versions (2014 and 2015), but we used the 2015 version.
AR-AES represents the largest available Arabic essay collection at 2046 essays and demonstrates AraBERT’s [
22] effectiveness, with a 79.49% exact match with human raters and a QWK of 0.88 in rubric-based scoring. However, this dataset suffers from a critical limitation in average essay length (just 58 words), which restricts its utility for more complex writing assessment, and lacks CEFR alignment that would facilitate international benchmarking.
ARAScore combines holistic scoring with four trait dimensions (Content, Organization, Grammar, Style) across 1500 essays covering eight diverse topics. While this represents one of the larger collections, ARAScore’s utility is diminished by very short essay length (averaging 40 words), limited public availability, and lack of dialectal variation that would better represent real-world Arabic writing.
Rounding out the landscape, AR-ASAG specializes in short-answer grading for academic content, covering 48 questions balanced across STEM (62%) and humanities (38%) through 2133 student–teacher answer pairs. The dataset features detailed scoring guidelines on a 0–3 scale with high human agreement (Pearson r = 0.838), making it valuable for focused assessment tasks, though its short-answer format differs from traditional essay evaluation.
Collectively, these datasets illustrate both the progress and persistent challenges in Arabic AES development. While they provide valuable resources for specific assessment tasks, none match the comprehensive coverage, scale, or quality of benchmark English datasets, highlighting the critical gap this research aims to address.
With an understanding of the available Arabic datasets and their limitations, we now examine the methodological approaches that researchers have employed to develop AES systems for Arabic, from traditional machine learning to more recent deep learning advances.
2.3. Techniques of Automatic Essay Scoring
The evolution of Arabic AES methodologies has progressed through three distinct generations: traditional machine learning systems, deep learning architectures, and LLM applications. This progression reflects both technological advancements and a growing understanding of Arabic’s unique linguistic challenges, including its complex morphology and diglossic nature.
2.3.1. Traditional Machine Learning Approaches
The first generation of Arabic AES systems relied on traditional machine learning techniques with manually engineered linguistic features. AAEE [
23] demonstrated that feature-based approaches utilizing Latent Semantic Analysis (LSA) for semantics and AraComLex [
24] for spelling error detection achieved reasonable accuracy, with a correlation to manual evaluation of 0.756. Ouahrani and Bennouar (2020) [
19] explored an unsupervised corpus-based approach with the COALS algorithm for semantic analysis in their AR-ASAG dataset. Hybrid methodologies were also employed; for instance, in the QALB shared task, participating systems combined machine learning modules with rules and morphological corrections, or statistical machine translation with language models and rules. The AAEE system [
23] established an important benchmark through its comprehensive feature engineering pipeline, though these early approaches were constrained by their dependency on manual feature extraction and limited capacity to model Arabic’s complex morphological structures.
2.3.2. Deep Learning Advancements
The second generation witnessed transformative changes through deep learning architectures. Initial breakthroughs came from Long Short-Term Memory (LSTM) networks, with optimization techniques like RMSProp [
25]. Neural networks, including recurrent and convolutional networks, proved effective for automatic text scoring [
25,
26]. Bidirectional architectures (e.g., GRU, LSTM variants) were particularly effective at capturing Arabic’s contextual relationships [
25], while hybrid CNN-RNN models combined local feature extraction with sequential processing [
26]. The paradigm shift occurred with Arabic-optimized transformer models; AraBERT [
22] and ARBERT [
27] demonstrated superior semantic understanding. AraBERT achieved a Quadratic Weighted Kappa (QWK) of 0.88 on the AR-AES dataset [
7,
17]. Moreover, a parameter-efficient approach [
28] based on the AraBART model integrates strategies such as Parameter-Efficient Fine-Tuning, Model Soup, Multi-Round Inference, and Edit Merging to efficiently address grammatical error correction and essay scoring. These advancements were facilitated by specialized datasets like QCAW and QAES [
12,
13,
14].
2.3.3. Large Language Model Applications
The current generation explores LLM applications, revealing both promise and persistent challenges. Evaluations of GPT-4 and ACEGPT demonstrate that few-shot approaches (QWK = 0.67–0.75) still lag behind fine-tuned BERT models (QWK = 0.88) [
7]. Innovative hybrid approaches are emerging, such as a framework [
29] that leverages rule-based techniques alongside LLMs to generate synthetic data with a controlled distribution of error types, thereby enhancing performance in underrepresented grammatical errors. This performance gap stems from three key Arabic-specific challenges: (1) tokenization inefficiencies causing morpheme splitting errors [
7], (2) the need for more few-shot examples (though studies used 1–3 examples per class due to context window limitations) [
7], and (3) optimal performance requiring mixed-language prompts [
7]. Further solutions include [
30]’s use of GPT-4 for synthetic data generation (3040 CEFR-aligned essays with controlled error injection). This third generation continues to evolve, with hybrid approaches combining LLM strengths with Arabic-specific optimizations showing particular promise for overcoming current limitations.
A summary of the comparative performance and key characteristics of AES approaches discussed can be found in
Table 4.
2.4. Comparative Studies and Metrics
The evaluation of Arabic AES systems is based on the Quadratic Weighted Kappa (QWK) as the standard metric used to measure the agreement between automated and human scores [
31]. QWK values range from −1 to 1, with specific ranges corresponding to different levels of agreement quality, as outlined in
Table 5.
Performance benchmarks for Arabic AES systems show significant variation based on model architecture and training approach. Recent studies demonstrate that fine-tuned Arabic-specific models such as AraBERT can achieve QWK scores up to 0.88 [
7,
17]. Human inter-annotator agreement for Arabic AES has been observed with a QWK of 0.69 in some studies [
17]. This represents “substantial agreement” according to standard interpretations of the metric. In contrast, zero-shot applications of LLMs like GPT-4 perform considerably worse (QWK 0.67–0.75), largely due to tokenization inefficiencies and insufficient adaptation to Arabic linguistic structures [
7].
The evaluation landscape for Arabic AES differs substantially from English language assessment in terms of available resources. While English AES benefits from extensive datasets such as ASAP [
9], which contains about 13,000 essays spanning 8 distinct prompts, Arabic AES development must contend with smaller, more fragmented corpora. This resource disparity significantly impacts both the development and evaluation of Arabic scoring systems, making direct cross-linguistic performance comparisons challenging and potentially misleading without accounting for these fundamental differences in resource availability.
Having examined both the available datasets and methodological approaches for Arabic AES, several critical research gaps become apparent that limit further advancement in this field.
3. Arabic Essay Scoring Datasets
This section describes the datasets used in this research and outlines the process used to expand the available annotated resources for the development of Arabic AES. We begin by explaining dataset selection criteria, then detail each dataset’s characteristics and how we enhanced them through the proposed annotation framework.
3.1. Dataset Selection and Categorization
We performed a comprehensive assessment of available Arabic writing resources, categorizing them into three distinct groups based on their characteristics and suitability for essay scoring:
Long-form essays (QCAW and QAES): Texts exceeding 400 words that demonstrate complex writing skills.
Short answers (ARAES, AR-ASAG, and ARAScore): Brief responses averaging 20–60 words that lack sufficient complexity for comprehensive trait assessment.
Long-form unannotated essays (ZAEBUC and QALB): Texts without scoring annotations but with potential for extension.
This analysis revealed that QAES represents the only available collection with comprehensive trait-specific scoring. We deliberately excluded short answer datasets despite their larger size (over 5000 combined essays), as their limited length makes them inadequate for evaluating the complex writing skills that characterize full essay compositions.
For expanding the annotated corpus, we identified ZAEBUC and QALB as promising candidates based on three criteria: (1) average length exceeding 100 words, (2) essay-like structure suitable for multi-trait assessment, and (3) representing diverse writing populations to improve model generalizability.
Table 6 summarizes the key characteristics of the datasets used in this research; see
Appendix B for samples from each dataset.
3.2. QCAW/QAES
We selected QAES as the primary development dataset for building and validating the proposed prompting strategies and model architectures. The detailed rubric published in [
14] (see
Appendix B) provided the foundation for the proposed annotation framework, allowing us to maintain consistency with established assessment practices while extending them to new collections.
For annotations, QAES contains rater 1’s score (r1), rater 2’s score (r2), rater 3’s score (r3), and the final score (fn). The final score for each trait is calculated primarily as the rounded integer mean of the assessments provided by the two main annotators (r1 and r2). If there is a significant difference (11 points or more out of a total of 32 points) in the overall holistic score between the two main annotators, a third annotator reviews the essay and provides a score. In such cases, the score determined by this third annotator becomes the final score for each trait, overriding the average of the initial two raters.
3.3. QALB
A significant challenge we encountered with QALB was the absence of topic annotations, which are essential for evaluating relevance and development aspects of writing. To address this limitation, we employed two topic generation models to generate appropriate topics for each essay. This approach involved the following:
Extracting key themes from each essay using both models;
Formatting the outputs as concise topic statements;
Conducting a manual review of the generated topics by selecting the most accurate topic when models disagreed.
3.4. ZAEBUC
Unlike QALB, ZAEBUC already contains explicit topic annotations, eliminating the need for topic generation. We leveraged these existing topics while applying the validated scoring framework to generate trait-specific annotations for the essays.
The combined annotation of QALB and ZAEBUC allowed us to expand the corpus of trait-annotated Arabic essays from the original 195 in QAES to 1021 essays (195 + 622 + 204), representing a large increase in resource availability for Arabic AES research. This expanded collection now covers different writing topics and includes essays from diverse writer populations, enhancing its utility for developing robust scoring models.
4. Methodology
This section details the systematic approach used to develop a robust Arabic AES framework. As presented in
Figure 1, the overall architecture highlights the system’s input components and the output scoring traits. The overarching workflow progresses from initial data acquisition to the final annotation of new corpora, providing a clear roadmap for the entire process, as illustrated in
Figure 2. Specifically, the methodology encompasses several key stages: (i) surveying and selecting appropriate datasets and LLMs, (ii) iteratively creating and enhancing prompts for AES, (iii) developing ensemble annotation models using LLMs, (iv) selecting the best-performing annotation technique by comparing direct LLM outputs and ensemble models against expert ratings, and (v) applying and evaluating these techniques to new Arabic corpora. The ultimate goal is to expand Arabic AES resources through a reproducible and cost-effective process, with each stage further elaborated upon in the subsequent subsections.
4.1. Survey and Selection of Essay Scoring Datasets and LLMs
This section details the initial steps of collecting and preparing datasets, along with the selection of appropriate LLMs for the annotation framework. As shown in the “Surveying and Selection of Essay Scoring Datasets and LLMs” phase of
Figure 2, these foundational activities are crucial for setting up the subsequent scoring process. This phase encompasses the selection of LLMs and preparation of the datasets.
When selecting LLMs for the proposed annotation framework, we prioritized both performance and efficiency. After initial experimentation with multiple options, we focused on two complementary models: Gemini 1.5 Flash and Deepseek v3. We excluded Gemini 2 from the final implementation as its results were nearly identical to Gemini 1.5, offering no significant performance advantage. Similarly, while GPT-4o produced results comparable to Gemini, its higher operational costs made it less suitable for this large-scale annotation task. This pragmatic selection allowed us to optimize resource efficiency while maintaining high-quality annotations.
Addressing the absence of topic annotations in the QALB dataset was crucial for the proposed AES framework, particularly for evaluating essay relevance and development. To overcome this limitation, we implemented a robust topic extraction and validation process utilizing two advanced language models: BGE-M3 [
32] and Gemini 1.5. This methodology involved several key steps.
Automated Topic Extraction: We employed BGE-M3 and Gemini 1.5 to automatically extract key themes from each essay in the QALB dataset. These models were chosen for their ability to process and understand Arabic text effectively.
Topic Refinement: The outputs from both models were formatted into concise topic statements. This step involved consolidating the models’ outputs to derive a single, coherent topic statement for each essay.
Manual Review and Validation: To ensure the accuracy and reliability of the generated topics, we conducted a thorough manual review process. During this phase, discrepancies between the models’ outputs were resolved by selecting the most accurate topic statements, thereby enhancing the quality of ZaQQ.
This methodology not only addressed the dataset’s limitations but also laid a strong foundation for subsequent trait scoring by ensuring that each essay was appropriately contextualized within its thematic framework.
4.2. Iterative Prompt Creation and Enhancement Using QCAW/QAES
This subsection details the iterative prompt creation and enhancement using QCAW/ QAES phase highlighted in
Figure 2, which involves the iterative development of prompts and their application to generate automated scores. This phase is comprehensively detailed in
Figure 3.
This prompt engineering approach for the QCAW dataset involved systematic experimentation across multiple dimensions to identify the best-performing configurations for each writing trait. Rather than applying a one-size-fits-all strategy, we recognized that different writing traits might require distinct prompting approaches to achieve the best results. The iterative prompt development process is a core component of this phase, as illustrated in
Figure 3. This iterative refinement focused on specific essay traits: Relevance, Organization, Vocabulary, Style, Development, Mechanics, and Structure.
We first explored the impact of prompt language, comparing full Arabic, full English, and mixed language instructions. This critical insight shaped all subsequent prompt development, including rigorous bilingual testing in both Arabic and English. The code snippets illustrating the initial prompt construction for both Arabic and English instructions are provided in
Appendix C.
Next, we examined the difference between zero-shot prompting (providing only the scoring rubric) and few-shot approaches (including example essays with human-assigned scores). We further refined the proposed approach by implementing a revision mechanism that allowed LLMs to reconsider their first assessments. This two-step process, analogous to a human reviewer reconsidering their first judgment, provided measurable improvements for certain traits. Few-shot optimization was a key activity in this phase, along with configuration validation on QAES.
To systematically optimize prompt structure and number of samples in the few-shot approach, we utilized DSPy [
33], which helped identify the configurations that yielded the best results across different models and traits. This programmatic approach to prompt engineering enabled us to iterate efficiently while maintaining methodological rigor. An example of the revision part of the system message and input/output fields used in the prompt engineering framework DSPy is provided in
Appendix C.2. After prompt optimization, automated scores are generated by executing the prompts on multiple LLMs (Deepseekv3, Gemini 1.5) and applying prompt variants (first and revised assessments).
4.3. Creation of Ensemble Annotation Models (Using LLMs)
This section focuses on the “Creation of ensemble annotation models (using LLMs)” phase from
Figure 2. Building on the insights from prompt engineering and LLM evaluations, we developed a comprehensive ensemble modeling approach that combined multiple scoring results to enhance overall performance. This approach recognized that different models showed strengths in assessing different aspects of writing quality.
Figure 4 illustrates the entire workflow. For each essay and each trait, we generated four distinct LLM-based scores: first and revised assessments from both Gemini and Deepseek. From these primary scores, we derived a rich set of statistical features, including basic measures (mean, median, range) and distributional characteristics (the frequency of each score point).
Selection of Classifiers
This model selection pipeline tested eight distinct classifier architectures: RandomForest [
34], GradientBoosting [
35], ExtraTrees [
36], CatBoost [
37], LogisticRegression [
38], SVC [
39], GaussianNB [
40], and MLPClassifier [
41]. This diverse selection was deliberate, aiming to leverage the unique strengths of various algorithmic paradigms to robustly combine the LLM-generated scores. The process involved comprehensive classification model training using these specified classifiers.
Ensemble Methods (RandomForest, GradientBoosting, ExtraTrees, CatBoost): These algorithms were chosen for their inherent high accuracy and robustness against overfitting. By combining predictions from multiple decision trees, they excel at integrating diverse information, such as the varied scores from different LLMs and prompts. This capability allows them to capture complex, non-linear relationships within the data. CatBoost, in particular, offers strong general performance and handles various feature types effectively.
Linear Models (LogisticRegression, SVC): These models were included for their interpretability and as strong performance baselines. If a relatively straightforward linear relationship exists between the input LLM scores and the true essay score, these models can effectively identify it. They are also computationally efficient, making them practical for various applications. SVC is particularly effective when clear margins of separation between score categories might exist.
Probabilistic Model (GaussianNB): This classifier was chosen for its simplicity and computational speed. While it operates under the assumption of feature independence, it can still deliver surprisingly strong performance, especially when a quick probabilistic baseline is needed. It also directly provides class probabilities, which can be valuable for understanding prediction confidence.
Neural Network (MLPClassifier): The MLPClassifier was included to capture complex, non-linear relationships that might be present in the data. Given the nuanced nature of essay scoring and natural language processing, a multi-layer perceptron can learn intricate patterns and interactions between the four input scores that might be missed by simpler models, thus offering powerful predictive capabilities.
For each classifier type, we performed systematic hyperparameter tuning using GridSearchCV [
42] with stratified cross-validation to ensure robust performance across different data splits. Following this, a comprehensive evaluation was performed on QAES, and the best-performing model was selected to be the main model used for other datasets.
4.4. Selection of Best-Performing Annotation Technique (LLMs vs. Ensemble LLMs) Using QCAW/QAES
This section details the “Selection of best-performing Annotation technique (LLMs vs. Ensemble LLMs) using QCAW/QAES” phase, which involves evaluating and selecting the best-performing approach for annotation. This process encompasses applying prompt variants and training multiple classification models using LLM outputs, followed by a comprehensive evaluation to select the best-performing model for each writing trait. This corresponds to “Phase 3: Model Evaluation and Selection” in
Figure 5.
We conducted a rigorous comparative analysis between LLM-generated annotations for QCAW and the existing human annotations in the QAES dataset. Quadratic Weighted Kappa (QWK) served as the primary evaluation metric, measuring agreement between automated and human scores across all seven writing traits.
For each trait, we compared LLM-generated scores against human benchmarks from QAES, i.e., the final consolidated scores (which represented the adjudicated consensus) and the closest individual rater scores. This dual comparison helped us to understand how well the proposed automated approaches aligned with both consensus judgment and individual human assessment patterns.
The evaluation process involved applying prompt variants and training multiple classification models using the LLM outputs. All models (direct LLMs and the various classification models) were then comprehensively evaluated. The best-performing model for each trait was selected based on these evaluations.
4.5. Application of the Best Automatic Annotation Techniques on QALB and ZEABUC
After validating the proposed approach on the QAES dataset, we applied it to automatically annotate both the QALB and ZAEBUC corpora. For each essay in these collections, we generated a comprehensive assessment profile consisting of five scores for each trait:
First assessment from Gemini;
Revised assessment from Gemini;
First assessment from Deepseek;
Revised assessment from Deepseek;
Ensemble classifier prediction based on the four LLM scores.
This multi-model approach provided a nuanced view of each essay’s quality across the seven writing dimensions, capturing different aspects of writing proficiency through complementary assessment perspectives.
4.6. Evaluation of QALB and ZEABUC Automatic Annotation
To ensure the quality and validity of the automated annotations, we implemented a systematic manual verification process across both expanded corpora.
Table 7 shows the number of manually annotated samples for each trait across both datasets.
The variation in sample sizes across traits reflects two key factors in the verification strategy. First, the proposed stratified sampling approach aimed to obtain adequate representation across all score points (1–5 for most traits, 1–2 for Relevance). Traits with more balanced score distributions required fewer verification samples, while traits with skewed distributions needed more extensive verification to ensure adequate coverage of rare score categories.
Second, the number of samples needed was influenced by the inherent complexity of each trait, as evidenced by the varying numbers of examples required for best-performing few-shot learning performance (
Table 8). For instance, Style required only 4 examples in prompts but needed 161 verification samples in ZAEBUC due to its subjective nature and score distribution, while Relevance required 8 examples for few-shot learning but only 32 verification samples due to its binary scoring nature and more straightforward assessment criteria.
5. Experiments and Results
This section details the experimental procedures and presents the findings from the comprehensive evaluation of our AES framework. Our experimental design is structured to directly address the key research questions presented in
Section 1, which are
RQ1: Methodology for LLM-based Scoring: We detail the results of our prompt engineering efforts, the validation of topic extraction, and our quality assurance processes, all of which were designed to refine our systematic LLM-based scoring methodology.
RQ2: Assessing LLM Effectiveness: We present a detailed analysis of model performance across various writing traits and a comparison of individual LLM and ensemble approaches to provide empirical insights into the effectiveness of different models for multidimensional assessment.
RQ3: Dataset Expansion: We outline how the application of our framework allowed us to substantially expand the pool of available Arabic AES datasets with multidimensional annotations.
RQ4: Framework Generalizability: We present the results from external validation on the QALB and ZAEBUC datasets to demonstrate the replicability and generalizability of our framework to new datasets and different annotation challenges.
We now outline the results of our prompt engineering efforts, the validation of topic extraction, strategies for addressing class imbalance and generalization, and the robust quality assurance through manual verification. Furthermore, we provide a detailed analysis of model performance across various writing traits, compare individual LLM and ensemble approaches, and assess scoring precision using Mean Absolute Error. Finally, we present the results from external validation on the QALB and ZAEBUC datasets, demonstrating the framework’s generalizability.
5.1. Prompt Engineering Results
Our experiments in prompt engineering consistently demonstrated that a mixed-language prompting strategy (English instructions combined with Arabic content) outperformed prompts given solely in Arabic or English, supporting previous findings about the importance of linguistic alignment and bilingual strategies in LLM task performance [
7].
Few-shot prompting consistently yielded superior results, but the best-performing number of examples varied significantly by trait. As shown in
Table 8, Style required only four examples to reach peak performance, while Vocabulary and Relevance needed seven and eight examples, respectively. This variation likely reflects the different complexities involved in assessing each writing dimension.
The implementation of a revision mechanism, allowing LLMs to reconsider their first assessments, provided measurable improvements for certain traits, particularly Vocabulary, where revision improved Gemini’s performance by 70%.
5.2. Topic Extraction Validation Results
This analysis revealed a substantial performance gap between the two models, with Gemini correctly identifying topics for 75.4% of essays, while BGE-M3 achieved only 39% accuracy. This verification step ensured that all essays had accurately labeled topics before proceeding to the trait-scoring phase. Furthermore, this approach significantly expanded the trait-scored Arabic essays by 622 additional samples, thereby advancing the capabilities of Arabic AES systems.
5.3. Addressing Challenges: Class Imbalance and Generalization
Two particular challenges required special attention: class imbalance and generalization. To address score distribution skew, we implemented custom weighting for rare score classes (particularly scores of 1 and 5) and applied targeted data augmentation for minority classes. To prevent overfitting, we carefully monitored the gap between training and testing performance, prioritizing models that maintained consistent performance across both contexts. The final model selection for each trait balanced three key factors: cross-validation performance, test set performance, and generalization ability.
5.4. Automatic Annotation Models
We evaluated the proposed annotation framework using Quadratic Weighted Kappa (QWK) to assess agreement levels and Mean Absolute Error (MAE) on a 10% test set from QAES to measure scoring precision, following established practices in AES evaluation [
31]. We compared generated results against both the final consolidated scores and individual rater assessments from the QAES dataset.
Ensemble Performance Summary: The proposed systematic model development process identified different best-performing classifiers for each writing trait.
Table 9 presents the results of the best-performing models for each trait based on the test set QWK.
Key Finding: Relevance exhibited the strongest test set performance (
QWK = 0.587629), followed by Mechanics (
QWK = 0.528986). Style showed the lowest agreement (QWK = 0.258065). Based on the Kappa interpretation in
Table 5 (assuming it is provided elsewhere), most traits fall within the “Moderate agreement” range (0.41–0.60), while Style shows “Fair agreement” (0.21–0.40).
Different model architectures emerged as the best-performing for different traits: GaussianNB for Style, Organization, and Relevance; SVC for Development and Vocabulary; RandomForest for Mechanics; and MLP for Structure. The variation in best model architectures across traits, combined with the inherent complexity differences between writing dimensions and the varying capabilities of different LLMs, contributes to the observed performance differences across evaluation metrics. As we will see in the following sections, Gemini excels at certain traits like Vocabulary and Development, while Deepseek demonstrates superior performance for others such as Mechanics, highlighting that no single approach dominates across all assessment dimensions.
The QAES test set consisted of approximately 19–20 essays (10% of 195 total), which may contribute to the observed variability in best-performing classifiers.
5.5. Comparative Analysis of LLMs and Ensemble LLMs
We compared the performance of individual LLMs (both before and after revision) and the proposed ensemble classifier against the final consolidated scores in the QAES dataset. The ensemble classifier takes score results from individual LLMs as input and produces a single final score.
Table 10 presents the QWK values for each approach.
Key Findings:
Ensemble vs. Individual LLMs: For Style and Structure, the proposed ensemble classifier achieved the highest agreement with human scores. Individual LLM approaches generally outperformed the ensemble for other traits.
Gemini’s Strengths: Gemini outperformed in many traits. Gemini with revision excelled at Development (QWK = 0.494048), Vocabulary (QWK = 0.545455), and Relevance (QWK = 0.700000), while unrevised Gemini performed best for Organization (QWK = 0.480198).
Deepseek’s Strengths: Both Deepseek variants achieved the highest performance for Mechanics (QWK = 0.533898).
This pattern reinforces that different LLMs possess complementary strengths; Gemini appears more adept at capturing semantic and developmental aspects of writing, while Deepseek shows particular strength in mechanical accuracy assessment.
Impact of Revision: The revision process showed mixed effects. For Gemini, revision improved performance on Development, Vocabulary, and Relevance, but slightly reduced agreement for Organization. Deepseek showed less consistent benefits from revision.
Comparison to Individual Raters: Table 11 presents performance against the closest individual human rating scores from QAES raters to the generated score. Agreement levels were generally higher when evaluated against individual rater judgments compared to consolidated final scores. For example,
Gemini with revision achieved particularly strong results for Vocabulary (QWK = 0.810127) and Relevance (QWK = 0.700000). This pattern suggests that LLMs tend to align more closely with individual human rater patterns than with consensus judgments.
Summary of Best-Performing Models: To provide a comprehensive overview of the best-performing approach for each trait,
Table 12 summarizes the best-performing model choice across all evaluated methods. This table highlights that a single model or approach is not universally optimal for all traits.
5.6. Mean Absolute Error Analysis
Key Findings from the MAE Analysis: When evaluating essay scoring, LLMs generally outperformed the ensemble classifier in Mean Absolute Error (MAE) for most traits.
Table 13 shows these results.
Comparison to Individual Raters (MAE): For comparison,
Table 14 shows the MAE values against the closest individual rating scores. Gemini with revision achieved the lowest MAE for Vocabulary (
0.30), highlighting its precision on this trait. Relevance consistently had the lowest error rates across all models and comparisons.
When evaluating essay scoring, LLMs generally outperformed the ensemble classifier in Mean Absolute Error (MAE) for most traits. Gemini showed strong performance across Style, Development, and Vocabulary, while Deepseek excelled in Mechanics. The ensemble model surpassed individual LLMs only for Structure, and Relevance consistently had the lowest error rates across all models.
5.7. Manual Verification for Automated Annotations of QALB and ZAEBUC
To validate the proposed approach beyond the original development dataset, we compared the automatically generated scores against manually verified annotations for subsets of both ZAEBUC and QALB.
Table 15 presents the QWK values for each trait across these external validation samples.
Key Findings: External validation of the automated annotation framework showed promising agreement with human assessments.
ZAEBUC Dataset: Performance was consistently high across all traits (QWK > 0.65), with Organization achieving a remarkable QWK of 0.919751.
QALB Dataset: Validation also demonstrated strong performance for Style (QWK = 0.858792) and Development (QWK = 0.903897), but agreement for Organization (QWK = 0.406863) and Structure (QWK = 0.456587) was more moderate.
These results, which varied with sample size and dataset characteristics like topic consistency, suggest the framework generalizes well to diverse Arabic writing, especially for traits like Development (QWK > 0.84 on both datasets,) where automated scoring can be highly reliable.
6. Discussion
This research provides a comprehensive analysis of the proposed framework, directly addressing the key research questions outlined in the Introduction. Our findings offer important insights into the effectiveness of using LLMs for Arabic AES and the potential of our methodology to expand existing resources.
6.1. Answering the Research Questions
RQ1: Optimizing the LLM-based Scoring Methodology. Our experimental results demonstrate that the proposed systematic framework successfully refines LLM-based scoring. The prompt engineering experiments revealed that a few-shot, mixed-language approach, particularly with trait-specific example counts, is crucial for optimizing LLM performance. The implementation of a revision mechanism, while not universally beneficial, significantly improved Gemini’s scoring for complex traits like Vocabulary. This confirms that our methodology provides a robust and optimizable process for leveraging LLMs for automated scoring.
RQ2: Assessing LLM Effectiveness Across Writing Dimensions. The comparative analysis of LLMs and the ensemble classifier in
Section 5.5 clearly shows that different models possess distinct strengths. Gemini-Revision excelled in assessing subjective, nuanced traits such as Vocabulary (QWK =
0.810) and Relevance (QWK =
0.700) when compared against the closest human ratings. In contrast, Deepseek performed best for Mechanics (
QWK = 0.571), a more rule-based trait. The ensemble classifier, while not outperforming the best single LLM for all traits, achieved top performance for Mechanics (QWK =
0.590) and was crucial for challenging traits like Style and Structure. This confirms that LLMs have varying capabilities across different writing dimensions and that a single model is not optimal for all.
RQ3: Expanding Arabic AES Datasets. A core objective of our research was to overcome the scarcity of large, multidimensional Arabic essay datasets. Our framework successfully expanded the QAES dataset by over 836 additional samples with multidimensional annotations. The high agreement levels observed during our manual verification on subsets of the ZAEBUC and QALB datasets (e.g., Development QWK > 0.84 on both datasets) suggest that this methodology can be reliably used to generate new, high-quality, trait-specific data at a large scale. This directly addresses the circular problem of needing large datasets to build systems that can generate large datasets, providing a meaningful contribution to the Arabic NLP research infrastructure.
RQ4: Demonstrating Framework Generalizability. The external validation on the ZAEBUC and QALB datasets, as detailed in
Section 5.6, confirmed the generalizability of our framework. Despite the differences in dataset characteristics, the automated scores showed high agreement with human ratings for several traits. While performance varied between datasets, particularly for Organization and Structure, the results show that our framework can be adapted to new domains and content types, serving as a replicable model for other low-resource languages.
6.2. LLM Performance Analysis
The LLMs demonstrated distinct strengths across different writing traits, with Gemini-Revision showing superior performance in assessing Vocabulary (QWK = 0.810), Development (QWK = 0.596), and Relevance (QWK = 0.700) when compared against the closest human ratings for QAES. Unrevised Gemini excelled in Organization assessment (QWK = 0.550), while both Deepseek models performed best for Mechanics evaluation (QWK = 0.571). When examining performance against QAES final consolidated scores, the agreement levels were notably lower, suggesting that LLMs align better with individual human raters than with adjudicated scores that resolve disagreements between multiple assessors.
The proposed ensemble classifier model approach achieved the highest performance for Mechanics (QWK = 0.590) and showed strong results for Style (QWK = 0.443) and Development (QWK = 0.571) when compared to closest human ratings. However, this approach did not universally outperform the best single LLM for traits where Gemini excelled, particularly for Vocabulary, where Gemini-Revision achieved an impressive QWK of 0.810, which is substantially higher than the classifier’s 0.663. A visual comparison of the QWK scores in
Figure 6 clearly illustrates these performance variations, with the Gemini-Revision model demonstrating superior agreement on traits like Vocabulary and the ensemble classifier excelling in others such as Mechanics. This indicates that while the ensemble classifier can effectively combine signals for some traits, for others, the inherent linguistic understanding and contextual reasoning of a single, well-tuned LLM can surpass the performance of a combined traditional model. This is especially true for traits that require more nuanced semantic or stylistic judgment, which LLMs are better equipped to handle due to their vast pre-training.
6.3. Prompt Engineering Insights
The prompt engineering experiments yielded several valuable findings.
Language Sensitivity: Mixed-language prompts (English instructions with Arabic content) consistently improved LLM comprehension and performance for Arabic language content, aligning with findings by [
7].
Few-shot vs. zero-shot: Few-shot prompting significantly improved performance across all traits compared to zero-shot approaches, and the best number of examples varied by trait (from four for Style to eight for Relevance). Studies also indicated that the quantity of few-shot examples was often constrained by the model’s context window limitations, with typical usage ranging from one to three examples per class [
7].
Revision prompts: The effectiveness of revision prompts varied by trait and model. While revision improved Gemini’s performance on Vocabulary from a QWK of 0.623 to 0.810 (a 30% increase), it had minimal impact on Deepseek’s Mechanics assessment, suggesting that the value of iterative refinement is not uniform.
6.4. Manual Verification Results
Manual verification of automatically generated scores on the ZAEBUC and QALB datasets showed encouraging agreement with human reviewers. Traits like Development (QWK = 0.844 for ZAEBUC, 0.904 for QALB) and Style (QWK = 0.809 for ZAEBUC, 0.859 for QALB) frequently reached “Almost perfect agreement” levels (0.81–1.00) as per
Table 5. However, performance varied notably between datasets, with ZAEBUC showing higher agreement (e.g., Organization QWK = 0.920 vs. 0.407 for QALB), suggesting that domain and content differences impact scoring consistency. While these external validations are positive, the results should be interpreted with caution due to small, selectively sampled subsets. This implies that actual agreement levels across complete datasets are likely more moderate, consistent with findings from comprehensive QAES evaluations (as shown in
Table 9 and
Table 10).
6.5. Limitations and Challenges
Despite achieving substantial agreement levels, the proposed approach and the resulting ZaQQ dataset face several limitations:
Seed Dataset Constraints: The small size of the original QAES dataset (195 essays) limited the ability to optimize models more precisely and may have contributed to overfitting in some cases. This also means that the test set, comprising approximately 19–20 essays, is very small, which can lead to high variance in performance metrics and limit the generalizability of the results.
Topic and Length Limitations: Although the ZaQQ data set is a good expansion, its utility is constrained by the characteristics of its source corpora. The topic diversity is narrow, originating from datasets with very few prompts (two in QAES and three in ZAEBUC). Furthermore, the average essay length of the added datasets is considerably shorter than that of the seed dataset and English benchmarks, which may impact the framework’s performance on more complex, long-form academic writing.
Trait Complexity Variations: Certain traits, such as Style, proved to be more challenging for automated assessment due to their subjective nature and complex criteria. The model’s Style assessment, for example, showed low agreement with the final human scores (QWK = 0.258). This is because style is inherently subjective and shaped by cultural contexts, making it difficult to standardize. Structure also (QWK of 0.364), requires a holistic view of the entire text. LLMs may fail to grasp long-range connections between ideas or the logical progression of an argument, particularly in lengthy essays.
To overcome these limitations, future work could focus on three key areas: designing more specific rubrics with concrete examples, expanding the variety of training examples, or building specialized models focused solely on evaluating these complex dimensions.
Arabic-Specific Challenges: LLMs still struggle with certain aspects of Arabic morphological complexity and dialectal variations, which is particularly evident in the mixed performance across different assessment dimensions.
Nevertheless, the proposed approach successfully demonstrates how existing limited resources can be leveraged to address the circular problem in Arabic AES: needing large datasets to build systems that can generate large datasets. By expanding the available scored essay corpus, we provide a meaningful contribution to the Arabic NLP research infrastructure.
The QWK scores achieved across traits indicate moderate to substantial agreement levels that, while not matching typical human–human agreement (often in the 0.70–0.85 range, with specific Arabic studies reporting around 0.690 [
17]), provide a solid foundation for further refinement. These results are particularly encouraging given the challenging nature of Arabic language processing and the limited training data available.
7. Conclusions
This research introduces an LLM-based framework to address data scarcity in Arabic AES by expanding annotated datasets. The framework utilizes carefully designed prompts to enable LLMs to generate trait-specific scores that show meaningful agreement with human ratings, proving effective even for longer, more complex essays compared to shorter texts used in prior Arabic AES research.
A significant achievement of this work is the expansion of annotated Arabic essay datasets from 195 to over 1000 essays, with comprehensive trait assessments across seven writing quality dimensions, marking a substantial advancement for Arabic NLP in educational applications. While these results are encouraging, the preliminary nature of these findings should be acknowledged, as the initial QAES dataset is small and the external validation was conducted on limited, manually verified subsets. Thus, the generalizability of these high-agreement scores to entire, unannotated corpora may be more moderate.
Rigorous experimentation demonstrated varying performance strengths of different LLMs across writing traits, with some configurations achieving high agreement with human raters, such as a Quadratic Weighted Kappa (QWK) of 0.810 for Vocabulary. These findings suggest that trait-specific LLM selection can improve overall assessment performance.
In addition to expanding resources, this work provides a systematic and reproducible methodology for optimizing LLM-based scoring, which is adaptable to other low-resource languages facing annotation scarcity. For instance, the prompt engineering strategies that accounted for Arabic’s morphological richness and diglossia (the co-existence of formal and colloquial language varieties) could be a model for languages like Urdu or Persian, which share similar linguistic challenges. The approach of using LLMs for data augmentation is particularly relevant for languages where manual annotation is costly and time-consuming. This framework offers a blueprint for how a small seed dataset can be bootstrapped to create a valuable resource, thus lowering the barrier for research in other languages with limited educational data.
Future research could focus on refining prompt engineering to improve agreement for traits like Style (QWK = 0.443), exploring cross-lingual transfer learning from English AES resources, and validating the practical utility of automatically scored datasets for training end-to-end AES systems. Developing specialized Arabic language models may also address challenges in grammatical assessment.
Author Contributions
Conceptualization, Y.E., E.N., M.T., S.F. and A.K.; methodology, Y.E., E.N., M.T. and A.K.; software, Y.E.; validation, E.N., M.T., S.F. and A.K.; formal analysis, Y.E., E.N., M.T. and A.K.; investigation, Y.E., E.N., M.T. and A.K.; data curation, Y.E., E.N., M.T. and A.K.; writing—original draft preparation, Y.E.; writing—review and editing, E.N., M.T., A.K. and S.F.; visualization, Y.E., E.N., M.T. and A.K.; supervision, E.N., M.T., S.F. and A.K.; funding acquisition, E.N. All authors have read and agreed to the published version of the manuscript.
Funding
This work was funded by the Deanship of Scientific Research, Islamic University of Madinah, Madinah, Saudi Arabia.
Data Availability Statement
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
ZaQQ | A newly annotated dataset that compiles three datasets, two unannotated datasets (ZAEBUC and QALB), and one annotated dataset (QAES). |
LLM | Large Language Model |
QWK | Quadratic Weighted Kappa |
NLP | Natural Language Processing |
AES | Automated Essay Scoring |
MAE | Mean Absolute Error |
CV | Cross-Validation |
SVC | Support Vector Classification |
MLP | Multi-Layer Perceptron |
CEFR | Common European Framework of Reference |
MSA | Modern Standard Arabic |
DSPy | Programming framework for LLM prompt optimization |
STEM | Science, Technology, Engineering, and Mathematics |
VSO | Verb–subject–object word order |
Appendix A. Dataset Sample
Figure A1 shows a representative essay from the QCAW dataset that demonstrates the typical structure and length of essays in this collection, and
Table A1 shows its scores in QAES.
Figure A1.
Sample essay from the QCAW dataset.
Figure A1.
Sample essay from the QCAW dataset.
Table A1.
Sample scores from QAES.
Table A1.
Sample scores from QAES.
Topic Round | Communication |
---|
Organization
|
Vocabulary
|
Style
|
Development
|
Mechanics
|
Structure
|
Relevance
|
---|
Rater 1 | 4 | 4 | 4 | 4 | 3 | 4 | 2 |
Rater 2 | 4 | 4 | 4 | 4 | 3 | 4 | 2 |
Rater 3 | – | – | – | – | – | – | – |
Final | 4 | 4 | 4 | 4 | 3 | 4 | 2 |
Figure A2 presents a typical essay from the QALB corpus, illustrating the shorter length compared to QCAW essays while maintaining sufficient complexity for trait-based assessment.
Figure A2.
Sample essay from the QALB dataset.
Figure A2.
Sample essay from the QALB dataset.
Figure A3 illustrates a representative essay from the ZAEBUC collection with “Tolerance” as the topic, showing the intermediate length and structure of this dataset.
Figure A3.
Sample essay from the ZAEBUC dataset.
Figure A3.
Sample essay from the ZAEBUC dataset.
Appendix B. Arabic Scoring Rubric
Figure A4 shows the detailed Arabic scoring rubric used to assess essays across the seven writing dimensions. English rubric details are presented in
Figure A5.
Figure A4.
Arabic scoring rubric for the seven writing traits evaluated in this study.
Figure A4.
Arabic scoring rubric for the seven writing traits evaluated in this study.
Figure A5.
English scoring rubric for the seven writing traits evaluated in this study.
Figure A5.
English scoring rubric for the seven writing traits evaluated in this study.
Appendix C. Prompt Engineering
Appendix C.1. Prompt Construction Code Snippets
Figure A6 and
Figure A7 illustrate the Python 3.10 code snippets used to construct the initial part of the prompts, demonstrating the variations for both Arabic and English instructions.
Figure A6.
Code snippet for initial prompt creation, showing Arabic and English instruction variations.
Figure A6.
Code snippet for initial prompt creation, showing Arabic and English instruction variations.
Figure A7.
Code snippet for revision prompt creation, showing Arabic and English instruction variations.
Figure A7.
Code snippet for revision prompt creation, showing Arabic and English instruction variations.
Appendix C.2. Prompt Example
Figure A8 illustrates an example for the revision message part of the system and the input/output fields used in the prompt engineering framework DSPy. This structured approach ensures consistent interaction with the LLMs for essay scoring.
Figure A8.
Example of a system message and input/output fields in the prompt engineering framework DSPy.
Figure A8.
Example of a system message and input/output fields in the prompt engineering framework DSPy.
Appendix D. Implementation and Reproducibility Details
Appendix D.1. Data Augmentation Strategy
To address the issue of class imbalance, particularly for rare scores (e.g., 1 and 5), we implemented a manual over-sampling strategy. For any score class in the training set with fewer than 10 samples, we randomly duplicated existing samples from that class until it reached a minimum of 10 samples. This was achieved by using numpy.random.choice with a fixed random seed for consistency, ensuring that our cross-validation splits remained robust.
Appendix D.2. Feature Engineering for Ensemble Models
The input for our ensemble classifiers was not the raw text but a set of features derived from the four LLM-generated scores (Gemini-Initial, Gemini-Revision, Deepseek-Initial, Deepseek-Revision). For each trait, the following features were engineered:
The four LLM scores themselves.
Statistical Features: mean, median, and range (max − min) of the four LLM scores.
Score Count Features: Five features counting the frequency of each possible score (1 through 5). For example, STY_count_4 would be the count of how many of the four LLMs assigned a score of 4 for the Style trait.
Appendix D.3. Classifier Hyperparameter Tuning
We used
GridSearchCV with 5-fold stratified cross-validation to find the optimal hyperparameters for a suite of classifiers. The models and their corresponding search spaces are detailed in
Table A2 below.
Table A2.
Hyperparameter search space for ensemble classifiers.
Table A2.
Hyperparameter search space for ensemble classifiers.
Classifier | Parameter | Search Space |
---|
RandomForest | n_estimators | [100, 200] |
max_depth | [5, 10, None] |
min_samples_split | [5, 10] |
min_samples_leaf | [3, 5] |
class_weight | [‘balanced_subsample’] |
GradientBoosting | n_estimators | [50, 100] |
learning_rate | [0.01, 0.1] |
max_depth | [3, 5] |
ExtraTrees | n_estimators | [100, 200] |
max_depth | [10, None] |
class_weight | [‘balanced_subsample’] |
LogisticRegression | C | [0.1, 1.0, 10.0] |
solver | [‘liblinear’] |
class_weight | [‘balanced’] |
SVC | C | [0.1, 1.0, 10.0] |
kernel | [‘rbf’, ‘linear’] |
class_weight | [‘balanced’] |
GaussianNB | var_smoothing | [1 × , 1 × , 1 × ] |
MLPClassifier | hidden_layer_sizes | [(50,), (100,), (50, 50)] |
alpha | [0.0001, 0.001] |
CatBoost | iterations | [100] |
learning_rate | [0.1] |
depth | [5] |
References
- Attali, Y.; Burstein, J. Automated Essay Scoring for Classroom Assessment. J. Technol. Learn. Assess. 2006, 4, 1–49. [Google Scholar]
- Shermis, M.D.; Burstein, J.C. Automated Essay Scoring: A Cross-Disciplinary Perspective; Lawrence Erlbaum Associates Publishers: Mahwah, NJ, USA, 2003. [Google Scholar]
- Habash, N.Y. Introduction to Arabic Natural Language Processing, 1st ed.; Morgan & Claypool Publishers: San Rafael, CA, USA, 2010; pp. 45–48. [Google Scholar]
- Jonsson, A.; Svingby, G. The use of scoring rubrics: Reliability, validity and educational consequences. Educ. Res. Rev. 2007, 2, 130–144. [Google Scholar] [CrossRef]
- Perelman, L. When “the state of the art” is counting words. Assess. Writ. 2014, 21, 104–111. [Google Scholar] [CrossRef]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shay, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Ghazawi, R.; Simpson, E. How well can LLMs Grade Essays in Arabic? arXiv 2025, arXiv:2501.16516. [Google Scholar] [CrossRef]
- Zaidan, O.; Callison-Burch, C. The Arabic Online Commentary Dataset. In Proceedings of the ACL, Portland, OR, USA, 19–24 June 2011; pp. 1–4. [Google Scholar]
- Mathias, S.; Bhattacharyya, P. ASAP++: Enriching the ASAP Automated Essay Grading Dataset with Essay Attribute Scores. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018; European Language Resources Association (ELRA): Paris, France, 2018. [Google Scholar]
- Hamner, B.; Morgan, J.; Lynnvandev; Shermis, M.; Vander Ark, T. The Hewlett Foundation: Automated Essay Scoring. Kaggle. 2012. Available online: https://kaggle.com/competitions/asap-aes (accessed on 17 September 2024).
- Blanchard, D.; Tetreault, J.; Higgins, D.; Cahill, A.; Chodorow, M. TOEFL11: A Corpus of Non-Native English; Report No. RR-13-25; Educational Testing Service: Princeton, NJ, USA, 2013. [Google Scholar]
- Ahmed, A.M.; Myhill, D.; Abdollahzadeh, E.; McCallum, L.; Zaghouani, W.; Rezk, L.; Jrad, A.; Zhang, X. Qatari Corpus of Argumentative Writing LDC2022T04; Web Download; Linguistic Data Consortium: Philadelphia, PA, USA, 2022. [Google Scholar]
- Zaghouani, W.; Ahmed, A.; Zhang, X.; Rezk, L. QCAW 1.0: Building a Qatari Corpus of Student Argumentative Writing. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 13382–13394. [Google Scholar]
- Bashendy, M.; Albatarni, S.; Eltanbouly, S.; Zahran, E.; Elhuseyin, H.; Elsayed, T.; Massoud, W.; Bouamor, H. QAES: First Publicly-Available Trait-Specific Annotations for Automated Scoring of Arabic Essays. In Proceedings of the ARABICNLP, Bangkok, Thailand, 16 August 2024. [Google Scholar]
- Habash, N.; Palfreyman, D.M. ZAEBUC: An Annotated Arabic-English Bilingual Writer Corpus. In Proceedings of the International Conference on Language Resources and Evaluation, Marseille, France, 20–25 June 2022. [Google Scholar]
- Mohit, B.; Rozovskaya, A.; Habash, N.; Zaghouani, W.; Obeid, O. The First QALB Shared Task on Automatic Text Correction for Arabic. In Proceedings of the First Workshop on Arabic Natural Language Processing, Doha, Qatar, 25 October 2014; pp. 39–47. [Google Scholar]
- Ghazawi, R.; Simpson, E. Automated essay scoring in Arabic: A dataset and analysis of a BERT-based system. arXiv 2024, arXiv:2407.11212. [Google Scholar] [CrossRef]
- Alfarah, Z.; Habash, N.; Saddiki, H. ARAScore: Holistic and Analytic Scoring for Arabic Essays. In Proceedings of the WANLP, Kyiv, Ukraine, 19 April 2021; pp. 78–92. [Google Scholar]
- Ouahrani, L.; Bennouar, D. AR-ASAG: An Arabic Dataset for Automatic Short Answer Grading. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 2634–2643. [Google Scholar]
- Alhafni, B.; Inoue, G.; Khairallah, C.; Habash, N. Advancements in Arabic Grammatical Error Detection and Correction: An Empirical Investigation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 6430–6448. [Google Scholar]
- Inoue, G.; Alhafni, B.; Baimukan, N.; Bouamor, H.; Habash, N. The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, Kyiv, Ukraine, 19 April 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021. [Google Scholar]
- Antoun, W.; Baly, F.; Hajj, H. AraBERT: Transformer-based Model for Arabic Language Understanding. In Proceedings of the LREC 2020 Workshop Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 9–15. [Google Scholar]
- Azmi, A.M.; Al-Jouie, M.F.; Hussain, M. AAEE–Automated evaluation of students’ essays in Arabic language. Inf. Process. Manag. 2019, 56, 1736–1752. [Google Scholar] [CrossRef]
- Attia, M.; Pecina, P.; Toral, A.; van Genabith, J. A Corpus-Based Finite-State Morphological Toolkit for Contemporary Arabic. J. Log. Comput. 2013, 24, 455–472. [Google Scholar] [CrossRef]
- Taghipour, K.; Ng, H.T. A Neural Approach to Automated Essay Scoring. In Proceedings of the EMNLP, Austin, TX, USA, 1–5 November 2016; pp. 1882–1891. [Google Scholar]
- Alikaniotis, D.; Yannakoudakis, H.; Rei, M. Automatic Text Scoring Using Neural Networks. Comput. Linguist. 2019, 45, 1–34. [Google Scholar]
- Abdul-Mageed, M.; Elmadany, A.; Nagoudi, E.M. ARBERT: Effective Arabic Tokenization. Comput. Linguist. 2022, 48, 1–55. [Google Scholar]
- Mahmoud, S.; Nabil, E.; Torki, M. Automatic Scoring of Arabic Essays: A Parameter-Efficient Approach for Grammatical Assessment. IEEE Access 2024, 12, 142555–142568. [Google Scholar] [CrossRef]
- Abdelrehim, M.; Torki, M.; El-Makky, N. Hybrid LLM and Rule-Based Synthetic Data Generation for Arabic Grammatical Error Correction. In Proceedings of the 2nd IEEE International Conference on Machine Intelligence and Smart Innovation (ICMISI 2025), Alexandria, Egypt, 10–12 May 2025. [Google Scholar]
- Qwaider, C.; Alhafni, B.; Chirkunov, K.; Habash, N.; Briscoe, T. Enhancing Arabic Automated Essay Scoring with Synthetic Data and Error Injection. arXiv 2025, arXiv:2503.17739. [Google Scholar] [CrossRef]
- Afrizal Doewes, A.; Kurdhi, N.A.; Saxena, A. Evaluating Quadratic Weighted Kappa as the Standard Performance Metric for Automated Essay Scoring. In Proceedings of the 16th International Conference on Educational Data Mining, Bengaluru, India, 11–14 July 2023; pp. 103–113. [Google Scholar]
- Laurer, M.; van Atteveldt, W.; Casas, A.; Welbers, K. Building Efficient Universal Classifiers with Natural Language Inference. arXiv 2023, arXiv:2312.17543. [Google Scholar]
- Khattab, O.; Singhvi, A.; Maheshwari, P.; Zhang, Z.; Santhanam, K.; Vardhamanan, S.; Haq, S.; Sharma, A.; Joshi, T.T.; Moazam, H.; et al. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv 2023, arXiv:2310.03714. [Google Scholar] [CrossRef]
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
- Geurts, P.; Ernst, D.; Wehenkel, L. Extremely Randomized Trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef]
- Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased Boosting with Categorical Features. Adv. Neural Inf. Process. Syst. 2018, 31, 6638–6648. [Google Scholar]
- Hosmer, D.W.; Lemeshow, S. Applied Logistic Regression, 2nd ed.; Wiley: New York, NY, USA, 2000. [Google Scholar]
- Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
- Zhang, H. The Optimality of Naive Bayes. In Proceedings of the 17th International FLAIRS Conference, Miami Beach, FL, USA, 12–14 May 2004; Volume 1, pp. 562–567. [Google Scholar]
- Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning Representations by Back-propagating Errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. Available online: https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html (accessed on 1 June 2025).
Figure 1.
Architecture of the proposed Arabic AES system, illustrating input components and output scoring traits.
Figure 1.
Architecture of the proposed Arabic AES system, illustrating input components and output scoring traits.
Figure 2.
High-level overview of the proposed AES framework workflow.
Figure 2.
High-level overview of the proposed AES framework workflow.
Figure 3.
Workflow for Phase 1: Iterative prompt development. This iterative process focuses on refining prompts through language testing and few-shot optimization, aiming to achieve the best configuration for each writing trait, as detailed in the key activities.
Figure 3.
Workflow for Phase 1: Iterative prompt development. This iterative process focuses on refining prompts through language testing and few-shot optimization, aiming to achieve the best configuration for each writing trait, as detailed in the key activities.
Figure 4.
LLM essay evaluation workflow for each trait, illustrating the input of LLM-generated scores into various classifiers for evaluation.
Figure 4.
LLM essay evaluation workflow for each trait, illustrating the input of LLM-generated scores into various classifiers for evaluation.
Figure 5.
Detailed workflow: Phase 3 (Model Evaluation and Selection).
Figure 5.
Detailed workflow: Phase 3 (Model Evaluation and Selection).
Figure 6.
A comparison of QWK scores for different models across seven writing traits.
Figure 6.
A comparison of QWK scores for different models across seven writing traits.
Table 1.
Most famous English AES datasets.
Table 1.
Most famous English AES datasets.
Dataset | Reference | Size | Traits | Topics | Avg Length |
---|
Hewlett | [10] | 13,000 | Holistic | 8 | 350 words |
ASAP++ | [9] | 13,000 | Traits | 8 | 350 words |
TOEFL11 | [11] | 1100 | Holistic | 8 | 320 words |
Table 2.
Arabic AES dataset comparison.
Table 2.
Arabic AES dataset comparison.
Dataset | Reference | Size | Original Annotations | Topics | Avg Length |
---|
QCAW/QAES | [12,13,14] | 195 | 7 traits | 2 | 489 words |
ZAEBUC | [15] | 214 | CEFR levels | 3 | 156 words |
QALB (2015) | [16] | 622 | Error types | N/A | 145 words |
AR-AES | [17] | 2046 | Holistic | 12 | 58 words |
ARAScore | [18] | 1500 | Holistic + traits | 8 | 40 words |
AR-ASAG | [19] | 2133 | Holistic | 48 | 25 words |
Table 3.
Arabic vs. English AES: critical comparison.
Table 3.
Arabic vs. English AES: critical comparison.
Metric | Arabic (AR-AES) | English (ASAP) | Ratio |
---|
Total essays | 2046 | 13,000 | 1:6.4 |
Average length | 50 words | 350 words | 1:7 |
Human agreement QWK | 0.85 | 0.92 | 1:1.08 |
Topics covered | 12 | 8 | 1.5:1 |
Table 4.
Comparative performance and key characteristics of AES approaches.
Table 4.
Comparative performance and key characteristics of AES approaches.
Ref. | Year | Approach/Model | Key Characteristics | Dataset | Performance |
---|
Traditional Machine Learning |
[16] | 2014 | Hybrid (Rule- based + Statistical) | This hybrid system combines statistical machine translation, language modeling, and morphological analysis. It focused on grammatical error correction rather than holistic scoring. | QALB-2014 | -score = 52.8% |
[23] | 2019 | Hybrid (Rule- based + Statistical) | Relies on manual feature engineering using LSA for semantics and AraComLex for spelling. It establishes a benchmark with a custom dataset. | Custom (350 essays) | Correlation = 0.756 |
[19] | 2020 | Semantic Textual Similarity | An unsupervised corpus-based approach using the Correlated Occurrence Analogue to Lexical Semantics (COALS) algorithm for semantic analysis. | AR-ASAG | Pearson corr. = 0.551 |
Deep Learning Advancements |
[25] | 2016 | LSTM Networks | Pioneering neural approach optimized with Root Mean Square Propagation (RMSProp), effective for automatic text scoring. This work focuses on English. | English ASAP dataset | Not for Arabic AES |
[26] | 2019 | Hybrid CNN-RNN Models | Combines CNNs for local feature extraction with RNNs for sequential processing. This work also focuses on English. | English ASAP dataset | Not for Arabic AES |
[22] | 2020 | AraBERT | An Arabic-optimized transformer model that demonstrates superior semantic understanding. | AR-AES | QWK = 0.88 |
[28] | 2024 | AraBART | Integrates strategies like Parameter-Efficient Fine-Tuning and Model Soup for grammatical error correction. | QALB and ZAEBUC | -score ≈ 77.5% |
Large Language Model Applications |
[7] | 2025 | GPT-4 and ACEGPT | A few-shot evaluation of LLMs that notes prompt sensitivity and tokenization inefficiencies. It shows that LLMs still lag behind fine-tuned BERT models. | AR-AES | QWK = 0.67–0.75 |
[29] | 2025 | Hybrid (Rule- based + LLM) | A method of leveraging rule-based techniques with LLMs to generate synthetic data with controlled error types to improve performance on rare grammatical errors. | N/A (Data Gen.) | N/A (Data Gen.) |
[30] | 2025 | GPT-4 | Used as a data generation method to create 3040 CEFR-aligned synthetic essays with controlled error injection. | N/A (Data Gen.) | N/A (Data Gen.) |
Table 5.
Interpretation of Kappa values.
Table 5.
Interpretation of Kappa values.
Kappa Range | Interpretation |
---|
<0 | Less than chance agreement |
0.01–0.20 | Slight agreement |
0.21–0.40 | Fair agreement |
0.41–0.60 | Moderate agreement |
0.61–0.80 | Substantial agreement |
0.81–1.00 | Almost perfect agreement |
Table 6.
Summary of selected datasets.
Table 6.
Summary of selected datasets.
Dataset | Essays | Avg. Length (Words) | Traits | Topics |
---|
QAES/QCAW | 195 | 489 | 7 traits | 2 |
QALB | 622 | 145 | – | – |
ZAEBUC | 216 | 156 | – | 3 |
Table 7.
Number of manually annotated/reviewed samples.
Table 7.
Number of manually annotated/reviewed samples.
Trait | ZAEBUC | QALB |
---|
Style | 161 | 54 |
Development | 39 | 37 |
Mechanics | 52 | 76 |
Vocabulary | 127 | 88 |
Organization | 36 | 61 |
Structure | 41 | 45 |
Relevance | 32 | 17 |
Table 8.
Number of samples required in the prompt to achieve best results using few-shot learning.
Table 8.
Number of samples required in the prompt to achieve best results using few-shot learning.
Trait | Number of Examples |
---|
Style | 4 |
Development | 5 |
Mechanics | 5 |
Vocabulary | 7 |
Organization | 5 |
Structure | 6 |
Relevance | 8 |
Table 9.
Results of best ensemble classification model for each trait depending on the QWK of the test set from QAES.
Table 9.
Results of best ensemble classification model for each trait depending on the QWK of the test set from QAES.
Trait | Meta Model | CV_Mean | CV_Std | Test_Set |
---|
Style | GaussianNB | 0.497293 | 0.019947 | 0.258065 |
Development | SVC | 0.510809 | 0.053345 | 0.483696 |
Mechanics | RandomForest | 0.534455 | 0.009175 | 0.528986 |
Vocabulary | SVC | 0.335084 | 0.030820 | 0.413043 |
Organization | GaussianNB | 0.443547 | 0.009126 | 0.394737 |
Structure | MLP | 0.491089 | 0.018955 | 0.363636 |
Relevance | GaussianNB | 0.403734 | 0.000000 | 0.587629 |
Table 10.
QWK for LLMs before and after revision and using a classifier on the test set compared to the QAES final result (the rounded integer mean of the assessments provided by the two main annotators).
Table 10.
QWK for LLMs before and after revision and using a classifier on the test set compared to the QAES final result (the rounded integer mean of the assessments provided by the two main annotators).
Trait | Deepseek Assessment | Gemini Assessment | Ensemble Classification Models |
---|
First | Revised | First | Revised | QWK | Meta-Model |
---|
Style | 0.160000 | 0.142857 | 0.166667 | 0.176471 | 0.258065 | (GaussianNB) |
Development | 0.347826 | 0.204545 | 0.322034 | 0.494048 | 0.483696 | (SVC) |
Mechanics | 0.533898 | 0.533898 | 0.482759 | 0.285714 | 0.528986 | (RF) |
Vocabulary | 0.471698 | 0.390244 | 0.409836 | 0.545455 | 0.413043 | (SVC) |
Organization | 0.387755 | 0.356061 | 0.480198 | 0.314815 | 0.394737 | (GaussianNB) |
Structure | 0.190476 | 0.166667 | 0.272727 | 0.250000 | 0.363636 | (MLP) |
Relevance | 0.500000 | 0.680851 | 0.659091 | 0.700000 | 0.587629 | (GaussianNB) |
Table 11.
QWK for LLMs before and after revision and using the classifier compared to QAES closest result.
Table 11.
QWK for LLMs before and after revision and using the classifier compared to QAES closest result.
Trait | Deepseek Assessment | Gemini Assessment | Ensemble Classification Models |
---|
First | Revised | First | Revised | QWK | Meta-Model |
---|
Style | 0.428571 | 0.388489 | 0.351852 | 0.389535 | 0.442724 | (GaussianNB) |
Development | 0.416667 | 0.322034 | 0.595960 | 0.481132 | 0.571429 | (SVC) |
Mechanics | 0.570815 | 0.570815 | 0.500000 | 0.423077 | 0.590444 | (RF) |
Vocabulary | 0.653846 | 0.490741 | 0.622642 | 0.810127 | 0.662698 | (SVC) |
Organization | 0.481132 | 0.473068 | 0.550265 | 0.420000 | 0.462500 | (GaussianNB) |
Structure | 0.487179 | 0.458333 | 0.362745 | 0.434783 | 0.405941 | (MLP) |
Relevance | 0.500000 | 0.680851 | 0.659091 | 0.700000 | 0.587629 | (GaussianNB) |
Table 12.
Summary of best-performing models on the QAES test set for each writing trait across all approaches.
Table 12.
Summary of best-performing models on the QAES test set for each writing trait across all approaches.
Trait | Best-Performing Model | QWK Score |
---|
Style | Ensemble Classifier (GaussianNB) | 0.258065 |
Development | Gemini_Revised | 0.494048 |
Mechanics | Deepseek/Deepseek_Revised | 0.533898 |
Vocabulary | Gemini_Revised | 0.545455 |
Organization | Gemini | 0.480198 |
Structure | Ensemble Classifier (MLP) | 0.363636 |
Relevance | Gemini_Revised | 0.700000 |
Table 13.
MAE for LLMs before and after revision and using the classifier compared to the QAES final result (the rounded integer mean of the assessments provided by the two main annotators).
Table 13.
MAE for LLMs before and after revision and using the classifier compared to the QAES final result (the rounded integer mean of the assessments provided by the two main annotators).
Trait | Deepseek Assessment | Gemini Assessment | Ensemble Classification Models |
---|
First | Revised | First | Revised | QWK | Meta-Model |
---|
Style | 0.75 | 0.80 | 0.70 | 1.00 | 0.85 | (GaussianNB) |
Development | 0.60 | 1.20 | 0.60 | 0.65 | 0.65 | (SVC) |
Mechanics | 0.45 | 0.45 | 0.50 | 0.70 | 0.55 | (RF) |
Vocabulary | 0.60 | 0.85 | 0.60 | 0.65 | 0.95 | (SVC) |
Organization | 1.30 | 1.35 | 0.75 | 1.05 | 1.20 | (GaussianNB) |
Structure | 0.65 | 0.80 | 0.60 | 0.70 | 0.35 | (MLP) |
Relevance | 0.25 | 0.15 | 0.15 | 0.15 | 0.20 | (GaussianNB) |
Table 14.
MAE for LLMs before and after revision and using the classifier compared to the QAES closest result.
Table 14.
MAE for LLMs before and after revision and using the classifier compared to the QAES closest result.
Trait | Deepseek Assessment | Gemini Assessment | Ensemble Classification Models |
---|
First | Revised | First | Revised | QWK | Meta-Model |
---|
Style | 0.60 | 0.65 | 0.60 | 0.85 | 0.70 | (GaussianNB) |
Development | 0.50 | 1.00 | 0.45 | 0.50 | 0.55 | (SVC) |
Mechanics | 0.40 | 0.40 | 0.40 | 0.50 | 0.50 | (RF) |
Vocabulary | 0.35 | 0.70 | 0.40 | 0.30 | 0.65 | (SVC) |
Organization | 1.10 | 1.15 | 0.65 | 0.85 | 1.05 | (GaussianNB) |
Structure | 0.40 | 0.55 | 0.45 | 0.45 | 0.30 | (MLP) |
Relevance | 0.25 | 0.15 | 0.15 | 0.15 | 0.20 | (GaussianNB) |
Table 15.
Model performance on unannotated datasets ZAEBUC and QALB compared to the subset size of manual annotated samples.
Table 15.
Model performance on unannotated datasets ZAEBUC and QALB compared to the subset size of manual annotated samples.
Model | Trait | ZAEBUC | QALB |
---|
QWK
|
Size
|
QWK
|
Size
|
GaussianNB | Style | 0.808967 | 161 | 0.858792 | 54 |
SVC | Development | 0.844037 | 39 | 0.903897 | 37 |
RandomForest | Mechanics | 0.656624 | 52 | 0.554364 | 76 |
SVC | Vocabulary | 0.668337 | 127 | 0.596497 | 88 |
GaussianNB | Organization | 0.919751 | 36 | 0.406863 | 61 |
MLP | Structure | 0.774250 | 41 | 0.456587 | 45 |
GaussianNB | Relevance | 0.844221 | 32 | 0.625000 | 17 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).