Next Article in Journal
Reducing LUT Counts in Moore FSMs with Twofold State Assignment
Previous Article in Journal
Succinic Acid in Cosmetics and Aesthetic Dermatology: Biological Roles and Applications
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

AI-Based Automated Scoring Layer Using Large Language Models and Semantic Analysis

by
Anastasia Vangelova
and
Veska Gancheva
*
Department of Programming and Computer Technologies, Faculty of Computer Systems and Technologies, Technical University of Sofia, 1756 Sofia, Bulgaria
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(7), 3537; https://doi.org/10.3390/app16073537
Submission received: 17 March 2026 / Revised: 28 March 2026 / Accepted: 2 April 2026 / Published: 4 April 2026
(This article belongs to the Special Issue Application of Semantic Web Technologies for E-Learning)

Featured Application

This study presents an AI-based scoring layer for automated assessment of open-ended student responses. The proposed framework combines large language models, Retrieval-Augmented Generation (RAG), and analytical rubrics in order to support criterion-based, context-grounded evaluation in e-learning environments. It can be integrated into platforms such as Moodle to assist instructors in grading, improve consistency, reduce scoring time, and support faster and more structured feedback for learners.

Abstract

Automated scoring of open-ended questions is an important research direction in educational technology and artificial intelligence, as manual grading is time-consuming and often subject to inter-rater variation. This paper proposes an AI-based framework for automated scoring that combines large language models (LLMs), Retrieval-Augmented Generation (RAG), analytical rubrics, and structured machine-readable output within a Moodle-supported e-learning environment. The framework is designed to support context-grounded and criterion-based evaluation by combining the student response, retrieved instructional context, and rubric-defined scoring criteria within a controlled assessment workflow. The proposed approach aims to improve the consistency, traceability, and practical applicability of automated scoring for open-ended responses. To examine its performance, an experimental study was conducted in a real university setting involving a five-task open-ended examination. AI-generated scores were compared with independent human scores using agreement, reliability, correlation, and error metrics. The results indicate a strong level of agreement between automated and expert scoring within the tested setting, together with relatively low average deviation. These findings suggest that the proposed framework has practical potential for supporting automated assessment in digital learning environments, while also highlighting the importance of careful interpretation within the scope of the experimental design.

1. Introduction

The rapid expansion of digital, online, and blended learning has increased the need for scalable and consistent assessment of student work. This challenge is especially evident in open-ended tasks, where learners are expected to explain, justify, analyze, or construct answers in free-text form. Unlike closed-form questions, open-ended responses require interpretation of reasoning, conceptual understanding, and argumentation, which makes manual grading time-consuming, difficult to scale, and often sensitive to inter-rater variation [1,2].
Automated scoring of open-ended responses has evolved through several methodological stages, including rule-based systems, statistical and feature-engineered models, classical machine learning approaches, deep neural architectures, and, more recently, transformer-based and large language model (LLM) systems [3,4]. Earlier approaches offered transparency and controllability, but often struggled with semantic variability, argumentation, and context-sensitive interpretation [4,5]. Newer LLM-based approaches have substantially improved the ability to process natural language, identify semantic relations, and evaluate more complex textual responses [1,3]. At the same time, their use in educational assessment raises concerns related to hallucinations, insufficient grounding in course-specific knowledge, limited transparency, and uneven alignment with pedagogical criteria [6,7,8].
Recent work increasingly explores the use of LLMs in automated essay scoring (AES) and automated short-answer grading (ASAG), often reporting promising levels of agreement with human raters [9,10,11]. At the same time, the literature remains fragmented. Some studies focus primarily on scoring accuracy, others on formative feedback, others on semantic retrieval or rubric-based prompting, and others on partial learning management system (LMS) integration [9,10,11]. Yet fully reproducible end-to-end architectures that combine pedagogical, institutional, and engineering layers of assessment remain relatively uncommon.
Several research gaps can be identified in this area. First, many published systems operate as isolated “input-to-score” models and do not address the full lifecycle of a real assessment event, from submission and contextual retrieval to structured storage, traceability, and return of results into the learning management system. Second, LMS integration is often limited to data access or result display, rather than event-driven execution embedded in an institutional workflow. Third, although Retrieval-Augmented Generation (RAG) is increasingly used to reduce hallucinations, the retrieved context is often treated as a technical add-on rather than as a formally constrained evidence layer within the assessment logic. Fourth, many systems still produce outputs as free text or single numeric scores, which limits automatic verification, criterion-level analysis, and reproducibility across repeated runs. Finally, a substantial part of the literature remains centered on English-language datasets, while transfer to low-resource or bilingual educational settings is still insufficiently explored [2,12,13].
In response to these gaps, this paper proposes an AI-based automated scoring layer for open-ended responses that integrates four core principles: (1) context-bounded evaluation through RAG-based retrieval of course-specific learning materials; (2) criterion-level scoring through analytical rubrics; (3) structured machine-readable output enabling traceability and verification; and (4) workflow-oriented integration into a Moodle-supported e-learning environment. Rather than treating the language model as an unconstrained virtual grader, the proposed framework is designed as an evidence-constrained evaluation process in which scoring decisions are limited to the student response, the retrieved instructional context, and the active rubric configuration.
The novelty of the present study lies not in the isolated use of LLMs, RAG, or rubrics as individual components, but in their integration into a reproducible and operational end-to-end assessment architecture. More specifically, the paper extends prior work by combining context-grounded LLM analysis, rubric-constrained criterion selection, structured JSON-based result representation, and event-driven LMS-oriented orchestration within a single scoring workflow. In this way, the study positions automated assessment not only as a predictive modeling task, but also as a controlled pedagogical and institutional process.
The main contributions of this paper are as follows:
  • It proposes an integrated framework for automated scoring of open-ended responses based on LLM analysis, RAG-based contextual grounding, and analytical rubric-guided evaluation.
  • It introduces an evidence-constrained scoring logic in which the model operates on a restricted set of assessment inputs rather than on unconstrained background knowledge.
  • It implements a workflow-oriented architecture for Moodle-based e-learning, enabling event-driven triggering, structured output generation, and automated write-back of results.
  • It uses machine-readable JSON output as a mechanism for traceability, consistency checking, and criterion-level analysis.
  • It presents an experimental validation in a real university setting through comparison between AI-generated and independent human scores using agreement, reliability, and error metrics.
  • It discusses practical limitations, fairness, transparency, and the scope conditions under which such a system can support educational assessment.
The remainder of the paper is organized as follows. Section 2 reviews related work on automated scoring of open-ended responses and outlines the methodological evolution toward LLM-based and context-grounded approaches. Section 3 presents the proposed methodology and system architecture. Section 4 reports the experimental design and evaluation results. Section 5 concludes the paper and discusses limitations and directions for future work.

2. Related Work

Automated scoring of open-ended questions has developed at the intersection of educational science and artificial intelligence. Unlike multiple-choice tests, where the assessment procedure is formally predefined, open-ended responses require interpretation of reasoning, logical coherence, conceptual depth, and cognitive complexity [14,15,16]. This interpretive dimension makes such tasks pedagogically valuable, but also difficult to automate. Manual grading of open-ended responses is time-consuming and often affected by inter-rater variation, particularly in large-scale courses where the volume of student work makes timely and detailed feedback difficult [9,13]. In this context, automation is not merely a technological convenience, but a response to a structural problem related to the scalability and consistency of assessment.
The first generation of automated scoring systems emerged in the second half of the twentieth century with systems such as Project Essay Grade (PEG), followed by e-rater®, IntelliMetric™, and Intelligent Essay Assessor (IEA) [17,18]. These systems functioned as symbolic or rule-based expert models in which the evaluation logic was defined through manually specified features and explicit rules [10,19,20]. Architecturally, they relied on hand-crafted indicators such as text length, grammatical correctness, lexical complexity, structural organization, and the presence of keywords, to which weights were assigned in accordance with expert scoring patterns [17,18]. In systems such as PEG, scores were calculated using regression formulas calibrated on corpora of human-rated texts [17,18]. More advanced implementations incorporated templates, regular expressions, and structural matching against predefined response models [21], while ontology-based approaches extended this logic through formalized domain knowledge and explicit conceptual relations, sometimes combined with latent semantic analysis (LSA) [21]. Despite these refinements, the underlying principle remained the same: response quality was inferred from predefined and controllable textual characteristics.
The strengths of this generation were transparency and reproducibility. Because the rules were explicitly defined, the logic of the evaluation could be traced and explained [22]. After initial calibration, such systems could process large volumes of text with relatively low variability in output [22,23]. However, their limitations became evident in the evaluation of free and conceptually complex responses. The variability of valid formulations made exhaustive rule definition practically impossible [21]. Many systems tended toward superficial evaluation, in which formal features disproportionately influenced the final score [22]. In addition, the manual construction of features and ontologies required substantial expert effort and was difficult to transfer across disciplines and languages [8,22,24].
The second generation marked a transition from explicitly encoded rules to statistical models trained to predict expert-assigned scores on the basis of extracted linguistic features [17,18]. In these systems, the response was represented as a vector of lexical, syntactic, semantic, and surface-level indicators, and the relationship between these indicators and the score was modeled statistically, often through regression-based approaches [3,5,25]. During this period, commercial automated essay scoring (AES) systems also became more widely adopted [18,25,26]. Architecturally, these systems continued to rely on feature extraction: manually defined text properties were treated as indirect indicators of response quality and combined into predictive scoring models [27,28,29]. Later, this feature set was expanded with richer natural language processing (NLP) based indicators such as lexical diversity, syntactic complexity, discourse organization, and cohesion, while the basic logic of numerical representation and prediction remained unchanged [3,4,25].
The third generation introduced a clearer machine learning formulation of the scoring task. Assessment was modeled as a supervised regression or classification problem in which the system was trained on corpora of student responses already scored by human experts [22,27,30]. The focus shifted from simple statistical relationships between individual indicators and scores to the learning of more complex patterns that better approximate human scoring behavior [3,25]. Compared with the previous generation, the emphasis moved from linear combinations of indicators to nonlinear modeling in high-dimensional feature spaces, supported by a broader range of predictive algorithms [3,4,25]. Architecturally, these systems remained predictive rather than generative: the input consisted of a numerical vector derived from predefined linguistic features, and the output was a probable score or achievement category [17,31]. Their typical workflow included preprocessing and vectorization, model training on an expert-rated corpus, and score prediction for new responses, sometimes combined with rubric-based schemes and partial credit assignment [17,18,31]. Such models were applied both in AES and in Automated Short Answer Grading (ASAG), including in standardized assessment contexts [11,32].
Empirical results are often presented as one of the strengths of these approaches. Under well-defined task conditions and with suitable training data, such models can achieve high correlation and Quadratic Weighted Kappa (QWK) values, and some studies report levels of agreement approaching those observed between human raters [14,17,31]. In addition, once trained, these systems produce scores rapidly and consistently, reducing variability associated with fatigue and subjectivity and enabling large-scale processing [3,5,33]. For some model types, relative interpretability is also preserved through analyses such as feature importance in regression models or decision trees, which supports calibration and pedagogical interpretation [4,34].
The fourth generation of automated scoring systems introduced a qualitative shift through deep learning architectures. Whereas the second and third generations relied on manually engineered features, deep learning models increasingly learned internal representations directly from raw text [3,8,25]. Instead of specifying in advance which textual properties were important, neural architectures generated multilayer vector representations (embeddings) that captured semantic, syntactic, and contextual relationships [3,35,36]. As a result, the scoring logic changed from explicit feature measurement to the learning of internal representations that approximate expert scoring behavior. Architecturally, this generation is associated with convolutional neural networks (CNNs) and recurrent neural networks (RNNs), including variants such as LSTM, BiLSTM, and GRU, as well as hybrid combinations of these architectures [3,37]. CNN-based models capture local text patterns without requiring predefined rules, while recurrent models are better suited to sequence modeling and argument flow [17]. In a number of systems, attention mechanisms were added to allow the model to focus on more informative textual segments during score generation [17].
The fifth generation of automated scoring systems is dominated by transformer-based architectures and large language models (LLMs), which use self-attention to build global contextual representations of text [2,36]. Unlike earlier generations, which depended on predefined features or sequential neural representations, LLMs can support more complex semantic and logical interpretation of responses. This shift is not only architectural; it also changes the role of the model from a tool for measuring similarity or predicting a numerical score into a system capable of interpreting a task, comparing a response against criteria, and producing a reasoned assessment.
Despite the substantial progress represented by transformer-based and LLM-driven approaches, several limitations remain insufficiently resolved in the current literature. Standalone LLM-based grading systems often lack explicit grounding in course-specific instructional content, which may weaken transparency, increase the risk of hallucinated judgments, and reduce alignment with pedagogical expectations [6,8]. In addition, many existing studies focus primarily on score prediction or agreement with human raters [19,38], without addressing broader operational requirements of educational assessment, such as rubric-constrained evaluation [10,11], machine-readable traceability [39], and integration into real LMS workflows [13,40]. These limitations motivate the need for frameworks that treat automated assessment not only as a predictive task, but also as a controlled pedagogical and institutional process. In this context, the present study differs from prior work by combining context-grounded LLM analysis, analytical rubric-guided scoring, structured JSON-based output, and workflow-oriented Moodle integration within a single end-to-end architecture for automated assessment.

3. Materials and Methods

3.1. Methodological Scheme of Automated Evaluation of Open-Ended Questions

The proposed methodology for automated evaluation of open-ended questions is designed as a controlled and reproducible assessment process in a Moodle-supported e-learning environment. It integrates semantic retrieval of relevant instructional context, analytical rubrics, a large language model (LLM) for criterion-based response analysis, and score normalization. Rather than treating the language model as an unconstrained grader, the framework follows an evidence-constrained evaluation logic in which scoring decisions are limited to the student response, the retrieved course-specific context, and the active rubric configuration.
This methodological scheme describes how the system transforms a free-text student response into a structured and pedagogically grounded assessment (Figure 1). The process is formulated as a complete assessment workflow rather than as isolated score prediction.
At a conceptual level, the framework combines four principles: semantic retrieval of relevant instructional evidence, rubric-guided criterion-level scoring, structured output for traceability and verification, and LMS-oriented workflow integration. The methodological framework can be summarized as a sequence of six stages:
  • Question and answer identification.
The system retrieves the student response together with the relevant task information from the e-learning platform.
2.
Semantic retrieval of contextual evidence.
Using embedding-based retrieval, the system identifies the instructional content that should serve as the evidential basis for the assessment.
3.
Construction of the model input.
The retrieved context, the student response, the active rubric, and the system instructions are combined into a structured input for the language model.
4.
Criterion-based LLM evaluation.
The response is analyzed against the rubric criteria, and the model generates a structured criterion-level assessment.
5.
Score normalization and interpretation.
The criterion-level results are aggregated and normalized into a pedagogically interpretable score.
6.
Feedback generation and return of results.
The system produces brief criterion-linked feedback and prepares the results for return to the learning platform.
This scheme forms the core of the methodology because it ensures that all learners are assessed through the same controlled sequence of operations.

3.2. Definition of Input Objects

For the automated evaluation process to operate reliably, the proposed framework uses four main categories of input data. Each of them plays a distinct role in ensuring the accuracy, consistency, and pedagogical validity of the assessment process.

3.2.1. Student Answer

The student answer is the primary object of evaluation. It is formulated in natural language and may contain definitions, explanations, arguments, examples, or analytical reasoning depending on the task. The role of the system is not merely to detect lexical overlap or surface correctness, but to interpret the response in relation to the pedagogical expectations of the task and the active assessment criteria.

3.2.2. Learning Materials (Lectures, Tasks, Examples)

The learning materials form the contextual knowledge base used during evaluation. They may include lecture notes, presentations, worked examples, task descriptions, algorithm explanations, and other instructional resources relevant to the course topic. Through embedding-based representation, these materials are transformed into a searchable semantic space that enables retrieval of passages most relevant to the assessed question. As a result, the evaluation is grounded in course-specific instructional content rather than general linguistic plausibility alone.

3.2.3. Analytical Rubric

The rubric provides the formal pedagogical structure of the assessment. It defines which aspects of the student response are evaluated and which performance levels are available for each criterion. In the general case, both the number and content of the criteria, as well as the number of rubric levels, are configurable by the teacher according to the objectives and cognitive complexity of the task.
In the proposed framework, Bloom’s Taxonomy serves as a conceptual basis for rubric design. Depending on the assessment goal, the rubric may focus on lower-order cognitive operations, such as recall and understanding, or on higher-order processes, such as analysis, evaluation, and creation. However, the methodology does not require all Bloom levels to be represented in every rubric.
In the experimental setting used in this study, the rubric criteria were selected to cover different cognitive dimensions, and each criterion was associated with discrete performance levels represented by numerical values. This structure supports transparency, and reproducibility by ensuring that the language model operates within a clearly defined scoring framework.

3.2.4. AI Model Instructions

The final input component consists of the instructions provided to the language model. These instructions define the role of the model, limit the scope of its inferences, and specify the required structure of the output. In the present framework, they include:
  • A description of the task,
  • An instruction to rely only on the provided contextual evidence;
  • A requirement to return the result in a structured format,
  • An explicit reminder to follow the rubric criteria strictly.
This component supports controlled LLM behavior and reduces the likelihood of hallucinated or rubric-inconsistent interpretations.

3.3. Main Stages of the Process

The automated evaluation process is implemented as a sequence of clearly defined stages that integrate technological and pedagogical components into a single workflow.

3.3.1. Retrieval of Submission Data from Moodle

The process begins with the automatic retrieval of assessment data from Moodle through a dedicated plugin/webhook mechanism. This stage provides:
  • The task or question text,
  • The student’s answer,
  • The user identifier,
  • Contextual metadata related to the assignment, such as course and task information.
The automated nature of this step ensures that the evaluation process begins immediately after submission and that all responses enter the workflow in a standardized format.

3.3.2. Semantic Retrieval of Relevant Context

Once the submission data have been retrieved, the system activates the Retrieval-Augmented Generation (RAG) component in order to identify the most relevant instructional context for the given response. At this stage, the course materials are represented in a semantic vector space through embeddings, which enables retrieval based on semantic similarity rather than keyword overlap alone.
The system then selects the text segments that are most relevant to the specific task being assessed. This step constrains the evaluation to course-specific material, improves pedagogical validity, and supports task-sensitive contextualization [2,9,41]. In the current implementation, the output of this step is a small set of retrieved text passages that form the contextual evidence base for the subsequent model-driven evaluation.

3.4. Preparing the Structured Input for the Model

In the third stage, the system constructs a structured input for the language model by combining the student response, the retrieved contextual evidence, the active rubric, and the system-level instructions. This step is methodologically critical because it determines how the evidential basis and scoring criteria are presented to the model before evaluation.
The input typically includes:
  • The wording of the task or question,
  • The student’s answer,
  • The retrieved relevant passages from the learning materials,
  • The analytical rubric,
  • The system instructions that define the model’s role and output constraints.
The purpose of this design is to ensure that the model evaluates the response under controlled conditions rather than through unconstrained free-text interpretation. In particular, the instructions direct the model to:
  • Follow the rubric criteria strictly,
  • Compare the student response against the retrieved context,
  • Assign scores only on the basis of the provided evidence,
  • Return the result in a predefined structured format.
This configuration improves transparency and reproducibility by ensuring that the same input structure can be applied consistently across different students and tasks.

3.5. Rubric-Guided Criterion-Level Assessment

After the structured input has been prepared, the language model performs the evaluation through a criterion-based rubric logic. Instead of generating a single holistic judgment, it evaluates the student response separately against each active rubric criterion and assigns the corresponding performance level.
This stage includes:
  • Analysis of the accuracy and completeness of the response,
  • Identification of the cognitive operations demonstrated in the answer,
  • Comparison of these operations with the rubric criteria,
  • Assignment of a score or performance level for each criterion,
  • Generation of brief criterion-linked feedback.
The direct connection between the assessment process and Bloom’s Taxonomy becomes visible at this stage. Depending on the task and rubric design, the evaluation may reflect lower-order processes such as recall and understanding, as well as higher-order operations such as application, analysis, evaluation, or creation.
A key strength of this approach is its multidimensionality. Instead of reducing the answer to a single score, the system evaluates multiple aspects of performance separately. This makes the assessment more transparent and more suitable for criterion-level feedback.
The evaluation result is returned in a structured machine-readable format. In the present implementation, this takes the form of JSON-based output containing criterion-level selections and short remarks. A simplified example is shown below:
{
  “picks”: [
    {“criterionid”: 1, “levelid”: 3, “remark”: “Accurate and well-justified answer.”},
    {“criterionid”: 2, “levelid”: 1, “remark”: “Argumentation is insufficient.”}
  ],
  “totalScore”: 4
}
This structured representation reflects the dynamic nature of the rubric configuration in Moodle, where the number of criteria and levels is determined by the instructor. At the same time, it supports traceability and verification, because each criterion-level decision can be stored, inspected, and compared across different evaluations.

3.6. Score Normalization

To ensure comparability across tasks and rubric configurations, the final score is normalized relative to the maximum score allowed by the active rubric. The normalized score is calculated as the ratio between the obtained score and the maximum possible rubric score:
N o r m S c o r e = S c o r e M a x ( R )
where Score is the total number of points assigned across all rubric criteria, and Max(R) is the maximum score defined by the active rubric.
This normalization makes it possible to compare results across tasks with different numbers of criteria or different point allocations. It also facilitates conversion to other grading scales, such as percentage-based, institutional, or national grading systems.

3.7. Returning the Assessment to Moodle

The final stage of the process is the automatic return of the evaluation result to Moodle. This includes:
  • Recording the numerical score,
  • Recording the criterion-level assessment,
  • Adding brief feedback,
  • Marking the submission as graded.
This step is performed automatically and in real time, reducing the need for manual intervention and enabling timely feedback within the learning platform.

3.8. AI Layer for Automated Assessment

The developed system includes a dedicated AI layer responsible for the automated evaluation of student responses. This layer integrates three core elements: a semantic repository of instructional content, a Retrieval-Augmented Generation (RAG) mechanism for contextual grounding, and a large language model for rubric-guided assessment. Together, these components form the intelligent scoring layer of the system, transforming LMS-derived input data into structured and traceable assessment results.
A key architectural characteristic of this layer is that it operates externally to Moodle. This allows the intelligent assessment logic to be implemented without modifying the internal logic of the LMS itself, while preserving modularity and the possibility of future model replacement or extension.
Within the overall workflow, the AI layer receives structured assessment data from Moodle, enriches them with retrieved contextual evidence from the semantic database, performs criterion-level evaluation, and produces a structured output containing scores and feedback. These results are then returned to Moodle through the orchestration layer. The overall organization of this layer is shown in Figure 2.

3.8.1. AI Agent Architecture

The AI agent is implemented within the n8n automation environment and serves as the main intelligent component of the scoring process. It coordinates the interaction between the learning platform, the semantic retrieval layer, and the language model (Figure 3).
The input to the agent is provided in a structured format and includes:
  • The text of the student response,
  • The rubric definition retrieved from Moodle,
  • The contextual fragments extracted through the RAG mechanism,
  • Where applicable, information about attached files for additional visual analysis.
Based on these inputs, the agent performs content analysis and generates a structured output containing criterion-level scores and short feedback. The result is returned in JSON format and is subsequently transformed into the format required for storage through the Moodle Web Services API.
Architecturally, the agent operates as part of an orchestrated workflow in which additional n8n nodes participate. These include:
  • Nodes for contextual data aggregation from the RAG layer,
  • Nodes for configuring and invoking the language model,
  • Nodes for formatting and transmitting the results to Moodle.
This organization ensures a clear separation between data retrieval, intelligent evaluation, and result persistence.

3.8.2. Rubric-Based Assessment

A central element of the AI layer is the use of analytical rubrics as the formal pedagogical model of assessment. The rubrics are created in Moodle and retrieved through the Web Services API, after which they are transformed into a structured format suitable for use by the AI agent.
Each rubric contains a set of criteria, and each criterion includes predefined performance levels associated with specific point values and descriptions. Instead of generating a free-form judgment, the AI agent selects one of the available levels for each criterion. In this way, the assessment remains constrained by a formal pedagogical structure rather than relying on unconstrained model interpretation.
This rubric-guided logic offers several important advantages. It:
  • Supports consistency across evaluations,
  • Reduces the influence of subjective or weakly grounded judgments,
  • Enables criterion-level traceability,
  • Makes it possible to generate structured and pedagogically meaningful feedback.

3.9. Integration of Bloom’s Taxonomy

To support a more fine-grained interpretation of student performance, the proposed framework incorporates Bloom’s Taxonomy as an additional pedagogical layer in the assessment process. In the experimental configuration used in this study, rubric criteria are associated with cognitive dimensions related to Bloom’s framework, such as remembering, understanding, applying, analyzing, evaluating, and creating. This mapping is used not as an independent scoring mechanism, but as an interpretive layer that helps characterize the cognitive orientation of the assessed response.
During the evaluation process, the AI agent analyzes the student answer in relation to the active rubric and the retrieved instructional context. Based on the criterion-level assessment, the system generates an additional cognitive profile indicating which types of cognitive performance are more strongly represented in the response. This information complements the score-based evaluation by providing a more pedagogically informative view of student performance.
The Bloom-related outputs are used for reporting and subsequent analysis rather than as a standalone grading criterion. In this way, Bloom’s Taxonomy extends the framework beyond score generation and supports richer interpretation of the assessment results. An example of a generated result with criterion-level scoring and Bloom-related cognitive classification is shown in Figure 4.
The resulting data can be stored for later analysis, including progress monitoring, criterion-level comparison, and statistical interpretation across learners or tasks.

3.10. Mechanisms for Increasing the Reliability of Scoring

To improve the consistency and verifiability of the generated assessments, the AI layer incorporates additional mechanisms that support structured reasoning and deterministic numerical checking. These mechanisms do not replace the language model, but complement it by reducing specific sources of error in the scoring workflow.
The first mechanism, referred to in the implementation as the Think tool, supports structured intermediate reasoning before the final criterion-level decisions are produced. It is used to encourage stepwise analysis of the student response in relation to the active rubric criteria and to reduce logical inconsistencies across criterion-level selections.
The second mechanism, the Calculator tool, is used for deterministic verification of numerical operations. This is especially relevant in cases involving score aggregation, normalization, or task-specific quantitative elements. By separating arithmetic processing from the language model’s generative component, the system reduces the risk of numerical inconsistencies in the final result.
Taken together, these mechanisms function as supporting reliability controls within the scoring architecture.

3.11. Integration with OpenAI Models

The intelligent processing in the proposed system is implemented through the use of OpenAI models specialized for different functions within the workflow. The main model used for criterion-level grading is GPT-4.1-mini, which analyzes student responses, compares them with the active rubric and retrieved context, and generates structured assessment outputs and feedback.
For semantic retrieval, the system uses text-embedding-3-small to generate vector representations of learning materials and query-related inputs. These vector representations support similarity-based search in the RAG layer and enable retrieval of contextually relevant instructional passages for the assessment process.
In cases where the task includes visual artefacts, the system can additionally use multimodal GPT capabilities for image-based analysis. This extends the architecture beyond text-only evaluation and allows certain diagrammatic or graphical student outputs to be included in the automated assessment workflow.
Overall, the use of multiple OpenAI models allows different forms of intelligent processing to be integrated within a single architecture while preserving a clear functional separation between semantic retrieval, scoring, and optional multimodal analysis.

4. Results and Discussion

4.1. Experimental Study

This section presents the experimental evaluation of the proposed system for fully automated assessment of open-ended questions based on large language models (LLMs), Retrieval-Augmented Generation (RAG), and analytical rubrics. The purpose of the experimental study is to examine the practical applicability of the proposed methodology in a real university e-learning environment and to evaluate its agreement with independent human scoring under controlled assessment conditions.
The experimental design was developed on the basis of the literature review, the proposed methodological framework, and the research objectives of the study. In line with current research on automated essay scoring (AES) and automated short-answer grading (ASAG), the evaluation focuses on agreement with human scoring, error-based indicators, and the interpretability of the generated assessment outputs [14,15,17].
The experiment was conducted in two parallel university courses in the discipline Systems Engineering. The two courses were equivalent in content, structure, tasks, and assessment criteria, but differed in the language of instruction and examination: one course was delivered in Bulgarian and the other in English. This setting made it possible to examine the behavior of the proposed system under language variation while preserving the same pedagogical and assessment model.
The scoring process was carried out fully automatically by the proposed AI system, without teacher intervention in the generation of the machine-assigned scores. For research purposes, the same student responses were also assessed independently by a human expert using the same analytical rubric. The expert scores were used only as a reference standard for scientific comparison and did not constitute part of the automated scoring workflow itself.
Within the scope of this article, the experimental analysis is centered on the following main research question:
RQ1: To what extent do the automated scores agree with the independent human scores?
Accordingly, the following working hypothesis was defined:
H1. 
The proposed automated scoring system will demonstrate a high degree of agreement with independent human scoring under the tested assessment conditions.
The hypothesis is examined through a set of quantitative indicators commonly used in the evaluation of automated scoring systems, including agreement, reliability, correlation, and error metrics. In this way, the experiment is intended to provide an empirical validation of the methodology within a real educational context rather than to claim universal generalizability beyond the tested setting.
The experimental study was conducted in a specialized Moodle-based e-learning environment developed and configured for the purposes of this research. The platform was designed to provide controlled conditions for automated assessment, event logging, and integration with the proposed AI architecture. Within this environment, student submissions were processed automatically by the system, and the resulting AI-generated scores were compared with independently assigned human scores.
The main characteristics of the experimental setting were as follows:
  • Discipline: Systems Engineering;
  • Educational level: bachelor’s degree;
  • Task types: structured text-based, analytical, argumentative, and UML-related modeling tasks requiring application, analysis, and interpretation of knowledge;
  • Language setting: one Bulgarian-language course and one English-language course;
  • Assessment conditions: identical tasks, rubrics, and scoring criteria across the two courses;
  • Exam format: remedial examination conducted after the end of the semester.
This design combines practical realism with methodological control. On the one hand, the experiment was conducted in an authentic university assessment context. On the other hand, the use of equivalent tasks, identical rubrics, and a controlled platform environment makes it possible to interpret the resulting comparisons between AI and human scoring in a more reliable and transparent way.

4.2. Data Collection and Preparation

For the purposes of the experimental analysis, data were collected from a total of 32 students, including 13 participants in the Bulgarian-language course and 19 participants in the English-language course. Each student completed an examination consisting of five open-ended tasks, which were evaluated using analytical rubrics with predefined criteria.
The analysis was conducted at two levels of aggregation: student level (N = 32), where each student represents one examined case; task level (32 × 5 = 160 observations), where each task is treated as a separate evaluated instance.
For each student submission, the following data were recorded:
  • AI-generated assessment produced by the automated system;
  • Independent human assessment assigned by the lecturer;
  • Scores on a 100-point scale;
  • Equivalent grades on the Bulgarian six-point grading scale;
  • Rounded and unrounded score values;
  • Task-level results;
  • Criterion-level rubric scores;
  • Course language (Bulgarian or English).
The examination covered five tasks corresponding to different response formats and different types of cognitive operations. The maximum total score was 100 points. All tasks were assessed using analytical rubrics with predefined criteria and performance levels. In the experimental setup, each criterion was evaluated on a three-level scale (0, 3, and 5 points), which allowed structured and transparent assessment of different components of student performance. The same rubrics were used both by the automated system and by the human expert, ensuring equivalence of the assessment framework.
After collection, the data were exported from the platform and organized in tabular form for statistical analysis. Before analysis, the dataset was checked for completeness, consistency, and correctness. Missing or incomplete submissions were marked and treated according to predefined rules. All records were anonymized through unique identifiers that did not allow direct identification of the participants. This procedure ensured compliance with ethical and institutional requirements for data protection.
The resulting dataset served as the basis for the calculation of agreement, reliability, correlation, and error metrics presented in the following sections.

4.3. Evaluation and Validation

The evaluation of the proposed automated scoring system is based on a combination of statistical indicators that support a multidimensional analysis of its performance in relation to independent human scoring. The selected metrics are intended to capture several complementary aspects of system behavior, including agreement, reliability, correlation, numerical error, and time efficiency.
In research on automated essay scoring (AES) and automated short-answer grading (ASAG), machine-generated scores are typically examined in relation to expert human scores, which serve as the reference standard in educational measurement [10,11,13]. For this reason, the validation strategy used in the present study does not rely on a single performance indicator. Instead, it combines agreement and reliability metrics with error-based and correlational measures in order to provide a broader picture of the strengths and limitations of the proposed system.
More specifically, the evaluation framework includes:
  • Agreement and reliability metrics, used to estimate the extent to which AI-generated scores align with independent human scores;
  • Error metrics, used to quantify the magnitude of deviations between automated and expert scoring;
  • Correlation measures, used to examine whether the automated system preserves the general ordering and score structure of the expert judgments;
  • Time-related indicators, used to assess the practical efficiency of the automated workflow.
Within the scope of the present article, the evaluation is centered primarily on RQ1, namely the extent to which automated scores agree with independent human scores. Accordingly, the main validation logic is based on comparing AI-generated and expert-assigned scores using agreement, reliability, and error metrics. Additional analyses related to cognitive profiling, language variation, and feedback interpretation are treated as complementary perspectives rather than as standalone proof of general system validity.
This multi-metric strategy is consistent with contemporary approaches to the validation of automated educational assessment systems, where no single indicator is sufficient to characterize overall system performance [13,19,20]. By combining several types of metrics, the study aims to evaluate not only whether the system produces similar scores to a human assessor, but also whether these similarities remain stable, interpretable, and practically meaningful within the tested educational setting.

4.3.1. Consistency and Reliability Metrics

Agreement and reliability metrics are used to assess the extent to which AI-generated scores align with independent human scores and whether this alignment is sufficiently stable for meaningful educational interpretation. In the context of automated educational assessment, such metrics are commonly used to estimate inter-rater agreement between machine-generated and expert-assigned scores [14,17,42].
In the present study, agreement between the human expert and the automated system is examined using two main indicators: Quadratic Weighted Kappa (QWK) and the Intraclass Correlation Coefficient (ICC).
Quadratic Weighted Kappa (QWK).
QWK is one of the most widely used metrics in research on automated essay scoring and rubric-based educational assessment [19,32,43]. Unlike the standard Kappa coefficient, QWK takes into account the magnitude of disagreement between two raters by penalizing larger discrepancies more heavily than smaller ones. For example, a difference between scores such as 1 and 5 is penalized more strongly than a difference between 4 and 5. Because most educational rating scales are ordinal, QWK is particularly appropriate for rubric-based scoring.
In a number of studies, QWK values above 0.70 have been interpreted as indicating an acceptable level of agreement in high-stakes or decision-relevant assessment settings [36,42]. In the present study, QWK is used as a primary indicator of how closely the automated system reproduces the human expert’s criterion-based scoring logic.
Intraclass Correlation Coefficient (ICC).
The Intraclass Correlation Coefficient is used to assess the reliability and consistency of scores assigned by different raters or systems [14]. In the current experiment, ICC is used to estimate the degree of agreement between AI-generated and expert-generated scores at the level of absolute reliability.
Common interpretation thresholds describe ICC values below 0.50 as poor, values between 0.50 and 0.75 as moderate, values between 0.75 and 0.90 as good, and values above 0.90 as excellent [44]. Prior studies have also reported that advanced language models can achieve relatively high ICC values under controlled scoring conditions [26]. In this study, ICC complements QWK by providing an additional perspective on the stability and strength of agreement between human and automated scoring.
Together, QWK and ICC provide a twofold view of scoring quality: QWK captures agreement on an ordinal scale with disagreement weighting, while ICC reflects the overall reliability of the scoring relationship between the automated system and the human expert.

4.3.2. Error Metrics

While agreement and reliability metrics describe the extent to which AI-generated scores resemble expert scores in relative or ordinal terms, error metrics quantify the direct numerical deviation between the two. In the present study, scoring accuracy is assessed using Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).
Mean Absolute Error (MAE).
MAE measures the average absolute difference between automated and expert scores [13,45]. Because it is expressed in the same units as the scoring scale, MAE is easy to interpret and is often used as an indicator of the typical magnitude of scoring deviation. In educational assessment research, lower MAE values generally indicate closer reproduction of expert judgment, especially when automated scores are compared against a human reference standard [13,17].
Root Mean Squared Error (RMSE).
RMSE measures the square root of the mean squared differences between automated and expert scores. Unlike MAE, RMSE gives greater weight to larger discrepancies because the errors are squared before averaging [46,47]. For this reason, RMSE is especially useful for identifying the presence of more substantial individual deviations that may not be fully visible through MAE alone.
The combined use of MAE and RMSE makes it possible to assess both the typical magnitude of scoring error and the extent to which larger disagreements occur. In this study, these two measures are used as complementary indicators of how closely the automated system reproduces expert scoring in numerical terms.

4.3.3. AI—Expert Agreement (RQ)

To address RQ1, agreement between the automated system and the independent human expert was examined using QWK, ICC, Pearson correlation, MAE, RMSE, and bias. Because no single indicator is sufficient to characterize overall scoring performance, these metrics are interpreted jointly.
Following commonly used interpretation ranges for Quadratic Weighted Kappa, values between 0.60 and 0.80 may be interpreted as substantial agreement, while values above 0.80 are often treated as near-perfect agreement [48]. On this basis, the overall result of QWK = 0.806 suggests a high level of agreement between AI-generated and expert-assigned scores across the full dataset (N = 160 task-level observations).
The group-level QWK results are presented in Table 1. The Bulgarian-language group shows the highest level of agreement (QWK = 0.851), while the English-language group remains in the substantial-agreement range (QWK = 0.765). These results indicate that the system maintains strong alignment with expert scoring in both language settings, although the English-language condition exhibits somewhat greater variability.
To complement the QWK analysis, the Intraclass Correlation Coefficient (ICC) was calculated using a model of absolute agreement between two independent raters. The resulting value (ICC = 0.868, 95% CI [0.75, 0.93], p < 0.001) indicates a high degree of agreement between the expert and automated scores within the tested setting. The confidence interval further supports the robustness of this result.
A compact overview of the main indicators of agreement, correlation, error, and reliability is given in Table 2.
The detailed group-level indicators are presented in Table 3. On the raw six-point scale, the overall mean absolute error is MAE = 0.453, with lower error in the Bulgarian group (0.387) and slightly higher error in the English group (0.499). On the 100-point scale, the average deviation is approximately 2.01 points, which indicates that the automated scores remain close to the expert-assigned scores in practical terms.
The correlation analysis shows a strong linear relationship between expert and AI-generated scores (Pearson r = 0.836 overall; 0.882 for Bulgarian; 0.796 for English), indicating that higher expert scores are generally associated with higher automated scores. Bias values remain close to zero in all groups, suggesting the absence of strong systematic over-scoring or under-scoring. A slight tendency toward stricter scoring in the Bulgarian group and slightly more lenient scoring in the English group can be observed, but these deviations remain small.
To further assess the presence of larger individual discrepancies, RMSE was calculated on the unrounded six-point scores. The overall value (RMSE = 0.81) indicates that the average squared deviation remains below one grade unit. Group-level analysis again shows somewhat lower deviation in the Bulgarian group (RMSE = 0.738) than in the English group (RMSE = 0.856), which is consistent with the broader pattern observed in the agreement and MAE results.
Task-level RMSE values reveal variation depending on task characteristics. The lowest RMSE is observed for Task 5 (RMSE = 0.298), indicating very small deviations between AI and expert scoring for this task. Task 3 also shows relatively low error (RMSE = 0.762), while Tasks 1 and 4 show moderate variability (RMSE = 0.887 and 0.874, respectively). The highest RMSE is recorded for Task 2 (RMSE = 1.029), suggesting that this task produced the greatest divergence between automated and expert judgment. This pattern indicates that the system performs more consistently on more structured tasks, while freer or more interpretive tasks remain more challenging.
Figure 5 presents the relationship between the unrounded expert scores and the corresponding AI-generated scores. Most observations are concentrated near the line of perfect agreement, visually supporting the strong correlation and relatively low average error reported in the statistical analysis.
Figure 6 presents the distribution of score differences (AI_raw—Expert_raw). The concentration of values around zero and the absence of a strong directional skew are consistent with the low bias values reported above. Extreme deviations appear relatively infrequent and are concentrated in a limited number of cases.
Overall, the combined results from QWK, ICC, Pearson correlation, MAE, RMSE, and bias indicate that the proposed system demonstrates a strong level of agreement with independent human scoring within the tested educational setting. At the same time, the observed variation across language groups and task types suggests that these findings should be interpreted within the scope of the present experiment rather than as evidence of universal generalizability.

4.4. Limitations, Fairness, and Transparency

The findings of the present study should be interpreted in light of several limitations.
First, the experiment was conducted within a relatively narrow educational context, involving two university courses in Systems Engineering. Although the assessment tasks covered multiple response formats and different cognitive dimensions, the domain-specific nature of the course limits the extent to which the results can be generalized to other disciplines, institutions, or educational settings.
Second, the sample size was relatively small (32 students), which constrains the statistical generalizability of the findings. While the task-level dataset included 160 evaluated observations, the study should still be interpreted as a small-scale real-course validation rather than as large-scale evidence of universal system performance.
Third, the human reference standard in the present experiment was based on one independent expert rater. Although this makes it possible to compare automated and human scores, it does not allow analysis of agreement among multiple human raters. As a result, the present study cannot determine how the system compares with the variability that might arise across several experts.
Fourth, the assessment process relied on predefined analytical rubrics and on a fixed pedagogical interpretation of cognitive dimensions related to Bloom’s Taxonomy. This improves structure and transparency, but it also means that the resulting performance depends in part on the quality and appropriateness of the rubric design. Different rubric structures or alternative pedagogical interpretations could lead to somewhat different scoring outcomes.
Fifth, the experiment did not include a systematic re-run analysis under identical conditions. Such analysis would be useful for examining the repeatability and temporal stability of the automated scoring process more explicitly.
An additional limitation concerns the dependence of the framework on the quality of the retrieved contextual evidence. Since the evaluation is grounded in RAG-based retrieval, the adequacy of the final score may be influenced by the relevance, completeness, and selection quality of the retrieved learning materials. This means that the observed performance should not be interpreted as independent of the underlying instructional corpus.
From the perspective of fairness and transparency, the proposed framework is designed to support more controlled and traceable assessment by combining shared rubric criteria, structured prompts, retrieved course-specific context, and machine-readable output. Similar concerns regarding consistency and bias in LLM-based educational assessment have been discussed in recent work [9,13]. However, fairness and neutrality should not be treated as automatically guaranteed properties of the system. They remain dependent on several factors, including rubric quality, prompt design, retrieved context, language-specific variation, and the broader instructional setting.
For this reason, the present results support cautious optimism rather than unrestricted claims of general applicability. Future research should extend the experimental design to larger participant groups, multiple subject domains, and multi-rater settings. Additional studies should also examine repeated-run stability, broader language conditions, and the behavior of the system under different rubric configurations and assessment formats.

5. Conclusions

This article presented an AI-based framework for the automated scoring of open-ended questions in a Moodle-supported e-learning environment. The proposed approach combines large language models, Retrieval-Augmented Generation (RAG), analytical rubrics, and structured machine-readable outputs in order to support criterion-based and context-grounded assessment.
The study showed that LLM-based automated scoring can be made more controlled and pedagogically aligned when it is constrained by course-specific context, rubric-defined criteria, and a structured scoring workflow. In this sense, the contribution of the proposed framework lies not in the isolated use of LLMs, RAG, or rubrics, but in their integration into a reproducible end-to-end assessment architecture designed for practical educational use.
Within the tested setting, the experimental results indicated a strong level of agreement between AI-generated and expert-assigned scores. The obtained values for QWK, ICC, Pearson correlation, MAE, RMSE, and bias suggest that the system was able to reproduce the general structure of expert scoring with a substantial degree of consistency under the conditions of the present study. At the same time, variation across language groups and task types showed that performance was not uniform across all conditions and should therefore be interpreted within the scope of the experiment rather than as evidence of universal generalizability.
A further contribution of the framework is its emphasis on traceability and methodological control. The use of rubric-guided scoring, retrieved contextual evidence, structured prompts, and JSON-based outputs supports greater transparency of the assessment process and facilitates later inspection, comparison, and analysis. These features make the system suitable not only for score generation, but also for more detailed criterion-level interpretation of student performance.
The findings also indicate that the proposed workflow has practical value in digital learning environments. By automating substantial parts of the scoring process and returning structured results to the LMS, the system can reduce manual workload and support faster feedback cycles. However, the present results also suggest that fully automated assessment should be applied with caution, especially in cases involving more interpretive responses, broader disciplinary variation, or high-stakes educational decisions.
For these reasons, the most appropriate direction for future development is not the complete replacement of human judgment, but the refinement of hybrid assessment models in which automated AI-based scoring supports, rather than fully displaces, expert pedagogical evaluation. Future work should therefore examine larger datasets, multiple subject domains, repeated-run stability, and multi-rater comparison in order to better assess the robustness and transferability of the proposed methodology.

Author Contributions

All authors were involved in the full process of producing this paper, including conceptualization, methodology, modeling, validation, visualization, and preparing the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been accomplished with financial support by the European Regional Development Fund within the Operational Programme “Bulgarian national recovery and resilience plan”, procedure for direct provision of grants “Establishing of a network of research higher education institutions in Bulgaria”, and under Project BG-RRP-2.004-0005 “Improving the re-search capacity anD quality to achieve intErnAtional recognition and reSilience of TU-Sofia (IDEAS)”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Pecuchova, J.; Benko, Ľ.; Drlik, M. Automated Grading of Open-Ended Questions in Higher Education Using GenAI Models. Int. J. Artif. Intell. Educ. 2025, 35, 3813–3846. [Google Scholar] [CrossRef]
  2. Jauhiainen, J.; Guerra, A.G. Evaluating Students’ Open-Ended Written Responses with LLMs: Using the RAG Framework for GPT-3.5, GPT-4, Claude-3, and Mistral-Large. Adv. Artif. Intell. Mach. Learn. 2024, 4, 3097–3113. [Google Scholar] [CrossRef]
  3. Tang, X.; Chen, H.; Lin, D.; Li, K. Harnessing LLMs for Multi-Dimensional Writing Assessment: Reliability and Alignment with Human Judgments. Heliyon 2024, 10, e34262. [Google Scholar] [CrossRef] [PubMed]
  4. Yeung, S.A. Comparative Study of Rule-Based, Machine Learning and Large Language Model Approaches in Automated Writing Evaluation (AWE). In Proceedings of the 15th International Learning Analytics and Knowledge Conference (LAK’25), Dublin, Ireland, 3–7 March 2025; pp. 984–991. [Google Scholar] [CrossRef]
  5. Lan, G.; Li, Y.; Yang, J.; He, X. Investigating a customized generative AI chatbot for automated essay scoring in a disciplinary writing task. Assess. Writ. 2025, 66, 100959. [Google Scholar] [CrossRef]
  6. Grévisse, C. LLM-based automatic short answer grading in undergraduate medical education. BMC Med. Educ. 2024, 24, 1060. [Google Scholar] [CrossRef]
  7. Latif, E.; Zhai, X. Fine-tuning ChatGPT for automatic scoring. Comput. Educ. Artif. Intell. 2024, 6, 100210. [Google Scholar] [CrossRef]
  8. Xu, J.; Liu, J.; Lin, M.; Lin, J.; Yu, S.; Zhao, L.; Shen, J. EPCTS: Enhanced Prompt-Aware Cross-Prompt Essay Trait Scoring. Neurocomputing 2025, 621, 129283. [Google Scholar] [CrossRef]
  9. Mendonça, P.C.; Quintal, F.; Mendonça, F. Evaluating LLMs for Automated Scoring in Formative Assessments. Appl. Sci. 2025, 15, 2787. [Google Scholar] [CrossRef]
  10. Qiu, H.; White, B.; Ding, A.; Costa, R.; Hachem, A.; Ding, W.; Chen, P. SteLLA: A Structured Grading System Using LLMs with RAG. arXiv 2025, arXiv:2501.09092. [Google Scholar] [CrossRef]
  11. Chu, S.; Kim, J.; Wong, B.; Yi, M. Rationale Behind Essay Scores: Enhancing S-LLM’s Multi-Trait Essay Scoring with Rationale Generated by LLMs. arXiv 2025, arXiv:2410.14202. [Google Scholar] [CrossRef]
  12. Seßler, K.; Fürstenberg, M.; Bühler, B.; Kasneci, E. Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring. In Proceedings of the 15th International Learning Analytics and Knowledge Conference, Dublin, Ireland, 3–7 March 2025; pp. 462–472. [Google Scholar] [CrossRef]
  13. Papachristou, I.; Dimitroulakos, G.; Vassilakis, C. Automated Test Generation and Marking Using LLMs. Electronics 2025, 14, 2835. [Google Scholar] [CrossRef]
  14. Emirtekin, E. Large Language Model-Powered Automated Assessment: A Systematic Review. Appl. Sci. 2025, 15, 5683. [Google Scholar] [CrossRef]
  15. Gao, R.; Merzdorf, H.E.; Anwar, S.; Hipwell, M.C.; Srinivasa, A.R. Automatic assessment of text-based responses in post-secondary education: A systematic review. Comput. Educ. Artif. Intell. 2024, 6, 100206. [Google Scholar] [CrossRef]
  16. Zlatkin-Troitschanskaia, O.; Fischer, J.; Braun, H.I.; Shavelson, R.J. Advantages and challenges of performance assessment of student learning in higher education. In International Encyclopedia of Education, 4th ed.; Elsevier: Amsterdam, The Netherlands, 2023; pp. 312–330. [Google Scholar] [CrossRef]
  17. Sun, J.; Song, T.; Peng, W.; Song, J. A Survey of Automated Essay Scoring: Challenges, Advances, and Future. Neurocomputing 2025, 650, 130916. [Google Scholar] [CrossRef]
  18. Dikli, S. An Overview of Automated Scoring of Essays. J. Technol. Learn. Assess. 2006, 5. Available online: https://ejournals.bc.edu/index.php/jtla/article/view/1640/1489 (accessed on 3 March 2025).
  19. Fateen, M.; Wang, B.; Mine, T. Beyond Scores: A Modular RAG-Based System for Automatic Short Answer Scoring with Feedback. IEEE Access 2024, 12, 185371–185385. [Google Scholar] [CrossRef]
  20. Zhuang, M.; Long, S.; Martin, F.; Castellanos-Reyes, D. The affordances of Artificial Intelligence (AI) and ethical considerations across the instruction cycle: A systematic review of AI in online higher education. Internet High. Educ. 2025, 67, 101039. [Google Scholar] [CrossRef]
  21. Sychev, O.; Anikin, A.; Prokudin, A. Automatic Grading and Hinting in Open-Ended Text Questions. Cogn. Syst. Res. 2020, 59, 264–272. [Google Scholar] [CrossRef]
  22. Aydın, B.; Kışla, T.; Elmas, N.T.; Bulut, O. Automated Scoring in the Era of Artificial Intelligence: An Empirical Study with Turkish Essays. System 2025, 133, 103784. [Google Scholar] [CrossRef]
  23. Stephen, T.C.; Gierl, M.C.; King, S. Automated Essay Scoring (AES) of Constructed Responses in Nursing Examinations: An Evaluation. Nurse Educ. Pract. 2021, 54, 103085. [Google Scholar] [CrossRef]
  24. Jung, J.Y.; Tyack, L.; von Davier, M. Towards the implementation of automated scoring in international large-scale assessments: Scalability and quality control. Comput. Educ. Artif. Intell. 2025, 8, 100375. [Google Scholar] [CrossRef]
  25. Mizumoto, A.; Eguchi, M. Exploring the Potential of Using an AI Language Model for Automated Essay Scoring. Res. Methods Appl. Linguist. 2023, 2, 100050. [Google Scholar] [CrossRef]
  26. Pack, A.; Barrett, A.; Escalante, J. Large Language Models and Automated Essay Scoring of English Language Learner Writing: Insights into Validity and Reliability. Comput. Educ. Artif. Intell. 2024, 6, 100234. [Google Scholar] [CrossRef]
  27. Birla, N.; Jain, M.K.; Panwar, A. Automated Assessment of Subjective Assignments: A Hybrid Approach. Expert Syst. Appl. 2022, 203, 117315. [Google Scholar] [CrossRef]
  28. Li, X.; Chen, M.; Nie, J.-Y. SEDNN: Shared and enhanced deep neural network model for cross-prompt automated essay scoring. Knowl.-Based Syst. 2020, 210, 106491. [Google Scholar] [CrossRef]
  29. Wang, Q. A Multifaceted Architecture to Automate Essay Scoring for Assessing English Article Writing: Integrating Semantic, Thematic, and Linguistic Representations. Comput. Electr. Eng. 2024, 118, 109308. [Google Scholar] [CrossRef]
  30. Bonthu, S.; Rama Sree, S.; Krishna Prasad, M.H.M. Improving the performance of automatic short answer grading using transfer learning and augmentation. Eng. Appl. Artif. Intell. 2023, 123, 106292. [Google Scholar] [CrossRef]
  31. Tan, L.Y.; Hu, S.; Yeo, D.J.; Cheong, K.H. A Comprehensive Review on Automated Grading Systems in STEM Using AI Techniques. Math 2025, 13, 2828. [Google Scholar] [CrossRef]
  32. Meyer, J.; Jansen, T.; Schiller, R.; Liebenow, L.W.; Steinbach, M.; Horbach, A.; Fleckenstein, J. Using LLMs to Bring Evidence-Based Feedback into the Classroom: AI-Generated Feedback Increases Secondary Students’ Text Revision, Motivation, and Positive Emotions. Comput. Educ. Artif. Intell. 2024, 6, 100199. [Google Scholar] [CrossRef]
  33. Quah, B.; Zheng, L.; Sng, T.J.H.; Yong, C.W.; Islam, I. Reliability of ChatGPT in automated essay scoring for dental undergraduate examinations. BMC Med. Educ. 2024, 24, 962. [Google Scholar] [CrossRef] [PubMed]
  34. Zhao, X. A Hybrid Deep Learning and Fuzzy Logic Framework for Feature-Based Evaluation of English Language Learners. Sci. Rep. 2025, 15, 33657. [Google Scholar] [CrossRef] [PubMed]
  35. He, X.; Xiao, X.; Fang, J.; Li, Y.; Li, Y.; Zhou, R. Exercise-Aware higher-order Thinking skills Assessment via fine-tuned large language model. Knowl.-Based Syst. 2025, 324, 113808. [Google Scholar] [CrossRef]
  36. Firoozi, T.; Bulut, O.; Gierl, M. Language models in automated essay scoring: Insights for the Turkish language. Int. J. Assess. Tools Educ. 2023, 10, 149–163. [Google Scholar] [CrossRef]
  37. Johnsi, R.; Kumar, G.B. Enhancing automated essay scoring by leveraging LSTM networks with hyper-parameter tuned word embeddings and fine-tuned LLMs. Eng. Res. Express 2025, 7, 025272. [Google Scholar] [CrossRef]
  38. Córdova-Esparza, D.-M. AI-Powered Educational Agents: Opportunities, Innovations, and Ethical Challenges. Information 2025, 16, 469. [Google Scholar] [CrossRef]
  39. Tyndall, E.; Gayheart, C.; Some, A.; Genz, J.; Wagner, T.; Langhals, B. Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents. Data Policy 2025, 7, e57. [Google Scholar] [CrossRef]
  40. Kinder, A.; Briese, F.J.; Jacobs, M.; Dern, N.; Glodny, N.; Jacobs, S.; Leßmann, S. Effects of adaptive feedback generated by a large language model: A case study in teacher education. Comput. Educ. Artif. Intell. 2025, 8, 100349. [Google Scholar] [CrossRef]
  41. Villegas-Ch, W.; Gutierrez, R.; García-Ortiz, J.; Guevara, V. Explainable educational assistant integrated in Moodle: Automated semantic assessment and adaptive tutoring based on NLP and XAI. Discov. Artif. Intell. 2025, 5, 191. [Google Scholar] [CrossRef]
  42. Oğuz, E. Can Generative AI Figure Out Figurative Language? The Influence of Idioms on Essay Scoring by ChatGPT, Gemini, and Deepseek. Assess. Writ. 2025, 66, 100981. [Google Scholar] [CrossRef]
  43. Morris, W.; Crossley, S.; Holmes, L.; Ou, C.; Dascalu, M.; McNamara, D. Formative Feedback on Student-Authored Summaries in Intelligent Textbooks Using Large Language Models. Int. J. Artif. Intell. Educ. 2025, 35, 1022–1043. [Google Scholar] [CrossRef]
  44. Koo, T.K.; Li, M.Y. A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. J. Chiropr. Med. 2016, 15, 155–163. [Google Scholar] [CrossRef]
  45. Cisneros-González, J.; Gordo-Herrera, N.; Barcia-Santos, I.; Sánchez-Soriano, J. JorGPT: Instructor-Aided Grading of Programming Assignments with Large Language Models (LLMs). Future Internet 2025, 17, 265. [Google Scholar] [CrossRef]
  46. Ferreira Mello, R.; Pereira Junior, C.; Rodrigues, L.; Pereira, F.D.; Cabral, L.; Costa, N.; Ramalho, G.; Gasevic, D. Automatic Short Answer Grading in the LLM Era: Does GPT-4 with Prompt Engineering beat Traditional Models? In Proceedings of the 15th International Learning Analytics and Knowledge Conference, Dublin, Ireland, 3–7 March 2025; pp. 93–103. [Google Scholar] [CrossRef]
  47. Cipriano, E.; Ferrato, A.; Limongelli, C.; Schicchi, D.; Taibi, D. Leveraging Large Language Models to Assist Teachers in Code Grading. In Artificial Intelligence in Education; Cristea, A.I., Walker, E., Lu, Y., Santos, O.C., Isotani, S., Eds.; Springer Nature: Cham, Switzerland, 2025; Volume 15880, pp. 204–217. [Google Scholar] [CrossRef]
  48. Landis, J.R.; Koch, G.G. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Methodological scheme of the automated evaluation process of open-ended questions.
Figure 1. Methodological scheme of the automated evaluation process of open-ended questions.
Applsci 16 03537 g001
Figure 2. Cognitive architecture of the AI layer for automated scoring.
Figure 2. Cognitive architecture of the AI layer for automated scoring.
Applsci 16 03537 g002
Figure 3. AI agent with an integrated model and tools for logical and arithmetic verification. * gpt-4.1-mini-2025-04-14.
Figure 3. AI agent with an integrated model and tools for logical and arithmetic verification. * gpt-4.1-mini-2025-04-14.
Applsci 16 03537 g003
Figure 4. Example of automated assessment output with criterion-level scoring and Bloom-related cognitive classification.
Figure 4. Example of automated assessment output with criterion-level scoring and Bloom-related cognitive classification.
Applsci 16 03537 g004
Figure 5. Relationship between expert and automated ratings (unrounded values, scale 2–6).
Figure 5. Relationship between expert and automated ratings (unrounded values, scale 2–6).
Applsci 16 03537 g005
Figure 6. Distribution of errors between AI and expert ratings (unrounded values).
Figure 6. Distribution of errors between AI and expert ratings (unrounded values).
Applsci 16 03537 g006
Table 1. QWK values for AI-expert agreement by language group.
Table 1. QWK values for AI-expert agreement by language group.
GroupQWKInterpretation
All 0.806Almost perfect
Bulgarian 0.851Almost perfect
English0.765Substantial
Table 2. Summary of the main overall indicators of agreement, correlation, error, and reliability.
Table 2. Summary of the main overall indicators of agreement, correlation, error, and reliability.
MetricResultMeaning
QWK0.806High agreement
Pearson r0.836Strong linear relationship
MAE_raw0.453Low average deviation
ICC0.868High reliability
Table 3. Error, correlation and bias indicators by group.
Table 3. Error, correlation and bias indicators by group.
GroupNQWKMAE_RawPearson_RawMAE_PointsBias_Raw
All1600.8060.4530.8362.013−0.011
Bulgarian650.8510.3870.8821.831−0.071
English950.7650.4990.7962.1370.030
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Vangelova, A.; Gancheva, V. AI-Based Automated Scoring Layer Using Large Language Models and Semantic Analysis. Appl. Sci. 2026, 16, 3537. https://doi.org/10.3390/app16073537

AMA Style

Vangelova A, Gancheva V. AI-Based Automated Scoring Layer Using Large Language Models and Semantic Analysis. Applied Sciences. 2026; 16(7):3537. https://doi.org/10.3390/app16073537

Chicago/Turabian Style

Vangelova, Anastasia, and Veska Gancheva. 2026. "AI-Based Automated Scoring Layer Using Large Language Models and Semantic Analysis" Applied Sciences 16, no. 7: 3537. https://doi.org/10.3390/app16073537

APA Style

Vangelova, A., & Gancheva, V. (2026). AI-Based Automated Scoring Layer Using Large Language Models and Semantic Analysis. Applied Sciences, 16(7), 3537. https://doi.org/10.3390/app16073537

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop