A Structured Dataset for Automated Grading: From Raw Data to Processed Dataset

Dada, Ibidapo Dare; Akinwale, Adio T.; Tunde-Adeleke, Ti-Jesu

doi:10.3390/data10060087

Open AccessData Descriptor

A Structured Dataset for Automated Grading: From Raw Data to Processed Dataset

by

Ibidapo Dare Dada

^1,2,*

,

Adio T. Akinwale

² and

Ti-Jesu Tunde-Adeleke

¹

Department of Computer and Information Science, Covenant University, P.M.B. 1023, Ota 112104, Ogun State, Nigeria

²

Department of Computer Science, Federal University of Agriculture, P.M.B. 2240, Abeokuta 111101, Ogun State, Nigeria

^*

Author to whom correspondence should be addressed.

Data 2025, 10(6), 87; https://doi.org/10.3390/data10060087

Submission received: 29 January 2025 / Revised: 26 March 2025 / Accepted: 26 March 2025 / Published: 6 June 2025

(This article belongs to the Special Issue Data Mining and Computational Intelligence for E-Learning and Education—3rd Edition)

Download

Browse Figures

Versions Notes

Abstract

The increasing volume of student assessments, particularly open-ended responses, presents a significant challenge for educators in ensuring grading accuracy, consistency, and efficiency. This paper presents a structured dataset designed for the development and evaluation of automated grading systems in higher education. The primary objective is to create a high-quality dataset that facilitates the development and evaluation of natural language processing (NLP) models for automated grading. The dataset comprises student responses to open-ended questions from the Management Information Systems (MIS221) and Project Management (MIS415) courses at Covenant University, collected during the 2022/2023 academic session. The responses were originally handwritten, scanned, and transcribed into Word documents. Each response is paired with corresponding scores assigned by human graders, following a detailed marking guide. To assess the dataset’s potential for automated grading applications, several machine learning and transformer-based models were tested, including TF-IDF with Linear Regression, TF-IDF with Cosine Similarity, BERT, SBERT, RoBERTa, and Longformer. The experimental results demonstrate that transformer-based models outperform traditional methods, with Longformer achieving the highest Spearman’s Correlation of 0.77 and the lowest Mean Squared Error (MSE) of 0.04, indicating a strong alignment between model predictions and human grading. The findings highlight the effectiveness of deep learning models in capturing the semantic and contextual meaning of both student responses and marking guides, making it possible to develop more scalable and reliable automated grading solutions. This dataset offers valuable insights into student performance and serves as a foundational resource for integrating educational technology into automated assessment systems. Future work will focus on enhancing grading consistency and expanding the dataset for broader academic applications.

Dataset:https://github.com/dadaibidaponysc1234/Automated_Grading_Dataset.

Dataset License: None

Keywords:

automated grading dataset; natural language processing; automated essay grading

1. Introduction

Academic assessments often include a variety of question types, such as multiple-choice and free-response questions, which are critical for evaluating students’ learning abilities and comprehension skills [1,2]. Existing studies have demonstrated the potential of learning analytics in understanding student interaction and engagement with teachers [3]. For instance, a learning analytic approach has been employed to model student-staff interactions based on students’ perceptions of engagement, highlighting the importance of data-driven insights in improving educational practices [4]. Similarly, automated grading systems can enhance the assessment process by providing instant and precise feedback to students, thereby improving learning outcomes and engagement.

In contemporary educational systems, the increasing volume of student assessments, particularly those involving open-ended responses, poses significant challenges for educators in ensuring grading accuracy, efficiency, and consistency [1,5]. Automated grading systems have emerged as a promising solution to these challenges, leveraging datasets of student responses and corresponding scores to train machine learning models capable of evaluating answers at scale. By integrating learning analytics with AI-driven assessment tools, educational institutions can optimize grading processes while simultaneously gaining deeper insights into student engagement and learning behaviors [4].

Automated Essay Scoring (AES) was first conceptualized in 1966 by E.B. Page, who introduced computer-aided grading systems [6]. AES employs natural language processing (NLP) technologies to evaluate and score student essays at scale, significantly reducing educators’ workload [7]. However, grading essays accurately and consistently, especially without bias, remains a challenge for educators in all fields of education. The integration of Artificial Intelligence (AI) into education has transformed traditional practices over the past six decades. AI’s primary role in education is to streamline processes and enhance learning experiences [8]. As [8] suggests, AI literacy is now critical for navigating academic, professional, and societal landscapes.

Existing research on automated grading systems has predominantly utilized datasets such as the SemEval-2013 Beetle, SciEntsBank, and Mohler datasets [9,10,11], which have been instrumental in developing machine and deep learning models for grading tasks. These datasets provide valuable benchmarks for evaluating the performance of automated grading systems; however, they are often limited in scope and lack the diversity required to generalize across various academic disciplines and question formats.

This paper introduces a novel structured dataset of student responses collected from a course on Management Information Systems (MIS221) at Covenant University during the 2022/2023 academic session. The dataset comprises two primary components: (1) raw student responses and their respective scores, and (2) marking guidelines that serve as a standardized benchmark for assessing student answers. By providing a comprehensive dataset that pairs student answers with detailed scoring criteria, this work aims to advance the field of educational technology by offering a foundation for developing automated grading systems capable of delivering consistent and objective evaluations.

Automated grading has gained significant traction, particularly in higher education, where timely and accurate feedback is essential for improving learning outcomes. Traditional grading methods are often subjective and susceptible to biases and inconsistencies [2], particularly in the case of open-ended questions, where there are multiple valid interpretations. The dataset presented in this study addresses these challenges by enabling the development of machine learning models that can replicate human grading standards with greater reliability and scalability. To maintain the smooth flow of the main text, examples of unprocessed student responses, the corresponding question paper, and the marking guidelines can be found in Appendix A (Figure A1, Figure A2 and Figure A3).

The primary objective of this study is to develop a structured dataset that supports the advancement of automated grading systems for free-response questions. This dataset includes student responses, human-assigned scores, and detailed marking guides, ensuring a standardized evaluation framework. Through extensive experimentation, this research assesses the effectiveness of different models in replicating human grading patterns, providing insights into the potential of deep learning for scalable and reliable educational assessment solutions.

This paper details the collection, preprocessing, and analysis of the dataset, highlighting its potential applications in automated grading systems. Additionally, we discuss the inherent challenges of grading open-ended responses and how this dataset can mitigate these challenges by improving grading consistency and reducing the time required for manual assessment. Through this contribution, we aim to enhance the development of AI-driven solutions for educational assessment, ultimately fostering a more efficient and equitable grading process.

2. Related Works

One of the state-of-the-art techniques that has revolutionized natural language processing (NLP) is BERT (Bidirectional Encoder Representations from Transformers) [12], which was introduced by Google in 2018. BERT interprets the meaning of a word by considering its surrounding context, both preceding and following the word. Additionally, its Masked Language Modeling (MLM) pre-training allows BERT to deeply learn the structure of language and the relationships between words, enabling it to handle unseen expressions with greater semantic understanding. These features make BERT particularly well suited for short-answer grading tasks, where precise semantic interpretation is essential. BERT is suitable for short-answer grading tasks because it can only handle inputs up to 512 tokens.

Building on the foundational capabilities of BERT, various studies have explored its potential in Automated Short-Answer Grading (ASAG). For instance, [13] fine-tuned BERT using domain-specific resources, including textbooks on topics such as psychology and the American government. Their results demonstrated that task-specific fine-tuning significantly improved the model’s performance in ASAG, highlighting the value of tailoring BERT to specific subject domains. Condor et al. [14] compared Sentence-BERT (SBERT), a variant of BERT optimized for sentence-level tasks, to traditional approaches like Word2Vec and Bag-of-Words. Their findings revealed that SBERT-based models outperformed traditional methods, reinforcing the advantages of contextualized embeddings for grading short answers. Alikaniotis et al. [15] proposed a novel deep neural network model for essay scoring, utilizing Kaggle’s ASAP dataset. The model incorporated Score-Specific Word Embeddings (SSWE) and Long Short-Term Memory (LSTM) networks to achieve human-like scoring without relying on explicit grammar rules or domain knowledge. Lei et al. [16] introduced a Bi-GRU architecture with a Siamese structure using pre-trained ALBERT. They transformed input expressions into word vectors via ALBERT and processed them through a Gated Recurrent Unit (GRU) network. An attention layer was added to enhance semantic interpretation, and the results were normalized using a softmax function. Their model outperformed traditional approaches, showcasing the potential of ALBERT in ASAG. Zhu et al. [17] proposed a four-stage ASAG framework leveraging BERT. They encoded student responses and reference texts with BERT and then used a Bi-directional Long Short-Term Memory (Bi-LSTM) network to enhance semantic understanding. The outputs were further refined with a Semantic Fusion Layer and processed using a max-pooling technique in the final prediction stage. Evaluations using the Mohler and SemEval datasets demonstrated impressive results, achieving an accuracy of 76.5% for unseen answers, 69.2% for unseen domains, and 66.0% for unseen questions. On the Mohler dataset, the model achieved a Root Mean Square Error (RMSE) of 0.248 and a Pearson Correlation Coefficient (R) of 0.89. Sayeed et al. [18] proposed a Siamese architecture leveraging RoBERTa-based bi-encoder transformer models for ASAG. By focusing on both student and reference responses, the model effectively evaluated descriptive answers. Trained on the SemEval-2013 2-way dataset, their approach demonstrated performance either equivalent to or superior to other benchmark models, while being tailored for computational efficiency. Ouahran et al. [19] proposed AR-ASAG, an Arabic dataset designed for Automatic Short-Answer Grading Evaluation. The COALS (Correlated Occurrence Analogue to Lexical Semantic) algorithm was proposed for Automatic Short-Answer Grading. The proposed method has been shown to achieve promising results in the Arabic language. It experimented with the AR-ASAG dataset of 2133 sets of (Model Answer, Student Answer) in various formats, like .txt, .xml, Moodle.xml, and .db. Badry et al. [20] proposed an Automatic Arabic Short-Answer Grading model based on Latent Semantic Analysis (LSA), one of the most widely used corpus-based similarity techniques. The model was applied to AR-ASAG, a publicly available Arabic dataset with limited resources. The results showed an F1-score of 82.82% and an RMSE value of 0.798, surpassing the performance of previously related works. Salam et al. [21] compared the resemblance between two short texts, which were the answers of the students and the model answer for a particular question based on deep learning and machine learning techniques, to calculate the semantic resemblance between the two sentences and automatically assign grades to the students. Nael et al. [22] proposed AraScore, conducting empirical studies and investigations using a baseline model, RNN, LSTM, Bi-LSTM, and two transformer-based language models, namely BERT and ELECTRA. Accordingly, [22] reported that this was the best system for the task at hand, with a QWK score of 0.78 achieved using ELECTRA.

These studies collectively highlight the adaptability of transformer-based models, such as BERT, SBERT, and RoBERTa, in ASAG. They underscore the importance of leveraging contextual embeddings, domain-specific fine-tuning, and hybrid architectures to improve the grading of short and descriptive answers. These advancements pave the way for more accurate, efficient, and scalable automated grading systems.

Existing Datasets

The effectiveness of automated grading models relies not only on the development of advanced machine learning techniques, but also on the quality and structure of the datasets used for training and evaluation. Datasets must exhibit key properties, such as construct validity, inter-rater reliability, and generalizability, to ensure robust and fair grading systems [11]. A dataset is broadly defined as a collection of related observations, organized and formatted for a specific purpose [23]. This encompasses images, texts, or documents curated to support a particular task. Several established datasets have been instrumental in advancing automated grading systems:

The Beetle Dataset: Designed for training and testing models on two-way and three-way classification tasks. It includes 47 unique questions and 3941 student responses in the training set. The test set is split into two parts: “Unseen Answers” (47 questions, 439 responses) and “Unseen Questions” (9 new questions, 819 responses) [10].
The SciEntsBank Dataset: Supports multiple classification tasks (two-way, three-way, and five-way). The training set contains 135 questions and 4969 responses. The test set includes three segments: “Unseen Answers” (135 questions, 540 responses), “Unseen Domains” (46 questions, 4562 responses), and “Unseen Questions” (15 questions, 733 responses) [11].
The Mohler Dataset: Comprises 79 questions and 2273 student answers graded by two educators on a 0–5 scale. The dataset provides individual scores, as well as their averages, facilitating the analysis of inter-rater reliability [11].
The AR-ASAG Dataset [24] was created to support short-answer grading in Arabic, marking it as the first publicly available dataset of its kind. It contains questions on cybercrime and computer science, with responses collected from three categories of master’s students who were native Arabic speakers.

This study contributes to this growing field by introducing a structured dataset designed for training and evaluating machine learning models in automated assessment tasks. The dataset follows the principles of construct validity by providing clear grading rubrics, ensures inter-rater reliability across multiple graders, and supports generalizability by covering diverse student responses.

3. Methods

This section outlines the steps involved in collecting, processing, and structuring the dataset for automated grading systems. The workflow of the methodology is illustrated in Figure 1, which provides a visual representation of the data pipeline, from collection to analysis. The process begins with data collection, followed by the grading process, where student responses are evaluated using predefined marking guidelines. The graded responses are then processed and structured into a standardized format. The dataset undergoes analysis and findings, focusing on key aspects such as student performance trends, token length analysis, and automated grading potential. The final step involves experimental evaluation to assess the effectiveness of various machine learning models for automated grading.

3.1. Data Collection

The dataset was collected from students who enrolled in the MIS221 (Introduction to Management Information Systems) and MIS415 (Project Management) courses during the 2022/2023 academic session at Covenant University, Nigeria. As part of their final assessment, students were required to respond to open-ended questions designed to test their knowledge. The students provided handwritten responses, which were collected and scanned using a high-resolution scanner (HP ScanJet Pro 2500 f1 Scanner). Data entry personnel carefully reviewed the scanned images and each student’s answer was manually transcribed into Microsoft Word documents. This step was essential to ensure that all responses were accurately captured before being processed into a structured dataset.

3.2. Grading Process

Each student’s answer was graded by experienced educators (graders) using a predefined marking guide, followed by a vetting process where a second reviewer (vetter) validated the assigned scores. This multi-step grading process adhered to standardized evaluation criteria to ensure consistency across students and questions. Scores were manually assigned based on the comprehensiveness, accuracy, and relevance of the responses. These scores were then recorded alongside the corresponding answers. The grading process relied on the Question Marking Guide, which served as the benchmark for evaluating student performance. This guide provided detailed instructions for each question, specifying the key elements required in a high-quality response. Additionally, the guide included the maximum achievable score (referred to as the GUIDE SCORE) for each question, ensuring that all graders adhered to a uniform and objective assessment process.

3.3. Data Structure and Formatting

The dataset was organized into two primary sets, each containing specific information necessary for training and evaluating automated grading systems.

Set 1: Student Responses and Scores

This set contains the transcribed responses from students, along with their corresponding scores. Each row represents an individual student’s answer to a specific question, providing data points for both the raw text (the student’s answer) and the associated score assigned by human graders.

Set 2: Question Marking Guide

This sheet includes the questions asked during the assessment and the corresponding marking guide. It provides a clear framework for evaluating student responses, detailing what constitutes a complete and correct answer for each question. The data were processed to remove any personally identifiable information (PII), ensuring that the dataset is fully anonymized and compliant with privacy guidelines. This step was crucial in enabling the safe use of the dataset for research purposes, including training machine learning models for automated grading.

3.4. Anonymization and Data Privacy

Given the sensitive nature of student responses, the dataset was fully anonymized before processing. All personally identifiable information, such as student names or identification numbers, was removed. The dataset only retains the text of student answers and the associated scores, ensuring compliance with data privacy regulations. This anonymization process allows researchers to use the dataset without compromising the privacy of the students involved.

4. Data Description

The dataset presented in this paper comprises responses from students enrolled in the MIS221 and MIS415 courses at Covenant University during the 2022/2023 academic session. The dataset includes student answers to open-ended questions and the corresponding scores assigned by human graders. It also includes a detailed marking guide for each question, which serves as the basis for the scoring. The dataset is structured across two primary sets, each playing a crucial role in creating the dataset for developing automated grading systems.

A.: Student Responses and Scores (Set 1):

This set contains the raw student responses along with the scores assigned by human graders. The responses were originally handwritten by students during assessments and later scanned using a scanner. The scanned documents were transcribed into Word files by data entry personnel, ensuring that each student’s response was captured accurately. The following columns are included in this sheet:

UNIVERSITY: The institution where the data were collected.

COURSE: The course code.

SESSION: The academic session during which the data were collected, e.g., 2022/2023.

SEMESTER: The semester during which the course was taught.

COURSE TITLE: The title of the course.

QUESTION NO: The number of the question to which the student is responding.

STUDENT ANSWER: The actual answer provided by the student, scanned and then typed into the dataset.

STUDENT SCORE: The numerical score assigned by the human grader based on the student’s response and the corresponding marking guide.

This set provides the raw text data necessary for developing and testing automated grading algorithms. Each student’s response is linked to the corresponding question number, allowing for analysis of both individual and aggregated responses. The STUDENT SCORE serves as the ground truth for training machine learning models that predict the quality of student responses.

B.: Question Marking Guide (Set 2)

This set contains the detailed marking guidelines used to assess student responses. Each question is accompanied by a marking guide that explains what a high-quality or correct answer should include. The GUIDE SCORE column represents the maximum possible score that can be awarded for each question. The following columns are included in this sheet:

UNIVERSITY: Covenant University.

COURSE: The course code.

SESSION: The academic session.

SEMESTER: The semester in which the exam took place.

COURSE TITLE: The title of the course.

QUESTION NO: The question number for which the marking guide is provided.

QUESTION: The question itself, as presented to the students during the exam.

MARK GUIDE: A detailed explanation of the points that should be covered in a student’s response to achieve a high score. This includes key concepts, relevant arguments, and any other criteria necessary for a correct answer.

GUIDE SCORE: The maximum score that can be awarded for the question, based on the marking guide.

The Question Marking Guide set serves as a benchmark for evaluating the consistency and fairness of both human and machine grading. By providing a structured evaluation guide, it ensures that each response is assessed against a standardized set of criteria. This set is invaluable for training and testing automated grading models, as it defines the expected output for each question, making it easier to quantify the accuracy of predicted scores.

Data Summary

Total Responses: The dataset includes about 3000 student responses, covering various types of open-ended questions. These responses reflect a diverse range of student abilities and knowledge, making the dataset ideal for training automated grading models that must handle a wide variety of answer styles and quality levels.

Question Diversity: The questions in the dataset range from theoretical to applied, requiring students to demonstrate both their understanding of key concepts and their ability to apply this knowledge in practical scenarios.

Scoring Distribution: The scores assigned by human graders provide insight into the distribution of student performance, offering a valuable resource for analyzing how different students respond to various types of questions.

The dataset is designed to be versatile, supporting a wide range of research in automated grading, text analysis, and educational assessment. By including both raw responses and a detailed marking guide, the dataset provides a comprehensive resource for developing models that aim to improve the accuracy and efficiency of grading open-ended responses.

5. Analysis and Findings

The analysis of the dataset involves multiple stages, each aimed at uncovering insights from the raw student responses and grading patterns. By leveraging this structured dataset, several key aspects of student performance, grading consistency, and automated grading model potential were examined. Below, we present the results and findings derived from various analyses, including performance trends, score distributions, and preliminary automated grading model outcomes.

5.1. Student Performance Trends

One of the primary goals of analyzing the dataset was to investigate how students performed across different questions and how these performances varied based on question type (theoretical vs. applied). The main goal of clustering is to identify patterns that naturally group together [25]. Theoretical questions are questions that typically ask students to explain concepts, definitions, principles, or theories, while applied questions are questions that require students to use their knowledge to solve a practical problem, analyze case studies, or perform calculations. To systematically differentiate between these two question types, we employed an unsupervised clustering approach using K-Means and TF-IDF vectorization. The classification process follows the algorithm (Algorithm 1) described below:

Algorithm 1: Unsupervised Clustering for Question Classification

Input: Dataset D with questions
Output: Dataset D’ with theoretical/applied categories
1. Load dataset D into a DataFrame df
2. Initialize stopword list
3. For each question q in df[’question’]:
                      q_lower = convert q to lowercase
                      q_cleaned = remove punctuation from q_lower
                      tokens = tokenize q_cleaned
                      q_preprocessed = remove stopwords from tokens
4. Store preprocessed questions in df[’cleaned_question’]
5. Initialize TF-IDF Vectorizer with max_features = 1000
6. Fit and transform df[’cleaned_question’] into matrix X (TF-IDF features)
7. Initialize K-Means with num_clusters = 2
8. Fit K-Means on matrix X to get cluster assignments C
9. For each cluster c:
           a. Extract top terms for each cluster centroid
           b. Manually label clusters as ’theoretical’ or ’applied’ based on top terms
10. Assign cluster labels (theoretical/applied) to df[’category’]
12. Return df

Comparative Analysis of Theoretical and Applied Question Performance

To analyze student performance trends across theoretical and applied questions, a comparative analysis was conducted by first segmenting the dataset into two categories: theoretical and applied. Questions were classified using an unsupervised clustering approach based on K-Means applied to TF-IDF feature representations. The classification results were manually validated by examining the top terms in each cluster to ensure an accurate distinction between theoretical and applied questions. Once categorized, the performance of students in each category was evaluated using key statistical metrics, including average scores, standard deviation, and maximum scores. The standard deviation was particularly useful in assessing the variability in student performance within each category, with higher values indicating greater dispersion in the scores. The results revealed that applied questions exhibited greater variability in scores, suggesting that students found it more challenging to apply their theoretical knowledge to practical scenarios. Conversely, theoretical questions had slightly higher average scores, indicating that students performed better when recalling conceptual knowledge. These findings are summarized in Table 1.

The results indicate that students demonstrated stronger performance on theoretical questions, which primarily test knowledge recall and conceptual understanding. In contrast, applied questions proved to be more difficult, with higher variability in performance. This suggests that while students may grasp theoretical concepts, applying them to real-world scenarios remains a challenge. The observed variability highlights the potential need for additional instructional support and more practical exercises to help students strengthen their ability to apply theoretical knowledge effectively. Addressing these challenges through targeted instructional interventions can better prepare students for assessments that require analytical thinking and problem-solving.

5.2. Token Length Analysis

Variability is exhibited in token lengths across the datasets, as shown in Table 2. The longest token counts for the MIS415 Student Answer (527) and MIS415 Mark Guide (515) datasets are considerably higher than those for the MIS221 Student Answer (275) and MIS221 Mark Guide (302) datasets. In the MIS415 Student Answer and MIS221 Student Answer datasets, the shortest token count is 1, indicating the presence of minimal or single-word responses. In contrast, the shortest token count in the MIS221 Mark Guide is 10, reflecting a higher minimum level of structure and content complexity.

The token length analysis highlights significant differences in the complexity and structure of the datasets. These findings underline the importance of tailoring preprocessing and model design to handle both concise and elaborate content effectively.

5.3. Automated Grading Potential

Preliminary experiments were conducted to explore the feasibility of using the dataset to train an automated grading system. The goal was to leverage machine learning models that can predict student scores based on their written responses. This approach aims to assist human graders by automating the grading process for open-ended responses, thereby improving grading efficiency and consistency.

Some natural language processing (NLP) models were tested to assess their ability to predict student scores. The models varied from traditional text representation techniques to more advanced language models, such as TF-IDF (Term Frequency–Inverse Document Frequency), BERT [12], SBERT [14], RoBERTa [26]) and Longformer [27]. We initially evaluated between short token models like BERT [12], SBERT [14], and RoBERTa [26], and long token models like Longformer [27].

Three preliminary algorithms (Algorithms 2–4) were implemented for automated grading, each utilizing a different text representation approach. The first approach, TF-IDF with Linear Regression, represents student responses numerically and applies Linear Regression to predict scores. The second approach, TF-IDF with Cosine Similarity, represents student responses and marking guides as TF-IDF vectors and then measures similarity using Cosine Similarity. The third approach, BERT with Cosine Similarity, uses pre-trained BERT embeddings to capture semantic meaning and applies Cosine Similarity to measure response relevance. Additionally, transformer-based models, such as BERT, SBERT, RoBERTa, and Longformer, were evaluated for their effectiveness in grading student responses accurately.

Algorithm 2: TF-IDF with Linear Regression

Input: Dataset D with student answers, marking guides, student scores, and guide scores
Output: Predicted normalized scores for student responses
1. For each student answer q and marking guide g in D:
           Replace missing values with empty strings
           Convert q and g to lowercase and remove punctuation
       Store preprocessed text in q’ and g’
2. For each q’ and g’:
           Combine q’ and g’ into a single input T
       Store T for each student response
3. For each response:
           Calculate the normalized score S_norm = Student Score/Guide Score
       Store S_norm for each response
4. Apply TF-IDF vectorization to T to generate matrix X
5. Split X and S_norm into training and test sets:
           (X_train, X_test, y_train, y_test)
6. Train Linear Regression model M on (X_train, y_train)
7. Use model M to predict normalized scores on X_test:
           y_pred = M.predict(X_test)
8. Evaluate the model:
           Calculate MSE between y_test and y_pred
           Calculate Spearman’s Rank Correlation between y_test and y_pred

Algorithm 3: TF-IDF with Cosine Similarity

Input: Dataset D with student answers, marking guides, student scores, and guide scores
Output: Predicted normalized scores for student responses
1. For each student answer q and marking guide g in D:
           Replace missing values with empty strings
           Convert q and g to lowercase and remove punctuation
       Store preprocessed text in q’ and g’
2. For each response:
           Calculate the normalized score S_norm = Student Score/Guide Score
       Store S_norm for each response
3. Apply TF-IDF vectorization to q’ and g’ to generate TF-IDF vectors V_q and V_g
4. For each pair (V_q, V_g):
           Compute cosine similarity cos(theta) between V_q and V_g
       Store cosine similarity for each response
5. Split cosine similarity and S_norm into training and test sets:
           (X_train, X_test, y_train, y_test)
6. Train Linear Regression model M on (X_train, y_train)
7. Use model M to predict normalized scores on X_test:
           y_pred = M.predict(X_test)
8. Evaluate the model:
           Calculate MSE between y_test and y_pred
           Calculate Spearman’s Rank Correlation between y_test and y_pred

Algorithm 4: BERT with Cosine Similarity

Algorithm: Predict_Grading_With_BERT
Input: Dataset D with student answers, marking guides, student scores, and guide scores
Output: Predicted normalized scores for student responses
1. For each student answer q and marking guide g in D:
           Replace missing values with empty strings
           Convert q and g to lowercase and remove punctuation
       Store preprocessed text in q’ and g’
2. For each response:
           Calculate the normalized score S_norm = Student Score/Guide Score
       Store S_norm for each response
3. Load the pre-trained BERT tokenizer and BERT model
4. For each q’ and g’:
           Tokenize q’ and g’ using the BERT tokenizer
           Extract BERT embeddings E_q and E_g from the [CLS] token of q’ and g’
5. For each pair (E_q, E_g):
           Compute cosine similarity cos(theta) between E_q and E_g
       Store cosine similarity for each response
6. Split cosine similarity and S_norm into training and test sets:
           (X_train, X_test, y_train, y_test)
7. Train Linear Regression model M on (X_train, y_train)
8. Use model M to predict normalized scores on X_test:
           y_pred = M.predict(X_test)
9. Evaluate the model:
           Calculate MSE between y_test and y_pred
           Calculate Spearman’s Rank Correlation between y_test and y_pred

6. Experiments and Results

6.1. Experimental Description

The experiments were designed to evaluate the ability of different models to predict student scores based on their written responses. The dataset was split into 80% training and 20% testing, ensuring that models were tested on responses that had not been seen before. To maintain consistency across different grading scales, student scores were normalized by dividing each score by the maximum possible score for the corresponding question. Several models were tested, including TF-IDF with Linear Regression, TF-IDF with Cosine Similarity, and BERT with Cosine Similarity, along with transformer-based models such as SBERT, RoBERTa, and Longformer. Each model followed a structured training process with the clearly defined algorithms provided in Section 5.3. All experiments were conducted on a high-performance machine equipped with an NVIDIA A10 Tensor Core GPU, 32 vCPUs, and 128GB of RAM, using PyTorch 2.6.0 as the primary deep learning framework. To prevent overfitting, early stopping was applied, i.e., stopping training once model performance no longer improved. Training was performed using AdamW optimization, with a learning rate of

2 \times 10^{- 4}

and a batch size of 32, running for 10 epochs. The hyperparameters used for training are summarized in Table 3.

Model performance was assessed using three key metrics: Spearman’s Correlation was used to measure the agreement between model-predicted rankings and human-assigned scores. Pearson’s Correlation was used to evaluate the linear relationship between predicted and actual scores. Mean Squared Error (MSE) was computed to quantify the error in score predictions.

6.2. Results

Table 4 presents the performance comparison of several models: TF-IDF with Linear Regression, TF-IDF with Cosine Similarity, and BERT with Cosine Similarity. The evaluation metrics include Spearman’s Correlation (SC), Pearson’s Correlation (PC), and Mean Squared Error (MSE), which collectively assess the alignment between predicted scores and human grader assessments. Additionally, advanced transformer-based models (BERT, SBERT, RoBERTa, and Longformer) are included for comparison. The results are summarized in Table 4 and visually depicted in Figure 2.

The performance comparison of models reveals significant differences in their ability to align with human-assigned scores and accurately predict student responses. The TF-IDF with Linear Regression model exhibited the weakest performance, achieving a Spearman’s Correlation of 0.22 and a highest Mean Squared Error (MSE) of 2.50. This poor alignment between predicted and actual scores highlights the limitations of TF-IDF, which relies solely on word frequency without considering semantic context. Additionally, linear regression, being a simplistic model, fails to capture the non-linear relationships inherent in complex textual data, such as open-ended student responses, further contributing to its underperformance.

The use of TF-IDF with Cosine Similarity demonstrated notable improvements, achieving a Spearman’s Correlation of 0.45 and a significantly reduced MSE of 0.20. Cosine Similarity enhances the comparison between student responses and marking guides by measuring their vector space alignment. However, despite this improvement, TF-IDF remains limited in its ability to capture the deeper contextual and semantic relationships required for grading nuanced responses, limiting its overall effectiveness.

The BERT with Cosine Similarity model outperformed the TF-IDF-based approaches, achieving a Spearman’s Correlation of 0.52 and an MSE of 0.12. BERT’s superior performance stems from its ability to encode contextual and semantic relationships between words, enabling it to understand the nuances within student responses. Unlike TF-IDF, which focuses on word frequency, BERT generates embeddings that reflect the meaning of words within their surrounding context. When paired with Cosine Similarity, BERT effectively measures the semantic alignment between student responses and marking guides, leading to more accurate score predictions and closer alignment with human grading.

The advanced transformer-based models, including SBERT, RoBERTa, and Longformer, demonstrated further performance improvements. Among these, Longformer achieved the best results, with a Spearman’s Correlation of 0.77, a Pearson’s Correlation of 0.78, and the lowest MSE of 0.04. These models incorporate various optimizations, such as sentence-level representations in SBERT, extensive pretraining on larger datasets in RoBERTa, and the efficient handling of long sequences in Longformer. These advancements enable these models to achieve superior semantic understanding and contextualization, making them particularly effective for automated short-answer grading tasks.

In conclusion, these results underscore the clear advantages of transformer-based models over traditional approaches like TF-IDF. While TF-IDF can serve as a baseline for grading, its inability to capture the contextual and semantic nuances of student responses limits its effectiveness. In contrast, BERT and its advanced variants leverage contextual embeddings to deliver significantly better performance, with Longformer achieving the most accurate and consistent results. These findings highlight the transformative potential of deep learning models in automated grading systems, paving the way for scalable and reliable assessment solutions.

7. Limitations and Challenges

Despite careful review, there is the potential for minor transcription errors when converting handwritten responses to typed text. These errors may slightly impact the quality of the dataset but are expected to be minimal. Another limitation is the dataset size and domain specificity. The dataset is derived from responses collected in MIS221 and MIS415 courses at Covenant University, and it may limit the models’ usability across other subjects, disciplines, or learning contexts. More diversified datasets across several academic disciplines would be needed to develop a more generalized automated grading system. Also, the grading process was based on a predefined guide, and there is an inherent level of subjectivity in evaluating open-ended responses. This subjectivity may lead to slight variations in the scores assigned by different graders, making it important to account for this variability when training models.

8. Conclusions

The development of this dataset addresses the increasing demand for automated grading systems in higher education by providing a structured and versatile resource for training and testing machine learning models. The dataset encompasses a diverse range of student responses to both theoretical and applied questions, which were carefully collected, transcribed, and scored using standardized marking guides. The analysis indicates that students tend to perform better on theoretical questions, which are generally more objective and easier to grade consistently. In contrast, applied questions pose greater challenges due to their subjective nature, resulting in higher variability in student performance and grading consistency. The inclusion of both raw student responses and detailed grading benchmarks ensures that the dataset is suitable not only for training machine learning models for automated grading, but also for broader applications such as text analysis and educational research. Preliminary experiments with NLP models, including some transformer models, demonstrate the potential for automated grading systems to align closely with human graders, particularly for theoretical questions. To enhance the applicability of this dataset and improve automated grading systems, future work should focus on developing more sophisticated frameworks for grading subjective responses. This includes refining model architectures, incorporating additional contextual and domain-specific features, and exploring advanced deep learning techniques, such as other transformer-based models. Enhancements in these areas will help reduce subjectivity and variability in grading, thereby increasing the reliability and scalability of automated assessment tools. This dataset serves as a valuable resource for researchers and educators aiming to advance automated grading approaches, offering insights into grading consistency and paving the way for more equitable and efficient evaluation systems in education. By addressing the challenges highlighted in this study, the dataset has the potential to contribute significantly to the development of next-generation educational technologies.

Author Contributions

I.D.D. and A.T.A.—conceptualization, methodology, resources. I.D.D. and T.-J.T.-A.—data curation, labelling the dataset, analysis, and writing of the original draft. A.T.A.—supervision. All authors have read and agreed to the published version of the manuscript.

Funding

The APC was funded by Covenant University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Figure A1. Sample of raw student responses.

Figure A2. Sample of raw question paper.

Figure A3. Sample of raw marking guide.

References

Süzen, N.; Gorban, A.N.; Levesley, J.; Mirkes, E.M. Automatic short answer grading and feedback using text mining methods. In Procedia Computer Science; Elsevier B.V.: Amsterdam, The Netherlands, 2020; pp. 726–743. [Google Scholar]
Janda, H.K.; Pawar, A.; Du, S.; Mago, V. Syntactic, semantic and sentiment analysis: The joint effect on automated essay evaluation. IEEE Access 2019, 7, 108486–108503. [Google Scholar] [CrossRef]
Oladipupo, O.O.; Olugbara, O.O. Evaluation of data analytics based clustering algorithms for knowledge mining in a student engagement data. Intell. Data Anal. 2019, 23, 1055–1071. [Google Scholar]
Oladipupo, O.; Samuel, S. A Learning Analytic Approach to Modelling Student-Staff Interaction From Students’ Perception of Engagement Practices. IEEE Access. 2024, 12, 10315–10333. [Google Scholar] [CrossRef]
Ahmed, A.; Joorabchi, A.; Hayes, M.J. On deep learning approaches to automated assessment: Strategies for short answer grading. CSEDU 2022, 2, 85–94. [Google Scholar]
Lagakis, P.; Demetriadis, S. Automated essay scoring: A review of the field. In Proceedings of the 2021 International Conference on Computer, Information and Telecommunication Systems (CITS), Istanbul, Turkey, 11–13 November 2021; pp. 1–6. [Google Scholar]
Wu, Y.; Henriksson, A.; Nouri, J.; Duneld, M.; Li, X. Beyond Benchmarks: Spotting Key Topical Sentences While Improving Automated Essay Scoring Performance with Topic-Aware BERT. Electronics 2023, 12, 150. [Google Scholar] [CrossRef]
Tzirides, A.O.O.; Zapata, G.; Kastania, N.P.; Saini, A.K.; Castro, V.; Ismael, S.A.; You, Y.-L.; Santos, T.A.D.; Searsmith, D.; O’Brien, C.; et al. Combining human and artificial intelligence for enhanced AI literacy in higher education. Comput. Educ. Open 2024, 6, 100184. [Google Scholar]
Garg, J.; Papreja, J.; Apurva, K.; Jain, G. Domain-Specific Hybrid BERT based System for Automatic Short Answer Grading. In Proceedings of the 2022 2nd International Conference on Intelligent Technologies (CONIT), Hubli, India, 24–26 June 2022; pp. 1–6. [Google Scholar]
Dzikovska, M.O.; Nielsen, R.D.; Brew, C.; Leacock, C.; Giampiccolo, D.; Bentivogli, L.; Clark, P.; Dagan, I.; Dang, H.T. SemEval-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge. In Proceedings of the Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Seventh International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, GA, USA, 14–15 June 2013; Association for Computational Linguistics: Atlanta, GA, USA, 2013; pp. 263–274. [Google Scholar]
Mohler, M.; Bunescu, R.; Mihalcea, R. Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; pp. 752–762. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Sung, C.; Saha, S.; Ma, T.; Reddy, V.; Arora, R. Pre-training BERT on domain resources for short answer grading. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Toronto, ON, Canada, 2019; pp. 6071–6075. [Google Scholar]
Condor, A.; Litster, M.; Pardos, Z. Automatic Short Answer Grading with SBERT on out-of-Sample Questions. In Proceedings of the 14th International Conference on Educational Data Mining (EDM 2021), Paris, France, 29 June–2 July 2021; International Educational Data Mining Society: Worcester, MA, USA, 2021. [Google Scholar]
Alikaniotis, D.; Yannakoudakis, H.; Rei, M. Automatic Text Scoring Using Neural Networks. arXiv 2016, arXiv:1606.04289. [Google Scholar]
Lei, W.; Meng, Z. Text similarity calculation method of Siamese network based on ALBERT. In Proceedings of the 2022 International Conference on Machine Learning and Knowledge Engineering (MLKE), Guilin, China, 25–27 February 2022; pp. 251–255. [Google Scholar]
Zhu, X.; Wu, H.; Zhang, L. Automatic Short-Answer Grading via BERT-Based Deep Neural Networks. IEEE Trans. Learn. Technol. 2022, 15, 364–375. [Google Scholar]
Sayeed, M.A.; Gupta, D. Automate Descriptive Answer Grading using Reference based Models. In Proceedings of the 2022 OITS International Conference on Information Technology (OCIT), Bhubaneswar, India, 14–16 December 2022; pp. 262–267. [Google Scholar]
Ouahrani, L.; Bennouar, D. AR-ASAG An ARabic Dataset for Automatic Short Answer Grading Evaluation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; European Language Resources Association: Paris, France, 2020; pp. 11–16. [Google Scholar] [CrossRef]
Badry, R.M.; Ali, M.; Rslan, E.; Kaseb, M.R. Automatic Arabic Grading System for Short Answer Questions. IEEE Access 2023, 11, 39457–39465. [Google Scholar] [CrossRef]
Salam, M.A.; El-Fatah, M.A.; Hassan, N.F. Automatic grading for Arabic short answer questions using optimized deep learning model. PLoS ONE 2022, 17, e0272269. [Google Scholar]
Nael, O.; ELmanyalawy, Y.; Sharaf, N. AraScore: A deep learning-based system for Arabic short answer scoring. Array 2022, 13, 100109. [Google Scholar]
Chapman, A.; Simperl, E.; Koesten, L.; Konstantinidis, G.; Ibáñez, L.D.; Kacprzak, E.; Groth, P. Dataset search: A survey. VLDB J. 2020, 29, 251–272. [Google Scholar]
Gomaa, W.H.; Fahmy, A.A. Automatic scoring for answers to Arabic test questions. Comput. Speech Lang. 2014, 28, 833–857. [Google Scholar] [CrossRef]
Oyelade, J.; Isewon, I.; Oladipupo, O.; Emebo, O.; Aromolaran, O.; Uwoghiren, E.; Olaniyan, D.; Olawole, O. Data Clustering: Algorithms and Its Applications. In Proceedings of the19th International Conference on Computational Science and Its Applications (ICCSA 2019), St. Petersburg, Russia, 1–4 July 2019; pp. 71–81. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The Long-Document Transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar]

Figure 1. Workflow of data collection, processing, and analysis for automated grading systems.

Figure 2. Comparison of models based on average SC and MSE scores.

Table 1. Comparative analysis of theoretical and applied question performance.

Category	Average Score	Standard Deviation	Max Score
Applied	1.33	1.57	9.0
Theoretical	1.47	1.51	5.0

Table 2. Token lengths across the datasets.

Dataset	Longest Token	Shortest Token
MIS415 Student Answer	527	1
MIS415 Mark Guide	515	1
MIS221 Student Answer	275	1
MIS221 Mark Guide	302	10

Table 3. Hyperparameters for training model.

Hyperparameter	Value
Learning rate	$2 \times 10^{- 4}$
Epoch	10
Word embedding dimension	768
Max sequence length for BERT/RoBERTa	128
Max sequence length for Longformer	1024
Optimizer	AdamW
Batch size	32

Table 4. Preliminary experiment.

Models	Spearman Correlation (SC)	Pearson Correlation (PC)	Mean Squared Error (MSE)
TF-IDF_linear	0.22	0.19	2.50
TF-IDF_cosineSim	0.45	0.40	0.20
BERT_cosineSim	0.52	0.53	0.12
BERT [7]	0.55	0.57	0.12
SBERT [6]	0.63	0.67	0.10
RoBERTa [14]	0.68	0.69	0.09
Longformer [4]	0.77	0.78	0.04

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dada, I.D.; Akinwale, A.T.; Tunde-Adeleke, T.-J. A Structured Dataset for Automated Grading: From Raw Data to Processed Dataset. Data 2025, 10, 87. https://doi.org/10.3390/data10060087

AMA Style

Dada ID, Akinwale AT, Tunde-Adeleke T-J. A Structured Dataset for Automated Grading: From Raw Data to Processed Dataset. Data. 2025; 10(6):87. https://doi.org/10.3390/data10060087

Chicago/Turabian Style

Dada, Ibidapo Dare, Adio T. Akinwale, and Ti-Jesu Tunde-Adeleke. 2025. "A Structured Dataset for Automated Grading: From Raw Data to Processed Dataset" Data 10, no. 6: 87. https://doi.org/10.3390/data10060087

APA Style

Dada, I. D., Akinwale, A. T., & Tunde-Adeleke, T.-J. (2025). A Structured Dataset for Automated Grading: From Raw Data to Processed Dataset. Data, 10(6), 87. https://doi.org/10.3390/data10060087

Article Menu

A Structured Dataset for Automated Grading: From Raw Data to Processed Dataset

Abstract

1. Introduction

2. Related Works

Existing Datasets

3. Methods

3.1. Data Collection

3.2. Grading Process

3.3. Data Structure and Formatting

3.4. Anonymization and Data Privacy

4. Data Description

Data Summary

5. Analysis and Findings

5.1. Student Performance Trends

Comparative Analysis of Theoretical and Applied Question Performance

5.2. Token Length Analysis

5.3. Automated Grading Potential

6. Experiments and Results

6.1. Experimental Description

6.2. Results

7. Limitations and Challenges

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI