A Structured Dataset for Automated Grading: From Raw Data to Processed Dataset
Abstract
1. Introduction
2. Related Works
Existing Datasets
- The Beetle Dataset: Designed for training and testing models on two-way and three-way classification tasks. It includes 47 unique questions and 3941 student responses in the training set. The test set is split into two parts: “Unseen Answers” (47 questions, 439 responses) and “Unseen Questions” (9 new questions, 819 responses) [10].
- The SciEntsBank Dataset: Supports multiple classification tasks (two-way, three-way, and five-way). The training set contains 135 questions and 4969 responses. The test set includes three segments: “Unseen Answers” (135 questions, 540 responses), “Unseen Domains” (46 questions, 4562 responses), and “Unseen Questions” (15 questions, 733 responses) [11].
- The Mohler Dataset: Comprises 79 questions and 2273 student answers graded by two educators on a 0–5 scale. The dataset provides individual scores, as well as their averages, facilitating the analysis of inter-rater reliability [11].
- The AR-ASAG Dataset [24] was created to support short-answer grading in Arabic, marking it as the first publicly available dataset of its kind. It contains questions on cybercrime and computer science, with responses collected from three categories of master’s students who were native Arabic speakers.
3. Methods
3.1. Data Collection
3.2. Grading Process
3.3. Data Structure and Formatting
- Set 1: Student Responses and Scores
- Set 2: Question Marking Guide
3.4. Anonymization and Data Privacy
4. Data Description
- A.
- Student Responses and Scores (Set 1):
UNIVERSITY: The institution where the data were collected. |
COURSE: The course code. |
SESSION: The academic session during which the data were collected, e.g., 2022/2023. |
SEMESTER: The semester during which the course was taught. |
COURSE TITLE: The title of the course. |
QUESTION NO: The number of the question to which the student is responding. |
STUDENT ANSWER: The actual answer provided by the student, scanned and then typed into the dataset. |
STUDENT SCORE: The numerical score assigned by the human grader based on the student’s response and the corresponding marking guide. |
- B.
- Question Marking Guide (Set 2)
UNIVERSITY: Covenant University. |
COURSE: The course code. |
SESSION: The academic session. |
SEMESTER: The semester in which the exam took place. |
COURSE TITLE: The title of the course. |
QUESTION NO: The question number for which the marking guide is provided. |
QUESTION: The question itself, as presented to the students during the exam. |
MARK GUIDE: A detailed explanation of the points that should be covered in a student’s response to achieve a high score. This includes key concepts, relevant arguments, and any other criteria necessary for a correct answer. |
GUIDE SCORE: The maximum score that can be awarded for the question, based on the marking guide. |
Data Summary
5. Analysis and Findings
5.1. Student Performance Trends
Algorithm 1: Unsupervised Clustering for Question Classification |
Input: Dataset D with questions Output: Dataset D’ with theoretical/applied categories 1. Load dataset D into a DataFrame df 2. Initialize stopword list 3. For each question q in df[’question’]: q_lower = convert q to lowercase q_cleaned = remove punctuation from q_lower tokens = tokenize q_cleaned q_preprocessed = remove stopwords from tokens 4. Store preprocessed questions in df[’cleaned_question’] 5. Initialize TF-IDF Vectorizer with max_features = 1000 6. Fit and transform df[’cleaned_question’] into matrix X (TF-IDF features) 7. Initialize K-Means with num_clusters = 2 8. Fit K-Means on matrix X to get cluster assignments C 9. For each cluster c: a. Extract top terms for each cluster centroid b. Manually label clusters as ’theoretical’ or ’applied’ based on top terms 10. Assign cluster labels (theoretical/applied) to df[’category’] 12. Return df |
Comparative Analysis of Theoretical and Applied Question Performance
5.2. Token Length Analysis
5.3. Automated Grading Potential
Algorithm 2: TF-IDF with Linear Regression |
Input: Dataset D with student answers, marking guides, student scores, and guide scores Output: Predicted normalized scores for student responses 1. For each student answer q and marking guide g in D: Replace missing values with empty strings Convert q and g to lowercase and remove punctuation Store preprocessed text in q’ and g’ 2. For each q’ and g’: Combine q’ and g’ into a single input T Store T for each student response 3. For each response: Calculate the normalized score S_norm = Student Score/Guide Score Store S_norm for each response 4. Apply TF-IDF vectorization to T to generate matrix X 5. Split X and S_norm into training and test sets: (X_train, X_test, y_train, y_test) 6. Train Linear Regression model M on (X_train, y_train) 7. Use model M to predict normalized scores on X_test: y_pred = M.predict(X_test) 8. Evaluate the model: Calculate MSE between y_test and y_pred Calculate Spearman’s Rank Correlation between y_test and y_pred |
Algorithm 3: TF-IDF with Cosine Similarity |
Input: Dataset D with student answers, marking guides, student scores, and guide scores Output: Predicted normalized scores for student responses 1. For each student answer q and marking guide g in D: Replace missing values with empty strings Convert q and g to lowercase and remove punctuation Store preprocessed text in q’ and g’ 2. For each response: Calculate the normalized score S_norm = Student Score/Guide Score Store S_norm for each response 3. Apply TF-IDF vectorization to q’ and g’ to generate TF-IDF vectors V_q and V_g 4. For each pair (V_q, V_g): Compute cosine similarity cos(theta) between V_q and V_g Store cosine similarity for each response 5. Split cosine similarity and S_norm into training and test sets: (X_train, X_test, y_train, y_test) 6. Train Linear Regression model M on (X_train, y_train) 7. Use model M to predict normalized scores on X_test: y_pred = M.predict(X_test) 8. Evaluate the model: Calculate MSE between y_test and y_pred Calculate Spearman’s Rank Correlation between y_test and y_pred |
Algorithm 4: BERT with Cosine Similarity |
Algorithm: Predict_Grading_With_BERT Input: Dataset D with student answers, marking guides, student scores, and guide scores Output: Predicted normalized scores for student responses 1. For each student answer q and marking guide g in D: Replace missing values with empty strings Convert q and g to lowercase and remove punctuation Store preprocessed text in q’ and g’ 2. For each response: Calculate the normalized score S_norm = Student Score/Guide Score Store S_norm for each response 3. Load the pre-trained BERT tokenizer and BERT model 4. For each q’ and g’: Tokenize q’ and g’ using the BERT tokenizer Extract BERT embeddings E_q and E_g from the [CLS] token of q’ and g’ 5. For each pair (E_q, E_g): Compute cosine similarity cos(theta) between E_q and E_g Store cosine similarity for each response 6. Split cosine similarity and S_norm into training and test sets: (X_train, X_test, y_train, y_test) 7. Train Linear Regression model M on (X_train, y_train) 8. Use model M to predict normalized scores on X_test: y_pred = M.predict(X_test) 9. Evaluate the model: Calculate MSE between y_test and y_pred Calculate Spearman’s Rank Correlation between y_test and y_pred |
6. Experiments and Results
6.1. Experimental Description
6.2. Results
7. Limitations and Challenges
8. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A
References
- Süzen, N.; Gorban, A.N.; Levesley, J.; Mirkes, E.M. Automatic short answer grading and feedback using text mining methods. In Procedia Computer Science; Elsevier B.V.: Amsterdam, The Netherlands, 2020; pp. 726–743. [Google Scholar]
- Janda, H.K.; Pawar, A.; Du, S.; Mago, V. Syntactic, semantic and sentiment analysis: The joint effect on automated essay evaluation. IEEE Access 2019, 7, 108486–108503. [Google Scholar] [CrossRef]
- Oladipupo, O.O.; Olugbara, O.O. Evaluation of data analytics based clustering algorithms for knowledge mining in a student engagement data. Intell. Data Anal. 2019, 23, 1055–1071. [Google Scholar]
- Oladipupo, O.; Samuel, S. A Learning Analytic Approach to Modelling Student-Staff Interaction From Students’ Perception of Engagement Practices. IEEE Access. 2024, 12, 10315–10333. [Google Scholar] [CrossRef]
- Ahmed, A.; Joorabchi, A.; Hayes, M.J. On deep learning approaches to automated assessment: Strategies for short answer grading. CSEDU 2022, 2, 85–94. [Google Scholar]
- Lagakis, P.; Demetriadis, S. Automated essay scoring: A review of the field. In Proceedings of the 2021 International Conference on Computer, Information and Telecommunication Systems (CITS), Istanbul, Turkey, 11–13 November 2021; pp. 1–6. [Google Scholar]
- Wu, Y.; Henriksson, A.; Nouri, J.; Duneld, M.; Li, X. Beyond Benchmarks: Spotting Key Topical Sentences While Improving Automated Essay Scoring Performance with Topic-Aware BERT. Electronics 2023, 12, 150. [Google Scholar] [CrossRef]
- Tzirides, A.O.O.; Zapata, G.; Kastania, N.P.; Saini, A.K.; Castro, V.; Ismael, S.A.; You, Y.-L.; Santos, T.A.D.; Searsmith, D.; O’Brien, C.; et al. Combining human and artificial intelligence for enhanced AI literacy in higher education. Comput. Educ. Open 2024, 6, 100184. [Google Scholar]
- Garg, J.; Papreja, J.; Apurva, K.; Jain, G. Domain-Specific Hybrid BERT based System for Automatic Short Answer Grading. In Proceedings of the 2022 2nd International Conference on Intelligent Technologies (CONIT), Hubli, India, 24–26 June 2022; pp. 1–6. [Google Scholar]
- Dzikovska, M.O.; Nielsen, R.D.; Brew, C.; Leacock, C.; Giampiccolo, D.; Bentivogli, L.; Clark, P.; Dagan, I.; Dang, H.T. SemEval-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge. In Proceedings of the Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Seventh International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, GA, USA, 14–15 June 2013; Association for Computational Linguistics: Atlanta, GA, USA, 2013; pp. 263–274. [Google Scholar]
- Mohler, M.; Bunescu, R.; Mihalcea, R. Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; pp. 752–762. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Sung, C.; Saha, S.; Ma, T.; Reddy, V.; Arora, R. Pre-training BERT on domain resources for short answer grading. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Toronto, ON, Canada, 2019; pp. 6071–6075. [Google Scholar]
- Condor, A.; Litster, M.; Pardos, Z. Automatic Short Answer Grading with SBERT on out-of-Sample Questions. In Proceedings of the 14th International Conference on Educational Data Mining (EDM 2021), Paris, France, 29 June–2 July 2021; International Educational Data Mining Society: Worcester, MA, USA, 2021. [Google Scholar]
- Alikaniotis, D.; Yannakoudakis, H.; Rei, M. Automatic Text Scoring Using Neural Networks. arXiv 2016, arXiv:1606.04289. [Google Scholar]
- Lei, W.; Meng, Z. Text similarity calculation method of Siamese network based on ALBERT. In Proceedings of the 2022 International Conference on Machine Learning and Knowledge Engineering (MLKE), Guilin, China, 25–27 February 2022; pp. 251–255. [Google Scholar]
- Zhu, X.; Wu, H.; Zhang, L. Automatic Short-Answer Grading via BERT-Based Deep Neural Networks. IEEE Trans. Learn. Technol. 2022, 15, 364–375. [Google Scholar]
- Sayeed, M.A.; Gupta, D. Automate Descriptive Answer Grading using Reference based Models. In Proceedings of the 2022 OITS International Conference on Information Technology (OCIT), Bhubaneswar, India, 14–16 December 2022; pp. 262–267. [Google Scholar]
- Ouahrani, L.; Bennouar, D. AR-ASAG An ARabic Dataset for Automatic Short Answer Grading Evaluation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; European Language Resources Association: Paris, France, 2020; pp. 11–16. [Google Scholar] [CrossRef]
- Badry, R.M.; Ali, M.; Rslan, E.; Kaseb, M.R. Automatic Arabic Grading System for Short Answer Questions. IEEE Access 2023, 11, 39457–39465. [Google Scholar] [CrossRef]
- Salam, M.A.; El-Fatah, M.A.; Hassan, N.F. Automatic grading for Arabic short answer questions using optimized deep learning model. PLoS ONE 2022, 17, e0272269. [Google Scholar]
- Nael, O.; ELmanyalawy, Y.; Sharaf, N. AraScore: A deep learning-based system for Arabic short answer scoring. Array 2022, 13, 100109. [Google Scholar]
- Chapman, A.; Simperl, E.; Koesten, L.; Konstantinidis, G.; Ibáñez, L.D.; Kacprzak, E.; Groth, P. Dataset search: A survey. VLDB J. 2020, 29, 251–272. [Google Scholar]
- Gomaa, W.H.; Fahmy, A.A. Automatic scoring for answers to Arabic test questions. Comput. Speech Lang. 2014, 28, 833–857. [Google Scholar] [CrossRef]
- Oyelade, J.; Isewon, I.; Oladipupo, O.; Emebo, O.; Aromolaran, O.; Uwoghiren, E.; Olaniyan, D.; Olawole, O. Data Clustering: Algorithms and Its Applications. In Proceedings of the19th International Conference on Computational Science and Its Applications (ICCSA 2019), St. Petersburg, Russia, 1–4 July 2019; pp. 71–81. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The Long-Document Transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar]
Category | Average Score | Standard Deviation | Max Score |
---|---|---|---|
Applied | 1.33 | 1.57 | 9.0 |
Theoretical | 1.47 | 1.51 | 5.0 |
Dataset | Longest Token | Shortest Token |
---|---|---|
MIS415 Student Answer | 527 | 1 |
MIS415 Mark Guide | 515 | 1 |
MIS221 Student Answer | 275 | 1 |
MIS221 Mark Guide | 302 | 10 |
Hyperparameter | Value |
---|---|
Learning rate | |
Epoch | 10 |
Word embedding dimension | 768 |
Max sequence length for BERT/RoBERTa | 128 |
Max sequence length for Longformer | 1024 |
Optimizer | AdamW |
Batch size | 32 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Dada, I.D.; Akinwale, A.T.; Tunde-Adeleke, T.-J. A Structured Dataset for Automated Grading: From Raw Data to Processed Dataset. Data 2025, 10, 87. https://doi.org/10.3390/data10060087
Dada ID, Akinwale AT, Tunde-Adeleke T-J. A Structured Dataset for Automated Grading: From Raw Data to Processed Dataset. Data. 2025; 10(6):87. https://doi.org/10.3390/data10060087
Chicago/Turabian StyleDada, Ibidapo Dare, Adio T. Akinwale, and Ti-Jesu Tunde-Adeleke. 2025. "A Structured Dataset for Automated Grading: From Raw Data to Processed Dataset" Data 10, no. 6: 87. https://doi.org/10.3390/data10060087
APA StyleDada, I. D., Akinwale, A. T., & Tunde-Adeleke, T.-J. (2025). A Structured Dataset for Automated Grading: From Raw Data to Processed Dataset. Data, 10(6), 87. https://doi.org/10.3390/data10060087