Cognitive Computing with Large Language Models for Student Assessment Feedback
Abstract
:1. Introduction
2. Background Research
3. Materials and Methods
3.1. Dataset
3.2. Models
3.3. Experimental Design
3.3.1. Preliminary Model Evaluation Phase
- Chunk size: 1500 tokens;
- Overlap margin: 200 tokens;
- Batch processing: 3 chunks per iteration.
- Systematic partitioning of report content into discrete sections;
- Development of section-specific prompts;
- Independent processing of each section for targeted feedback generation.
3.3.2. Primary Evaluation Phase
3.4. Report Processing and Content Extraction
3.4.1. PDF Text Extraction Methodology
3.4.2. Content Extraction Architecture
3.4.3. Semantic Segmentation, Content Organisation, and Validation
- Verification of section completeness;
- Validation of hierarchical relationships between sections and subsections;
- Content attribution to appropriate assessment components;
- Preservation of formatting elements crucial for assessment.
3.5. Performance Metrics and Analysis
3.5.1. Evaluation Metrics
3.5.2. Qualitative Analysis
4. Results and Discussion
4.1. Computing Infrastructure
4.2. Preliminary Model Evaluation Results
4.2.1. Model Performance Comparison
4.2.2. Comparative Prompt Sensitivity
Output Processing Requirements and Methodological Adaptations
- Development of explicitly structured prompts with clear boundary conditions;
- Implementation of context-specific disambiguation rules;
- Integration of post-processing pipelines for output standardisation.
4.3. Quantitative Evaluation of Primary Models
Models’ Context Window Size Limitations
4.4. Qualitative Feedback Analysis
4.4.1. Evaluation of Feedback Coherence and Consistency
CodeLlama-7B Output:
1. Description of Data:
Strength: The dataset description is clear and complete, providing all the necessary information about the dataset, including the source, data types, and size. The comments on accuracy, reliability, and data quality are also provided.
Area for Improvements:
* The dataset description could be more specific about the variables included in the dataset and their definitions.
* The comments on data accuracy and reliability could be more detailed and provide examples to support the claims.
Mistral-7B Output:
1. Data Description
- Strength: The dataset description is clear and comprehensive, providing a detailed explanation of the dataset’s source, contents, and purpose. The author also explains the quality and accuracy of the data, which is helpful for understanding the dataset’ s reliability.
- Area for Improvements: The author could provide more specific details about the data collection process, such as sampling methods or data cleaning techniques used. Additionally, the author could clarify the meaning of some variables, like “XXX,” to make the description more accessible to readers.
4.4.2. Evaluating Pedagogical Effectiveness in Automated Feedback Generation: Style, Structure, and Completeness
Mistral-7B
1. Program Code
✓ Strength: The code imports the necessary libraries and initializes the dataframe. It also provides a function to filter data by interval, which is useful for cleaning and pre-processing data.
▲ Area for Improvements: The code does not handle exceptions or errors effectively. For example, if the CSV file is not found or is not in the correct format, the code will crash without any error message. Additionally, the code does not provide any documentation for
the functions it defines, making it difficult for others to understand their purpose and usage.
2. Code Structure:
✓ Strength: The code is organized by objective, with separate sections for loading data, cleaning data, and analyzing data. This makes it easy to follow the logic of the code.
⊹ Area for Improvements: The functions for cleaning and filtering data are defined within the main code block, rather than being separated into their own modules. This makes the code less modular and harder to reuse in other projects.
CodeLlama-7B
Criteria 1: Program Code
✓ Strong: The code is well-structured and easy to read. The use of user-define functions and classes makes the code more modular and reusable.
▱ Area for Improvement: The code could benefit from more error handling and proper exception handling mechanism to ensure robustness and reliability.
Criteria 2: Code Structure
✓ Strong: The notebook is well-organized and easy to follow. The use of functions and classes helps to keep the code modular and maintainable.
▱ Area for Improvement: The code could benefit from more comments and documentation to make it more self-explanatory and easier to understand.
Another example from the Project Outcome section is as follows:
Mistral-7B Output
2. Visualization of Results
Strength: The report includes visualizations for each objective and uses a variety of appropriate visualization types. The visualizations are informative and easy to understand, with clear titles, labels, and legends.
Area for Improvements: The report could improve by ensuring all writing on figures are legible, and that figures are appropriately sized and scaled. Additionally, providing a brief description of each figure within the report would improve the overall flow and readability of the report.
CodeLlama-7B Output
Criteria 2: Visualization
✓ Strong: The visualization of the results is clear and informative, with appropriate visualizations for each objective and a clear and concisely written explanation of the visualization.
✓ Area for Improvement: The visualizations could be more varied and include more visualizations for different objectives, and the visualization could be more interactive and dynamic.
5. Conclusions
6. Limitations and Future Work
7. Ethical Considerations
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
LLMs | Large Language Models |
References
- Evans, C. Making sense of assessment feedback in higher education. Rev. Educ. Res. 2013, 83, 70–120. [Google Scholar] [CrossRef]
- Harris, L.R.; Brown, G.T.L.; Harnett, J.A. Understanding classroom feedback practices: A study of New Zealand student experiences, perceptions, and emotional responses. Educ. Assess. Eval. Account. 2014, 26, 107–133. [Google Scholar] [CrossRef]
- Nicol, D.J.; Macfarlane-Dick, D. Formative Assessment and Self-Regulated Learning: A Model and Seven Principles of Good Feedback Practice. Stud. High. Educ. 2006, 31, 199–218. [Google Scholar] [CrossRef]
- Carless, D.; Boud, D. The development of student feedback literacy: Enabling uptake of feedback. Assess. Eval. High. Educ. 2018, 43, 1315–1325. [Google Scholar] [CrossRef]
- Leiker, D.; Finnigan, S.; Gyllen, A.R.; Cukurova, M. Prototyping the use of large language models (llms) for adult learning content creation at scale. arXiv 2023, arXiv:2306.01815. [Google Scholar]
- Boud, D.; Molloy, E. Rethinking models of feedback for learning: The challenge of design. Assess. Eval. High. Educ. 2013, 38, 698–712. [Google Scholar] [CrossRef]
- Bloxham, S.; Den-Outer, B.; Hudson, J.; Price, M. Let’s stop the pretence of consistent marking: Exploring the multiple limitations of assessment criteria. Assess. Eval. High. Educ. 2016, 41, 466–481. [Google Scholar] [CrossRef]
- Yan, L.; Sha, L.; Zhao, L.; Li, Y.; Martinez-Maldonado, R.; Chen, G.; Li, X.; Jin, Y.; Gašević, D. Practical and ethical challenges of large language models in education: A systematic scoping review. Br. J. Educ. Technol. 2024, 55, 90–112. [Google Scholar] [CrossRef]
- Kasneci, E.; Sessler, K.; Küchemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; Günnemann, S.; Hüllermeier, E.; et al. ChatGPT for Good? On Opportunities and Challenges of Large Language Models for Education. Learn. Individ. Differ. 2023, 103, 102274. [Google Scholar] [CrossRef]
- Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; Casas, D.D.L.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar]
- Roziere, B.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan, X.E.; Adi, Y.; Liu, J.; Sauvestre, R.; Remez, T.; et al. Code llama: Open foundation models for code. arXiv 2023, arXiv:2308.12950. [Google Scholar]
- Kluyver, T.; Ragan-Kelley, B.; Pérez, F.; Granger, B.E.; Bussonnier, M.; Frederic, J.; Jupyter Development Team. Jupyter Notebooks—A publishing format for reproducible computational workflows. In Positioning and Power in Academic Publishing: Players, Agents and Agendas; IOS Press: Amsterdam, The Netherlands, 2016; pp. 87–90. [Google Scholar]
- Rule, A.; Tabard, A.; Hollan, J.D. Exploration and explanation in computational notebooks. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, Montreal, QC, Canada, 21–26 April 2018; pp. 1–12. [Google Scholar]
- MacNeil, S.; Tran, A.; Mogil, D.; Bernstein, S.; Ross, E.; Huang, Z. Generating diverse code explanations using the GPT-3 large language model. In Proceedings of the 2022 ACM Conference on International Computing Education Research, Lugano, Switzerland, 7–11 August 2022; Volume 2, pp. 37–39. [Google Scholar]
- Viet, T.D.; Markov, K. Using Large Language Models for Bug Localization and Fixing. In Proceedings of the 12th International Conference on Awareness Science and Technology (iCAST), Taichung, Taiwan, 9–11 November 2023; pp. 192–197. [Google Scholar] [CrossRef]
- Savelka, J.; Agarwal, A.; An, M.; Bogart, C.; Sakr, M. Thrilled by your progress! Large language models (GPT-4) no longer struggle to pass assessments in higher education programming courses. In Proceedings of the 2023 ACM Conference on International Computing Education Research, Chicago, IL, USA, 7–11 August 2023; Volume 1, pp. 78–92. [Google Scholar]
- Dai, W.; Lin, J.; Jin, H.; Li, T.; Tsai, Y.-S.; Gašević, D.; Chen, G. Can Large Language Models Provide Feedback to Students? ACase Study on ChatGPT. In Proceedings of the 2023 IEEE International Conference on Advanced Learning Technologies (ICALT), Orem, UT, USA, 10–13 July 2023. [Google Scholar] [CrossRef]
- Yancey, K.P.; Laflair, G.; Verardi, A.; Burstein, J. Rating short l2 essays on the CEFR scale with GPT-4. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), Toronto, ON, Canada, 13 July–2 May 2023; pp. 576–584. [Google Scholar]
- Xiao, C.; Ma, W.; Xu, S.X.; Zhang, K.; Wang, Y.; Fu, Q. From Automation to Augmentation: Large Language Models Elevating Essay Scoring Landscape. arXiv 2024, arXiv:2401.06431. [Google Scholar]
- Venter, J.; Coetzee, S.A.; Schmulian, A. Exploring the use of artificial intelligence (AI) in the delivery of effective feedback. Assess. Eval. High. Educ. 2024, 1–21. [Google Scholar] [CrossRef]
- Maity, S.; Deroy, A. Human-Centric eXplainable AI in Education. arXiv 2024, arXiv:2410.19822. [Google Scholar] [CrossRef]
- Sclar, M.; Choi, Y.; Tsvetkov, Y.; Suhr, A. Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting. arXiv 2023, arXiv:2310.11324. [Google Scholar]
- Shankar, S.; Zamfirescu-Pereira, J.; Hartmann, B.; Parameswaran, A.; Arawjo, I. Who validates the validators? Aligning LLM-assisted evaluation of LLM outputs with human preferences. In Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology, Pittsburgh, PA, USA, 13–16 October 2024; pp. 1–14. [Google Scholar]
- Xu, T.; Wu, S.; Diao, S.; Liu, X.; Wang, X.; Chen, Y.; Gao, J. SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales. arXiv 2024, arXiv:2405.20974. [Google Scholar]
- Abbas, N.; Whitfield, J.; Atwell, E.; Bowman, H.; Pickard, T.; Walker, A. Online chat and chatbots to enhance mature student engagement in higher education. Int. J. Lifelong Educ. 2022, 41, 308–326. [Google Scholar] [CrossRef]
- Aleedy, M.; Alshihri, F.; Meshoul, S.; Al-Harthi, M.; Alramlawi, S.; Aldaihani, B.; Shaiba, H.; Atwell, E. Designing AI-powered translation education tools: A framework for parallel sentence generation using SauLTC and LLMs. PeerJ Comput. Sci. 2025, 11, e2788. [Google Scholar] [CrossRef]
- Alsafari, B.; Atwell, E.; Walker, A.; Callaghan, M. Towards effective teaching assistants: From intent-based chatbots to LLM-powered teaching assistants. Nat. Lang. Process. J. 2024, 8, 100101. [Google Scholar] [CrossRef]
- Wang, S.; Xu, T.; Li, H.; Zhang, C.; Liang, J.; Tang, J.; Yu, P.S.; Wen, Q. Large language models for education: A survey and outlook. arXiv 2024, arXiv:2403.18105. [Google Scholar]
- Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
- Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst. 2025, 43, 1–55. [Google Scholar] [CrossRef]
Project Plan | Program Code | Project Outcome | Conclusion | Total Words | |
---|---|---|---|---|---|
Report 1 | 535 | 927 | 468 | 325 | 2255 |
Report 2 | 799 | 2086 | 521 | 393 | 3799 |
Report 3 | 1105 | 2650 | 1006 | 268 | 5029 |
Report 4 | 966 | 2135 | 1054 | 650 | 4805 |
Report 5 | 994 | 9078 | 1185 | 246 | 11,503 |
Report 6 | 1388 | 6298 | 1071 | 969 | 9726 |
Report 7 | 503 | 3378 | 545 | 376 | 4802 |
Report 8 | 959 | 3884 | 286 | 254 | 5383 |
Report 9 | 721 | 1082 | 561 | 252 | 2616 |
Report 10 | 1583 | 542 | 1475 | 473 | 4073 |
Report 11 | 956 | 3847 | 1359 | 665 | 6827 |
Report 12 | 2659 | 2622 | 2421 | 524 | 8226 |
Report 13 | 675 | 7486 | 1012 | 3448 | 12,621 |
Report 14 | 941 | 1112 | 553 | 584 | 3190 |
Report 15 | 978 | 2230 | 582 | 188 | 3978 |
Total | 15,762 | 49,357 | 14,099 | 9615 | 88,833 |
Micro | Macro | |||||
---|---|---|---|---|---|---|
Precision | Recall | F1-Score | Precision | Recall | F1-Score | |
Project Plan | 0.84 | 0.87 | 0.85 | 0.78 | 0.87 | 0.85 |
Program Code | 0.97 | 0.89 | 0.93 | 0.84 | 0.84 | 0.92 |
Project Outcome | 0.87 | 0.89 | 0.88 | 0.82 | 0.89 | 0.88 |
Conclusion | 1.00 | 0.98 | 0.99 | 0.93 | 0.98 | 0.99 |
Micro | Macro | |||||
---|---|---|---|---|---|---|
Precision | Recall | F1-Score | Precision | Recall | F1-Score | |
Project Plan | 0.83 | 0.72 | 0.77 | 0.83 | 0.71 | 0.80 |
Program Code | 0.97 | 0.90 | 0.93 | 0.78 | 0.72 | 0.81 |
Project Outcome | 0.90 | 0.84 | 0.87 | 0.90 | 0.84 | 0.92 |
Conclusion | 1.00 | 0.98 | 0.99 | 1.00 | 1.00 | 1.00 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Abbas, N.; Atwell, E. Cognitive Computing with Large Language Models for Student Assessment Feedback. Big Data Cogn. Comput. 2025, 9, 112. https://doi.org/10.3390/bdcc9050112
Abbas N, Atwell E. Cognitive Computing with Large Language Models for Student Assessment Feedback. Big Data and Cognitive Computing. 2025; 9(5):112. https://doi.org/10.3390/bdcc9050112
Chicago/Turabian StyleAbbas, Noorhan, and Eric Atwell. 2025. "Cognitive Computing with Large Language Models for Student Assessment Feedback" Big Data and Cognitive Computing 9, no. 5: 112. https://doi.org/10.3390/bdcc9050112
APA StyleAbbas, N., & Atwell, E. (2025). Cognitive Computing with Large Language Models for Student Assessment Feedback. Big Data and Cognitive Computing, 9(5), 112. https://doi.org/10.3390/bdcc9050112