From Pilots to Practices: A Scoping Review of GenAI-Enabled Personalization in Computer Science Education
Abstract
1. Introduction
- Policy context.
- What we mean by personalization.
- A working definition: exploration-first.
- Help design defaults that preserve productive struggle: Explanation-first hints (pseudocode, tracing, and fault localization), solution withholding by default, and graduated hint ladders supported by short reflection prompts before escalation.
- Artifact grounding: Tutors and feedback are conditioned on the learner’s current code, failing tests, and assignment specification; assessment is conducted with explicit rubrics and exemplars and unit tests and mutation checks.
- Human-in-the-loop audits of any generated tests, items, and grades, with logs retained for pedagogy and moderation (not “detector” policing).
- Pilot → measure → scale: Activation for one section or assignment, examining process and outcome metrics, and expanding scope when the combined quantitative and qualitative evidence supports doing so.
- Enablement governance: Vetted or enterprise instances, data minimization, and prompt and version change logs; short allow-lists in syllabi plus process evidence (what was asked, hint levels used, and test history) instead of AI detectors.
- Why a scoping review now?
- Objective and gap.
- Contributions.
- A structured map of how GenAI is used to personalize CS learning, emphasizing mechanisms (explanation-first hints, ladders, and rubric and test grounding) over brands;
- A synthesis of effectiveness signals (time-to-help, error remediation, feedback quality, and grading reliability) and the conditions under which they appear;
- A consolidation of risks (integrity, privacy, bias and equity, and over-reliance) with actionable mitigation;
- Design principles and workflow patterns for exploration-first personalization;
- Research questions.
- RQ1.
- RQ2.
- RQ3.
- RQ4.
- RQ5.
- RQ6.
2. Background and Related Work
2.1. From ITSs to LLMs
2.2. Clarifying Terms
2.3. Affordances Across CS Subdomains
2.4. Pre-GenAI Baselines
2.5. GenAI-Enabled Patterns in CS Education
3. Methods: Scoping Approach
- Registration.
3.1. Eligibility and Selection
- Rationale for purposive sampling.
- (a)
- Mechanism transparency: Studies that clearly described personalization mechanisms (e.g., hint ladders, explanation-first scaffolding, course-aligned generation, or test- or rubric-grounding) were included. Papers that invoked “ChatGPT support” without detailing intervention logic were excluded.
- (b)
- Interpretable process or learning outcomes: Studies reporting measurable learning, debugging, process, or behavioral outcomes were included. Papers reporting only post hoc satisfaction surveys or generic perceptions without task-linked metrics were excluded because they could not inform mechanism–outcome relationships.
- (c)
- Sufficient intervention detail: Studies that described prompts, constraints, workflows, model grounding, or tutor policies were included. Excluded papers typically lacked enough detail to map how personalization was implemented (e.g., no description of scaffolding, no explanation of input grounding, or insufficient reporting of tasks).
- Why 27 full-text studies were excluded.
- Personalization not actually implemented: The system provided static advice or open-ended chat interaction with no evidence of adaptation.
- Insufficient mechanism description: The intervention lacked detail on how hints were generated, how tasks were adapted, or how the model was conditioned.
- Outcomes limited to satisfaction surveys: No behavioral, process, or learning-related data were reported, preventing mechanism mapping.
- Redundant or superseded work: Conference abstracts or short papers from the same research groups that were expanded into more detailed publications were included.
- Negative or null results with no mechanistic insight: Some studies reported poor or null outcomes but provided too little detail to attribute failure to design, prompting, scaffolding, or grounding decisions.
- Implications for bias.
3.2. Charting and Synthesis
4. Results
4.1. Corpus Characteristics
4.2. Application Areas and Mechanisms
4.3. Measures and Constructs
4.4. Descriptive Outcome Signals
5. Comparative Analysis: Design Patterns and Outcomes
5.1. Design Pattern Effectiveness
5.2. Condition Analysis
- Artifact grounding: Tutoring and feedback anchored in students’ current code, failing tests, and assignment specifications.
- Quality assurance loops: Human review of generated tests, items, hints, or grades before or alongside student exposure.
- Graduated scaffolding: Multi-level hint structures or feedback ladders requiring reflection or effort before escalation.
- AI literacy integration: Explicit instruction on effective help-seeking, limitations of tools, and expectations around academic integrity.
- Unconstrained access to solutions early in the interaction;
- Grading prompts without explicit rubrics or exemplar calibration;
- Limited or no instructor review of generated content;
- Weak integration with existing course infrastructure (autograders, LMS, and version control).
5.3. Mechanism–Outcome Mapping
6. Discussion
6.1. RQ1: Design Mechanisms
6.2. RQ2: Effectiveness Conditions
6.3. RQ3: Risks and Mitigation
6.4. RQ4: Workflows That Align with Durable Learning
6.5. RQ5: Institutional Practice
6.6. RQ6: Evidence Gaps
6.7. Theoretical Grounding of Findings
- Desirable difficulties and productive struggle.
- Worked examples and fading.
- Assessment for learning.
- Cognitive load management.
7. Implementation Roadmap for Departments
- Critical decision points.
- Resource requirements.
- Year 1 (pilot): 0.25 FTE coordinator; 20–30 h faculty training; tool licensing; 10–15 h/week pilot faculty time.
- Years 2–3 (scale): 0.5 FTE coordinator; ongoing training (5–10 h/faculty); audit processes (5–10 h/semester per tool); vendor management.
- Ongoing: Policy review; assessment validation; longitudinal studies (potentially grant-supported).
8. Limitations
- Temporal and selection bias.
- Publication and outcome bias.
- Quality appraisal and study design.
- Heterogeneity in measurement.
- Limited longitudinal data.
- Equity analysis gaps.
9. Future Work and Research Priorities
9.1. Critical Research Needs
- Longitudinal studies of learning and skill development.
- Comparative effectiveness trials of guardrail designs.
- Equity-focused research.
- Standardized benchmarks and shared datasets.
- Open-source tool development.
9.2. Practice Innovations
- Process-based assessment portfolios.
- Multi-institution collaboratives.
- Student co-design and AI literacy curricula.
9.3. Policy and Governance Research
- Vendor vetting and contract negotiation.
- Labor and instructor impact.
- Long-term institutional case studies.
10. Conclusions
- Actionable takeaways.
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Deployment Checklist for Instructors
Appendix A.1. Before Deployment
- ☐
- Policy approved by department and communicated to students;
- ☐
- Vendor FERPA/GDPR compliance audit completed;
- ☐
- Faculty training conducted (tool features, pedagogical strategies, and risk mitigation);
- ☐
- Baseline data collection planned (control sections or pre-deployment metrics);
- ☐
- Human-in-the-loop audit workflow defined (who reviews what, when, and how);
- ☐
- Syllabus updated with AI-use statement (allowed tools, permitted uses, and citation requirements);
- ☐
- Student AI literacy session scheduled (effective help-seeking, tool limitations, and academic integrity).
Appendix A.2. During Pilot (Weekly/Bi-Weekly)
- ☐
- Interaction logs reviewed for answer-seeking patterns and over-reliance signals;
- ☐
- Student feedback collected (weeks 3, 8, 15: utility, clarity, and concerns);
- ☐
- Quality spot-checks (sample generated hints, grades, or tests; verify accuracy and alignment);
- ☐
- Equity monitoring (compare usage and outcomes by subgroup where feasible; investigate disparities);
- ☐
- Incident log maintained (errors, hallucinations, inappropriate outputs, and student complaints).
Appendix A.3. Evaluation at End of Pilot
- ☐
- Key process metrics (e.g., time-to-help, hint usage, and grading agreement) summarized and interpreted;
- ☐
- Evidence of meaningful error remediation and/or improved feedback quality relative to baseline;
- ☐
- Grading and test-generation workflows checked for reliability and alignment with rubrics and specifications;
- ☐
- No major equity concerns identified in stratified analyses (where data permits);
- ☐
- Faculty and student feedback indicates that benefits outweigh burdens or risks.
Appendix A.4. Decision Point
- ☐
- If evidence is broadly positive, consider scaling to additional sections or courses with ongoing monitoring;
- ☐
- If evidence is mixed, diagnose causes (tool design, instructor preparation, and task alignment), refine, and re-pilot;
- ☐
- If major risks are identified, pause or discontinue use pending remediation; document lessons learned.
Appendix A.5. Ongoing (Post-Scale)
- ☐
- Periodic review of usage, outcome, and equity metrics;
- ☐
- Annual policy review (update for new use cases, model changes, and regulatory shifts);
- ☐
- Vendor re-evaluation (privacy practices, pricing, feature roadmap, and lock-in risks);
- ☐
- Longitudinal follow-up where feasible (e.g., retention, transfer, and downstream course performance);
- ☐
- Community contribution (share practices, prompts, and lessons via conferences or repositories).
References
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- OpenAI. Hello GPT-4o. 2024. Available online: https://openai.com/index/hello-gpt-4o/ (accessed on 10 December 2025).
- Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku (Model Card). 2024. Available online: https://www.anthropic.com/claude-3-model-card (accessed on 10 December 2025).
- DeepSeek-AI; Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar] [CrossRef]
- Wang, S.; Xu, T.; Li, H.; Zhang, C.; Liang, J.; Tang, J.; Yu, P.S.; Wen, Q. Large Language Models for Education: A Survey and Outlook. arXiv 2024, arXiv:2403.18105. [Google Scholar] [CrossRef]
- Xu, H.; Gan, W.; Qi, Z.; Wu, J.; Yu, P.S. Large Language Models for Education: A Survey. arXiv 2024, arXiv:2405.13001. [Google Scholar] [CrossRef]
- U.S. Department of Education, Office of Educational Technology. Artificial Intelligence and the Future of Teaching and Learning: Insights and Recommendations; U.S. Department of Education: Washington, DC, USA, 2023. [Google Scholar]
- EDUCAUSE. 2024 EDUCAUSE AI Landscape Study; EDUCAUSE: Boulder, CO, USA, 2024; Available online: https://library.educause.edu/resources/2024/2/2024-educause-ai-landscape-study (accessed on 10 December 2025).
- UNESCO. Guidance for Generative AI in Education and Research; UNESCO: Paris, France, 2023; Available online: https://www.unesco.org/en/articles/guidance-generative-ai-education-and-research (accessed on 10 December 2025).
- OECD. OECD Digital Education Outlook 2023: Towards an Effective Digital Education Ecosystem; Technical Report; OECD Publishing: Paris, France, 2023. [Google Scholar]
- Reuters. Top French University Bans Use of ChatGPT to Prevent Plagiarism; Reuters: London, UK, 2023; Available online: https://www.reuters.com/technology/top-french-university-bans-use-chatgpt-prevent-plagiarism-2023-01-27/ (accessed on 5 September 2025).
- California State University. CSU Prepares Students, Faculty and Staff for an AI-Driven Future; California State University: Long Beach, CA, USA, 2025; Available online: https://www.calstate.edu/csu-system/news/Pages/CSU-Prepares-Students-Employees-for-AI-Driven-Future.aspx (accessed on 5 September 2025).
- California State University; Northridge. ChatGPT Edu for Students & Faculty. 2025. Available online: https://www.csun.edu/it/software-services/chatgpt (accessed on 5 September 2025).
- California State University; San Bernardino. CSUSB ChatGPT Edu. 2025. Available online: https://www.csusb.edu/faculty-center-for-excellence/instructional-design-and-academic-technologies-idat/chatgpt (accessed on 5 September 2025).
- Kelly, R. California State University Launches Systemwide ChatGPT Edu Deployment. Campus Technology, 2 May 2025. [Google Scholar]
- Kazemitabaar, M.; Ye, R.; Wang, X.; Henley, A.Z.; Denny, P.; Craig, M.; Grossman, T. CodeAid: Evaluating a Classroom Deployment of an LLM-based Programming Assistant that Avoids Revealing Solutions. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024. [Google Scholar]
- Yang, S.; Zhao, H.; Xu, Y.; Brennan, K.; Schneider, B. Debugging with an AI Tutor: Investigating Novice Help-Seeking Behaviors and Perceived Learning. In Proceedings of the 2024 ACM Conference on International Computing Education Research (ICER), Melbourne, VIC, Australia, 13–15 August 2024. [Google Scholar] [CrossRef]
- Jury, B.; Lorusso, A.; Leinonen, J.; Denny, P.; Luxton-Reilly, A. Evaluating LLM-generated Worked Examples in an Introductory Programming Course. In Proceedings of the ACE 2024: Australian Computing Education Conference, Sydney, NSW, Australia, 29 January–2 February 2024. [Google Scholar] [CrossRef]
- del Carpio Gutierrez, A.; Denny, P.; Luxton-Reilly, A. Automating Personalized Parsons Problems with Customized Contexts and Concepts. In Proceedings of the 2024 ACM Conference on Innovation and Technology in Computer Science Education (ITiCSE 2024); ACM: New York, NY, USA, 2024; pp. 688–694. [Google Scholar] [CrossRef]
- Logacheva, E.; Hellas, A.; Prather, J.; Sarsa, S.; Leinonen, J. Evaluating Contextually Personalized Programming Exercises Created with Generative AI. In Proceedings of the 2024 ACM Conference on International Computing Education Research (ICER), Melbourne, VIC, Australia, 13–15 August 2024. [Google Scholar] [CrossRef]
- Meyer, J.; Jansen, T.; Schiller, R.; Liebenow, L.W.; Steinbach, M.; Horbach, A.; Fleckenstein, J. Using LLMs to Bring Evidence-Based Feedback into the Classroom: AI-generated Feedback Increases Secondary Students’ Text Revision, Motivation, and Positive Emotions. Comput. Educ. Artif. Intell. 2024, 6, 100199. [Google Scholar] [CrossRef]
- Alkafaween, U.; Albluwi, I.; Denny, P. Automating Autograding: Large Language Models as Test Suite Generators for Introductory Programming. arXiv 2024, arXiv:2411.09261. [Google Scholar] [CrossRef]
- Xie, W.; Niu, J.; Xue, C.J.; Guan, N. Grade Like a Human: Rethinking Automated Assessment with Large Language Models. arXiv 2024, arXiv:2405.19694. [Google Scholar] [CrossRef]
- Phung, T.; Pădurean, V.A.; Singh, A.; Brooks, C.; Cambronero, J.; Gulwani, S.; Singla, A.; Soares, G. Automating Human Tutor-Style Programming Feedback: Leveraging GPT-4 Tutor Model for Hint Generation and GPT-3.5 Student Model for Hint Validation. In Proceedings of the 14th Learning Analytics and Knowledge Conference (LAK 2024), Kyoto, Japan, 18–22 March 2024; ACM: New York, NY, USA, 2024; pp. 12–23. [Google Scholar]
- Tyton Partners. Time for Class 2024: Unlocking Access to Effective Digital Teaching and Learning; Report; Tyton Partners: Boston, MA, USA, 2024. [Google Scholar]
- The Harvard Crimson. CS50 Will Integrate Artificial Intelligence into Course Instruction; The Harvard Crimson: Cambridge, MA, USA, 2023. [Google Scholar]
- Liu, R.; Zenke, C.; Liu, C.; Holmes, A.; Thornton, P.; Malan, D.J. Teaching CS50 with AI: Leveraging Generative Artificial Intelligence for Scaffolding and Feedback. In Proceedings of the 55th ACM Technical Symposium on Computer Science Education (SIGCSE), Portland, OR, USA, 20–23 March 2024. [Google Scholar] [CrossRef]
- Stanford Teaching Commons. Artificial Intelligence Teaching Guide; Stanford Teaching Commons: Stanford, CA, USA, 2024. [Google Scholar]
- MIT Teaching + Learning Lab. Generative AI & Your Course. Available online: https://tll.mit.edu/teaching-resources/course-design/gen-ai-your-course/ (accessed on 10 December 2025).
- Penn Center for Excellence in Teaching, Learning, & Innovation. Generative AI & Your Teaching. Available online: https://cetli.upenn.edu/resources/generative-ai/ (accessed on 10 December 2025).
- University of Pennsylvania, Center for Excellence in Teaching, Learning, & Innovation (CETLI). Penn AI Guidance and Policies. Available online: https://ai.upenn.edu/guidance (accessed on 10 December 2025).
- Duke Learning Innovation & Lifetime Education. Generative AI and Teaching at Duke: Guidance for Instructors; Duke University: Durham, NC, USA, 2025. [Google Scholar]
- Future of Privacy Forum. Vetting Generative AI Tools for Use in Schools. 2024. Available online: https://fpf.org/ (accessed on 5 September 2024).
- Gabbay, H.; Cohen, A. Combining LLM-Generated and Test-Based Feedback in a MOOC for Programming. In Proceedings of the 11th ACM Conference on Learning @ Scale (L@S 2024), Atlanta, GA, USA, 18–20 July 2024; ACM: New York, NY, USA, 2024; pp. 177–187. [Google Scholar] [CrossRef]
- MIT RAISE. Securing Student Data in the Age of Generative AI; Report; MIT RAISE: Cambridge, MA, USA, 2024. [Google Scholar]
- Carbonell, J.R. AI in CAI: An Artificial-Intelligence Approach to Computer-Assisted Instruction. IEEE Trans. Man–Mach. Syst. 1970, 11, 190–202. [Google Scholar] [CrossRef]
- Sleeman, D.; Brown, J.S. (Eds.) Intelligent Tutoring Systems; Academic Press: New York, NY, USA, 1982. [Google Scholar]
- Woolf, B.P. Building Intelligent Interactive Tutors: Student-Centered Strategies for Revolutionizing E-Learning; Morgan Kaufmann: San Francisco, CA, USA, 2009. [Google Scholar]
- Nwana, H.S. Intelligent Tutoring Systems: An Overview. Artif. Intell. Rev. 1990, 4, 251–277. [Google Scholar] [CrossRef]
- Corbett, A.T.; Anderson, J.R. Knowledge Tracing: Modeling the Acquisition of Procedural Knowledge. User Model. User-Adapt. Interact. 1994, 4, 253–278. [Google Scholar] [CrossRef]
- Piech, C.; Bassen, J.; Huang, J.; Ganguli, S.; Sahami, M.; Guibas, L.J.; Sohl-Dickstein, J. Deep Knowledge Tracing. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 7–12 December 2015; Curran Associates, Inc.: Red Hook, NY, USA, 2015. [Google Scholar]
- Pelánek, R. Bayesian Knowledge Tracing, Logistic Models, and Beyond: An Overview of Learner Modeling Techniques. User Model. User-Adapt. Interact. 2017, 27, 313–350. [Google Scholar] [CrossRef]
- Kasneci, E.; Seßler, K.; Küchemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; Günnemann, S.; Hüllermeier, E. ChatGPT for Good? On Opportunities and Challenges of Large Language Models for Education. Learn. Individ. Differ. 2023, 103, 102274. [Google Scholar] [CrossRef]
- Lohr, D.; Berges, M.; Chugh, A.; Striewe, M. Adaptive Learning Systems in Programming Education. In Proceedings of the GI Software Engineering/Informatics Education (DELFI 2024), Linz, Austria, 26 February–1 March 2024; Lecture Notes in Informatics (LNI), Gesellschaft für Informatik: Bonn, Germany, 2024. [Google Scholar]
- Ishaq, K.; Alvi, A.; Ikram ul Haq, M.; Rosdi, F.; Nazeer Choudhry, A.; Anjum, A.; Ali Khan, F. Level up your coding: A systematic review of personalized, cognitive, and gamified learning in programming education. PeerJ Comput. Sci. 2024, 10, e2310. [Google Scholar] [CrossRef]
- Marwan, S.; Gao, G.; Fisk, S.R.; Price, T.W.; Barnes, T. Adaptive Immediate Feedback Can Improve Novice Programming Engagement and Intention to Persist in Computer Science. In Proceedings of the 2020 International Computing Education Research Conference (ICER ’20), Virtual Event, 10–12 August 2020; ACM: New York, NY, USA; pp. 1–10. [Google Scholar] [CrossRef]
- Cavalcanti, A.P.; Barbosa, A.; Carvalho, R.; Freitas, F.; Tsai, Y.-S.; Gašević, D.; Mello, R.F. Automatic Feedback in Online Learning Environments: A Systematic Review. Comput. Educ. Artif. Intell. 2021, 2, 100027. [Google Scholar] [CrossRef]
- Lucas, H.C.; Upperman, J.S.; Robinson, J.R. A systematic review of large language models and their impact in medical education. Med. Educ. 2024, 58, 1276–1285. [Google Scholar] [CrossRef]
- Zawacki-Richter, O.; Marín, V.I.; Bond, M.; Gouverneur, F. Systematic Review of Research on Artificial Intelligence Applications in Higher Education—Where Are the Educators? Int. J. Educ. Technol. High. Educ. 2019, 16, 39. [Google Scholar] [CrossRef]
- Bassner, P.; Frankford, E.; Krusche, S. Iris: An AI-Driven Virtual Tutor for Computer Science Education. In Proceedings of the ITiCSE 2024, Milan, Italy, 8–10 July 2024; ACM: New York, NY, USA, 2024; pp. 534–540. [Google Scholar] [CrossRef]
- Yang, Y.; Liu, J.; Zamfirescu-Pereira, J.D.; DeNero, J. Scalable Small-Group CS Tutoring System with AI. arXiv 2024, arXiv:2407.17007. [Google Scholar] [CrossRef]
- Kestin, G.; Miller, K.; Klales, A.; Milbourne, T.; Ponti, G. AI tutoring outperforms in-class active learning. Sci. Rep. 2025, early access. [Google Scholar] [CrossRef]
- Hou, X.; Wu, Z.; Wang, X.; Ericson, B.J. CodeTailor: LLM-Powered Personalized Parsons Puzzles for Engaging Support While Learning Programming. In Proceedings of the Eleventh ACM Conference on Learning @ Scale (L@S ’24), Atlanta, GA, USA, 18–20 July 2024; pp. 1–12. [Google Scholar] [CrossRef]
- Heickal, H.; Lan, A.S. Generating Feedback-Ladders for Logical Errors in Programming Assignments Using GPT-4. In Proceedings of the Educational Data Mining 2024 (Posters), Atlanta, GA, USA, 11–13 July 2024. [Google Scholar]
- Zhu, E.; Teja, S.; Coombes, C.; Patterson, D. FEED-BOT: Formative Design Feedback on Programming Assignments. In Proceedings of the ITiCSE 2025, Nijmegen, The Netherlands, 27 June–2 July 2025; ACM: New York, NY, USA, 2025; Volume 1. [Google Scholar] [CrossRef]
- Doughty, J.; Wan, Z.; Bompelli, A.; Qayum, J.; Wang, T.; Zhang, J.; Zheng, Y.; Doyle, A.; Sridhar, P.; Agarwal, A.; et al. A Comparative Study of AI-Generated (GPT-4) and Human-crafted MCQs in Programming Education. In Proceedings of the 55th ACM Technical Symposium on Computer Science Education (SIGCSE Companion), Portland, OR, USA, 20–23 March 2024; ACM: New York, NY, USA, 2024; pp. 114–123. [Google Scholar] [CrossRef]
- Savelka, J.; Agarwal, A.; Bogart, C.; Sakr, M. From GPT-3 to GPT-4: On the Evolving Efficacy of Large Language Models to Answer Multiple-Choice Questions for Programming Classes in Higher Education. arXiv 2023, arXiv:2311.09518. [Google Scholar]
- Isley, C.; Gilbert, J.; Kassos, E.; Kocher, M.; Nie, A.; Brunskill, E.; Domingue, B.; Hofman, J.; Legewie, J.; Svoronos, T.; et al. Assessing the Quality of AI-Generated Exams: A Large-Scale Field Study. arXiv 2025, arXiv:2508.08314. [Google Scholar]
- Impey, C.; Wenger, M.; Garuda, N.; Golchin, S.; Stamer, S. Using Large Language Models for Automated Grading of Essays and Feedback Generation. arXiv 2024, arXiv:2412.18719. [Google Scholar]
- Yousef, M.; Mohamed, K.; Medhat, W.; Mohamed, E.H.; Khoriba, G.; Arafa, T. BeGrading: Large Language Models for Enhanced Feedback in Programming Education. Neural Comput. Appl. 2024, 37, 1027–1040. [Google Scholar] [CrossRef]
- Gaggioli, A.; Casaburi, G.; Ercolani, L.; Collova’, F.; Torre, P.; Davide, F. Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education. arXiv 2025, arXiv:2508.02442. [Google Scholar] [CrossRef]
- Lin, H.Y.; Thongtanunam, P.; Treude, C.; Charoenwet, W.P. Improving Automated Code Reviews: Learning from Experience. In Proceedings of the 21st IEEE/ACM International Conference on Mining Software Repositories (MSR 2024), Lisbon, Portugal, 14–20 April 2024. [Google Scholar]
- Almeida, Y.; Gomes, A.A.R.; Dantas, E.; Muniz, F.; de Farias Santos, K.; Perkusich, M.; Almeida, H.; Perkusich, A. AICodeReview: Advancing Code Quality with AI-Enhanced Reviews. SoftwareX 2024, 26, 101677. [Google Scholar] [CrossRef]
- Shah, A.; Erickson, S.; Waldvogel, T.; Brown, K.M. The CS1 Reviewer App: Choose Your Own Adventure or Learn by Repetition? In Proceedings of the ACM ITiCSE, Virtual, 26 June–1 July 2021. [Google Scholar]
- Cihan, U.; Haratian, V.; İçöz, A.; Gül, M.K.; Devran, Ö.; Bayendur, E.F.; Uçar, B.M.; Tüzün, E. Automated Code Review in Practice: Experience from Deploying and Improving an LLM-based PR Agent at Scale. arXiv 2024, arXiv:2412.18531. [Google Scholar]
- Cihan, U.; İçöz, A.; Haratian, V.; Tüzün, E. Evaluating Large Language Models for Code Review. arXiv 2025, arXiv:2505.20206. [Google Scholar] [CrossRef]
- Arksey, H.; O’Malley, L. Scoping studies: Towards a methodological framework. Int. J. Soc. Res. Methodol. 2005, 8, 19–32. [Google Scholar] [CrossRef]
- Levac, D.; Colquhoun, H.; O’Brien, K.K. Scoping studies: Advancing the methodology. Implement. Sci. 2010, 5, 69. [Google Scholar] [CrossRef]
- Peters, M.D.J.; Marnie, C.; Tricco, A.C.; Pollock, D.; Munn, Z.; Alexander, L.; McInerney, P.; Godfrey, C.M.; Khalil, H. Updated methodological guidance for the conduct of scoping reviews. JBI Evid. Synth. 2020, 18, 2119–2126. [Google Scholar] [CrossRef]
- Tricco, A.C.; Lillie, E.; Zarin, W.; O’Brien, K.K.; Colquhoun, H.; Levac, D.; Moher, D.; Peters, M.D.J.; Horsley, T.; Weeks, L.; et al. PRISMA Extension for Scoping Reviews (PRISMA-ScR): Checklist and Explanation. Ann. Intern. Med. 2018, 169, 467–473. [Google Scholar] [CrossRef]
- Zamfirescu-Pereira, J.; Qi, L.; Hartmann, B.; DeNero, J.; Norouzi, N. 61A Bot Report: AI Assistants in CS1 Save Students Homework Time and Reduce Demands on Staff. (Now What?). arXiv 2024, arXiv:2406.05600v3. [Google Scholar]
- Burstein, J.; Chodorow, M.; Leacock, C. Automated Essay Evaluation: The Criterion Online Writing Service. AI Mag. 2004, 25, 27–36. [Google Scholar]
- Florida Department of Education. 2013 Audit III Report: Scoring of the FCAT 2.0 Writing Assessment. 2013. Available online: https://www.fldoe.org/core/fileparse.php/3/urlt/2013burosreportfcatwritingassessment.pdf (accessed on 10 December 2025).
- Pizzorno, J.A.; Berger, E.D. CoverUp: Coverage-Guided LLM-Based Test Generation. arXiv 2024, arXiv:2403.16218. [Google Scholar]
- Broide, L.; Stern, R. EvoGPT: Enhancing Test Suite Robustness via LLM-Based Generation and Genetic Optimization. arXiv 2025, arXiv:2505.12424. [Google Scholar]
- Yang, B.; Tian, H.; Pian, W.; Yu, H.; Wang, H.; Klein, J.; Bissyandé, T.F.; Jin, S. CREF: An LLM-Based Conversational Software Repair Framework. In Proceedings of the ISSTA 2024, Vienna, Austria, 16–20 September 2024. [Google Scholar]
- Venugopalan, D.; Yan, Z.; Borchers, C.; Lin, J.; Aleven, V. Combining Large Language Models with Tutoring System Intelligence: A Case Study in Caregiver Homework Support. In Proceedings of the LAK 2025, Dublin, Ireland, 3–7 March 2025; ACM: New York, NY, USA, 2025. [Google Scholar]
- Nielsen, J. Response Times: The 3 Important Limits. 1993. Updated by Nielsen Norman Group. Available online: https://www.nngroup.com/articles/response-times-3-important-limits/ (accessed on 10 December 2025).
- Akoglu, L.; de Mel, G. Analysis of Question Response Time in StackOverflow. In Proceedings of the ASONAM 2014, Beijing, China, 17–20 August 2014; pp. 215–222. [Google Scholar]
- Piazza. Fall Usage Data Far Exceeds Expectations. 2011. Available online: https://piazza.com/about/press/20120106.html (accessed on 10 December 2025).
- LuPLab, UC Davis. Piazza Statistics: Response Time vs Class Size. 2021. Available online: https://luplab.cs.ucdavis.edu/2021/03/16/piazza-statistics.html (accessed on 10 December 2025).
- Washington, T., II; Bardolph, M.; Hadjipieris, P.; Ghanbari, S.; Hargis, J. Today’s Discussion Boards: The Good, the Bad, and the Ugly. Online J. New Horizons Educ. 2019, 9, 222–230. [Google Scholar]
- Prather, J.; Reeves, B.N.; Leinonen, J.; MacNeil, S.; Randrianasolo, A.S.; Becker, B.A.; Kimmel, B.; Wright, J.; Briggs, B. The Widening Gap: The Benefits and Harms of Generative AI for Novice Programmers. In Proceedings of the 2024 ACM Conference on International Computing Education Research (ICER 2024), Melbourne, VIC, Australia, 13–15 August 2024; Association for Computing Machinery: New York, NY, USA, 2024. [Google Scholar] [CrossRef]
- Zviel-Girshin, R.; Terk-Baruch, M.; Shvartzman, E.; Shonfeld, M. Generative AI in Novice Programming Education: Opportunities and Challenges. Educ. Sci. 2024, 14, 1089. [Google Scholar] [CrossRef]
- Pew Research Center. A Quarter of U.S. Teachers Say AI Tools Do More Harm Than Good in K–12 Education. Survey report; Pew Research Center, 2024. Available online: https://www.pewresearch.org/ (accessed on 10 February 2025).
- Price, T.W.; Dong, Y.; Roy, R.; Barnes, T. The Effect of Hint Quality on Help-Seeking Behavior. In Proceedings of the 18th International Conference on Artificial Intelligence in Education (AIED 2017), Wuhan, China, 28 June–1 July 2017; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2017; pp. 312–323. [Google Scholar]
- Landis, J.R.; Koch, G.G. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef] [PubMed]
- Roll, I.; Aleven, V.; McLaren, B.M.; Koedinger, K.R. The Help Tutor: Does Metacognitive Feedback Improve Students’ Help-Seeking Actions, Skills and Learning? In Proceedings of the 7th International Conference on Intelligent Tutoring Systems (ITS 2006), Jhongli, Taiwan, 26–30 June 2006; Lecture Notes in Computer Science, Vol. 4053. Springer: Berlin/Heidelberg, Germany, 2006; pp. 360–369. [Google Scholar]
- Rahe, C.; Maalej, W. How Do Programming Students Use Generative AI? In Proceedings of the ACM Joint Meeting on Foundations of Software Engineering (FSE 2025), Trondheim, Norway, 23–27 June 2025; Association for Computing Machinery: New York, NY, USA, 2025. [Google Scholar] [CrossRef]
- Harvard University. CS50 Will Use Artificial Intelligence to Help Students Learn; University news announcement; Harvard University: Cambridge, MA, USA, 2023. [Google Scholar]
- Quality Assurance Agency for Higher Education (QAA). Reconsidering Assessment for the ChatGPT Era; Guidance report; QAA: Gloucester, UK, 2023. [Google Scholar]
- Fogg, B.J. Persuasive Technology: Using Computers to Change What We Think and Do; Morgan Kaufmann: San Francisco, CA, USA, 2003. [Google Scholar]
- Bandura, A. Social Learning Theory; Prentice-Hall: Englewood Cliffs, NJ, USA, 1977. [Google Scholar]
- Deci, E.L.; Ryan, R.M. The “What” and “Why” of Goal Pursuits: Human Needs and the Self-Determination of Behavior. Psychol. Inq. 2000, 11, 227–268. [Google Scholar] [CrossRef]
- Jisc National Centre for AI. Embracing Generative AI in Assessments: A Guided Flowchart; Guidance document; Jisc: London, UK, 2024. [Google Scholar]
- Future of Privacy Forum. Generative AI in Higher Education: Considerations for Privacy and Data Governance; Policy report; Future of Privacy Forum: Washington, DC, USA, 2024. [Google Scholar]
- OECD. The Potential Impact of Artificial Intelligence on Equity and Inclusion in Education; Technical report; OECD Publishing: Paris, France, 2024. [Google Scholar]
- OpenAI. New AI Classifier for Indicating AI-Written Text (Notice of Discontinuation); Announcement; OpenAI: San Francisco, CA, USA, 2023. [Google Scholar]
- Liang, W.; Yuksekgonul, M.; Mao, Y.; Wu, E.; Zou, J. GPT detectors are biased against non-native English writers. Patterns 2023, 4, 100779. [Google Scholar] [CrossRef]
- Baker, R.S.; Hawn, A. Algorithmic Bias in Education. Int. J. Artif. Intell. Educ. 2022, 32, 901–902. [Google Scholar] [CrossRef]
- OECD. Algorithmic Bias: The State of the Situation and Policy Recommendations. In OECD Digital Education Outlook 2023; OECD Publishing: Paris, France, 2023. [Google Scholar]
- Bjork, E.L.; Bjork, R.A. Making things hard on yourself, but in a good way: Creating desirable difficulties to enhance learning. In Psychology and the Real World: Essays Illustrating Fundamental Contributions to Society; Gernsbacher, M.A., Pew, R.W., Hough, L.M., Pomerantz, J.R., Eds.; Worth Publishers: New York, NY, USA, 2011; pp. 56–64. [Google Scholar]
- Sweller, J. Cognitive load during problem solving: Effects on learning. Cogn. Sci. 1988, 12, 257–285. [Google Scholar] [CrossRef]
- EDUCAUSE. 2024 EDUCAUSE Action Plan: AI Policies and Guidelines; Report; EDUCAUSE: Louisville, CO, USA, 2024. [Google Scholar]
- Stanford Center for Teaching and Learning. Teaching with AI: Guidelines, Policies, and Recommendations; Instructional guidance; Stanford University: Stanford, CA, USA, 2024. [Google Scholar]
- Duke Learning Innovation. Generative AI Guidance for Instructors; Instructional guidance; Duke University: Durham, NC, USA, 2024. [Google Scholar]
- World Economic Forum. Shaping the Future of Learning: The Role of AI in Education; Policy report; World Economic Forum: Cologny, Switzerland, 2024. [Google Scholar]
- Sciences Po. Sciences Po Bans the Use of ChatGPT Without Transparent Referencing. Institutional Announcement; Sciences Po, 2023. Available online: https://newsroom.sciencespo.fr/ (accessed on 10 December 2024).
- University of Hong Kong. HKU Temporarily Bans Students from Using ChatGPT; University announcement; University of Hong Kong: Hong Kong, China, 2023. [Google Scholar]
- University of Hong Kong. HKU Drops Ban and Provides Generative AI Tools Campus-Wide; University announcement; University of Hong Kong: Hong Kong, China, 2023. [Google Scholar]
- Arizona State University. ASU–OpenAI Partnership (ChatGPT for Education/Enterprise) Announcement; Press release; Arizona State University: Tempe, AZ, USA, 2024. [Google Scholar]
- Vygotsky, L.S. Mind in Society: The Development of Higher Psychological Processes; Cole, M., John-Steiner, V., Scribner, S., Souberman, E., Eds.; Harvard University Press: Cambridge, MA, USA, 1978. [Google Scholar]
- Sweller, J.; Cooper, G.A. The use of worked examples as a substitute for problem solving in learning algebra. Cogn. Instr. 1985, 2, 59–89. [Google Scholar] [CrossRef]
- Renkl, A.; Atkinson, R.K.; Maier, U.H.; Staley, R. From example study to problem solving: Smooth transitions help learning. J. Exp. Educ. 2002, 70, 293–315. [Google Scholar] [CrossRef]
- Black, P.; Wiliam, D. Assessment and classroom learning. Assess. Educ. Princ. Policy Pract. 1998, 5, 7–74. [Google Scholar] [CrossRef]
- Sadler, D.R. Formative assessment and the design of instructional systems. Instr. Sci. 1989, 18, 119–144. [Google Scholar] [CrossRef]



| App. Type | Sources | Personalization Mechanism | Setting and Population | Main Takeaway |
|---|---|---|---|---|
| Solution-withholding programming assistant | [16] | Context-aware explanations, pseudocode, and line-level annotations that avoid full solutions | Large CS course | Guardrails (no solutions) sustain productive struggle and perceived learning. |
| Debugging tutor for novices | [17] | Conversational hints grounded in student code and errors | Intro CS | Designs should nudge learners toward strategy-seeking over answer-seeking. |
| Virtual tutor integrated with LMS and IDE | [50] | Tutor role prompts plus context (specifications, code, tests); calibrated assistance | CS course platform | Immediate, personalized support at scale without revealing solutions. |
| Scalable small-group AI tutoring | [51] | Group-aware facilitation; targeted prompts and hints | Small-group CS sessions | Personalization extends to group dynamics and roles. |
| CS61A Bot (course assistant) | [71] | Course-aware assistant supporting task orchestration and help | Large intro CS | Shows feasibility and challenges of course-integrated assistants. |
| LLM-generated worked examples | [18] | Course-aligned, level-appropriate exemplars with stepwise explanations | Intro programming | Novices rate examples as useful; curate for quality. |
| Contextually personalized exercises | [20] | Tailored practice aligned to course context and learner profile | CS courses | Personalized items are viable; quality varies. |
| Personalized Parsons problems | [19] | Custom code-rearrangement tasks targeting concepts | CS1 and online practice | Automating generation enables individualized practice at scale. |
| Personalized Parsons (L@S) | [53] | Multi-staged, on-demand puzzles that adapt to struggle patterns | CS1 | Engaging support without giving away solutions. |
| Evidence-based formative feedback | [21] | Structured, error-specific feedback aligned to pedagogy | Classroom deployments | LLMs surface actionable feedback; prompt design matters. |
| Feedback ladders for logic errors | [54] | Graduated hints from high-level cues to specific guidance | Programming assignments | Laddered feedback supports stepwise progress. |
| Tutor-style hints with validation | [24] | GPT-4 “tutor” generates hints; GPT-3.5 “student” validates quality | Benchmarks (Python) | Improves hint precision using tests and fixes; test-driven prompting. |
| Combining LLM + test feedback | [34] | LLM feedback complements automated tests in a MOOC | MOOC programming | Hybrid feedback increases correctness and coverage in practice. |
| FEED-BOT design feedback | [55] | Structured, design-recipe-aware formative comments | CS1 design tasks | High-level, structured feedback on design-oriented tasks. |
| Autograding—test suite generation | [22] | LLM-generated unit tests tuned to specifications | CS1 tasks | Improves coverage; reveals ambiguities; audit required. |
| “Grade-like-a-human” pipelines | [23] | Rubric-guided, explanation-rich grading with exemplars | Code and short answers | Near-human reliability with explicit rubrics and calibration. |
| MCQ generation (programming) | [56,57,58] | Blueprint-aligned MCQs with difficulty control | Intro courses | Acceptable psychometrics with expert review. |
| AI-authored exams (quality) | [58] | Item generation with validity checks | Web and CS-adjacent | Viable with rigorous vetting workflows. |
| AI grading and feedback (essays) | [59,72] | Criterion-aligned evaluation with formative comments | General education tasks | Transferable patterns; configure for bias and accuracy. |
| LLM-supported grading (NCA) | [60,73] | LLMs for enhanced feedback in programming education | Programming assignments | LLM feedback augments human grading, but there is concern over reliability. |
| AI-enhanced code review (ICSE) | [62,74,75] | Personalized code critiques learned from review corpora; coverage-guided test pipelines | SE courses | Review quality improves with curated data and guardrails. |
| AI-assisted code review (JSS) | [63] | Prompted or trained reviewers with exemplars | SE courses | Consistency rises; superficial suggestions are avoided. |
| Automated code review in practice | [65,76] | PR-agent deployed at scale; quality and operations insights | Industrial; SEIP’25 | Industrial lessons for integrating LLM reviewers. |
| Evaluating LLMs for code review | [66] | GPT-4o and other models evaluated on correctness and fixes | Benchmarks | LLM reviews help but need human-in-the-loop processes. |
| Hybrid tutoring/code-review systems | [77] | Combining LLMs with existing tutoring intelligence | Informal CS support | Illustrates hybrid human+AI configurations. |
| Application Area | Representative Sources | n (%) |
|---|---|---|
| AI-augmented assessment (tests, grading, item and exam generation) | [22,23,56,57,58,59,60,61] | 8 (25.0) |
| Tutoring and assistants | [16,17,50,51,52,71,77] | 7 (21.9) |
| Personalized learning materials | [18,19,20,53] | 6 (18.8) |
| Targeted formative feedback | [21,24,34,54,55] | 6 (18.8) |
| AI-assisted code review (SE) | [62,63,65,66] | 5 (15.6) |
| Total | 32 (100) | |
| Construct | Operationalization | Sources |
|---|---|---|
| Time-to-help | Latency from request to first actionable hint (IDE or LMS logs). | [16,17,78,79,80,81,82] |
| Error remediation | Share of failing tests resolved; next-attempt correctness; debug step count. | [17,22,74,75,76] |
| Perceived understanding and utility | Post-task Likert on clarity, usefulness, and confidence; coded rationales. | [18,20,83,84,85] |
| Feedback quality | Rubric-coded specificity, actionability, and alignment; inter-rater agreement (). | [21,24,54,86,87,88] |
| Grading reliability | QWK, Pearson or Spearman r, exact or adjacent agreement. | [23,60,72,73] |
| Test and coverage quality | Statement or branch coverage; mutation score; unique edge cases surfaced. | [22,74,75] |
| Item and exam quality | Difficulty (p), discrimination (point-biserial), KR-20 or ; expert review. | [56,57,58] |
| Help-seeking behavior | Proportion of hint versus solution requests; escalation; prompt taxonomy counts. | [17,86,88,89] |
| Instructor and TA effort | Authoring, curation, and audit time; TA workload deltas; review pass rates. | [26,27,56,90] |
| Code-review efficacy | Precision and recall of true issues; fix acceptance; developer effort. | [62,65,66] |
| Design Pattern | Typical Outcome Tone | Common Observations |
|---|---|---|
| Explanation-first + solution withholding | Often positive | Supports productive struggle and perceived learning; requires AI literacy to discourage answer-seeking. |
| Graduated hint ladders | Often positive | Aligns with stepwise scaffolding; development cost and tuning are non-trivial. |
| Test/rubric-grounded assessment | Often positive | Reliability improves when coupled with clear rubrics, exemplars, and audit; hallucinations surface when specs are vague. |
| Course-aligned generation (examples, exercises) | Often positive | Helps with practice at scale; quality variance highlights need for instructor review and curation. |
| Unconstrained chat interface | Often mixed or negative | Solution dumping, reduced productive struggle, integrity concerns, and over-reliance are recurrent issues. |
| Phase | Activities | Illustrative Success Indicators |
|---|---|---|
| Foundation (Months 1–2) |
| Policy approved; tools vetted; volunteers trained |
| Pilot (Months 3–6) |
| Data collected; no major incidents; preliminary signals encouraging |
| Evaluation (Month 7) |
| Metrics and qualitative feedback suggest pedagogical value without clear harm |
| Scale (Months 8–12) |
| Sustained performance; quality maintained; instructor capacity built |
| Sustain (Year 2+) |
| Durable integration; evidence of learning benefits; community contribution |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Reihanian, I.; Hou, Y.; Sun, Q. From Pilots to Practices: A Scoping Review of GenAI-Enabled Personalization in Computer Science Education. AI 2026, 7, 6. https://doi.org/10.3390/ai7010006
Reihanian I, Hou Y, Sun Q. From Pilots to Practices: A Scoping Review of GenAI-Enabled Personalization in Computer Science Education. AI. 2026; 7(1):6. https://doi.org/10.3390/ai7010006
Chicago/Turabian StyleReihanian, Iman, Yunfei Hou, and Qingquan Sun. 2026. "From Pilots to Practices: A Scoping Review of GenAI-Enabled Personalization in Computer Science Education" AI 7, no. 1: 6. https://doi.org/10.3390/ai7010006
APA StyleReihanian, I., Hou, Y., & Sun, Q. (2026). From Pilots to Practices: A Scoping Review of GenAI-Enabled Personalization in Computer Science Education. AI, 7(1), 6. https://doi.org/10.3390/ai7010006

