Abstract
This article formalizes AI-assisted assessment as a discrete-time policy-level design for iterative feedback and evaluates it in a digitally transformed higher-education setting. We integrate an agentic retrieval-augmented generation (RAG) feedback engine—operationalized through planning (rubric-aligned task decomposition), tool use beyond retrieval (tests, static/dynamic analyzers, rubric checker), and self-critique (checklist-based verification)—into a six-iteration dynamic evaluation cycle. Learning trajectories are modeled with three complementary formulations: (i) an interpretable update rule with explicit parameters and that links next-step gains to feedback quality and the gap-to-target and yields iteration-complexity and stability conditions; (ii) a logistic-convergence model capturing diminishing returns near ceiling; and (iii) a relative-gain regression quantifying the marginal effect of feedback quality on the fraction of the gap closed per iteration. In a Concurrent Programming course ( ), the cohort mean increased from 58.4 to 91.2 (0–100), while dispersion decreased from 9.7 to 5.8 across six iterations; a Greenhouse–Geisser corrected repeated-measures ANOVA indicated significant within-student change. Parameter estimates show that higher-quality, evidence-grounded feedback is associated with larger next-step gains and faster convergence. Beyond performance, we engage the broader pedagogical question of what to value and how to assess in AI-rich settings: we elevate process and provenance—planning artifacts, tool-usage traces, test outcomes, and evidence citations—to first-class assessment signals, and outline defensible formats (trace-based walkthroughs and oral/code defenses) that our controller can instrument. We position this as a design model for feedback policy, complementary to state-estimation approaches such as knowledge tracing. We discuss implications for instrumentation, equity-aware metrics, reproducibility, and epistemically aligned rubrics. Limitations include the observational, single-course design; future work should test causal variants (e.g., stepped-wedge trials) and cross-domain generalization.