Review Reports - Dynamic Assessment with AI (Agentic RAG) and Iterative Feedback: A Model for the Digital Transformation of Higher Education in the Global EdTech Ecosystem

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The integration of an Agentic Retrieval-Augmented Generation (RAG) model into a dynamic feedback loop represents a worthwhile contribution to AIED research. The authors offer a sound theoretical underpinning, a clear methodological explanation, and adequate empirical evidence in a case study of a Concurrent Programming course. The formalization of the feedback loop as a linear-difference and logistic-convergence model is a particular strength, offering an intriguing and interpretable framework for learning trajectories. Even though "Agentic RAG" is used throughout, there is no clear statement of what is "agentic" about the RAG system in this specific application. The references to planning, tool use, and self-criticism (from cited texts like ReAct and Self-RAG) are mentioned but not defined for this deployment. My suggestion is A paragraph in Section 3.2 needs to explicitly define the usage-based features. For example: How was "planning" supported in generating feedback? What "tools" were used other than retrieval? What was the specific "self-criticism" process? A concrete example or a cut prompt chain would be highly beneficial. The contrast established between this work and the Knowledge Tracing (KT) literature (Section 2.4) is superficial. It is unclear why KT concerns itself with "predictive fit" while this work concerns itself with "prescribing feedback policies." Modern KT models, especially those using Deep Learning, are sequential and would be appropriate to guide feedback policies. Please make this section stronger by making the novel contribution more explicit. The key difference is the interpretable and minimal update rule with explicit parameters (ηi,λiηi,λi) relating quality of feedback to learning advancement in an analytically tractable form (e.g., allowing iteration complexity bounds in Prop. 1). Emphasize that this is a design model for the feedback loop itself, rather than a state estimation model of the student.

Comments on the Quality of English Language

This manuscript would benefit from a thorough proofread to polish its language and enhance clarity.

Author Response

We thank the reviewer for the careful reading and constructive suggestions. We have revised the manuscript substantially to address each point. Below we respond point-by-point and indicate exactly where changes were made.

In the new version of the manuscript, changes have not been highlighted (e.g., with a different text color) because a comprehensive editorial revision was performed, which included a translation into English using a more specific scientific terminology.

Comments 1: Agentic RAG is used throughout, but there is no clear statement of what is ‘agentic’ in this application.

Response 1. We now provide an explicit, operational definition of agentic RAG tailored to our deployment. Concretely, “agentic” refers to three controller-enforced, auditable capabilities: (i) planning (rubric-aligned task decomposition), (ii) tool use beyond retrieval (test runner; static/dynamic analyzers; rubric-check microservice; citation formatter), and (iii) self-critique (checklist verification prior to delivery). We also state this succinctly on first mention.

Where we revised the manuscript.

Abstract: added the operational definition (“operationalized through planning, tool use beyond retrieval, and self-critique”).
Introduction (§1): paragraph clarifying agentic RAG as used in this study (planning/tool-use/self-critique).
Theoretical Framework – Advanced AI for Personalized Feedback (§2.2): “Operational definition in this study” paragraph.
Materials and Methods – System Architecture (§3.2): new subsection “Usage-based agentic features (explicit and auditable)” and an abridged controller action trace (A1–A2/T1–T3/G1/V1/F). See also Algorithm 1 (Iterative Cycle) and Figure 2.

Comments 2: Define planning, tool use, and self-criticism for this deployment; include a concrete example or cut prompt chain.

Response 2. We added a usage-based, auditable definition and a concrete, abbreviated action trace showing the sequence of controllable steps (plan → retrieve → run tests/analyses → rubric check → draft → checklist self-critique → finalize with evidence citations). We list the actual tools used beyond retrieval and specify that inputs/outputs, versions, and evidence IDs are logged.

Where we revised the manuscript.

§3.2 System Architecture:

“Usage-based agentic features (explicit and auditable)” with bullet-level definitions of planning, tool use (tests; static/dynamic analyzers; rubric checker; citation formatter), and self-critique (checklist, bounded one revision).
“Controller-enforced action trace (abridged)” showing S/C/A/T/G/V/F steps.

Algorithms & Figures: Algorithm 1 (cycle), Algorithm 2 (FQI computation), Figure 2 (end-to-end workflow).

Comments 3: Contrast with Knowledge Tracing (KT) is superficial; make the novel contribution explicit. Modern KT can guide feedback policies.

Response. We clarified the novelty and complementarity. Our work is a policy-level design model for the feedback loop itself, with explicit update mechanics and interpretable parameters (η,λ)(\eta,\lambda)(η,λ) that admit iteration-complexity and stability results (Propositions/Corollary). By contrast, KT focuses on state estimation optimized for predictive fit. We now explain how KT can still inform targeting (e.g., modulating scaffolding to raise Fi,tF_{i,t}Fi,t) while our update rules preserve analytical tractability and prescriptive guarantees.

Where we revised the manuscript.

Theoretical Framework – Relation to Knowledge Tracing (§2.4): strengthened contrast; added a Complementarity paragraph describing how KT signals can modulate Fi,tF_{i,t}Fi,t without changing our analytic guarantees.
Introduction (§1): explicitly position our approach as policy-level design vs state estimation.
Discussion (§4.3): “On Knowledge Tracing (KT)” reiterates the distinction and the role of KT for targeting.
Conclusions (§5): restate the design-model contribution and its guarantees.

Comments 4: Emphasize the interpretable, minimal update rule with explicit parameters (ηi, λi) and that this is a design model for the loop (iteration-complexity in Prop. 1).

Response. We emphasized throughout that our contribution is a design model with interpretable parameters linking feedback quality to learning advancement. We highlight the iteration-complexity bound (Prop. 1), stability condition λF<2\lambda F<2λF<2 (Prop. 2), the variance-contraction prediction (Cor. 1), and a practical half-life rule of thumb. These analytic properties are now explicitly tied to the empirical findings.

Where we revised the manuscript.

Theoretical Framework – Mathematical Modeling (§2.3):

Proposition 1 (monotonicity, geometric convergence, iteration complexity).
Proposition 2 (stability and monotone convergence for logistic update).
Corollary 1 (cohort variance contraction) and half-life rule-of-thumb note.

Introduction (§1): “policy-level algorithmic process” with parameters (η,λ)(\eta,\lambda)(η,λ) and reference to complexity/stability.
Results (§3): connect estimates to conditions (λF<2\lambda F<2λF<2) and variance contraction.
Discussion (§4.2, §4.3) and Conclusions (§5): explicit “design-model” language and cross-references to Props./Cor.

English Language

Response. We have thoroughly proofread the manuscript to polish language and improve clarity. Revisions include: streamlining long sentences; harmonizing terminology and notation; correcting hyphenation and punctuation; improving figure/table captions; and ensuring consistent tense and style. We also disclose in §3.6 “Statement on Generative AI Use” that ChatGPT was used only for language editing; all technical content and analyses are the sole responsibility of the authors.

Reviewer 2 Report

Comments and Suggestions for Authors

Thank you for submitting a very attractive article to the Algorithms.

I would like to ask you to consider the following points to improve the readability.

Details of study design should be provided regarding the “students’ submission.”

I think that there are 6 different tasks to submit, which is indexed as “t” (iteration), in the course. If my understanding is correct, according to Figure 1, students asked to revise the submission but there is no evaluation between “5. Student action (revision)” and “6. Next performance, S_i,t+1.
Also in “3.3 Dynamic Assessment Cycle” section, the phase “5. Revision” said “students .. to prepare the next submission” where “the next submission” means the submission for the next task.

If my understanding is not correct, and the iteration is performed on the same task and resubmit the revision up to 6 times, Why number of iterations is limited to 6 should be explained.

According to Figure 4, S_6 is more than 90 thus more than 2 out of 4 rubric criterions is 5 and some are not. Is it possible to discuss based on criterions and tasks.
According to lines 294-286, it is better to provide the number of students in each iteration because total number of students, 35, is small.
Minor points:
There are some overlaps on right hand side of Figure 2.
Line 244, typo as \eqref{}.

Author Response

Response to Reviewer

We thank you for your careful reading and constructive suggestions to improve the paper’s clarity and readability. Below we respond point-by-point and indicate the exact revisions we made. All changes preserve the contributions requested by Reviewers 1 and 2 (formal modeling, reproducibility, and engagement with broader pedagogical questions), and do not alter our results or conclusions.

Comments 1. Study design details regarding “student delivery” and where evaluation occurs

Reviewer’s point. The iteration index t appears to denote six deliveries (tasks). Figure 1 suggests students “revise” after feedback, but there is no evaluation between “5. Student action (revision)” and “6. Next performance, Si,t+1S_{i,t+1}Si,t+1.” Section 3.3 also says “students will prepare the next delivery,” which could mean the next task.

Response 1. We have clarified the iteration semantics and the delivery schedule:

New paragraph added in Methods §3.1 (Overview and Study Design) explicitly titled “Iteration semantics and delivery schedule” stating that each iteration ttt corresponds to a distinct syllabus-aligned assignment (six in total). Students submit at iteration ttt, receive Si,tS_{i,t}Si,t plus feedback, and use that feedback to prepare the next submission at t+1t+1t+1. There is no within-iteration re-evaluation.
Methods §3.3 (Dynamic Assessment Cycle), Step 5 (Revision) now states explicitly that feedback from iteration ttt is applied to prepare the next submission at t+1t+1t+1, and that Si,t+1S_{i,t+1}Si,t+1 is assigned on the subsequent task.
Figure 1 caption now clarifies that the “Revision” step prepares the submission for the next iteration and that Si,t+1S_{i,t+1}Si,t+1 is measured on the subsequent task (no within-iteration grading).

Net effect. Readers can now see at a glance that (i) there are six distinct tasks, (ii) revision happens between tasks, and (iii) Si,t+1S_{i,t+1}Si,t+1 is evaluated at the next scheduled task.

Comments 2. If the same task were revised up to six times, explain why the number of iterations is limited to six

Reviewer’s point. If the iteration is a resubmission of the same task, please justify the number “6.”

Response 2. We confirmed and made explicit (see edits above) that the study uses six distinct, syllabus-aligned assignments (not repeated resubmissions of a single task). The number six reflects the course’s assessment windows. No additional “cap” rationale is required.

Comments 3. Endline performance above 90 (Figure 4) and criterion/task debate (rubric profile)

Reviewer’s point. With S6>90S_6 > 90S6>90, more than two of the four rubric criteria may be at level 5, while others may not. Please discuss by criteria and task.

Response 3. We added a concise, criterion-by-task analysis:

New paragraph in Results (right after the per-iteration descriptives) titled “Criterion-by-task rubric profile at endline” explaining that at t=6t=6t=6, Accuracy and Actionability most often reach level 5 across tasks, while Clarity and Relevance show strong but more task-contingent gains.
New Supplementary Table S3 reports the criterion-by-task breakdown at endline and is referenced in both Results and Discussion (epistemic implications).

This directly addresses your request to “debate it based on criteria and tasks.”

Comments 4. Provide the number of students at each iteration (small cohort)

Reviewer’s point. Given the small total n=35n=35n=35, please report the number of students contributing data at each iteration.

Response 4. We now report per-iteration counts:

New Table (Results): “Number of students contributing non-missing Si,tS_{i,t}Si,t per iteration” placed immediately after Table “Descriptive statistics by iteration.”
We also added cross-references to this new table in Methods §3.1 and Methods §3.6 (Preprocessing and missing data) to guide readers to per-iteration sample sizes.

(For this cohort, all six iterations have nt=35n_t=35nt=35; if a later version updates counts, the table will reflect those values.)

Minor points

A) Figure 2 right-side overlaps.
We adjusted the legend placement in the pgfplots settings to avoid right-side overlaps (moved to the top-left / or below the axis, depending on layout), ensuring no visual occlusion.
B) Typo “\eqref{}”.
We conducted a project-wide search and fixed the stray empty \eqref{}. All equation references now point to valid labels.

Closing

We appreciate your suggestions—they substantially improved the paper’s readability. The clarifications to delivery/iteration semantics, the addition of per-iteration counts, the criterion-by-task discussion, and the figure/typo fixes address all your comments while preserving the methodological and pedagogical enhancements requested by Reviewers. The results, interpretations, and conclusions remain unchanged.

Reviewer 3 Report

Comments and Suggestions for Authors

This paper makes a meaningful and well-timed contribution to the algorithmic conceptualization of AI-supported assessment in education. The authors offer a structured formalization of the feedback loop as an explicit, stepwise algorithm, thereby creating an interpretable framework for analyzing learning dynamics in a systematic way.

A notable strength of the study lies in its empirical demonstration that high-quality, evidence-based feedback generated by a generative AI engine can lead not only to an overall improvement in learner performance but also to a reduction in disparities across a cohort. This finding resonates with the authors’ theoretical claims and provides persuasive support for the idea that iterative, well-calibrated feedback contributes to both learning gains and equity.

The practical relevance of this work is considerable. It provides a foundation for designing scalable, feedback-driven EdTech systems that can support learning at both individual and institutional levels. At the same time, the paper would benefit from a deeper engagement with the broader educational questions raised by the integration of generative AI into assessment practices.
The core pedagogical issue is no longer limited to how feedback can refine traditional modes of evaluation. It increasingly concerns what kinds of knowledge and understanding should be valued and how they should be assessed in contexts where AI systems are already capable of producing sophisticated, high-quality academic outputs. Expanding the discussion in this direction would strengthen the paper’s theoretical grounding and align it more closely with the current debates surrounding AI-mediated higher education.

Author Response

We are grateful for the reviewer’s thoughtful and constructive feedback. We respond point-by-point and summarize the concrete revisions.

Comment 1: Overall evaluation (algorithmic conceptualization, empirical equity, practical relevance). The paper provides a timely, interpretable, stepwise formalization; empirically shows that high-quality, evidence-based feedback improves means and reduces disparities; and is practically relevant for scalable EdTech.

Response 1: Thank you. We have reinforced the link between the formal results and the empirical findings, and we surfaced the practitioner implications more clearly:

We explicitly frame our update rules as policy-level design with iteration-complexity and stability guarantees (Propositions 1–2).
We highlight equity metrics and tail behavior in the Results (dispersion, inter-decile spread, proficiency-tail analysis).
We provide an actionable implementation roadmap (instrumentation, equity-aware monitoring, reproducibility assets) in the Discussion.

Comment 2: Main request: Deeper engagement with broader educational questions (what to value and how to assess it when AI can produce high-quality outputs). Move beyond “refining feedback” to address which forms of knowledge and understanding should be valued and how they should be assessed in AI-mediated contexts.

Response 2: What we changed:

Introduction — We give an operational and auditable definition of agentic RAG:
(a) planning (rubric-aligned task decomposition),
(b) tool use beyond retrieval (tests, static/dynamic analyzers, rubric checker), and
(c) self-critique (checklist verification).
This shifts emphasis from product-only scoring to process/epistemic competencies that can be measured.
Theoretical Framework — We add an “Operational definition in this study” paragraph: the controller enforces the agentic steps as explicit actions (not opaque traces), producing auditable evidence (plans, analyzer results, rubric-coverage scores, citations). These artifacts let instructors value and assess epistemic practices (planning quality, evidence use, verification discipline) in addition to final performance.
Materials and Methods — We document usage-based agentic features and include an abridged controller action trace (plan → retrieve → tests/analysis → rubric check → draft → self-critique → finalize with citations). This establishes concrete provenance artifacts for assessment and programmatic audit.
Discussion — We add design levers that optimize equity and process, not only means: instrument the loop, systematically raise feedback quality, monitor dispersion and tails, and personalize pacing using predicted relative gains.
Conclusions — We articulate a multi-objective policy view: maximize expected progress (via the relative-gain effect and logistic rate) subject to equity constraints (dispersion, tail mass) and minimum process thresholds (planning and provenance quality). We note the Feedback Quality Index can be extended with “Epistemic Alignment” and “Provenance Completeness,” both measurable from logs.

Result: The manuscript now addresses “what to value” (epistemic/process competencies alongside product quality) and “how to assess it” (controller logs and provenance), aligning with ongoing debates on AI-mediated higher education.

Comment 3: Point 1 (algorithmic, interpretable formalization)

Response: We clarify that our models are design policies, not latent-state estimators. We keep the emphasis on interpretable parameters (η, λ) and make explicit the convergence/stability implications (Propositions 1–2), which are referenced from the Introduction and Theoretical Framework.

Comment 4: Point 2 (empirical evidence of improved means and reduced disparities)

Response: We retain and foreground the equity-relevant results in the Results section:

Standard deviation falls by ~40%, inter-decile spread contracts, and the proficiency tail collapses across iterations.
We explicitly connect these outcomes to the variance-contraction mechanism stated in our theory (Corollary on variance).

Comment 5: Point 3 (practical relevance; broaden educational analysis)

Response: We expand practice-facing guidance and the educational discussion by:

Treating process/provenance as first-class assessment targets, enabled by the agentic controller and logs (Methods, Discussion).
Positioning real deployments as auditable pipelines with reproducibility assets (prompt templates, tests, configuration files).
Framing deployment as multi-objective optimization (learning gains + equity + process thresholds) in the Conclusions.

Comment 6: Language polishing

We performed a global edit for clarity and consistency of terminology across the revised sections.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Thank you for the careful revisions. The current version is appropriate and I have no further major comments

Comments on the Quality of English Language

This manuscript would benefit from a thorough proofread to polish its language and enhance clarity.