Next Article in Journal
Grounded Knowledge Graph Extraction via LLMs: An Anchor-Constrained Framework with Provenance Tracking
Next Article in Special Issue
MTKD-RL: Multi-Teacher Knowledge Distillation Method for Reinforcement Learning Based on Few-Shot Node Classification
Previous Article in Journal
TrustGTN: A Social Network Trust Evaluation Method Based on Heterogeneous Graph Neural Network
 
 
Article
Peer-Review Record

ExamQ-Gen: Instructor-in-the-Loop Generation of Self-Contained Exam Questions from Course Materials and Decision-Support Grading

Computers 2026, 15(3), 177; https://doi.org/10.3390/computers15030177
by Catalin Anghel 1,*, Emilia Pecheanu 1, Andreea Alexandra Anghel 2, Marian Viorel Craciun 1 and Adina Cocu 1,*
Reviewer 1:
Reviewer 2: Anonymous
Computers 2026, 15(3), 177; https://doi.org/10.3390/computers15030177
Submission received: 28 January 2026 / Revised: 3 March 2026 / Accepted: 5 March 2026 / Published: 9 March 2026

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors
  1. Ambiguity of the assessment criteria - it is necessary to provide a detailed assessment rubricator, according to which the instructor awarded points.
  2. Examples of responses and relevant ratings should be provided for reproducibility.
  3. Insufficient discussion of practical relevance - it is unclear exactly how teachers should use the research results to improve their practice. A section explaining this issue should be added.
  4. More nuanced definitions of "sustainable issues" and "distributional shifts" need to be developed.

Author Response

REVIEWER 1

  1. Ambiguity of the assessment criteria - it is necessary to provide a detailed assessment rubricator, according to which the instructor awarded points.

We thank the reviewer for this suggestion. We have clarified the assessment criteria by adding an explicit instructor grading rubric for the 1–10 scale. The revised manuscript now specifies how points were awarded (by comparison to the reference solution) and how the score relates to the categorical correctness label (correct/partial/incorrect) and to the operational pass indicator (exam_point; pass if score ≥ 9).

  1. Examples of responses and relevant ratings should be provided for reproducibility.

We thank the reviewer for this suggestion. We added a table with representative graded examples (question, reference solution, model answer, and instructor rating) to support reproducibility and to illustrate the application of the instructor rubric across both instructor-authored (HUMAN) and pipeline-generated (PIPELINE) items. The examples include correct, partially correct, and incorrect cases; text is abbreviated for readability.

  1. Insufficient discussion of practical relevance - it is unclear exactly how teachers should use the research results to improve their practice. A section explaining this issue should be added.

We thank the reviewer for this suggestion. We expanded the Discussion to clarify how instructors can use the results in practice, including (i) positioning LLM-based grading as decision support under strict pass policies, (ii) pre-validating pipeline-generated questions before deployment, (iii) using the grader for triage and feedback with instructor review near the pass boundary, and (iv) monitoring severe false-fail regimes (e.g., credit denial) alongside aggregate agreement.

  1. More nuanced definitions of "sustainable issues" and "distributional shifts" need to be developed.

We thank the reviewer for this suggestion. We clarified the terminology by providing explicit definitions of distributional shift (a workflow-induced change in the distribution of exam artifacts, operationalized here via HUMAN vs. PIPELINE) and sustainable issues (systematic, deployment-relevant failure modes that persist under repeated use rather than isolated outliers), and we aligned the manuscript’s discussion with these definitions.

Reviewer 2 Report

Comments and Suggestions for Authors

The paper is highly relevant and discusses the current state of AI applications in education. Here are some proposals for improvements:

  1. The research uses only two students (LLMs), which disables generalisation at the level of LLM. You should expand the number of LLMs, or make conclusions less LLM-general and talk more precisely. It would be useful to include GPT-4, Claude, Gemini.
  2. The research uses only one course and one teacher, which additionally disables generalisation. Here again, you should write the conclusions less generally, but focus on the real context in which the research was conducted. 
  3. The passing criterion is very high (>=9/10). How do you argue this? Can you perform a sensitivity analysis by decreasing the passing criterion, which is common in many systems (e.g., 50% or 60%)?
  4. Human grading is highly subjective because it is based on a single person's judgment. Maybe the professor is very strict or very mild. You should introduce additional teachers and also calculate inter-rater agreement.
  5. The AI-generated questions (PIPELINE) are not validated. If they are too AI-friendly, maybe this is the reason Mistral has better performances!? Maybe the questions are poorly designed!?
  6. In grading, the distribution is not normal; it is mostly binary, even though the scale is 1-10. It is questionable whether you, in reality, have a 1-10 scale, or whether it is practically binary!? Can you present examples of questions and the criteria (rubric) for intermediate scores on the scale?
  7. The current AI grading tool is misleading - you should use a stronger grader model or rubric-based prompting.

 

Author Response

REVIEWER 2

The paper is highly relevant and discusses the current state of AI applications in education. Here are some proposals for improvements:

1. The research uses only two students (LLMs), which disables generalisation at the level of LLM. You should expand the number of LLMs, or make conclusions less LLM-general and talk more precisely. It would be useful to include GPT-4, Claude, Gemini.

We thank the reviewer for this suggestion. We agree that using two student LLMs limits model-level generalization; accordingly, we revised the manuscript to avoid LLM-general claims and to state model-level conclusions specifically for the two evaluated student models (Llama3-8B-Instruct and Mistral-7B-Instruct), with wording harmonized throughout the text (research questions, contributions, and conclusions). We also note that extending the comparison to additional student models, including strong proprietary systems (e.g., GPT-4, Claude, Gemini), is an important direction for future work.

2. The research uses only one course and one teacher, which additionally disables generalisation. Here again, you should write the conclusions less generally, but focus on the real context in which the research was conducted. 

We thank the reviewer for this suggestion. We agree that the study context (a single course and a single instructor reference) limits context-level generalization; accordingly, we revised the manuscript to avoid broad claims about courses or teachers and to state conclusions specifically for the studied IA2 exam setting.

3. The passing criterion is very high (>=9/10). How do you argue this? Can you perform a sensitivity analysis by decreasing the passing criterion, which is common in many systems (e.g., 50% or 60%)?

We thank the reviewer for raising this point. The ≥ 9/10 pass criterion reflects the local IA2 examination policy for technical, reference-solution–anchored items, and our primary decision-level analysis is therefore reported under this operational policy. To address sensitivity, we recomputed pass/fail outcomes at lower thresholds (t = 6 and t = 5) using the same rule for both instructor and grader scores and report the resulting false-fail/false-pass trade-off in Table 13.

4. Human grading is highly subjective because it is based on a single person's judgment. Maybe the professor is very strict or very mild. You should introduce additional teachers and also calculate inter-rater agreement.

We thank the reviewer for this suggestion. We agree that using a single-instructor reference limits rater-level generalization; accordingly, we clarified this as an explicit limitation and framed the human grades as instructor-specific ground truth for the studied IA2 setting. We also note that the technical, reference-solution–anchored format and the explicit grading rubric constrain subjectivity, although they do not eliminate potential instructor-specific strictness or leniency.

5. The AI-generated questions (PIPELINE) are not validated. If they are too AI-friendly, maybe this is the reason Mistral has better performances!? Maybe the questions are poorly designed!?

We thank the reviewer for this comment. We clarified the PIPELINE generation workflow by describing the concrete checks implemented in the pipeline (prompt-level self-check and a post-generation numeric-literal consistency check with manual-review flagging) and by avoiding wording that could be interpreted as full validation. We also emphasize that HUMAN vs. PIPELINE is treated as a workflow-induced condition shift in our study, and performance differences across these sources should be interpreted within this context rather than as model-general superiority.

6. In grading, the distribution is not normal; it is mostly binary, even though the scale is 1-10. It is questionable whether you, in reality, have a 1-10 scale, or whether it is practically binary!? Can you present examples of questions and the criteria (rubric) for intermediate scores on the scale?

We thank the reviewer for raising this point. We clarified that the 1–10 scale is operational, but the observed endpoint-heavy distribution arises because the items are technical and reference-solution–anchored, so partially correct answers are relatively rare. We also explicitly described how intermediate scores are assigned under the instructor rubric and provided representative intermediate-score examples in the revised manuscript.

7. The current AI grading tool is misleading - you should use a stronger grader model or rubric-based prompting.

We thank the reviewer for this comment. We clarified that the automatic grader is used as a rubric-guided, deterministic decision-support baseline and sanity check, while all primary analyses use the instructor reference as ground truth. We also explicitly state that the grader prompt encodes rubric guidance (comparison to the reference solution, numerical/key-concept checks) and enforces a structured output format, addressing the concern about misleading grading.

Reviewer 3 Report

Comments and Suggestions for Authors

This manuscript addresses a timely question (or two): how large language models behave when treated as “virtual students” under realistic exam conditions, and how automatic grading aligns with instructor judgment in threshold-based pass/fail settings. The instructor-in-the-loop framing and the comparison between HUMAN (instructor-authored) and PIPELINE (automatically generated) question sources represent a meaningful contribution. The study is particularly strong in its use of operational grading policies, its disaggregation of score-level and decision-level outcomes, and its identification of asymmetric false-fail behavior in automatic grading. These elements make the work relevant also for real-world educational deployment.

The manuscript would benefit from clearer positioning particularly in the Introduction and Related Work sections. The literature review is comprehensive and extensive but somehow dilutes the central contribution between LLM-as-judge frameworks, question generation systems, and LLM-as-exam-taker evaluations.

Methodologically, the overall design is sound, but several aspects require further clarification to strengthen transparency and reproducibility. The study relies on a single course and a single instructor as the ground-truth grader; a more explicit discussion of external validity and potential instructor bias would be valuable. Additional detail on prompt design, decoding parameters for the student models, filtering of pipeline-generated questions, and any manual post-processing would also improve replicability. The strongly polarized score distributions (predominantly 1 or 10) and the strict pass threshold (≥ 9) warrant further justification, as they substantially shape the interpretation of pass/fail outcomes.

The Results section is statistically thorough but it is also lengthy and, therefore, could be consolidated to improve readability. Some figures might be merged or moved to supplementary material, and a brief, explicit threats-to-validity subsection would help contextualize the findings. The conclusions are largely supported by the data.

With clearer framing, additional methodological detail, and some condensation of the presentation, this manuscript has the potential to make a good contribution to research on LLM-based assessment and educational benchmarking.

Author Response

REVIEWER 3

1. This manuscript addresses a timely question (or two): how large language models behave when treated as “virtual students” under realistic exam conditions, and how automatic grading aligns with instructor judgment in threshold-based pass/fail settings. The instructor-in-the-loop framing and the comparison between HUMAN (instructor-authored) and PIPELINE (automatically generated) question sources represent a meaningful contribution. The study is particularly strong in its use of operational grading policies, its disaggregation of score-level and decision-level outcomes, and its identification of asymmetric false-fail behavior in automatic grading. These elements make the work relevant also for real-world educational deployment.

We thank the reviewer for this encouraging assessment of our work. We appreciate the recognition of the instructor-in-the-loop framing, the HUMAN vs. PIPELINE comparison, and the operational, threshold-based decision analysis, and we are pleased that the reviewer considers these aspects relevant for real-world educational deployment.

2. The manuscript would benefit from clearer positioning particularly in the Introduction and Related Work sections. The literature review is comprehensive and extensive but somehow dilutes the central contribution between LLM-as-judge frameworks, question generation systems, and LLM-as-exam-taker evaluations.

We thank the reviewer for this comment. We improved the positioning in the Introduction/Related Work by explicitly separating the three relevant research threads and by adding a focused research-gap statement that makes the manuscript’s central contribution and scope clearer, thereby avoiding conflation between LLM-as-judge, question generation, and LLM-as-exam-taker evaluations.

3. Methodologically, the overall design is sound, but several aspects require further clarification to strengthen transparency and reproducibility. The study relies on a single course and a single instructor as the ground-truth grader; a more explicit discussion of external validity and potential instructor bias would be valuable. Additional detail on prompt design, decoding parameters for the student models, filtering of pipeline-generated questions, and any manual post-processing would also improve replicability. The strongly polarized score distributions (predominantly 1 or 10) and the strict pass threshold (≥ 9) warrant further justification, as they substantially shape the interpretation of pass/fail outcomes.

We thank the reviewer for these suggestions. We strengthened transparency and reproducibility by (i) making external-validity constraints explicit (single course and single-instructor reference) and clarifying the potential for instructor-specific bias, (ii) reporting the key decoding parameters used for the student models, (iii) clarifying the concrete checks applied to pipeline-generated questions and stating that the PIPELINE set was used as generated without additional filtering or manual post-processing, and (iv) justifying the strict pass policy and documenting threshold sensitivity via an additional analysis at lower pass thresholds.

4. The Results section is statistically thorough but it is also lengthy and, therefore, could be consolidated to improve readability. Some figures might be merged or moved to supplementary material, and a brief, explicit threats-to-validity subsection would help contextualize the findings. The conclusions are largely supported by the data.

We thank the reviewer for this suggestion. To improve readability, we consolidated the Results by removing redundant visualizations while keeping the key figures that directly support the main claims. In particular, figures that duplicated information already reported in tables were removed, and the narrative was tightened to reduce repetition while preserving the statistical support for the conclusions.

5. With clearer framing, additional methodological detail, and some condensation of the presentation, this manuscript has the potential to make a good contribution to research on LLM-based assessment and educational benchmarking.

We thank the reviewer for this encouraging summary. Following these suggestions, we clarified the framing in the Introduction/Related Work, added methodological details to strengthen reproducibility, and condensed the Results presentation by removing redundant elements while retaining the key evidence supporting our conclusions.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors took into account all the comments, the work does not require additional corrections.

Back to TopTop