Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

AI-Assisted Exam Variant Generation: A Human-in-the-Loop Framework for Automatic Item Creation

Educ. Sci. 2025, 15(8), 1029; https://doi.org/10.3390/educsci15081029

by Charles MacDonald Burke

Reviewer 1:

Ben Lutz

Reviewer 2: Anonymous

Educ. Sci. 2025, 15(8), 1029; https://doi.org/10.3390/educsci15081029

Submission received: 23 June 2025 / Revised: 6 August 2025 / Accepted: 9 August 2025 / Published: 11 August 2025

(This article belongs to the Special Issue Educational Assessment Theories and Methodologies: Trends in Standardized Testing)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Thank you for the opportunity to review this manuscript. The author presents an interesting description of what appears to have been a rigorous process of using chatbots to develop test items in ways that allow for creation of multiple forms of a standardized test. The paper its purpose are clear, and the framework for generating items seems to be well-supported and grounded in relevant literature. While I do think it could be useful to actually examine some of the results from the exams that were implemented, I also recognize that this might be beyond the scope of the present work. Still, it would help to strengthen the impact of the framework if there was some evidence that these exams performed at least as well as previous exams that were fully human-generated. There are some good lessons learned and solid recommendations, but they could be even more convincing with some evidence that the approach was effective in meaningfully assessing student outcomes.

Line 60. The authors use the word "understand" to describe what LLMs can do with natural language, but I might caution them to chose a different word. Interpret might be more appropriate. They do not "understand" in the way that we think about it more traditionally, and I think it is helpfiul to delineate what LLMs do that *resembles* thinking as separate and distinct from what humans do when it comes to the word understanding.

The process of developing items and refining those items is meticulously documented and helps to provide a model for other educators working on this process.

If the instructor comes into the test with their own biases, isn't it kind of the case that these items would still be susceptible to bias that is "baked in" to the instructor?

The checklists on line 628 seem to. be useful for these specific classes, but I feel like they might be different for, say, STEM courses? I am not saying you need to make the checklist universal, but it might be useful to acknowledge that the these questions were effective in your context, but might differ in other fields.

One other question I have here is about when it might be appropriate to use these kinds of tools in terms of the arc of one's career. What I mean here is that it is likely that the instructor in this case study built up his or her expertise WITHOUT assistance from AI. And so in this case, AI is only helpful because of the work that the instructor had done to acquire this expertise in item creation. Would the authors recommend this approach for new instructors who have never created their own exam items before or who have never had to think about thing like Item response theory? I realize this might be outside the scope of the work, but it seems important that the success of this approach is predicated on the development of expertise that was not assisted by AI. So, when is the right time to introduce this tool?

Author Response

Thank you for your thoughtful and constructive feedback on our manuscript. In the attached document we respond point-by-point to your insightful comments.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The article should clearly rearrange the content so that the principle of the case study stands out - first a brief, separate Introduction and Literature Review (both shortened so as not to mix and clutter the text), then a transparent methodological section with a precise description of the procedure; v it is not necessary to provide specific examples of how teachers worked with LLM, where commands had to be repeatedly modified and which types of tasks required the most intervention, or to supplement quantitative data on the correction of numbers and other results; qualitative examples of use are welcome, but at the same time it is necessary to clearly state how the tests created by LLM are connected to real ones, not to state that they were generated only for testing.

Author Response

Thank you for taking the time to provide comments and feedback on our manuscript. We appreciate your input and recognize the intention behind your suggestions. In the attached document we address each of your points individually, clarifying our stance where necessary.

Author Response File: Author Response.pdf

Article Menu

AI-Assisted Exam Variant Generation: A Human-in-the-Loop Framework for Automatic Item Creation

Further Information

Guidelines

MDPI Initiatives

Follow MDPI