Review Reports
- Branislav Bédi1,
- Hakeem Beedar2 and
- Belinda Chiera3
- et al.
Reviewer 1: Sara Badia-Climent Reviewer 2: Anonymous
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsGeneral Comment
Overall, this manuscript offers a valuable and timely contribution to the study of AI-generated pedagogical materials, and it clearly illustrates how C-LARA operates when creating multimodal resources for foreign-language learning. The workflow, the methodological structure, and the design of the questionnaires are described in detail, and the potential of the platform is convincingly presented. The visual quality and consistency of the books are also addressed across studies.
However, some issues require clarification or improvement:
- In the state of the art, although the introduction states that the study is framed within computer science, linguistics, and language education, the literature cited is almost exclusively related to language education. No references are provided to current work on computer-generated image consistency, LLM-based text generation, or multimodal alignment, despite these being core components of the workflow. While it is possible that not all of these areas require extensive discussion, the current framing should be revised to ensure coherence between the claimed interdisciplinary stance and the literature reviewed.
- The student questionnaire is limited to only two global items: engagement and likelihood of self-study use. Both questions are affective in nature and focus on “liking” or perceived usefulness. This restricts the interpretability of the student data and contrasts sharply with the teacher questionnaire, which is much more detailed and analytically oriented.
- This question is linked to the last one. As noted on page 16, while it is understandable that identical questions cannot be posed to teachers and students, the lack of shared items prevents meaningful comparison between the two groups. Including at least a minimal set of parallel constructs would strengthen the study’s interpretability.
- The manuscript does not report how many student participants took part in Study 1 and Study 2. This omission limits the validity of the analyses,
Last, and importantly, the use of Generative AI in the manuscript. On page 13, the authors explicitly state that GPT-5 contributed substantially to the writing of the manuscript, at a level that “clearly exceeded that of many of the human authors.” This raises a significant concern, as MDPI guidelines expressly prohibit the use of generative AI tools for drafting, expanding, or substantively writing scientific manuscripts. The authors also indicate their disagreement with this policy:
“We note our principled disagreement with this policy… [GPT-5’s contributions] clearly exceeded that of many of the human authors…”
Given the explicitness of this disclosure, I believe this issue must be evaluated by the editors to determine whether the manuscript complies with the journal’s publication policies.
Other minor comments:
Page 5:
- It may be preferable to present Figures 1 and 2 together, rather than separating them with such a brief intervening section. Combining them could improve visual coherence and reader flow.
- Additionally, the new section introduced here refers to two tables that could be placed adjacent to it for greater clarity and cohesion.
- In the sentence “To give a quick intuitive idea of what C-LARA output is like, Figure 1 shows a typical page; Figure 2 presents all but one of the images from ‘A Day in the Life of a Pet Detective’)”, the closing parenthesis is not matched with an opening one. Please revise this typographical issue.
Page 8:
- Table placement is inconsistent: some tables display their titles above, while others place them below. Please review the formatting to ensure uniformity throughout the manuscript.
- It may help the reader if you add a link or explicit cross-reference to the section in which the tables are discussed (e.g., Section 2.5.3). A similar addition would be useful for the tables linked to the previous studies.
Page 9.
- The word “typical” appears repeated; consider revising for stylistic clarity.
Author Response
Thank you for your helpful comments! We have revised the paper, here are our responses:
- In the state of the art, although the introduction states that the study is framed within computer science, linguistics, and language education, the literature cited is almost exclusively related to language education. No references are provided to current work on computer-generated image consistency, LLM-based text generation, or multimodal alignment, despite these being core components of the workflow. While it is possible that not all of these areas require extensive discussion, the current framing should be revised to ensure coherence between the claimed interdisciplinary stance and the literature reviewed.
Our response: here, we have revised the Introduction to include more references. We have also added two pages in §2.1 describing C-LARA's workflow in detail, so that the reader can see how we address issues of text generation, text annotation and image generation. - The student questionnaire is limited to only two global items: engagement and likelihood of self-study use. Both questions are affective in nature and focus on “liking” or perceived usefulness. This restricts the interpretability of the student data and contrasts sharply with the teacher questionnaire, which is much more detailed and analytically oriented.
Our response: we justify our reasons for presenting the teachers and students with different questionnaires in §2.2.1. As we say there, the basic problem is that these two groups engage with the learning materials from fundamentally different standpoints. - This question is linked to the last one. As noted on page 16, while it is understandable that identical questions cannot be posed to teachers and students, the lack of shared items prevents meaningful comparison between the two groups. Including at least a minimal set of parallel constructs would strengthen the study’s interpretability.
Our response: same as to (2). - The manuscript does not report how many student participants took part in Study 1 and Study 2. This omission limits the validity of the analyses,
Our response: information about numbers of participants has now been added in all relevant places, both in the tables reporting the results and in the text. -
Last, and importantly, the use of Generative AI in the manuscript. On page 13, the authors explicitly state that GPT-5 contributed substantially to the writing of the manuscript, at a level that “clearly exceeded that of many of the human authors.” This raises a significant concern, as MDPI guidelines expressly prohibit the use of generative AI tools for drafting, expanding, or substantively writing scientific manuscripts. The authors also indicate their disagreement with this policy:
“We note our principled disagreement with this policy… [GPT-5’s contributions] clearly exceeded that of many of the human authors…”
Given the explicitness of this disclosure, I believe this issue must be evaluated by the editors to determine whether the manuscript complies with the journal’s publication policies.
Our response: we have already discussed this issue at length with the editors and believe we are in good agreement.
Reviewer 2 Report
Comments and Suggestions for AuthorsThis is a well-written and well-documented paper, with a clear and logical structure. The manuscript is highly professional in tone and a strong fit for the journal’s scope. The proposal to use AI-generated comics as multimodal textbooks is timely and innovative, especially in the context of multimodal learning, digital storytelling, and CALL. The comparison between materials in several languages is also very interesting.
The approach is likely to be positively received within the scientific community for its creativity and for the way it leverages generative AI to rapidly produce rich and engaging language-learning input. The integration of images, audio, highlighted text, and translations offers valuable opportunities to scaffold learning through multimodal means.
At the same time, the enthusiasm surrounding AI in education should be tempered by attention to broader concerns. A growing body of research points out that AI has the potential to encourage passive learning, weaken critical thinking, and reduce opportunities for creativity and problem-solving. It would strengthen the paper to engage with these issues more directly. Yan, Sha, Zhao et al. (2023), Practical and Ethical Challenges of Large Language Models in Education: A Systematic Scoping Review, British Journal of Educational Technology 55(1), provide a comprehensive overview of practical and ethical challenges, including transparency, replicability, and the risk that automation may erode meaningful educational engagement.
Several practical aspects of the paper would benefit from clarification. Some links to the generated materials are not functional and should be checked. The source of inspiration for the AI-generated books also needs greater transparency. For instance, it is unclear whether the resemblance between the generated book A Day in the Life of a Pet Detective and the existing manga The (Pet) Detective Agency by Tantei Jimusho no Kainushi-sama (French translation Des détectives au poil, 2020) comes from explicit prompting or simply reflects the model’s autonomous output, which also raises potential copyright concerns.. Making this process more explicit would enlighten the reader on the outcomes of the study.
The presentation of participant demographics is also insufficient. While the language level of the students is given, their age is not. A textbook appropriate for young adults may be perceived very differently by older learners. Information on how participants were recruited is missing as well, and socio-professional background can influence reactions to pedagogical materials. It is also unclear how many participants took part in the study: these details appear only vaguely here and there in the text (p. 12, mention of five teachers and one student), and greater clarity earlier in the paper would be beneficial.
The evaluation design could also be enriched. It is surprising that no questions addressed the quality of the audio files, given that audio constitutes an essential multimodal component of the materials. Moreover, evaluators did not seem to have the possibility of contributing open-ended qualitative comments, which would have yielded deeper insight into their perceptions.
Minor linguistic issues:
- p.14/317:
“Study 2 examined six AI-generated books English tailored for low-intermediate East-Asian adults who recently moved to Adelaide.”
→ should read: “Study 2 examined six AI-generated books in English tailored for low-intermediate East-Asian adults who recently moved to Adelaide.” - p.17/390:
“Ukrainian is both a much smaller language than the other two”
→ A more diplomatic formulation would be: “Ukrainian has a more limited number of speakers compared to the other two and is less documented.”
Overall the manuscript offers a valuable and original perspective on the pedagogical use of AI-generated multimodal materials. Addressing the points above would further enhance the clarity, balance, and methodological transparency of the paper.
Author Response
Thank you for your kind and helpful comments! Here are our replies:
1. At the same time, the enthusiasm surrounding AI in education should be tempered by attention to broader concerns. A growing body of research points out that AI has the potential to encourage passive learning, weaken critical thinking, and reduce opportunities for creativity and problem-solving. It would strengthen the paper to engage with these issues more directly. Yan, Sha, Zhao et al. (2023), Practical and Ethical Challenges of Large Language Models in Education: A Systematic Scoping Review, British Journal of Educational Technology 55(1), provide a comprehensive overview of practical and ethical challenges, including transparency, replicability, and the risk that automation may erode meaningful educational engagement.
Our reply: we have added more references in the Introduction where we discuss these issues.
2. Several practical aspects of the paper would benefit from clarification. Some links to the generated materials are not functional and should be checked. The source of inspiration for the AI-generated books also needs greater transparency. For instance, it is unclear whether the resemblance between the generated book A Day in the Life of a Pet Detective and the existing manga The (Pet) Detective Agency by Tantei Jimusho no Kainushi-sama (French translation Des détectives au poil, 2020) comes from explicit prompting or simply reflects the model’s autonomous output, which also raises potential copyright concerns.. Making this process more explicit would enlighten the reader on the outcomes of the study.
Our response: we have fixed the broken links. We have clarified how we generated the texts in Experiment 1 (beginning of §2.2.1) and in particular explained that the apparent resemblance to Tantei Jimusho no Kainushi-sama’s manga is purely coincidential (footnote on p. 7).
3. The presentation of participant demographics is also insufficient. While the language level of the students is given, their age is not. A textbook appropriate for young adults may be perceived very differently by older learners. Information on how participants were recruited is missing as well, and socio-professional background can influence reactions to pedagogical materials. It is also unclear how many participants took part in the study: these details appear only vaguely here and there in the text (p. 12, mention of five teachers and one student), and greater clarity earlier in the paper would be beneficial.
Our response: we have clearly given the number of evaluators for each experiment (tables in Results section and also main text). Unfortunately, we do not have detailed data about student demographics available.
4. The evaluation design could also be enriched. It is surprising that no questions addressed the quality of the audio files, given that audio constitutes an essential multimodal component of the materials. Moreover, evaluators did not seem to have the possibility of contributing open-ended qualitative comments, which would have yielded deeper insight into their perceptions.
Our response: we clarify (lines 237-239) that all audio used was produced using third-party Text To Speech (TTS) engines, so we considered that evaluation of its quality was outside the scope of the experiment. We are not sure what to say about open-ended responses: yes, it would of course have been nice to include them in the questionnaires, but none of our evaluators were paid to participate, and it was already hard to persuade them to do what they did here :(