1. Introduction
Well-designed digital picture books, supported by affordances such as page-aligned audio/voice reading, can be effective tools for enhancing vocabulary and story comprehension (
Tunkiel & Bus, 2022). In today’s digital environment, platforms that provide free access to multilingual digital books further increase opportunities for exposure, particularly when voice options are available (
Bus et al., 2023). Importantly, recent work indicates that picture books are not limited to children’s learning contexts but can also benefit adult learners:
Leow and Shaari (
2025) argue that picture books can play a valuable role in adult foreign-language learning by providing simple, engaging, context-rich materials that reduce culture shock and build confidence. Relative to adult literature, picture books typically use accessible language, predictable structures, and meaningful illustrations—features that promote comprehensible input and discussion. In parallel, reviews of visual literacy and children’s literature synthesise how purposeful image use supports meaning-making and critical engagement, not merely decoration (
Farrar et al., 2024).
Beyond their linguistic and cultural benefits, the effectiveness of picture books is also supported by research on multimodal learning, which shows that combining visual, auditory, and textual input enhances comprehension and memory.
Li and Lan (
2022) emphasise that multimodal resources—reading, listening, and visual input—make language learning more contextualised, personalised, and cognitively richer. More specifically, visual components accompanying reading positively affect vocabulary learning and grammar comprehension, while auditory elements support listening and reading comprehension (
Ed-dali, 2024). Empirical findings further show that pairing words with pictures improves memory and metacognitive judgements in L2 vocabulary learning, notably for Chinese vocabulary among international learners (
Martín-Luengo et al., 2023). Consistent with this, learners generally remember paired pictures and words better than words alone, aligning with Mayer’s Multimedia Learning framework, in which learners construct understanding by integrating information across verbal and pictorial channels (
Mayer, 2002). This is in line with
Teng’s (
2023) research supporting the combination of multimedia input (text, sound, image, animation, and subtitles) for reinforcing learners’ imagery and language comprehension.
A complementary strand highlights how images can
lead meaning-making, including in (almost) wordless formats where interpretation is negotiated by the reader, offering rich opportunities for discussion and inference in instructional settings (
Arizpe, 2014;
Mantei & Kervin, 2015). For oral skills in particular, storytelling through picture description has been shown to enhance young EFL learners’ production (
Arguello San Martín et al., 2020). This motivates the observation by
Derakshan (
2021) that images in educational materials should not be “merely decorative”: cultural representations, roles, and contexts matter for learner identity, engagement, and appropriateness. Arguments like these reinforce the need for principled image selection and generation in L2 resources.
To summarise, illustrated digital picture books—short narratives with page-aligned images, audio, and word- or segment-level support (glosses, translations)—are, for good reasons, widely used in foreign-language classrooms to improve reading skills and vocabulary. Until recently, producing such resources typically required collaboration between experts such as teachers, illustrators, and technologists, and outputs were often targeted at a “generic learner”. Rapid progress in generative AI (GenAI) now, however, makes it plausible for teachers and self-learners to assemble tailored, multimodal courses with little or no special expertise and effort measured in hours or even minutes.
In practice, three requirements remain non-negotiable for classroom use: (i)
page-accurate text–image alignment with consistent visual style across pages and recurring elements; (ii)
reliable linguistic support, including trustworthy glosses, translations, and explicit handling of
multi-word expressions (MWEs) (e.g.,
give up and
how much in English,
se lever and
tout à fait in French, and
fangen an and
ein bisschen in German); and (iii)
pedagogical fit to a target learner profile (level, L1, and cultural context)—including image choices that are culturally appropriate rather than merely decorative (
Derakshan, 2021).
This paper asks a practical question for teachers, self-learners, and materials developers:
How close are current GenAI tools to delivering ready-to-use, tailor-made picture books that meet these three requirements? We address this via three initial case studies using the open-source ChatGPT-based Learning And Reading platform (C-LARA,
https://www.c-lara.org/), which integrates state-of-the-art GenAI text models for authoring and linguistic annotation together with a GenAI image model for per-page illustration. We focus primarily on English source texts with glosses in Chinese, and to a lesser extent French and Ukrainian, and include a small German→English pilot designed for a single user. We evaluate the resulting resources and in some cases estimate the residual human effort needed to adapt materials for a defined learner group. In order to answer the core question, it is essential to adopt an interdisciplinary stance which combines the perspectives of computer science (how do we build these digital texts?), linguistics (how do we evaluate their formal adequacy?), and education science (how do we determine whether they are useful to learners?)
- RQ1.
Content and imagery quality: To what extent do AI-generated books provide adequate linguistic support (text+annotations) and coherent, culturally appropriate page-aligned images?
- RQ2
Human effort: Which steps still require teacher/editor intervention to achieve reliable annotations and satisfactory illustrations, and roughly how much effort is involved?
- RQ3.
Pedagogical tailoring: How effectively can the C-LARA workflow produce context-appropriate texts for a specified learner demographic, as judged by teachers and students?
- C1.
Workflow and examples: An overview of a teacher- or self-learner-operable end-to-end workflow for generating and editing AI-illustrated picture books (
Section 2.1).
- C2.
Annotation quality: An evaluation of linguistic annotation for six sample texts with glosses in French, Chinese, and Ukrainian (
Section 2.2.1).
- C3.
Group tailoring: A demographic-tailoring experiment with six texts for a specified learner group (
Section 2.2.2).
- C4.
Individual tailoring: A single-learner case study for an individual target user (
Section 2.2.3).
Images and Chinese glossing scored consistently high; French glossing was strong but showed issues with gender agreement, register, and naturalness of phrasing; and Ukrainian glossing underperformed with systematic morphosyntax and idiom errors. Students rated demographically tailored English texts positively, whereas teachers requested tighter briefs and curricular alignment. The individually tailored German pilot was engaging and largely usable, with minor image-consistency and cultural-detail issues. Overall, for at least some well-supported language pairs (most notably, the very important language pair English–Chinese), the workflow appears close to classroom and self-study usability. Other language pairs require improvements to language handling, with better handling of multi-word expressions and glossing being the apparent priority tasks.
Positioning Within Prior Work
Teacher-facing ecosystems. Classroom practice already draws on large graded libraries with dashboards (e.g.,
CommonLit,
Raz +), levelled news (
Newsela), open repositories of pre-illustrated readers (
StoryWeaver), and bring-your-own-text readers with click-to-translate (
LingQ,
Readlang) (
CommonLit—About, n.d.;
LingQ—Language Learning That Works, n.d.;
Newsela—The AI-powered text leveler, n.d.;
Raz-Plus—Leveled books and resources, n.d.;
Readlang—Learn languages by reading, n.d.;
StoryWeaver—Open platform for multilingual children’s stories, n.d.). These ecosystems attest to sustained demand for levelled content, in-context support, and teacher-operable workflows. However, three capabilities remain uncommon: (1)
explicit, persistent MWE handling as a teachable unit; (2)
per-page illustration for custom teacher-authored texts with cross-page stylistic consistency; and (3)
rapid end-to-end tailoring from a teacher brief to a specified demographic (level + L1 + cultural context). In our spot checks (September 2025),
Readlang and
LingQ support user texts with word/phrase translation but lack persistent MWE tracking and page-aligned image generation;
Raz + and
StoryWeaver host pre-illustrated readers but do not generate images for
new teacher-authored stories (
LingQ—Language Learning That Works, n.d.;
Raz-Plus—Leveled books and resources, n.d.;
Readlang—Learn languages by reading, n.d.;
StoryWeaver—Open platform for multilingual children’s stories, n.d.). See
Table 1.
AI in language education: reviews and syntheses. Recent syntheses describe broad uptake of AI for reading, writing, vocabulary, grammar, speaking, and listening (e.g., large-scale reviews by
Huang et al., 2023;
Kristiawan et al., 2024;
Woo & Choi, 2021). Reported benefits include personalisation, increased engagement, and measurable proficiency gains; recurrent cautions include motivation sustainability, feedback reliability, privacy, over-reliance, and the need for teacher preparation to maximise pedagogical effectiveness. Complementary discussions in CALL/EdTech likewise highlight the promise of GenAI alongside requirements for transparency, cultural fit, verifiability, and classroom-aligned evaluation (
AbuSahyon et al., 2023;
Ali et al., 2025;
De la Vall & Araya, 2023;
National Council of Teachers of English, 2025;
Schmidt & Strasser, 2022;
Son et al., 2025;
UNESCO, 2023). Within this literature, most tools emphasise authoring aids, adaptive practice, or feedback and assessment workflows; fewer address the
multimodal book as a unit of instruction with auditable linguistic annotation and
teacher-controlled illustration coherence.
Integrating different GenAI tools into the teaching process can provide opportunities for creating learner-oriented multimodal resources that can be purposefully used for enhancing different skills in the target language (
Jiang & Lai, 2025). Using GenAI tools for simplifying complicated texts and adjusting them to different learner levels positively influences reading comprehension (
Çelik et al., 2024). Moreover, GenAI tools can provide supportive reading instruction in language classes because they help increase learners’ confidence in reading texts in the target language depending on the quality of the tools used and the instructional context (
Daweli & Mahyoub, 2024). However, using specifically developed GenAI tools for language learning via reading may reduce learners’ reading anxiety due to technological features such tools may offer, e.g., using voice and translation assistance for better comprehension of the text (
Zheng, 2024).
Gap our study addresses. We target the less-common capabilities above—(i) page-aligned, coherent GenAI illustration for teacher-authored texts; (ii) reliable linguistic support with
explicit MWE glossing; and (iii) efficient demographic tailoring—and evaluate them with teacher/student questionnaires and qualitative feedback. In this sense, our work complements survey-level findings by providing a concrete, auditable workflow (
Section 2.1) and small, quickly replicable studies (
Section 2.2.1,
Section 2.2.2,
Section 2.2.3) that quantify where the workflow is already usable (in particular, English–Chinese) and where improvements are still required (e.g., MWE handling and glossing for more challenging language pairs, like English–Ukrainian).
2. Materials and Methods
2.1. Platform and Workflow (C–LARA)
We used the open-source ChatGPT -based Learning And Reading (C–LARA) platform (
https://www.c-lara.org/) to generate multimodal pedagogical picture books.
1 The end-to-end workflow comprises four stages: (1)
text creation from a brief user-supplied specification (or import of an existing text); (2)
automatic linguistic annotation (segmentation into pages and segments; identification of multi-word expressions (MWEs); lemma tagging; segment translations; word/phrase glossing; and audio tagging); (3)
illustration generation with page-aligned images; and (4)
web deployment as an interactive HTML document with audio, translations, and mouseover glosses.
Extensive descriptions of this processing can be found in other publications about C-LARA, in particular (
Bédi et al., 2025) for text processing and web deployment, and (
Rendina et al., 2025) for image generation. In order to make the current paper more self-contained, we elaborate slightly on the functioning of the four steps in the C-LARA workflow.
Text generation: The user creating the text starts by creating a new project, specifying the text language and the glossing language. They then either paste in a piece of existing text that they wish to use or (more commonly) ask C-LARA to generate text from a specification they provide. The specification is typically short, perhaps one to three sentences. C-LARA passes it to an AI, in these experiments OpenAI’s GPT-5, with instructions to create a piece of text based on the specification. This is shown to the user, who has the option of editing or regenerating it if necessary.
Automatic linguistic annotation: Once the user is satisfied with the text, they use controls to invoke a series of automatic linguistic annotation operations which C-LARA again carries out using calls to the AI. The user can either ask C-LARA to perform the entire sequence as a single operation or review the output of each stage before proceeding to the next one in case they wish to edit it. The stages are as follows:
Segmentation: This stage is divided into two parts. In the first part, the AI is asked to divide the text into pages, with each page further divided into “segments”. Examples of appropriate divisions are shown, with some rough guidelines. For instance, in a prose text a page might be one to three paragraphs, and a segment might be a sentence. In a poem or song, a page might be one to three verses, and a segment might be one to two lines.
In the second part, the AI is asked to divide each segment into lexical tokens. Most often, these lexical tokens are single words delimited by spaces, but sometimes words need to be split into smaller units: for example, in English “they’ll” should be split into the tokens “they” and “’ll”, in French “l’autre” should be split into “l”’ and “autre”, and in German “Gebrauchtwagen” should be split into “Gebraucht” and “wagen”.
For efficiency, splitting of the segments in the second part is parallelised, with each segment sent independently to the AI at the same time.
Translation: In this step, the surface text of each segment is passed to the AI with a request to translate it into the glossing language. As in the segmentation step, all the translation requests are processed in parallel.
Multi Word Expression identification: Once again using parallel processing, each segment is passed to the AI with a request to mark possible Multi Word Expressions (MWEs). Examples are included showing typical MWEs. For instance, many English and German MWEs are phrasal/separable verbs (“throw out”, “stehe auf”), while a common French MWE is the existential construction “il y a”. All three languages have many MWEs that are set expressions (English, “how much”; French, “tout à fait”; German, ”ein bisschen”).
Lemma tagging: Similarly using parallel processing, and including the results of the MWE identification stage, each segment is passed to the AI with a request to add lemma and part of speech (POS) information to each lexical token in a way that respects the MWE annotations, i.e., assigns the same lemma and POS to all tokens that are marked as being part of the same MWE. So for example, if in the segment “She threw it out” “threw” and “out” are both marked as parts of the MWE “threw out”, then each word should be tagged with the lemma “throw out” and the POS “VERB”.
Gloss tagging: The gloss tagging stage is similar to the lemma tagging stage. Again with parallel processing, and including the results of the MWE identification stage, segments are passed to the AI with a request to add a gloss in the glossing (L1) language while respecting the MWE annotations. Continuing the example with “She threw it out” from the preceding paragraph, if we are glossing in French we might thus tag both “threw” and “out” with “a jeté”.
Audio tagging: In the audio tagging stage, audio files are attached to words and segments. In the study reported here, audio tagging was exclusively performed using third party Text-To-Speech (TTS) engines, usually the Google TTS Engine. We consequently felt that evaluation of audio quality was beyond the scope of the experiment.
Image creation: In the image creation phase, the overall goal is to create one illustration for each page of the text in a way that is consistent both in terms of style and content. OpenAI’s image generation model GPT-Image-1, released in April 2023, makes it easy to achieve these objectives; with earlier models, the task was extremely challenging. In the currently implemented C-LARA pipeline, image generation goes through the following three phases:
Style: In the first stage, the user creates the style. They give C-LARA a brief specification, e.g., “amusing manga-inspired style suitable for mid-teen students at an Australian school”. C-LARA passes this to the AI with a request to elaborate it into a comprehensive description with information about line style, colour palette, tone, etc.; this description is then passed to GPT-Image-1, with instructions to use it when creating a sample image relevant to the text. The user is shown both the sample image and the elaborated description, with the option of editing the description and regenerating before proceeding to the next stage.
Elements: The purpose of the second stage is to create visual representations of the “elements”, the people, things, locations, etc., that will occur in more than one image. C-LARA starts by sending the text to the AI with a request to produce a list of such elements. This is shown to the user, who is able to add or remove elements. C-LARA then, in parallel, sends requests to the AI to produce elaborated descriptions of each element which take account of the style description and finally passes the elaborated element descriptions to GPT-Image-1 to produce actual images for each element. The user can once again review these and, if they wish, intervene to edit or regenerate.
Images: When the first and second stages have been completed, C-LARA uses the results to create an image for each page of the text; as usual, this is performed in parallel. As we shall see later in this paper, the results of the image generation process are typically quite good.
Web deployment: Once C-LARA has completed the preceding steps of text generation, linguistic annotation, and image creation, the final step is to combine everything into an HTML document. The user reviews this and when they are satisfied has the option of posting it on the web so that it is accessible to third parties. Compiled texts can, if desired, be password-protected.
To give a quick, intuitive idea of C-LARA output,
Figure 1 shows a typical page;
Figure 2 presents all but one of the images from “A Day in the Life of a Pet Detective”) (we omitted the title page for layout reasons), illustrating the image workflow’s ability to produce engaging, amusing images while maintaining good consistency of style and recurring elements across the text.
2 2.2. Materials
We used AI-generated English-language picture books hosted on the C–LARA platform to conduct three qualitative studies.
2.2.1. Study 1
For Study 1, we triaged 18 candidate books, which had been created at various times and posted on the C-LARA platform. All the books had been created using short prompts, typically of one or two sentences, and were designed for educational or entertainment usage.
Table 2 presents a hyperlinked list of the candidate books, and triaging was performed with a three-item Likert instrument (
Table 3); the six highest-scoring books, those at the bottom of the table, were selected for detailed evaluation together with the three glossing languages French, Ukrainian, and Chinese (
Table 4), using distinct teacher and student Likert instruments (see
Section 2.5). Throughout this study, we used 5-point Likert scales ranging from 1 (totally unacceptable/do not agree at all) to 5 (perfect/agree completely). The reader will note that the teacher and student questionnaires are quite different. We discussed the idea of offering similar or overlapping questionnaires but decided against it. Teachers and students engage with learning materials from fundamentally different standpoints: teachability and accuracy of the content vs. learnability and enjoyment. Clearly, using the same questionnaire for both groups would not have been appropriate. We tried to capture a minimal set of parallel constructs, namely, engagement with the materials and perceived usefulness, while tailoring other questions to each group’s perspective.
2.2.2. Study 2
Study 2 probes C-LARA’s ability to contextually tailor picture books when the platform is given no more than a one-sentence demographic brief—here, adult East-Asian migrants with low-intermediate English who have just moved to Australia.
Originally we expected an EFL specialist to draft the three C-LARA prompts (text generation, image background, and image style) for each book; instead, we experimented with a zero-shot workflow in which the OpenAI
o3 model
3 generated the triplets directly from the demographic description via the ChatGPT web interface, without human rewrites. The texts are listed in
Table 5; evaluation used distinct teacher and student Likert instruments introduced in
Section 2.5.
2.2.3. Study 3
Study 3 focuses on an interesting possibility, which we have only recently begun to investigate: taking the idea of demographic tailoring to its logical extreme and creating content designed for a single user. In our initial study, we made no attempt to select the subject in a methodical way but simply created material for someone we knew who had personal reasons for wanting to improve their skills in a particular language and was enterprising enough to trial this new technology seriously over an extended period. The person in question, Sarah Wright (also a co-author of the current paper) is an engineering student and flautist who is considering spending a year in Bavaria to study a course in green hydrogen technology. As with the demographic tailoring experiment (Study 2), the AI is performing all the work of creating the courses, based only on a paragraph from Ms Wright summarising her reasons for wanting to improve her command of German together with a brief description that lets the AI depict the central character as an idealised cartoon version of her.
Figure 3 shows an example page.
We release the texts at a rate of one or two a week, following Ms Wright’s schedule; the first six are shown in
Table 6. Evaluation used distinct teacher and student Likert instruments introduced in
Section 2.5. In this case, the teachers are two germanophone people with teaching experience, and the student is Ms Wright herself.
2.3. Study Design and Research Questions
We address three questions introduced in
Section 1: (RQ1) content and imagery quality, (RQ2) residual human effort, and (RQ3) effectiveness of demographic/individual tailoring. For each study, the material used is online C-LARA multimodal texts, and the instruments are online page-level and whole-book Likert questionnaires filled out by teachers and students, as described in
Section 2.2.
Study 1 (EFL picture book quality; RQ1–RQ2). We first triaged 18 books to select six for detailed evaluation; each selected book was regenerated with French, Ukrainian, and Chinese glosses and evaluated.
Study 2 (Group tailoring; RQ3). We defined a concrete learner profile, prompted C–LARA to generate six English books for that demographic, and evaluated.
Study 3 (Single-user tailoring; RQ3). We generated a six-episode German mini-course for one learner and evaluated.
2.4. Participants, Recruitment, and Ethics
Adult participants with relevant language expertise (teachers and advanced L2 users) were recruited from the authors’ professional networks and the C–LARA community. Student raters were adult EFL learners in informal university or community settings. Except for Sarah Wright, whose much larger contribution resulted in her also being listed as an author, no personally identifying information beyond self-reported language expertise was collected, and all responses were anonymous or de-identified prior to analysis.
Ethical approval: This work involved minimal-risk questionnaires with adult volunteers and no sensitive personal data. Given standard policies in the countries concerned and experience with related previous papers, we did not consider it necessary to seek formal ethical approval.
2.5. Instruments
The instruments used were in all cases 5-point Likert questionnaires hosted on the C-LARA platform, using a format specially designed for studies like the ones conducted here. When creating a Likert questionnaire of this kind, the user specifies a list of C-LARA texts and a list of Likert-scale questions, each of which is classified as being either “book-level” (the question is posed once for the book as a whole) or “page-level” (the question is posed separately for each individual page). A typical book-level question is “How likely would you be to use this text as a self-learning tool?”, and a typical page-level question is “How well does the image correspond to the text?”, where it is implicitly understood that the image and text are those for the page currently being displayed. We now present the details for the three studies.
2.5.1. Study 1
For Study 1, we used three Likert questionnaires: one for the initial triaging step and two for the main evaluation carried out on the six selected texts. The triage questionnaire comprised three 5-point Likert items (
Table 3). The main evaluation used (i) a teacher-viewpoint page-level+global instrument with seven 5-point items spanning image–text correspondence, gloss and translation accuracy, style/element consistency, cultural appropriateness, and overall appeal (
Table 7) and (ii) a student-viewpoint global instrument with two 5-point items targeting engagement and self-study likelihood (
Table 8).
2.5.2. Study 2
The organisation of the questionnaires for Study 2 (creation of texts adapted to a given user demographic) resembles that for Study 1 but is simpler since no triaging phase was used. As before, we have a teacher-viewpoint questionnaire and a student-viewpoint questionnaire. These are shown in
Table 9 and
Table 10.
2.5.3. Study 3
The questionnaires for Study 3, creation of a course adapted to a single user, are similar to those for Study 2. The teacher-viewpoint and learner-viewpoint instruments are listed in
Table 11 and
Table 12.
2.6. Procedure
Study 1
Triage: Teachers skimmed each of 18 books (max. ∼3 min/book) and rated three items;
Selection: The top six books advanced;
Regeneration: Books were recompiled with GPT-5 and GPT-Image-1;
Get ratings: For each book and glossing language, teacher raters answered the questionnaire shown in
Table 7; student raters answered the questionnaire shown in
Table 8.
Study 2
Define demographic: After a short discussion between an EFL teacher author and a C-LARA expert author, we agreed on a suitable target demographic.
Create book generation prompts: We used GenAI to draft prompts for six books potentially useful for this demographic;
Create books: Books were generated in C–LARA;
Get ratings: For each book, teacher raters answered the questionnaire shown in
Table 9; student raters answered the questionnaire shown in
Table 10.
Study 3
Collect learner brief: The learner provided a short paragraph describing her reasons for wishing to improve her German and a sentence describing how she wished her cartoon alter ego to be presented.
Generate German book prompts: GPT-5 was used to create prompts for the various books. This was performed iteratively as the experiment progressed; In many cases we gave the AI feedback we had received from the student so that it could target the stories more effectively.
Create books: Books were generated in C–LARA;
Get ratings: For each book, teacher raters answered the questionnaire shown in
Table 11; the single student answered the questionnaire shown in
Table 12.
2.7. Outcome Measures and Scoring
All items used 5-point Likert scales. For page-level items, we averaged per-page scores to yield a book-level value. For book-level items (teacher and student), responses were used as-is. For Study 1 triage, we computed per-book averages over the three items and raters to rank-order candidates. No free-text responses were used in quantitative summaries; qualitative comments were thematically summarised in the Results.
2.8. Statistical Treatment
We report means of Likert scores and, where the number of independent raters allows, inter-rater reliability metrics. For Study 1 (Chinese), which includes five teacher evaluators and sufficient page-level data, we specifically present Kendall’s W, the intraclass correlation coefficient (ICC), and Cronbach’s to evaluate the coherence and reliability of teacher judgements.
2.9. Availability of Materials, Data, and Code
2.10. Use of Generative AI (Required Disclosure)
Generative AI systems were used extensively in this study. Apart from their use inside C-LARA itself, a GenAI-based platform, GPT-5 also participated actively in discussions of overall project goals, assisted with drafting many sections of this manuscript, and wrote nearly all of the new code required in the platform. This, in particular, included the nontrivial modules used to administer online questionnaires and format the resulting data as CSV files and LaTeX tables. All this material was carefully reviewed by the human authors, who formally take responsibility for it. In accordance with MDPI policy, GPT-5 is not credited as an author. We note our principled disagreement with this policy given the AI’s substantive technical and writing contributions, which clearly exceeded that of many of the human authors, and observed behaviour which strongly suggested an ability to understand the content in a reasonable intuitive sense of the word “understand”. Nevertheless, we comply and provide this explicit disclosure so that readers can evaluate the provenance and auditability of the results.
4. Discussion
We discuss the results, again dividing by study then, if necessary, by language.
4.1. Study 1
4.1.1. Chinese
Both teacher and student evaluations of the Chinese glossing condition were strong and internally consistent. Student ratings were uniformly high, typically near the ceiling on the two student book-level questions, corroborating the teacher view that Chinese L1 support is already usable with little or no post-editing. For the teacher evaluations, where we had sufficient page-level data (three page-level questions for a total of 97 pages), we could formally evaluate inter-rater agreement as moderate to good: Kendall’s (), indicating a shared structure in how pages were ranked, and ICC(2,k) = 0.64 (95% CI [0.51, 0.74]), signifying good absolute agreement among the five teachers. Cronbach’s confirmed that the three page-level items (image–text correspondence, gloss accuracy, and translation accuracy) measured distinct facets rather than a single construct. Together, these metrics quantitatively support the qualitative impression that the Chinese materials were evaluated as coherent, high-quality resources.
4.1.2. French
Teacher judgements for French were also positive overall. In line with the educator feedback we received, the AI’s glosses and translations attained mean evaluations around the mid-to-high 4s (e.g., ∼4.5 for glosses and ∼4.3–4.4 for translations). Qualitative comments identified two recurrent issues that slightly depress scores without compromising pedagogical usability: (i) incorrect gender resolution (e.g., masculine verb morphology used for a female narrator; masculine ils instead of feminine elles for female groups) and (ii) occasional lexical choices that were comprehensible but sub-optimal in register or collocation. The teacher evaluator noted that these issues are straightforward to correct in light post-editing and did not prevent use of the materials in class.
4.1.3. Ukrainian (Student Only; Teacher Evaluation Not Performed)
In contrast, the Ukrainian condition underperformed. Student scores were low and variable, and a brief review from a native-speaker teacher highlighted systematic grammatical and lexical errors: incorrect gender/morphology in first-person forms, missing function words (e.g., negation particles), mishandled case and prepositional government, literal renderings of idioms (e.g.,
a fly on the wall) without idiomatic equivalents, and English-like compound noun order. Given both the error density and the current situation in Ukraine, we judged it inappropriate to pursue teacher-rater recruitment for this condition. Instead, we treat these outcomes as design signals for workflow changes (see
Section 4.2).
4.1.4. Inter-Group Comparisons
A Kruskal–Wallis test across all languages indicated a significant overall effect of language on ratings (, , , medium effect). Post-hoc Dunn’s tests showed that the Chinese materials received significantly higher ratings than both the French and Ukrainian versions. However, Chinese–French and French–Ukrainian differences were not statistically reliable when mean ratings were aggregated by book (), where the effect of language remained significant () with a large effect size ( = 0.88). This pattern indicates that most of the variance in book-level ratings is explained by language, with Chinese materials standing out as particularly well received.
Students tended to give higher ratings overall, reflecting the materials’ engaging nature, but because the teacher and student questionnaires targeted different constructs we judged that a direct statistical comparison would not be meaningful.
4.1.5. Interim Takeaways
Chinese: For Chinese linguistic annotation, the present workflow already achieves near-deployment quality, with average teacher Likert scores for text–image alignment, linguistic support, and visual coherence of 4.5 or better and near-ceiling student scores.
French: For French, small but systematic linguistic errors (gender/number agreement; pronoun choice; collocations) remain. Agreement and pronoun choice are predictable and may be relatively straightforward to fix; collocations are more challenging.
Ukrainian: For Ukrainian, core grammatical control and idiomaticity are not yet reliable; nontrivial work is required before classroom deployment would be responsible.
4.2. Overall Discussion of Study 1
Overall, the picture is coherent: when glossing in Chinese, and to a large extent in French, both teacher and student judgements indicate that the workflow is already close to achieving the desired goals for classroom use. In Ukrainian, systemic grammatical and idiomatic gaps remain visible at page level and accumulate into lower whole-book acceptability. These observations are consistent with the facts that Chinese and French are both large languages; Chinese is substantially larger than French and also has a substantially simpler grammar with almost no morphology, while Ukrainian is both a much smaller language than the other two and has a much more complex morphology.
If we wish to improve the quality of the final annotated texts, we have three main options:
Human post-editing: A straightforward approach is to have humans post-edit the annotations. Unfortunately our experience suggests that most teachers and self-learners do not have time to devote to this kind of task. This means that the post-editing strategy can only be used by experts who distribute the results widely, and is incompatible with the core goal of producing user-tailored texts.
Waiting for better models: An even more straightforward approach is to wait for better models to become available. So far, we have observed a steady upward trend. It is by no means impossible that late 2026s models will yield adequate performance without us needing to do anything. It is also, of course, possible that this will not happen or happen more slowly, especially in languages of lesser commercial interest like Ukrainian.
Machine post-editing: The third approach is to use the existing model to post-edit the annotations. If the issues in the original annotations usually fall into a small number of categories, e.g., gender/number agreement or incorrect choice of case-marking, then it is not implausible that a suitably designed prompt or sequence of prompts may find and correct most of them; even if there is a larger range, it is conceivable that GenAI post-editing may produce useful gains. This idea could be investigated quickly and easily.
4.3. Study 2
In Study 2 (creation of EFL texts tailored to a specified demographic), student ratings were consistently higher than teacher ratings. Students tended to score page-level items (Q1–Q2) in the high positive range, indicating that the materials felt engaging and usable. Teachers, by contrast, gave mid-scale means (roughly Likert mid-3s to low-4s) across page-level quality and whole-book items (Q3–Q5), suggesting reservations about linguistic targeting, cultural/contextual fit, or classroom alignment.
Across all six texts, teacher evaluations clustered around the mid-scale range (means ∼3.3–4.2), with the lowest values observed for Q5, “How likely are you to use this book with students similar to the above demographic?” (means ∼2.5–3.0). In contrast, student responses to the most comparable item, “Would this text have taught you vocabulary and grammar that later might have been useful to you?”, were consistently positive (means ∼4.0–4.3). This suggests that while students perceived the materials as engaging and beneficial, teachers were less convinced of their classroom suitability. Because the two questions target different perspectives—pedagogical appropriateness versus perceived personal gain—we again judged that a direct statistical comparison was not useful.
Here, the impression is that we overestimated the AI’s ability to determine what kinds of texts teachers would consider appropriate; we only engaged superficially with one teacher before we created the texts and distributed the rating questionnaires. In the next iteration, a sensible way to refine the first steps of the workflow might be to agree on a process which allows a group of three to five teachers to participate in a couple of rounds where they are shown a range of AI-created starting prompts for generating the texts, select the ones they consider most useful, and optionally ad comments. This could be performed efficiently over a web interface so that it does not represent a large investment of effort for the teachers and could create content that much more closely matches the teacher’s conception of what is appropriate.
4.4. Study 3
Finally, we discuss the findings of Study 3, tailoring to a single user. We present the learner perspective (Sarah Wright, the single user in question), the perspectives of the two germanophone teacher evaluators, and the overall takeaways.
4.4.1. Learner Perspective
The learner, Ms Wright, reports that all six episodes closely reflected her brief and “paint a cohesive idealized narrative” of her cartoon self, covering everyday settling-in tasks (e.g., housing, admin, and lab safety) and personal interests (e.g., first rehearsal in a wind orchestra). Perceived difficulty matched B1/B2, with C-LARA glosses helping on technical terms; on request, the system produced slightly more challenging variants. The main issues were image-side: (i) the title page sometimes depicted a different character from the body pages, (ii) one anatomy artefact (“three hands”), (iii) text rendered inside images was sometimes unrelated or in English, and (iv) occasional orientation mistakes (papers upside-down/facing away). Despite these, the learner judged the texts engaging and useful, with clear potential as an at-home complement to classes.
4.4.2. Teacher Perspective
The teachers judged the topics appropriate for exchange students (registration, accommodation, etc.) but flagged two categories of refinement before classroom use: cultural authenticity and adjustment to the learner’s linguistic level. On culture, one accommodation storyline (“Wohnungssuche in Burghausen”) was seen as an oversimplified path that is atypical for students, and one office scene showed a staff name badge with a first name rather than the more appropriate title+surname. On language, the texts landed closer to B1 than the intended B2; this was defended pedagogically as supporting focus on key vocabulary, but it should be explicit in design. Glosses/translations were “generally very appropriate,” though occasional inconsistencies were noted (e.g., mapping of wäre (roughly, “would be”) glossed as “would” in Erste Probe im Blasorchester). Overall verdict: This is a strong first pass that will become learner-ready with targeted human post-editing for culture and a tighter level specification.
4.4.3. Inter-Group Comparisons
Because the teacher and student questionnaires in this single-user study addressed mainly different constructs—teachers focused on linguistic accuracy, cultural fit, and visual correspondence, while the learner assessed engagement, personal relevance, and image consistency—and because ratings were uniformly high, we again judged that a formal statistical comparison was not appropriate. Instead, the complementary teacher and student perspectives were interpreted qualitatively, revealing broad agreement that the materials were engaging, relevant, and pedagogically valuable, with minor refinements needed for cultural detail and consistency in images.
4.4.4. Takeaways and Next Steps
Relevance: Both teacher raters thought the texts related well to the real experience of being a student in Bavaria and consequently saw clear alignment with the learner’s goals and situations; the student found all six episodes relevant and engaging.
Images: Image workflow needs two consistency checks: identity locking for recurring characters (so title and body pages match) and automatic screening of text inside images (language, relevance, and orientation).
CEFR level Levelling should be explicitly set and checked (B1 vs. B2), with a quick loop that regenerates sentences above/below target proficiency.
Cultural issues: A short checklist of cultural issues (e.g., realistic housing pathways; badge conventions) would catch many of the kinds of mismatches observed here. These changes could be integrated into C-LARA’s generation and review workflow with minimal friction and should materially reduce post-editing effort in subsequent iterations.
5. Conclusions and Further Directions
We have described a range of experiments carried out using the C-LARA platform, where we evaluated various kinds of automatically and near-automatically generated multimodal illustrated pedagogical texts designed to support L2 learning, both in the classroom and for individual learners. Although the results vary depending on the language pair, those for the best language pairs are clearly promising. In particular, both educators and learners considered performance for the important Chinese–English language pair to be of high quality. The platform has attained a level approaching deployment readiness, as evidenced by favourable evaluations at both the local (page) and global (book) levels concerning text–image alignment, linguistic support, and visual coherence. Consequently, the research team intends soon to conduct a larger experiment which will use C-LARA within an authentic Chinese educational setting with a substantial number of students. The objective will be to investigate the differences in learning outcomes between a conventional textbook-based English instruction method and a C-LARA-supported multimodal teaching approach, as well as to compare the effects of studying ChatGPT-generated texts versus traditionally authored texts.
Looking further ahead, a natural extension of the platform would be to include spoken interaction; this is very far from representing the huge step it would have been just a few years ago. We have already carried out some initial experiments using OpenAI’s Advanced Voice Mode. All that is necessary is to enhance the posted C-LARA content so that it also contains a version of the content that can be accessed by GPT-5 via the web. The user then enters Advanced Voice Mode and tells GPT-5 to go to the page in question and read the content. It is then possible to carry out a spoken discussion. This packaging, needless to say, is not convenient for practical use, but better packagings can easily be envisaged and should not be very challenging to implement.
We think it likely that CALL apps similar in nature to C-LARA including voice interaction will be readily available in the near future. The experiments reported here suggest that these kinds of platform are popular with both students and teachers, and they are no longer hard to build.