Natural Language Processing as a Scalable Method for Evaluating Educational Text Personalization by LLMs

Huynh, Linh; McNamara, Danielle S.

doi:10.3390/app152212128

Open AccessArticle

Natural Language Processing as a Scalable Method for Evaluating Educational Text Personalization by LLMs

by

Linh Huynh

and

Danielle S. McNamara

^*

Learning Engineering Institute, Arizona State University, 120 Cady Mall, Tempe, AZ 85281, USA

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(22), 12128; https://doi.org/10.3390/app152212128 (registering DOI)

Submission received: 19 September 2025 / Revised: 9 November 2025 / Accepted: 13 November 2025 / Published: 15 November 2025

(This article belongs to the Special Issue Neural Network Technologies in Natural Language Processing and Data Mining)

Download

Browse Figures

Versions Notes

Abstract

Four versions of science and history texts were tailored to diverse hypothetical reader profiles (high and low reading skills and domain knowledge), generated by four Large Language Models (i.e., Claude, Llama, ChatGPT, and Gemini). The Natural Language Processing (NLP) technique was applied to examine variations in Large Language Model (LLM) text personalization capabilities. NLP was leveraged to extract and quantify linguistic features of these texts, capturing linguistic variations as a function of LLMs, text genres, and reader profiles. An approach leveraging NLP-based analyses provides an automated and scalable solution for evaluating alignment between LLM-generated personalized texts and readers’ needs. Findings indicate that NLP offers a valid and generalizable means of tracking linguistic variation in personalized educational texts, supporting its use as an evaluation framework for text personalization.

Keywords:

natural language processing; large language models; personalized learning; text evaluation

1. Introduction

Recent advancements in Generative AI (GenAI) and Large Language Models (LLMs) have transformed various aspects of education. LLMs have shown immense capabilities in text-processing and text-generation tasks such as translation, summarization, and question-answering [1]. LLMs have been applied to automate various tasks such as content creation, reasoning, problem-solving, and tailoring content to individual needs [2,3,4,5,6]. In recent years, LLMs have been widely applied to enable scalable, data-driven personalization in educational contexts [7,8,9,10].

1.1. Personalized Learning

Personalized learning refers to tailoring educational experiences and providing individualized support to accommodate students’ unique abilities, goals, and preferences [11]. LLMs can dynamically tailor learning content, pace, and feedback to align with their growth and challenges. Each student possesses a complex and unique profile requiring tailored educational experiences [12,13]. The urgent need for personalized learning is well-documented due to several factors, such as learner diversity, limitations of traditional one-size-fits-all instruction, and the special requirements of students with disabilities [14,15,16,17,18]. Recent advancements in GenAI and LLMs have accelerated the development of adaptive learning technologies capable of dynamically adjusting learning materials to individual learner profiles. LLM-powered tutors that tailor explanations, questions, and feedback according to learners’ knowledge have been shown to enhance comprehension and engagement [19,20,21]. For example, a multi-year study involving 62 schools that implemented personalized learning approaches reported significant gains in student achievement such that the students outperforming them matched comparison schools in both mathematics and reading outcomes [22]. Interestingly, personalized learning approaches that adapt to learners’ characteristics yield larger effects compared to approaches limited to interest-based personalization or feedback [23]. By aligning instruction with each student’s unique needs and abilities, personalized learning provides effective scaffolding for struggling learners and helps prevent them from falling behind [24,25]. Recent empirical research also provides evidence suggesting the benefits of personalized technology-enhanced learning for cognitive outcomes [26,27]. Technology-supported adaptive learning has been shown to result in delayed learning gains rather than immediate benefits, suggesting the long-lasting impact of personalized instruction on learning outcomes [28]. Personalized learning not only increased student engagement but also reduced achievement gaps, which supports students with low prior knowledge to catch up with their peers.

Recent work has highlighted the potential of utilizing LLMs to personalize content and demonstrated the positive impacts of LLM-driven personalization technologies [29,30]. To create accessible learning materials that meet the needs of learners, particularly those with learning disabilities, LLMs have been used to simplify content and align text complexity to match readers’ abilities [31]. This approach is grounded in reading comprehension theories showing that text comprehension depends on both individual characteristics (e.g., prior knowledge, reading skills, cognitive abilities) and text-specific linguistic features (e.g., cohesion, lexical complexity, and syntactic sophistication; [32]). The complex interplay between reader characteristics and textual features highlights the critical need to consider both textual features and individual differences when tailoring texts [11,33]. Comprehension difficulties arise not only from limited reading skills or prior knowledge but also from a mismatch between text readability and a reader’s knowledge and reading proficiency [34,35]. When readers lack sufficient background knowledge or reading skills, linguistic features such as complex vocabulary, sentence structure or low cohesion can further hinder understanding. In particular, cohesion plays an important role in text comprehension. Although high cohesion texts alleviate knowledge demand and benefit low-knowledge readers, they hinder deep processing for high-knowledge readers. For readers with high prior knowledge, less cohesive text is more beneficial to encourage active processing and promote deep comprehension [11,34]. Moreover, skilled readers can comprehend complex academic texts because they have well-developed lexical knowledge and employ strategic reading processes [35]. In contrast, readers with limited skills or background knowledge often find it difficult to process texts that lack cohesion or contain advanced vocabulary and complex sentence structures [11,36,37].

Tailoring text complexity to prior knowledge and reading skills has strong potential to enhance students’ motivation, interest, and overall learning outcomes [38]. Matching the texts to students’ abilities has been shown to foster students’ interest, motivation, and passion for reading and learning [39,40]. For example, the Lexile Framework is a widely adopted tool for aligning readers with texts at suitable difficulty levels [41,42,43]. The framework text complexity uses linguistic features such as syntactic structure, sentence length, and word frequency, while assessing reader ability through standardized test scores. By integrating these measures, the framework allows educators to select reading materials that appropriately match students’ proficiency levels and support effective comprehension. These findings supported the value and benefits of integrating LLMs to personalize learning and align text features with cognitive demand for a specific reader profile.

While LLMs offer significant benefits and potential to transform education, persistent challenges related to standardized evaluation, data privacy issues, ethical considerations, and effective integration remain unresolved [44,45,46]. When an LLM is prompted to tailor feedback for students, only high-quality, theoretically grounded prompts consistently generate feedback that is superior to expert human-generated feedback in terms of explanation quality, specificity, and engagement outcomes for the students [47]. These findings highlight the need to establish a rigorous and theory-driven evaluation framework to maximize the potential of LLMs. Rigorously assessing the quality and alignment of personalized educational content, providing reliable and real-time feedback, are critical to ensure effective implementation of an LLM-powered personalized learning system [48]. A rigorous validation method is critical to ensure that LLM-driven personalization adapts content to learners’ needs in a consistent and effective manner [18,38]. Standardized evaluation frameworks that are theoretically grounded are crucial factors for successful implementation [49,50,51].

1.2. Text Personalization Evaluation Using Natural Language Processing

A scalable, objective method is essential to validate the extent to which LLM adaptations align with learning theories and evolving student profiles. Human-based evaluation (e.g., expert rating the quality, comprehension assessment, and learning outcomes) is time-consuming, resource-intensive, and generally suffers from inherent biases and inter-rater variability. These limitations make it challenging to improve performance rapidly over time and impractical to implement on a large scale. Traditional automated evaluation metrics such as BLEU (BiLingual Evaluation Understudy), ROUGE (Recall-Oriented Understudy for Gisting Evaluation), and METEOR (Metric for Evaluation of Translation with Explicit Ordering) calculate the similarity score between the generated and expected output, focusing only on lexical similarity and word overlap [5,52,53,54]. However, adapting text to meet readers’ individual needs goes well beyond semantic overlap. Text personalization must evolve according to the learners’ changing skills and knowledge over time as learners advance or struggle at different points in their learning trajectories [55,56]. The shortcomings of available evaluation methods are consequential since effective personalized educational content requires rapid and iterative evaluation of text appropriateness tailored to learners’ levels of skills and knowledge [57].

Beyond these limitations, the literature on automatic text evaluation shows that metrics used in Automated Text Simplification (ATS) fail to capture quality relevant to comprehension and learning. For instance, ATS metrics such as SARI (lexical rewriting) and SAMSA (structural simplification) can reflect simplicity gains, while readability formulas (e.g., FKGL) gauge general difficulty level [58]. However, these metrics do not assess whether an adaptation aligns with the cognitive demands of a specific learner profile or preserves domain-specific discourse features. To effectively measure reader–text alignment, we need to assess theory-aligned, interpretable linguistic features that map to comprehension processes (e.g., cohesion, syntactic and lexical sophistication, academic frequency, and word concreteness). This validation approach complements human evaluation while enabling rapid, large-scale, and iterative improvements of text appropriateness as learners’ skills and knowledge evolve over time.

To overcome these evaluation challenges, Natural Language Processing (NLP) offers an alternative method, capturing linguistic features that are well-aligned with cognitive theories. NLP can be used to extract and quantify linguistic features that have been shown to be strongly predictive of reading difficulty and comprehension outcomes [34,36,37,59]. Text readability refers to the ease with which readers process and understand a text, which can be quantified using metrics derived from NLP tools such as the Writing Assessment Tool (WAT; [60]). NLP-based analyses offer a robust evaluation method for assessing text personalization. Unlike traditional evaluation methods, they can be leveraged to assess the linguistic features critical for optimal comprehension and engagement. These metrics allow researchers to differentiate between various types of texts and levels of complexity based on linguistic and semantic properties [61]. Specifically, features such as cohesion, language variety, syntax and lexical sophistication significantly influence text comprehension and ease of text processing, particularly in educational contexts [62,63]. These theoretically driven metrics are necessary to quickly assess personalized content and iteratively improve the performance of the personalization system.

Huynh and McNamara (2025) [57] found that NLP techniques effectively differentiate and evaluate personalized text generated by various LLMs, including Claude, Llama, ChatGPT, and Gemini. They selected theoretically driven metrics derived from an NLP tool (i.e., WAT; [60]) to assess alignment of tailored scientific texts intended for different reader profiles. Their study highlighted variability in linguistic alignment between reader profiles and outputs generated by different LLMs. NLP analyses successfully captured linguistic variations among texts and provided assessments for cohesion, lexical and syntactic complexity measures consistent with theoretical predictions. These linguistic features varied systematically in alignment with reader profiles and effectively differentiated between the outputs generated by different LLMs. Their findings underscored the importance of measuring fine-grained linguistic alignments between generated texts and specific reader profiles, emphasizing NLP’s utility in objectively benchmarking personalization quality.

However, different disciplines (i.e., science and history texts) exhibit unique linguistic patterns due to differences in disciplinary conventions and purposes. Science texts are predominantly explanatory and informative with the purpose of conveying factual information [64,65,66,67]. They are characterized by high conceptual density with interconnected abstract concepts and specialized technical terminology [68,69,70]. Moreover, science texts feature dense nominalization in which nouns or noun phrases are converted from verbs (e.g., oxidation, measurement; [71]). Science texts also exhibit syntactic complexity characterized by longer sentences, passive voice, and embedded clauses [72].

These features highlight the objective nature of science texts but also contribute to comprehension difficulties for readers [73]. Readers need to possess sufficient background knowledge, strong vocabulary, and comprehension strategies to fully grasp the materials [32,37]. In contrast, history texts aim to contextualize and interpret historical events through a blend of descriptive, evaluative and narrative writing styles [74,75]. History texts often have varied syntax, fewer nominalizations, and implicit cohesion (e.g., chronological sequencing, storytelling; [76,77,78]). While also including specialized vocabulary (e.g., terms referring to historical events, institution names, dates, and figures), history texts incorporate concrete and descriptive language that is less complex compared to science texts [79]. These linguistic variations result in comprehension differences such that successful comprehension of science texts relies on vocabulary knowledge while comprehension of history texts depends on understanding the overall context and connections between events [80].

Using indices derived from tools such as Coh-Metrix (e.g., cohesion, syntactic complexity, lexical sophistication), researchers can differentiate discourse types (e.g., science vs. history) consistent with discourse-processing theories [76]. Scientific texts often feature dense nominalizations, abstract concepts, and complex syntax, while academic vocabulary increases difficulty for learners with limited background knowledge [55,56,57]. While prior studies have assessed LLM adaptations without explicitly differentiating text domains [57], the inherent linguistic differences between science and history materials suggest that effective personalization must consider these linguistic distinctions [63]. Effective personalization of educational content not only aligns text complexity with readers’ unique needs and skills but also retains domain-specific linguistic features. As such, it is imperative to determine whether LLM-generated modifications sufficiently reflect these demands. NLP metrics have primarily been validated within science. Due to the domain-specific nature of science and history, it is necessary to establish the validity and generalizability of NLP-based evaluation methods by assessing text personalization across domains. Cross-domain validation helps strengthen the rigor and validity of NLP-based evaluation methods. Without cross-domain testing, it remains uncertain to what extent these linguistic metrics can be generalized effectively to texts with fundamentally different linguistic structures.

1.3. Current Research

The current research aims to replicate the method and extend the prior study [57] by leveraging NLP analyses to assess LLM-generated text personalization across different domains (i.e., science and history). This study examined linguistic variations in LLM-generated personalized texts by analyzing the effects of domain (Science vs. History), LLM (Claude, Llama, ChatGPT, Gemini), and reader profile (i.e., high and low prior knowledge and reading skill) on linguistic features. Personalization has been primarily examined on math and science subjects, but discourse research shows that texts from history/humanities disciplines differ in connective use, syntax, lexical and nominalization patterns compared to science texts [81,82]. No study has jointly examined reader profile adaptation, LLM variation, and cross-disciplinary text features. Analyzing science and history text adaptations allows us to assess whether LLM-modified texts align with known linguistic complexities inherent to each domain and how tailored content aligns with readers’ needs.

We hypothesized that NLP analyses would reveal robust and meaningful variations in linguistic features across text domains, LLMs and reader profiles. Several linguistic metrics are related to syntactic complexity, such as the noun-to-verb ratio, language variety, and sentence length. Lexical sophistication is determined by whether texts include vocabulary commonly used in academic texts and words with a low level of concreteness [83,84]. Syntax and lexical complexity present significant challenges for readers with lower reading skills or prior knowledge. These readers often struggle with decoding complex vocabulary and sentence structures since they are less likely to use effective reading strategies, integrate textual information with prior knowledge, or monitor their understanding, all of which can hinder comprehension [85,86]. When readers with limited vocabulary encounter unfamiliar academic terms or phrases, they may misinterpret the text or fail to comprehend it altogether [87]. Therefore, for less skilled readers or those with limited vocabulary knowledge, texts should be modified to use clear and straightforward language, avoiding overly complex vocabulary and sentence structures. Modifications should minimize language variety and avoid complex syntax structures and sophisticated wording.

Based on previous findings [57], we expected alignment between the complexity of personalized texts and the cognitive needs of different reader profiles. High-cohesion texts containing clear referential and causal connections benefit low-knowledge readers who lack the necessary background knowledge [63]. In contrast, low-cohesion texts are more suitable for high-knowledge readers, as they promote deep learning by facilitating inference generation [88]. Skilled readers are able to comprehend complex and low-cohesion texts, which require them to actively generate inferences and engage in deep processing [32]. Examining these linguistic features provides quantifiable insights into alignment between personalized content and readers’ needs, allowing researchers to effectively assess personalization quality [63]. Cohesion, lexical and syntactic features are linguistic features grounded in theories of reading comprehension and psycholinguistics and have been shown to be directly related to comprehension and learning difficulties [32,34,89]. These metrics have been validated as predictors of text readability, making them appropriate for objectively assessing LLM-generated personalized texts [61]. Table 1 includes a list of features related to text readability.

We also predicted that there would be differences in linguistic features between science and history adaptations based on genre-specific characteristics [64,68,73]. We anticipated measurable differences in lexical, syntactic, and cohesive features when comparing science and history passages using NLP metrics [72]. Regardless of reader profile and LLMs, science modifications would contain denser and more complex syntax, advanced vocabulary commonly used in academic discourse, and more causal connectives compared to historical modifications. In contrast, history texts would show higher use of temporal connectives with a higher noun-to-verb ratio. A corpus analysis study showed that history texts emphasize temporal relations and nominal references. Nominalizations appear frequently in historical discourse because they often contain narrative accounts of historical figures, whereas science texts employ causal reasoning to link abstract concepts [65,66,67].

Moreover, due to the inherent differences in model design, training corpus and fine-tuning strategies, we anticipated variations in linguistic adaptations across LLMs. This hypothesis is based on previous findings by Huynh and McNamara (2025) [57]. Specifically, differences in model architectures also contribute to variability in outputs [90,91]. Moreover, different LLMs are trained on a unique training corpus, so they exhibit different capabilities and behaviors even when given identical prompts [90,91]. As a result, each LLM’s unique architectural design impacts how the model processes and generates answers [92,93]. These model differences lead to different interpretations and text generation by different LLMs.

2. Materials and Methods

2.1. LLM Selection and Implementation Details

Four large language models (LLMs) were used in this study: Claude 3.5 Sonnet (Anthropic), Llama 3.1 (Meta), Gemini Pro 1.5 (Google DeepMind), and ChatGPT4 (OpenAI). These commonly used models were selected to represent a diverse range of architectures and training approaches. Additional technical information, including version details, access dates, and general training specifications, is provided in Appendix A. These LLMs have comparable training and parameter sizes, which indicate similar capabilities in language understanding and generation. Moreover, these LLMs have a strong track record of high performance on general-purpose natural language processing (NLP) tasks. Although the training sizes differ, all four models are widely recognized for producing coherent, contextually appropriate responses and demonstrating advanced language comprehension skills.

2.2. Text Corpus

A total of twenty source texts (ten in science and ten in history) were selected from the publicly available iSTART (Interactive Strategy Training for Active Reading and Thinking) website (www.adaptiveliteracy.com/istart, accessed on 10 October 2025). The iSTART platform provides expository passages commonly used in reading comprehension research, making it well-suited for evaluating text adaptation and personalization. Users can create a free account and access these materials under the Texts Library navigation tab on the website. All texts are licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0) (https://creativecommons.org/licenses/by/4.0/, accessed on 10 October 2025), which permits redistribution and adaptation for educational and research purposes (e.g., iSTART StairStepper materials [94]).

The 20 passages varied in level of difficulty and cover a wide range of subject areas in science and history, making them suitable for evaluating how different LLMs adapt texts to various reader profiles. Table 2 provides details of the corpus, including domain, text titles, word counts, and Flesch–Kincaid Grade Levels.

2.3. Descriptions of Reader Profiles

Reading comprehension is influenced by domain-specific knowledge such that students may comprehend science and history texts differently depending on their prior knowledge in the domain, even with comparable reading skills (e.g., [32,37]). A reader with strong prior knowledge in one domain (e.g., history) may not generalize to comprehension of texts in another subject area (e.g., biology, chemistry or physics). As such, we designed four hypothetical reader profiles replicating the profiles described in prior research and paired each with its corresponding domain (e.g., science PK with science texts, history PK with history texts, [57]). Details of the reader profiles are presented in Appendix B. The descriptions varied in terms of domain-specific prior knowledge (science or history) and reading proficiency (high or low). Each profile presented detailed descriptions of the reader’s age, education, reading level, background knowledge, interests, and reading goals (see Appendix B for descriptions). While the use of eight profiles does not capture the full complexity of real-world personalization, they serve as a proof-of-concept simulation to assess the extent to which LLMs generate tailored content for each profile and leverage NLP tools to evaluate the alignment of linguistic features for readers’ needs.

2.4. Procedure

We prompted the four LLMs to modify 10 science and 10 history texts to align with four distinct reader profiles, each varying in reading proficiency and prior knowledge in the science or history domains. The prompt instructed the LLM to adapt the text to enhance comprehension and engagement while considering specific readers’ needs according to their educational background, reading skill level, prior knowledge, and learning objectives. We used the theory-aligned RAG augmented prompt that guided linguistic adjustments based on the reader profiles (e.g., higher cohesion/simpler syntax for low-knowledge readers). Prior work showed that adding RAG helps LLM to adhere to these linguistic modifications consistently with established comprehension theories. Details of the prompting strategy are included in Appendix C. For every reader profile listed, an LLM generated 20 adapted texts, 10 in science and 10 in history. After completing a set of prompts for one profile, the conversation history was cleared before beginning the next reader. In total, each LLM generated 80 personalized modifications, resulting in a total of 320 outputs across all four models (Claude 3.5, Llama 3.1, Gemini Pro 1.5, and ChatGPT 4). We performed minimal text pre-processing because the LLM-generated adaptations were already high quality. All the adaptations contained well-formed sentences, proper punctuation, and consistent casing. We only removed occasional generation scaffolding (e.g., “Here is the simplified version:”) and emojis if present. No other edits were made.

To evaluate the linguistic adaptations, we extracted the linguistic features listed in Table 1 using the Writing Analytics Tool (WAT; [60]). WAT extracts and provides validated indices of cohesion, syntactic complexity, and lexical sophistication that have been shown to correspond to comprehension difficulties faced by students. These linguistic patterns provide objective, quantifiable measures of text difficulty that correlate with expert judgments of readability and text difficulty. WAT metrics provide objective criteria for validating how well each modification corresponds to the cognitive and linguistic needs of distinct reader profiles. We also apply the same NLP features to compare science and history passages. For instance, low-knowledge readers should receive texts with higher cohesion and simpler syntax, whereas high-knowledge readers can handle, and even benefit from, lower cohesion and sophisticated syntax and vocabulary [88].

3. Results

3.1. Main Effect of Reader Profile on Variations in Linguistic Features of Modified Texts

A 4 × 4 × 2 MANCOVA was conducted to examine the effects of reader profiles (High RS/High PK, High RS/Low PK, Low RS/High PK, and Low RS/Low PK) and LLMs (Claude, Llama, Gemini, and ChatGPT) and text types (Science and History texts) on linguistic features of modifications, with word count included as a covariate. Table 3 displays the average scores and statistical results for these differences.

There was a strong impact of reader profile characteristics on the language that the models produced within the modified texts. Specifically, texts tailored for Profile 1 (readers with high reading skills and prior knowledge) demonstrated the most advanced language use and overall text complexity. These adaptations contained a wider range of vocabulary and sentence structures (M = 80.73, SD = 19.21) compared to texts tailored for Profile 2 (M = 50.76, SD = 20.57), Profile 3 (M = 27.72, SD = 17.39), and Profile 4 (M = 30.33, SD = 18.44), with ps < 0.001. They also had the highest lexical density, indicating that each sentence carried more information-rich content. In contrast, Profile 1 adaptations were the least cohesive (p < 0.001), with fewer explicit connectives and cohesive cues. This linguistic pattern is well-suited for high-knowledge readers who can generate inferences to infer meanings.

For Profile 4 (readers with both low skill and limited prior knowledge), the adapted texts were lexically and syntactically simpler and more cohesive. These adaptations featured shorter sentences and familiar, concrete vocabulary to support understanding. They were also the most cohesive of all adaptations, which is consistent with reading comprehension theories suggesting that highly cohesive texts are beneficial for readers with limited background knowledge.

Pairwise comparisons confirmed significant differences among reader profiles across multiple features related to text complexity, such as language variety, lexical density, noun-to-verb ratio, sentence length, sophisticated wording, and academic vocabulary (all ps < 0.05). As reader skill and prior knowledge increased, the modified texts became more syntactically complex, lexically advanced, and less explicitly connected. These writing patterns align with theoretical predictions that advanced readers benefit from less cohesive and complex texts, whereas less skilled or less knowledgeable readers require simpler, more cohesive language to support meaning construction.

3.2. Main Effect of LLMs

Table 4 presents the differences in linguistic characteristics of the adapted texts as a function of the LLM. Overall, each model demonstrated a distinct linguistic pattern, with the largest variations observed for measures of sentence length, academic vocabulary, cohesion, and language variety. LLMs differ not only in how complex their texts are but also in how they express that complexity.

Specifically, Claude produced texts that were the most cohesive (M = 63.35, SD = 28.59) and syntactically simple (M = 12.46, SD = 4.38. This finding suggested that Claude’s writing style included clear connections between sentences and ideas. In contrast, Llama produced the most advanced and technical adaptations. Llama’s modifications were least cohesive (M = 16.35, SD = 5.88) but also high in academic vocabulary (M = 2.77, SD = 0.01). This pattern suggests that Llama generates more formal texts appealing to expert readers but could be challenging for less skilled readers.

Gemini’s writing comprised richer sentence structures. Its texts demonstrated the greatest language variety (M = 54.42, SD = 28.78), indicating that it varied vocabulary and sentence structure more so than the other models. ChatGPT generated moderately cohesive texts (M = 51.05, SD = 29.52) with intermediate sentence length (M = 16.35, SD = 5.88). Its texts contained the most academic vocabulary (M = 2.89, SD = 0.01), suggesting a style that blends elaboration with accessibility.

Differences also emerged in how the models convey information. Claude had the highest noun-to-verb ratio compared to all other models (M = 2.56, SD = 0.84), suggesting that its sentences contained more nouns and nominal phrases. This writing style is characteristic of information-dense content, wherein ideas are expressed as entities rather than actions. In contrast, ChatGPT featured more verbs and actions (M = 2.33, SD = 0.51), giving its writing a more conversational style. These distinctions highlight that even with identical prompts, LLMs exhibit unique writing patterns that are highly likely to influence the readability of personalized educational materials.

3.3. Main Effect of Text Types

Table 5 presents the differences in linguistic characteristics of the adapted texts as a function of text genre. As expected, modifications of science texts exhibited greater content density, precision, and technicality, whereas adaptations of the history texts contained more narrative and interpretive discourse.

Science texts featured longer sentences and greater lexical density (M = 0.72, SD = 0.06) than history texts (M = 0.51, SD = 0.05), p < 0.001. Each sentence in the science adaptations conveys more information, which is consistent with the explanatory nature of scientific discourse. Science texts also contained more sophisticated vocabulary (M = 49.27, SD = 30.95) than history texts (M = 46.94, SD = 30.25), p < 0.001.

Patterns in connective use also aligned with disciplinary conventions. Science adaptations included a higher frequency of connectives overall, particularly causal connectives (e.g., because, therefore, as a result), which explicitly convey cause-and-effect relationships. In contrast, history texts featured more temporal connectives (e.g., then, after, during) p < 0.001, consistent with their emphasis on sequencing and chronology. History texts also contained a higher noun-to-verb ratio (M = 2.51, SD = 0.68) compared to science texts (M = 2.34, SD = 0.58), p < 0.001. These results highlighted that historical modifications used more noun phrases (e.g., people’s names, places, institutions, wars) and temporal connectives to organize and connect the timeline of historical events. In contrast, science texts included more causal connectives, which established explicit cause–effect relations, a critical component of scientific writing.

3.4. Interaction Effect Reader × Text Genre

Linguistic adjustments for each reader profile were tailored depending on the genre. Science passages are often dense, technical, and include abstract terminology that requires strong prior knowledge to decode and derive meanings from text. In contrast, comprehending history texts relies more on understanding historical context and connections between events [37,63]. As such, cohesion should be increased for less skilled and low-knowledge readers.

Figure 1 illustrates text cohesion as a function of text genre and reader profile to illustrate the significant interaction between these two factors, F (3, 303) = 23.54, p < 0.001, η² = 0.10. As expected, history adaptations for low-knowledge readers contained significantly higher cohesion than the science adaptations. This pattern reflects how the modifications of the texts supported comprehension for low-knowledge and less-skilled readers, who benefit from explicit connections between ideas. In contrast, science texts remained more challenging. Readers with limited domain knowledge often struggle with the expository, information-dense structure typical of scientific writing [37]. To make science passages more accessible, simplifying lexical and syntactic complexity is particularly important for low-skill and low-knowledge readers.

Significant interactions also emerged for sophisticated wording, F (3, 303) = 29.14, p < 0.001, η² = 0.12, and word concreteness, F (3, 303) = 16.83, p < 0.001, η² = 0.08. As intended, science modifications for less skilled and lower-knowledge readers included simpler, more concrete vocabulary. In contrast, modifications for advanced reader profiles significantly increased word complexity and were less abstract, aligning with the hypothesized difficulty levels for each profile. Figure 2 and Figure 3 illustrate these genre-by-profile differences in sophisticated wording and word concreteness, respectively.

4. Discussion

Four LLMs (i.e., Claude, Llama, Gemini, ChatGPT) were prompted to tailor educational texts in science and history domains for different reader profiles. NLP analyses were applied to provide insights into how effectively modifications align with readers’ needs. Our findings demonstrated that NLP-based evaluation methods can be applied across both science and history domains to assess personalization quality. The results implied that LLMs are sensitive not only to readers’ needs but also to language variation across genres. These linguistic patterns were quantified and demonstrated through NLP analyses, showcasing that LLMs retained these domain-specific linguistic features when generating modifications.

4.1. Texts Adapted for Different Reader Profiles

Adaptations intended for less skilled and low-knowledge readers included text features that foster comprehension (i.e., high cohesion, more concrete words, simpler language and sentence structures), whereas modifications for high-knowledge and skilled readers featured sophisticated, less explicit connections to encourage active cognitive processing and deep comprehension. The results suggested that adapted texts for skilled readers with high background knowledge were complex, using more academic language, varied vocabulary, and sophisticated wording. Modifications for high prior knowledge readers had low cohesion and word concreteness, which indicates that less explicit explanations were provided to promote deeper cognitive engagement through inference-making [32,34]. Using the theory-aligned prompt with RAG, LLMs modified text features meaningfully based on reader skill and knowledge to promote deep comprehension as informed by theories of text readability and reading comprehension. While less skilled and low knowledge readers received more cohesive, concrete, and syntactically simple modifications, modifications for skilled and high knowledge readers were higher in measures related to academic writing, such as low cohesiveness, higher lexical density, more sophisticated syntax structure and vocabulary used. These linguistic modifications facilitate comprehension and engagement, ensuring that the text content is accessible and supportive for the low knowledge readers to fill in knowledge gaps. These findings highlighted that LLM-generated modifications successfully tailored linguistic complexity based on readers’ needs by adjusting linguistic features grounded in theories from the reading comprehension literature (when they were prompted to do so and provided a sufficient knowledge base). These linguistic variations captured by NLP analyses align with prior research demonstrating that adjusting features related to text readability is an effective method to support reading comprehension outcomes [96,97].

4.2. LLMs Generated Outputs with Unique Linguistic Patterns

Moreover, LLMs differed in adaptation style due to inherent training data and fine-tuning strategy differences within each model [90,91]. Each model exhibited a unique writing style with variation in linguistic features, even when using the same prompt and personalizing texts for the same reader profile. Claude generated cohesive, shorter, and less academic texts, which lowers the text’s difficulty level. In contrast, Llama generated texts that are more syntactically and more lexically complex but less cohesive, which are similar to texts commonly found in science domains. Modifications generated by Gemini were rich and varied in the language used and moderately complex. ChatGPT produced outputs with moderate cohesion and complexity. These results highlighted the distinct linguistic pattern of each LLM, which was consistent with prior research [57,92,93]. NLP analyses demonstrated how various LLMs approach personalization tasks differently, which highlights the importance of understanding each model’s unique capabilities and weaknesses. This study reinforced the importance of NLP-based evaluation methods in providing replicable and scalable metrics to rapidly assess personalization quality.

4.3. Linguistic Differences in Adapted Texts from Science Versus History Domain

Prior corpus analyses suggest that science texts are typically more abstract, information-dense, and contain more complex vocabulary than history texts [98]. Scientific writing also connects propositions using causal connectives that signal explanatory logic [99]. In contrast, history discourse includes more temporal connectives (e.g., then, during, by the time) and noun-dense, highlighting that history texts have a narrative and sequential structure. Historical texts are typically noun-dense due to frequent references to people, places, and institutions and event sequencing (e.g., George Washington, Industrial Revolution; [100,101]).

In our study, NLP analyses revealed that the adapted texts preserved the same linguistic patterns observed in science and history texts. As expected, science modifications exhibited higher lexical density, longer sentence length, and higher use of domain-specific academic vocabulary (e.g., organism, mutation, photosynthesis). Science modifications also included explicit causal and referential cohesive markers (e.g., because, therefore, thus) to outline processes, explanations, and cause-effect relationships [73,76,102]. In contrast, historical modifications featured shorter sentences describing sequences of historical events. They also contained more temporal connectives and a higher noun-to-verb ratio compared to science modifications. The modifications preserved genre-typical patterns and thus reflect each discipline’s epistemic goals. While science texts aim to explain generalized causal processes, history texts focus on recounting the chronology of events.

4.4. Implications

The findings of this study have implications for personalized learning research and adaptive learning systems. Various works have demonstrated that LLMs can be leveraged to modify learning materials across diverse contexts and learner profiles, simplify domain-specific texts and adjust instructional materials to the students’ needs [22,27,31]. Personalized learning has also been shown to enhance student engagement and have the potential to foster long-term learning gains [27,45]. Despite the growing body of research on LLM-powered personalization, few studies have systematically evaluated how well adapted texts align with students’ needs and abilities [72,103,104]. There is currently no scalable, theory-aligned validation framework to evaluate whether LLM-generated adaptations reflect appropriate cognitive alignment with learner profiles. Human evaluations are resource-intensive and impractical to be implemented at scale, while surface-level automatic metrics such as BLEU, ROUGE, or readability formulas (e.g., FKGL) quantify only lexical similarity rather than cognitive suitability or theoretical alignment [5,53,62,105]. These methods cannot determine whether a text modification supports the intended comprehension process for a particular student profile (e.g., high cohesion for low-knowledge readers or lowered cohesion for high-knowledge readers) [11,34,37].

The current study demonstrates that when LLMs are guided by theory-informed prompts, they can tailor linguistic features (e.g., cohesion, syntax, lexical sophistication) in ways that are consistent with evidence-based comprehension theories (e.g., [11,34,37]). Moreover, the proposed NLP-validation framework provides a scalable, objective method to continuously assess and refine LLM-generated texts in real time. This validation method can be leveraged for iterative adaptation of learning materials as students’ skills, knowledge, and motivation evolve over time [22,26].

While automated text adaptation using LLMs offers a scalable solution to tailor learning materials to students’ needs, there are several ethical considerations. For instance, LLM-generated content is also influenced by general training data biases. These biases can affect tone, examples, and representations of knowledge and how content is framed, or which perspectives are emphasized [106,107,108]. Additionally, maintaining factual accuracy is crucial to prevent misinformation, especially in an educational context [109,110]. As such, these considerations highlight the need for a theory-aligned, scalable validation framework to ensure that adaptive technologies remain trustworthy and pedagogically relevant. Our study contributes to the current literature by establishing a scalable, theory-driven NLP validation framework that bridges this gap. By leveraging theory-aligned linguistic indices (i.e., cohesion, syntactic and lexical sophistication, academic frequency, and concreteness), the NLP-based validation framework quantitatively assesses whether text adaptations are appropriately aligned with comprehension theories (e.g., the Construction–Integration model; [111]) and with reader needs based on factors such as prior knowledge and reading skill. This approach complements human judgment by providing replicable, fine-grained, and real-time evaluation, enabling iterative improvement of adaptive content in large-scale educational settings.

4.5. Limitations and Future Directions

Although the current study suggested a scalable NLP-based framework for evaluating LLM-powered text personalization, several limitations should be considered. The current study focused on science and history subjects, which were chosen due to their distinct linguistic characteristics. Expanding the examination to additional domains such as literature, social studies, or mathematics and non-academic genres would further test the generalizability and robustness of the NLP-based evaluation approach. Additionally, we used simulated reader profiles as a proof-of-concept test to explore whether NLP analyses can quantify and detect theoretically meaningful linguistic adaptations across reader profiles. The current study was designed as a proof of concept to evaluate the validity of an NLP-based validation framework for assessing linguistic alignment between personalized texts and readers’ needs, rather than to assess comprehension outcomes directly. Extensive prior research has already established that the linguistic indices tested in the current study (i.e., cohesion, syntactic and lexical complexity, and academic writing) are strong predictors of comprehension and learning [57,62,69]. Future research might expand to examine the impacts of LLM-generated personalized text on actual learner performance and engagement, providing further insights into the pedagogical values and effectiveness of personalized LLM modifications.

Moreover, we generated one modification for each reader profile, LLM, and text combination. Since LLMs are stochastic, repeated generations and versioned models would allow in-depth examination of output variance [112,113]. Adaptations may vary depending on repeated generations, model version, and prompt design [57,114,115]. Future insight may be gleaned from examining consistency and reproducibility by incorporating multiple prompt generation cycles and different model versions. Moreover, because the adapted texts were generated from 20 source passages, some dependency among adaptations from the same passage is possible. Given the low to moderate between-passage variance and our study’s primary goal of examining linguistic differences across LLMs, reader profiles, and domains, treating each adaptation as an independent observation was judged appropriate. Moreover, considering the complexity of the model design, which already included three factors (LLM × Reader × Domain) and word count as a covariate, adding source text as an additional factor would substantially reduce statistical power. Nevertheless, future research could extend this work by explicitly examining passage-level variance.

Additionally, LLMs are inherently non-deterministic and thus multiple regenerations can produce variable outputs even with identical prompts [116]. This issue poses challenges for educational applications that require reliability and consistency [90]. Because this study generated one adaptation per condition, it may not fully capture this within-model variability. Future research might consider including repeated generations to systematically examine the stability and reproducibility of linguistic patterns across runs.

In this study, a Retrieval-Augmented Generation (RAG) prompt was designed to ground the LLM’s personalization process with empirical evidence and reading comprehension theories rather than introduce arbitrary bias. The materials used to support RAG consisted of relevant peer-reviewed journal articles and validated theories (e.g., the Construction–Integration model [111]; differential effects of text cohesion based on prior knowledge). These resources were intentionally selected to instruct the LLM in tailoring texts according to well-established principles. However, we acknowledge that this approach may favor certain theoretical frameworks and linguistic conventions, which shape the model’s interpretation of what is considered a high-quality adaptation. Future work could integrate bias detection into the evaluation pipeline to ensure that adaptive learning systems remain inclusive and trustworthy.

Finally, hallucinations in which LLM-generated content that is linguistically coherent but semantically incorrect remain a known risk in an LLM system [117,118]. LLMs may generate irrelevant information from source texts, which poses a serious challenge for maintaining content accuracy and consistency [119,120]. To address this concern, we utilized the NLP tool to compute similarity scores between the original and modified versions, while human evaluators reviewed the outputs to verify that key concepts were maintained and that no irrelevant or inconsistent information was introduced (see Appendix D for the quality evaluation rubric). Although fully automating semantic accuracy assessment remains challenging, NLP provides an efficient and scalable approach to support quality assurance and iterative content refinement [121]. Future studies could integrate automated multi-agent verification pipelines that combine automated NLP assessments with human validation in the loop to enhance reliability.

5. Conclusions

In this study, we leveraged an NLP-based validation method to systematically evaluate the personalization capabilities of four LLMs across diverse reader profiles and text genres. Our results demonstrated linguistic variations between science and history texts, highlighting domain-specific adaptations generated by each LLM [64,101]. Additionally, each LLM exhibited unique linguistic characteristics in text generation, providing evidence for the inherent differences attributable to training corpus and model architectures [112,113]. Replicating the NLP evaluation framework from prior study [57], this study quantified linguistic variations and demonstrated alignment between text modifications and intended reader profiles across different genres. NLP assessment can be automated to provide immediate feedback, enabling real-time assessment and text refinement. As a result, personalized content can be modified based on quantitative feedback, assessed, and adapted iteratively. NLP-based evaluation methods provide a robust framework to continuously assess and improve LLM-generated personalized texts, facilitating adaptive, data-driven personalization across diverse domains.

Author Contributions

Conceptualization, L.H. and D.S.M.; Methodology, L.H.; Validation, L.H.; Formal analysis, L.H.; Investigation, L.H.; Resources, D.S.M.; Data curation, L.H.; Writing—original draft, L.H.; Writing—review & editing, L.H. and D.S.M.; Visualization, L.H.; Supervision, D.S.M.; Project administration, D.S.M.; Funding acquisition, D.S.M. All authors have read and agreed to the published version of the manuscript.

Funding

The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305T240035 to Arizona State University. The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in OSF at: https://osf.io/dx3af/overview?view_only=cd11ff1a1f6442af90bbcd645ceea9f9, accessed on 12 November 2025.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

PK	Prior Knowledge
RS	Reading Skills
GenAI	Generative AI
LLM	Large Language Model
AFS	Adaptive Feedback System
RAG	Retrieval-Augmented Generation
NLP	Natural Language Processing
iSTART	Interactive Strategy Training for Active Reading and Thinking
FKGL	Flesch–Kincaid Grade Level
WAT	Writing Analytics Tool

Appendix A. LLM Descriptions

This appendix includes technical details on the LLMs used in the current research, including model versions, training size, and number of parameters. All models were accessed through the Poe.com web platform using default configurations on 10 June 2025.

Table A1. Summarized technical details of each LLM. Source: authors’ contribution.

Model	Claude 3.5 Sonnet	Llama 3.1	ChatGPT 4	Gemini Pro 1.5
Owned by	Anthropic	Meta	OpenAI	Google DeepMind
Training Size	Anthropic does not disclose the exact number but training datasets are large, publicly available, and curated online sources.	2 trillion tokens sourced from publicly available datasets (i.e., books, websites, and other digital content)	1.8 trillion tokens from diverse sources, including books, web pages, academic papers, and large text corpora	1.5 trillion tokens, sourced from a wide variety of publicly available and curated data, including text from books, websites, and other large corpora
Number of Parameters	Somewhere between 70 and 100 billion parameters	70 billion parameters	Not publicly disclosed but has approximately 175 billion parameters.	100 billion parameters

Appendix B. Reader Profile Descriptions

This appendix includes details of the various reader profiles provided to the LLMs.

Table A2. Revised Reader Description. Source: authors’ contribution.

	Descriptions of High and Low Knowledge Reader in Science	Descriptions of High and Low Knowledge Reader in History
Reader 1 (High RS/High PK *)	Age: 25 Educational level: Senior Major: Chemistry (Pre-med) ACT English composite score: 32/36 (performance is in the 96th percentile) ACT Reading composite score: 32/36 (performance is in the 96th percentile) ACT Math composite score: 28/36 (performance is in the 89th percentile) ACT Science composite score: 30/36 (performance is in the 94th percentile) Science background: Completed eight required biology, physics, and chemistry college-level courses (comprehensive academic background in the sciences, covering advanced topics in biology, chemistry, and physics, well-prepared for higher-level scientific learning and analysis) Reading goal: Understand scientific concepts and principles	Age: 25 Educational level: Senior Major: History and Archeology ACT English: 32/36 (96th percentile) ACT Reading: 33/36 (97th percentile) AP History score: 5 out of 5 History background: Completed 4 years of college-level courses in U.S. and World History (extensive training in historical analysis, primary source evaluation, and historiography) Reading goal: Understand key historical events and their relevance to society
Reader 2 (High RS/Low PK *)	Age: 20 Educational level: Sophomore Major: Psychology ACT English composite score: 32/36 (performance is in the 96th percentile) ACT Reading composite score: 31/36 (performance is in the 94th percentile) ACT Math composite score: 18/36 (performance is in the 42nd percentile) ACT Science composite score: 19/36 (performance is in the 46th percentile) Science background: Completed one high-school-level chemistry course (no advanced science course). Limited exposure and understanding of scientific concepts Interests/Favorite subjects: arts, literature Reading goal: Understand scientific concepts and principles	Age: 21 Educational level: Junior Major: Biology ACT English: 32/36 (96th percentile) ACT Reading: 31/36 (94th percentile) AP History score: 2 out of 5 History background: Completed general education high school history; no college-level history courses. Limited interest and knowledge of historical events Interests/Favorite subjects: arts, literature Reading goal: Understand key historical events and their relevance to society
Reader 3 (Low RS/High PK *)	Age: 20 Educational level: Sophomore Major: Health Science ACT English composite score: 19/36 (performance is in the 44th percentile) ACT Reading composite score: 20/36 (performance is in the 47th percentile) ACT Math composite score: 32/36 (performance is in the 97th percentile) ACT Science composite score: 30/36 (performance is in the 94th percentile) Science background: Completed one physics, one astronomy, and two college-level biology courses (substantial prior knowledge in science, having completed multiple college-level courses across several disciplines, strong foundation in scientific principles and concepts) Reading goal: Understand scientific concepts Reading disability: Dyslexia	Age: 22 Educational level: Junior Major: History ACT English: 19/36 (44th percentile) ACT Reading: 20/36 (47th percentile) AP History score: 5 out of 5 History background: Completed 3 years of college-level history courses (specializing in U.S. history and early modern Europe) Reading goal: Understand key historical events and their relevance to society Reading disability: Dyslexia
Reader 4 (Low RS/Low PK *)	Age: 18 Educational level: Freshman Major: Marketing ACT English composite score: 17/36 (performance is in the 33rd percentile) ACT Reading composite score: 18/36 (performance is in the 36th percentile) ACT Math composite score: 19/36 (performance is in the 48th percentile) ACT Science composite score: 17/36 (performance is in the 34th percentile) Science background: Completed one high-school-level biology course (no advanced science course) Limited exposure and understanding of scientific concepts Reading goal: Understand scientific concepts	Age: 18 Educational level: Freshman Major: Finance ACT English: 18/36 (35th percentile) ACT Reading: 17/36 (32nd percentile) AP History: 1 out of 5 History background: Only completed basic U.S. History in high school; little engagement or interest in history topics Reading goal: Understand key historical events and their relevance to society

* RS = Reading Skill, PK = Prior Knowledge.

Appendix C. Prompt Used

The appendix provides the instruction prompt that applies advanced strategies including personification, clear task specification, step-by-step reasoning, and retrieval-augmented generation (RAG) to guide LLM outputs more effectively.

Table A3. Augmented Prompt. Source: authors’ contribution.

Components	Augmented Prompt
Personification	Imagine you are a cognitive scientist specializing in reading comprehension and learning science
Task objectives	Modify this text to enhance text comprehension, engagement, and accessibility for the reader profile while maintaining conceptual depth, scientific rigor, and pedagogical value Adapt the text in a way that supports the readers’ understanding of scientific concepts, using strategies that align with empirical findings on text cohesion, reading skills, and prior knowledge Help the reader retain scientific concepts and reinforce understanding Ensure that the reader can build meaningful understanding while being challenged at an appropriate level
Chain-of-thought	Explain the rationale behind each modification approach and how each change helps the reader grasp the scientific concepts and retain information
RAG	Refer to the attached pdf files. Apply these empirical findings and theoretical frameworks from these files as guidelines to tailor text Impact of prior knowledge and reading skills on comprehension of science texts Impact of prior knowledge on integration of new knowledge according to the Construction-Integration (CI) Model of Text Comprehension Impact of text cohesion on comprehension the differential effect of cohesion on comprehension depending on level of prior knowledge and reading skills
Reader profile	[Insert Reader Profile Description from Appendix B]
Text input	[Insert Text]

Appendix D. Quality Assessment Rubric

This appendix presents the rubric used to evaluate the quality of each adapted text. The evaluation focused on three main dimensions: how well the text matched the intended reader profile, how clearly it supported comprehension, and whether it maintained accuracy and readability. The criteria were designed to capture both linguistic and pedagogical aspects of the adaptations, providing a holistic picture of how effectively each version addressed readers’ needs.

1. Text–Reader Alignment- how well the modified text fits the characteristics of the target reader considering multiple factors, including reading level, sentence complexity, vocabulary choice, and tone.

Readability: Does the text use an appropriate reading level and sentence length for the intended audience? Examine whether the syntax, vocabulary, and tone were accessible to readers of varying skill and knowledge level.
Structure and Organization: Is information presented in a clear, logical sequence that supports understanding? The structure was evaluated for cohesion across paragraphs, the use of headings and transitions, and the ease with which readers could follow the text.
Cohesion and Flow: Are titles, headings, and subheadings used effectively to guide readers through the material?
Language Use: Is the language appropriate for the reader’s background knowledge and interests? Assess the formality, specificity, and engagement level of the text.
Engagement: Does the writing capture and sustain the reader’s interest? Whether the content, tone, and examples made the reading experience enjoyable and relatable.
Reader Perspective: Reviewers also considered the question, “If I were the student, would this text hold my attention and make me want to keep reading?”

2. Comprehension Support—evaluates how effectively the text helps readers build understanding.

Clarity and Precision: Are key concepts explained clearly, concisely, and without ambiguity? Consider whether definitions, explanations, and descriptions were phrased in a way that minimizes confusion.
Depth and Rigor: Does the level of detail match the reader’s background knowledge? Does each adaptation provide sufficient technical detail and conceptual depth without overwhelming the intended audience?
Use of Examples: Rate for the examples’ relevance and explanatory value. High-quality examples should concretely reinforce abstract ideas and help the reader connect new information to prior knowledge.

References

Lee, J.S. InstructPatentGPT: Training patent language models to follow instructions with human feedback. Artif. Intell. Law 2024, 33, 739–782. [Google Scholar] [CrossRef]
Cherian, A.; Peng, K.C.; Lohit, S.; Matthiesen, J.; Smith, K.; Tenenbaum, J. Evaluating large vision-and-language models on children’s mathematical olympiads. Adv. Neural Inf. Process. Syst. 2024, 37, 15779–15800. [Google Scholar]
Liu, D.; Hu, X.; Xiao, C.; Bai, J.; Barandouzi, Z.A.; Lee, S.; Lin, Y. Evaluation of large language models in tailoring educational content for cancer survivors and their caregivers: Quality analysis. JMIR Cancer 2025, 11, e67914. [Google Scholar] [CrossRef]
Krause, S.; Stolzenburg, F. Commonsense reasoning and explainable artificial intelligence using large language models. In Proceedings of the European Conference on Artificial Intelligence; Springer: Cham, Switzerland, 2023; pp. 302–319. [Google Scholar]
Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81. [Google Scholar]
Lee, C.; Porfirio, D.; Wang, X.J.; Zhao, K.; Mutlu, B. VeriPlan: Integrating formal verification and LLMs into end-user planning. arXiv 2025, arXiv:2502.17898. [Google Scholar]
Pesovski, I.; Santos, R.; Henriques, R.; Trajkovik, V. Generative AI for customizable learning experiences. Sustainability 2024, 16, 3034. [Google Scholar] [CrossRef]
Laak, K.-J.; Aru, J. AI and personalized learning: Bridging the gap with modern educational goals. arXiv 2024, arXiv:2404.02798. [Google Scholar] [CrossRef]
Park, M.; Kim, S.; Lee, S.; Kwon, S.; Kim, K. Empowering personalized learning through a conversation-based tutoring system with student modeling. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems; ACM: New York, NY, USA, 2024; pp. 1–10. [Google Scholar]
Wen, Q.; Liang, J.; Sierra, C.; Luckin, R.; Tong, R.; Liu, Z.; Cui, P.; Tang, J. AI for education (AI4EDU): Advancing personalized education with LLM and adaptive learning. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2024; pp. 6743–6744. [Google Scholar]
Pane, J.F.; Steiner, E.D.; Baird, M.D.; Hamilton, L.S.; Pane, J.D. Informing Progress: Insights on Personalized Learning Implementation and Effects; Res. Rep. RR-2042-BMGF; Rand Corp.: Santa Monica, CA, USA, 2017. [Google Scholar]
Bernacki, M.L.; Greene, M.J.; Lobczowski, N.G. A systematic review of research on personalized learning: Personalized by whom, to what, how, and for what purpose(s)? Educ. Psychol. Rev. 2021, 33, 1675–1715. [Google Scholar] [CrossRef]
Kaswan, K.S.; Dhatterwal, J.S.; Ojha, R.P. AI in personalized learning. In Advances in Technological Innovations in Higher Education; CRC Press: Boca Raton, FL, USA, 2024; pp. 103–117. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Hardaker, G.; Glenn, L.E. Artificial intelligence for personalized learning: A systematic literature review. Int. J. Inf. Learn. Technol. 2025, 42, 1–14. [Google Scholar] [CrossRef]
Ma, X.; Mishra, S.; Liu, A.; Su, S.Y.; Chen, J.; Kulkarni, C.; Cheng, H.T.; Le, Q.; Chi, E. Beyond chatbots: ExploreLLM for structured thoughts and personalized model responses. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems; ACM: New York, NY, USA, 2024; pp. 1–12. [Google Scholar]
Ng, C.; Fung, Y. Educational personalized learning path planning with large language models. arXiv 2024, arXiv:2407.11773. [Google Scholar] [CrossRef]
Zhang, Y.; Xu, X.; Zhang, M.; Cai, N.; Lei, V.N.L. Personal learning environments and personalized learning in the education field: Challenges and future trends. In Applied Degree Education and the Shape of Things to Come; Springer Nature Singapore: Singapore, 2023; pp. 231–247. [Google Scholar]
Lyu, W.; Wang, Y.; Chung, T.; Sun, Y.; Zhang, Y. Evaluating the effectiveness of LLMs in introductory computer science education: A semester-long field study. In Proceedings of the 11th ACM Conference on Learning Scale; ACM: New York, NY, USA, 2024; pp. 63–74. [Google Scholar]
Létourneau, A.; Deslandes Martineau, M.; Charland, P.; Karran, J.A.; Boasen, J.; Léger, P.M. A systematic review of AI-driven intelligent tutoring systems (ITS) in K–12 education. npj Sci. Learn. 2025, 10, 29. [Google Scholar] [CrossRef]
Xiao, R.; Hou, X.; Ye, R.; Kazemitabaar, M.; Diana, N.; Liut, M.; Stamper, J. Improving student–AI interaction through pedagogical prompting: An example in computer science education. arXiv 2025, arXiv:2506.19107. [Google Scholar]
Pane, J.F.; Steiner, E.D.; Baird, M.D.; Hamilton, L.S. Continued Progress: Promising Evidence on Personalized Learning; Rand Corporation: Santa Monica, CA, USA, 2015. [Google Scholar]
Cuéllar, Ó.; Contero, M.; Hincapié, M. Personalized and timely feedback in online education: Enhancing learning with deep learning and large language models. Multimodal Technol. Interact. 2025, 9, 45. [Google Scholar] [CrossRef]
Lim, L.; Bannert, M.; van der Graaf, J.; Singh, S.; Fan, Y.; Surendrannair, S.; Gašević, D. Effects of real-time analytics-based personalized scaffolds on students’ self-regulated learning. Comput. Hum. Behav. 2023, 139, 107547. [Google Scholar] [CrossRef]
Chen, C.H.; Law, V.; Huang, K. Adaptive scaffolding and engagement in digital game-based learning. Educ. Technol. Res. Dev. 2023, 71, 1785–1798. [Google Scholar] [CrossRef]
Major, L.; Francis, G.A.; Tsapali, M. The effectiveness of technology-supported personalised learning in low- and middle-income countries: A meta-analysis. Br. J. Educ. Technol. 2021, 52, 1935–1964. [Google Scholar] [CrossRef]
Hooshyar, D.; Weng, X.; Sillat, P.J.; Tammets, K.; Wang, M.; Hämäläinen, R. The effectiveness of personalized technology-enhanced learning in higher education: A meta-analysis with association rule mining. Comput. Educ. 2024, 223, 105169. [Google Scholar] [CrossRef]
Sibley, L.; Fabian, A.; Plicht, C.; Pagano, L.; Ehrhardt, N.; Wellert, L.; Lachner, A. Adaptive teaching with technology enhances lasting learning. Learn. Instr. 2025, 99, 101863. [Google Scholar] [CrossRef]
Jian, M.J.K.O. Personalized learning through AI. Adv. Eng. Innov. 2023, 5, 16–19. [Google Scholar] [CrossRef]
Pratama, M.P.; Sampelolo, R.; Lura, H. Revolutionizing education: Harnessing the power of artificial intelligence for personalized learning. Klasikal J. Educ. Lang. Teach. Sci. 2023, 5, 350–357. [Google Scholar] [CrossRef]
Martínez, P.; Moreno, L.; Ramos, A. Exploring large language models to generate easy-to-read content. Front. Comput. Sci. 2024, 6, 1394705. [Google Scholar] [CrossRef]
Ozuru, Y.; Dempsey, K.; McNamara, D.S. Prior knowledge, reading skill, and text cohesion in the comprehension of science texts. Learn. Instr. 2009, 19, 228–242. [Google Scholar] [CrossRef]
Follmer, D.J.; Sperling, R.A. Interactions between reader and text: Contributions of cognitive processes, strategy use, and text cohesion to comprehension of expository science text. Learn. Individ. Differ. 2018, 67, 177–187. [Google Scholar] [CrossRef]
O’Reilly, T.; McNamara, D.S. Reversing the reverse cohesion effect: Good texts can be better for strategic, high-knowledge readers. Discourse Process. 2007, 43, 121–152. [Google Scholar] [CrossRef]
Van den Broek, P.; Bohn-Gettelmann, S.; Kendeou, P.; White, M.J. Reading comprehension and the comprehension-monitoring activities of children with learning disabilities: A review using the dual-process theory of reading. Educ. Psychol. Rev. 2015, 27, 641–644. [Google Scholar]
Frantz, R.S.; Starr, L.E.; Bailey, A.L. Syntactic complexity as an aspect of text complexity. Educ. Res. 2015, 44, 387–393. [Google Scholar] [CrossRef]
McNamara, D.S.; Ozuru, Y.; Floyd, R.G. Comprehension challenges in the fourth grade: The roles of text cohesion, text genre, and readers’ prior knowledge. Int. Electron. J. Elem. Educ. 2011, 4, 229–257. [Google Scholar]
Sharma, S.; Mittal, P.; Kumar, M.; Bhardwaj, V. The role of large language models in personalized learning: A systematic review of educational impact. Discov. Sustain. 2025, 6, 1–24. [Google Scholar] [CrossRef]
du Boulay, B.; Poulovassilis, A.; Holmes, W.; Mavrikis, M. What does the research say about how artificial intelligence and big data can close the achievement gap? In Enhancing Learning and Teaching with Technology: What the Research Says; Luckin, R., Ed.; Routledge: London, UK, 2018; pp. 256–285. [Google Scholar]
Kucirkova, N. Personalised learning with digital technologies at home and school: Where is children’s agency? In Mobile Technologies in Children’s Language and Literacy; Oakley, G., Ed.; Emerald Publishing: Bingley, UK, 2018; pp. 133–153. [Google Scholar]
Mesmer, H.A.E. Tools for Matching Readers to Texts: Research-Based Practices, 1st ed.; Guilford Press: New York, NY, USA, 2008; pp. 1–234. [Google Scholar]
Lennon, C.; Burdick, H. The Lexile Framework as an Approach for Reading Measurement and Success . Available online: https://metametricsinc.com/wp-content/uploads/2017/07/The-Lexile-Framework-for-Reading.pdf (accessed on 12 June 2025).
Stenner, A.J.; Burdick, H.; Sanford, E.E.; Burdick, D.S. How accurate are Lexile text measures. J. Appl. Meas. 2007, 8, 307–322. [Google Scholar]
Wang, S.; Xu, T.; Li, H.; Zhang, C.; Liang, J.; Tang, J.; Yu, P.S.; Wen, Q. Large language models for education: A survey and outlook. arXiv 2024, arXiv:2403.18105. [Google Scholar] [CrossRef]
Merino-Campos, C. The impact of artificial intelligence on personalized learning in higher education: A systematic review. Trends High. Educ. 2025, 4, 17. [Google Scholar] [CrossRef]
Yan, L.; Sha, L.; Zhao, L.; Li, Y.; Martinez-Maldonado, R.; Chen, G.; Li, X.; Jin, Y.; Gašević, D. Practical and ethical challenges of large language models in education: A systematic scoping review. Br. J. Educ. Technol. 2024, 55, 90–112. [Google Scholar] [CrossRef]
Jacobsen, L.J.; Weber, K.E. The promises and pitfalls of large language models as feedback providers: A study of prompt engineering and the quality of AI-driven feedback. AI 2025, 6, 35. [Google Scholar] [CrossRef]
Murtaza, M.; Ahmed, Y.; Shamsi, J.A.; Sherwani, F.; Usman, M. AI-based personalized e-learning systems: Issues, challenges, and solutions. IEEE Access 2022, 10, 81323–81342. [Google Scholar] [CrossRef]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating text generation with BERT. In Proceedings of the International Conference on Representation Learning (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Basham, J.D.; Hall, T.E.; Carter, R.A., Jr.; Stahl, W.M. An operationalized understanding of personalized learning. J. Spec. Educ. Technol. 2016, 31, 126–135. [Google Scholar] [CrossRef]
Bray, B.; McClaskey, K. A step-by-step guide to personalize learning. Learn. Lead. Technol. 2013, 40, 12–19. [Google Scholar]
Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
Novikova, J.; Dušek, O.; Curry, A.C.; Rieser, V. Why we need new evaluation metrics for NLG. arXiv 2017, arXiv:1707.06875. [Google Scholar] [CrossRef]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics; Philadelphia, PA, USA, 2002; pp. 311–318. [Google Scholar]
Crossley, S.; Salsbury, T.; McNamara, D. Measuring L2 lexical growth using hypernymic relationships. Lang. Learn. 2009, 59, 307–334. [Google Scholar] [CrossRef]
Tetzlaff, L.; Schmiedek, F.; Brod, G. Developing personalized education: A dynamic framework. Educ. Psychol. Rev. 2021, 33, 863–882. [Google Scholar] [CrossRef]
Huynh, L.; McNamara, D.S. GenAI-powered text personalization: Natural language processing validation of adaptation capabilities. Appl. Sci. 2025, 15, 6791. [Google Scholar] [CrossRef]
Tran, H.; Yao, Z.; Li, L.; Yu, H. ReadCtrl: Personalizing text generation with readability-controlled instruction learning. arXiv 2024, arXiv:2406.09205. [Google Scholar]
Cromley, J.G.; Snyder-Hogan, L.E.; Luciw-Dubas, U.A. Reading comprehension of scientific text: A domain-specific test of the direct and inferential mediation model of reading comprehension. J. Educ. Psychol. 2010, 102, 687–700. [Google Scholar] [CrossRef]
Potter, A.; Shortt, M.; Goldshtein, M.; Roscoe, R.D. Assessing academic language in tenth-grade essays using natural language processing. Assess. Writ. 2025; in press. [Google Scholar]
Crossley, S.A. Developing linguistic constructs of text readability using natural language processing. Sci. Stud. Read. 2025, 29, 138–160. [Google Scholar] [CrossRef]
Crossley, S.A.; Skalicky, S.; Dascalu, M.; McNamara, D.S.; Kyle, K. Predicting text comprehension, processing, and familiarity in adult readers: New approaches to readability formulas. Discourse Process. 2017, 54, 340–359. [Google Scholar] [CrossRef]
Smith, R.; Snow, P.; Serry, T.; Hammond, L. The role of background knowledge in reading comprehension: A critical review. Read. Psychol. 2021, 42, 214–240. [Google Scholar] [CrossRef]
Staples, S.; Egbert, J.; Biber, D.; Gray, B. Academic Writing Development at the University Level: Phrasal and Clausal Complexity across Level of Study, Discipline, and Genre. Writ. Commun. 2016, 33, 149–183. [Google Scholar] [CrossRef]
Fang, Z.; Schleppegrell, M.J. Reading in Secondary Content Areas: A Language-Based Pedagogy; University of Michigan Press: Ann Arbor, MI, USA, 2008. [Google Scholar]
Halliday, M.A.K.; Martin, J.R. Writing Science: Literacy and Discursive Power; Routledge: London, UK, 2003. [Google Scholar]
Schleppegrell, M.J.; Achugar, M.; Oteíza, T. The grammar of history: Enhancing content-based instruction through a functional focus on language. TESOL Q. 2004, 38, 67–93. [Google Scholar] [CrossRef]
Biber, D.; Gray, B.; Poonpon, K. Lexical density and structural elaboration in academic writing over time: A multidimensional corpus analysis. J. English Acad. Purp. 2021, 50, 100968. [Google Scholar]
Graesser, A.C.; McNamara, D.S.; Kulikowich, J.M. Coh-Metrix: Providing multilevel analyses of text characteristics. Educ. Res. 2011, 40, 223–234. [Google Scholar] [CrossRef]
Nagy, W.E.; Townsend, D. Words as tools: Learning academic vocabulary as language acquisition. Read. Res. Q. 2012, 47, 91–108. [Google Scholar] [CrossRef]
Biber, D.; Gray, B. Nominalizing the verb phrase in academic science writing. In The Verb Phrase in English: Investigating Recent Language Change with Corpora; Aarts, B., Close, J., Leech, G., Wallis, S., Eds.; Cambridge University Press: Cambridge, UK, 2013; pp. 99–132. [Google Scholar] [CrossRef]
Dong, J.; Wang, H.; Buckingham, L. Mapping out the disciplinary variation of syntactic complexity in student academic writing. System 2023, 113, 102974. [Google Scholar] [CrossRef]
Fang, Z. The language demands of science reading in middle school. Int. J. Sci. Educ. 2006, 28, 491–520. [Google Scholar] [CrossRef]
Grever, M.; Van der Vlies, T. Why national narratives are perpetuated: A literature review on new insights from history textbook research. London Rev. Educ. 2017, 15, 155–173. [Google Scholar] [CrossRef]
Huijgen, T.; Van Boxtel, C.; Van de Grift, W.; Holthuis, P. Toward historical perspective taking: Students’ reasoning when contextualizing the actions of people in the past. Theory Res. Soc. Educ. 2017, 45, 110–144. [Google Scholar] [CrossRef]
Duran, N.D.; McCarthy, P.M.; Graesser, A.C.; McNamara, D.S. Using temporal cohesion to predict temporal coherence in narrative and expository texts. Behav. Res. Methods 2007, 39, 212–223. [Google Scholar] [CrossRef] [PubMed]
Van Drie, J.; Van Boxtel, C. Historical reasoning: Towards a framework for analyzing students’ reasoning about the past. Educ. Psychol. Rev. 2008, 20, 87–110. [Google Scholar] [CrossRef]
Wineburg, S.S.; Martin, D.; Monte-Sano, C. Reading Like a Historian: Teaching Literacy in Middle and High School History Classrooms; Teachers College Press: New York, NY, USA, 2012. [Google Scholar]
Shanahan, T.; Shanahan, C. Teaching disciplinary literacy to adolescents: Rethinking content-area literacy. Harv. Educ. Rev. 2008, 78, 40–59. [Google Scholar] [CrossRef]
Blevins, B.; Magill, K.; Salinas, C. Critical historical inquiry: The intersection of ideological clarity and pedagogical content knowledge. J. Soc. Stud. Res. 2020, 44, 35–50. [Google Scholar] [CrossRef]
Biber, D.; Conrad, S.; Cortes, V. If you look at…: Lexical bundles in university teaching and textbooks. Appl. Linguist. 2004, 25, 371–405. [Google Scholar] [CrossRef]
Hyland, K. As can be seen: Lexical bundles and disciplinary variation. English Spec. Purp. 2008, 27, 4–21. [Google Scholar] [CrossRef]
Malvern, D.; Richards, B.; Chipere, N.; Durán, P. Lexical Diversity and Language Development; Palgrave Macmillan UK: London, UK, 2004; pp. 16–30. [Google Scholar]
McCarthy, P.M.; Jarvis, S. MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behav. Res. Methods 2010, 42, 381–392. [Google Scholar] [CrossRef]
Cain, K.; Oakhill, J.V.; Barnes, M.A.; Bryant, P.E. Comprehension skill, inference-making ability, and their relation to knowledge. Mem. Cognit. 2001, 29, 850–859. [Google Scholar] [CrossRef]
Magliano, J.P.; Millis, K.K.; RSAT Development Team; Levinstein, I.; Boonthum, C. Assessing comprehension during reading with the Reading Strategy Assessment Tool (RSAT). Metacogn. Learn. 2011, 6, 131–154. [Google Scholar] [CrossRef] [PubMed]
Cruz Neri, N.; Guill, K.; Retelsdorf, J. Language in science performance: Do good readers perform better? Eur. J. Psychol. Educ. 2021, 36, 45–61. [Google Scholar] [CrossRef]
McNamara, D.S. Reading both high-coherence and low-coherence texts: Effects of text sequence and prior knowledge. Can. J. Exp. Psychol. 2001, 55, 51–62. [Google Scholar] [CrossRef] [PubMed]
Pickren, S.E.; Stacy, M.; Del Tufo, S.N.; Spencer, M.; Cutting, L.E. The contribution of text characteristics to reading comprehension: Investigating the influence of text emotionality. Read. Res. Q. 2022, 57, 649–667. [Google Scholar] [CrossRef]
Chen, L.; Zaharia, M.; Zou, J. How is ChatGPT’s behavior changing over time? arXiv 2023, arXiv:2307.09009. [Google Scholar] [CrossRef]
Luo, Z.; Xie, Q.; Ananiadou, S. Factual consistency evaluation of summarization in the era of large language models. Expert Syst. Appl. 2024, 254, 124456. [Google Scholar] [CrossRef]
Liu, Y.; Cong, T.; Zhao, Z.; Backes, M.; Shen, Y.; Zhang, Y. Robustness over time: Understanding adversarial examples’ effectiveness on longitudinal versions of large language models. arXiv 2023, arXiv:2308.07847. [Google Scholar] [CrossRef]
Rosenfeld, A.; Lazebnik, T. Whose LLM is it anyway? Linguistic comparison and LLM attribution for GPT-3.5, GPT-4, and Bard. arXiv 2024, arXiv:2402.14533. [Google Scholar]
Arner, T.; McCarthy, K.S.; McNamara, D.S. iSTART StairStepper—Using comprehension strategy training to game the test. Computers 2021, 10, 48. [Google Scholar] [CrossRef]
Cohen, J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed.; Lawrence Erlbaum Associates: Hillsdale, NJ, USA, 1988. [Google Scholar]
Viera, R.T. Syntactic complexity in journal research article abstracts written in English. MEXTESOL J. 2022, 46, n2. [Google Scholar] [CrossRef]
Wu, J.; Zhao, H.; Wu, X.; Liu, Q.; Su, J.; Ji, Y.; Wang, Q. Word concreteness modulates bilingual language control during reading comprehension. J. Exp. Psychol. Learn. Mem. Cogn. 2024; Advance online publication. [Google Scholar]
McNamara, D.S.; Graesser, A.C.; Louwerse, M.M. Sources of text difficulty: Across genres and grades. In Measuring Up: Advances in How We Assess Reading Ability; Sabatini, J.P., Albro, E., O’Reilly, T., Eds.; Rowman & Littlefield: Lanham, MD, USA, 2012; pp. 89–116. [Google Scholar]
Achugar, M.; Schleppegrell, M.J. Beyond connectors: The construction of cause in history textbooks. Linguist. Educ. 2005, 16, 298–318. [Google Scholar] [CrossRef]
Gatiyatullina, G.M.; Solnyshkina, M.I.; Kupriyanov, R.V.; Ziganshina, C.R. Lexical density as a complexity predictor: The case of science and social studies textbooks. Res. Result. Theor. Appl. Linguist. 2023, 9, 11–26. [Google Scholar] [CrossRef]
de Oliveira, L.C. Nouns in history: Packaging information, expanding explanations, and structuring reasoning. Hist. Teach. 2010, 43, 191–203. [Google Scholar]
Follmer, D.J.; Li, P.; Clariana, R. Predicting expository text processing: Causal content density as a critical expository text metric. Read. Psychol. 2021, 42, 625–662. [Google Scholar] [CrossRef]
Hao, Y.; Cao, P.; Jin, Z.; Liao, H.; Chen, Y.; Liu, K.; Zhao, J. Evaluating personalized tool-augmented LLMs from the perspectives of personalization and proactivity. arXiv 2025, arXiv:2503.00771. [Google Scholar] [CrossRef]
Zhang, Z.; Rossi, R.A.; Kveton, B.; Shao, Y.; Yang, D.; Zamani, H.; Wang, Y. Personalization of large language models: A survey. arXiv 2024, arXiv:2411.00027. [Google Scholar] [CrossRef]
Sulem, E.; Abend, O.; Rappoport, A. BLEU is not suitable for the evaluation of text simplification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 738–744. [Google Scholar]
Weissburg, I.; Anand, S.; Levy, S.; Jeong, H. LLMs are biased teachers: Evaluating LLM bias in personalized education. arXiv 2024, arXiv:2410.14012. [Google Scholar] [CrossRef]
Gupta, V.; Chowdhury, S.P.; Zouhar, V.; Rooein, D.; Sachan, M. Multilingual performance biases of large language models in education. arXiv 2025, arXiv:2504.17720. [Google Scholar] [CrossRef]
Chinta, S.V.; Wang, Z.; Yin, Z.; Hoang, N.; Gonzalez, M.; Quy, T.L.; Zhang, W. FairAIED: Navigating fairness, bias, and ethics in educational AI applications. arXiv 2024, arXiv:2407.18745. [Google Scholar] [CrossRef]
Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.; Madotto, A.; Fung, P. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
Ji, Z.; Yu, T.; Xu, Y.; Lee, N.; Ishii, E.; Fung, P. Towards mitigating LLM hallucination via self-reflection. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; ACL: Singapore, 2023. [Google Scholar]
Kintsch, W. The Role of Knowledge in Discourse Comprehension: A Construction-Integration Model. Psychol. Rev. 1988, 95, 163–182. [Google Scholar] [CrossRef] [PubMed]
Atil, B.; Chittams, A.; Fu, L.; Ture, F.; Xu, L.; Baldwin, B. LLM Stability: A detailed analysis with some surprises. arXiv 2024, arXiv:2408.04667. [Google Scholar]
Zhou, H.; Savova, G.; Wang, L. Assessing the macro and micro effects of random seeds on fine-tuning large language models. arXiv 2025, arXiv:2503.07329. [Google Scholar] [CrossRef]
Echterhoff, J.; Faghri, F.; Vemulapalli, R.; Pouransari, H. MUSCLE: A Model Update Strategy for Compatible LLM Evolution. arXiv 2024, arXiv:2407.09435. [Google Scholar] [CrossRef]
Pimentel, M.A.F.; Christophe, C.; Raha, T.; Munjal, P.; Kanithi, P.; Khan, S. Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks. arXiv 2024, arXiv:2407.21072. [Google Scholar] [CrossRef]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Alkaissi, H.; McFarlane, S.I. Artificial Hallucinations in ChatGPT: Implications in Scientific Writing. Cureus 2023, 15, e35179. [Google Scholar] [CrossRef]
Hatem, R.; Simmons, B.; Thornton, J.E. A call to address AI “hallucinations” and how healthcare professionals can mitigate their risks. Cureus 2023, 15, e44720. [Google Scholar] [CrossRef]
Bang, Y.; Cahyawijaya, S.; Lee, N.; Dai, W.; Su, D.; Wilie, B.; Fung, P. A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. arXiv 2023, arXiv:2302.04023. [Google Scholar] [CrossRef]
Maynez, J.; Narayan, S.; Bohnet, B.; McDonald, R. On faithfulness and factuality in abstractive summarization. arXiv 2020, arXiv:2005.00661. [Google Scholar] [CrossRef]
Laban, P.; Kryściński, W.; Agarwal, D.; Fabbri, A.R.; Xiong, C.; Joty, S.; Wu, C.S. LLMs as factual reasoners: Insights from existing benchmarks and beyond. arXiv 2023, arXiv:2305.14540. [Google Scholar] [CrossRef]

Figure 1. Sentence cohesion index for Science and History texts across Reader Profiles varying in Reading Skill (RS) and Prior Knowledge (PK). Higher cohesion measures indicate more explicit use of connective and referential links. Aligning with comprehension theory [11,34,37], LLMs increased cohesion when adapting for low-knowledge readers who benefit from explicit connections, at the same time lowering cohesion for high-knowledge readers to encourage inference generation. Source: Authors’ contribution.

Figure 2. Sophisticated wording measure for Science and History texts across Reader Profiles varying in Reading Skill (RS) and Prior Knowledge (PK). Higher values indicate the use of more advanced vocabulary frequently found in academic texts. LLMs reduced lexical sophistication for low-skill and low-knowledge readers to enhance readability, while increasing complexity for high-skill and high-knowledge readers. Source: Authors’ contribution.

Figure 3. Word concreteness measure for Science and History texts across Reader Profiles varying in Reading Skill (RS) and prior knowledge (PK). Higher values indicate more concrete, sensory vocabulary that supports easier mental representation and understanding of concepts. LLMs tailor adaptations to be more concrete for less skilled readers, especially in science texts, to support comprehension. Texts adapted for skilled and high-knowledge readers contained more abstract language. Source: Authors’ contribution.

Table 1. Linguistic features affecting coherence-building processes and reading comprehension. Source: Authors’ contribution.

Features	Metrics and Descriptions
Writing Style	Academic writing *: The extent to which texts include domain-specific terminology and complex sentence structures typical of academic discourse. Texts with higher academic measures reflect a more formal style that can foster difficulty for less-skilled and low-knowledge readers.
Conceptual Density and Cohesion	Lexical density: The extent to which text contains sentences with dense and precise information, including complex noun phrases and sophisticated words. Texts with high lexical density convey more information per sentence but may require greater effort to process. Noun-to-verb ratio : Text with a high noun-to-verb ratio results in dense information and complex sentences. High nominalization is characteristic of academic discourse and can be challenging for struggling readers. Sentence cohesion : The degree to which sentences are explicitly connected through connectives (e.g., because, therefore, or in addition) or cohesion cues (e.g., overlapping ideas and concepts). Cohesive texts support comprehension, especially for low-knowledge readers, since they rely on explicit textual cues to infer meanings.
Syntax Complexity	Sentence length : Longer sentences often have multiple clauses and embedded phrases, increasing syntactic complexity. Longer sentences may hinder comprehension for less skilled readers. Language variety : The extent to which the text contains a variety of lexical and syntactic structures. The high language variety measure enhances stylistic richness and engagement, while low variety can be monotonous but simplifies comprehension.
Lexical Complexity	Word concreteness: The degree to which words refer to sensory experiences or physical objects that can be experienced by the senses. High measures indicate the texts contain more tangible words, while low measures indicate more abstract concepts that can be difficult for novices. Sophisticated wording : Lower measures indicate common, familiar vocabulary, whereas higher measures indicate more advanced words. Using sophisticated vocabulary enriches expression and academic tone but can reduce readability for readers with less knowledge. Academic frequency : Indicates the extent of sophisticated vocabulary that is used, which is also common in academic texts. High academic frequency indicates technical or scholarly language that requires greater background knowledge to comprehend.
Connectives	All connectives: Refers to the overall density of linking words and phrases (e.g., however, therefore, then, in addition). Higher values indicate the text is overtly guiding the reader through logical, additive, contrastive, temporal, or causal relations, increasing cohesion. Lower values imply that relationships must be inferred from contextTemporal connectives: Markers that place events on a timeline (e.g., then, meanwhile, during, subsequently)Causal connectives: Markers that signal cause-and-effect or reasoning links (e.g., because, since, therefore, thus, as a result)

* Indices used in Huynh and McNamara (2025) [57].

Table 2. Scientific and history texts. Source: Authors’ contribution.

Domain	Topic	Text Titles	Word Count	FKGL *
Science	Biology	Bacteria	468	12.10
Science	Biology	The Cells	426	11.61
Science	Biology	Microbes	407	14.38
Science	Biology	Genetic Equilibrium	441	12.61
Science	Biology	Food Webs	492	12.06
Science	Biology	Patterns of Evolution	341	15.09
Science	Biology	Causes and Effects of Mutations	318	11.35
Science	Biochemistry	Photosynthesis	427	11.44
Science	Chemistry	Chemistry of Life	436	12.71
Science	Physics	What are Gravitational Waves?	359	16.51
History	American History	Battle of Saratoga	424	9.86
History	American History	Battles of New York	445	11.77
History	American History	Battles of Lexington and Concord	483	12.85
History	American History	Emancipation Proclamation	271	13.4
History	American History	House of Burgesses	200	12.8
History	American History	Abraham Lincoln—Rise to Presidency	631	12.15
History	American History	George Washington	260	9.79
History	French and American History	Marquis de Lafayette	356	13.78
History	Dutch and American History	New York (New Amsterdam) Colony	403	12.97
History	World History	Age of Exploration	490	10.49

* Flesch–Kincaid Grade Level.

Table 3. Descriptive Statistics and Main Effects of Reader Profiles. Source: Authors’ contribution.

Linguistic Features	Reader 1 (High RS/High PK **)		Reader 2 (High RS/Low PK **)		Reader 3 (Low RS/High PK **)		Reader 4 (Low RS/Low PK **)		Main Effects of Profile
Linguistic Features	M	SD	M	SD	M	SD	M	SD	F (3, 303)	p	η2
Academic Writing *	75.84	24.74	51.66	26.48	33.06	27.15	34.30	22.96	121.25	<0.001	0.38
Language Variety *	80.73	19.21	50.76	20.57	27.72	17.39	30.33	18.44	251.32	<0.001	0.55
Lexical Density *	0.68	0.12	0.61	0.12	0.59	0.11	0.58	0.10	226.13	<0.001	0.53
Sentence Cohesion *	32.86	28.89	54.75	29.93	55.83	22.68	60.45	26.92	35.11	<0.001	0.15
Noun-to-Verb Ratio *	2.79	0.46	2.53	0.55	2.54	0.72	1.84	0.34	119.86	<0.001	0.37
Sentence Length *	18.62	5.97	14.78	5.49	14.59	4.47	13.53	4.11	61.98	<0.001	0.23
Word Concreteness *	29.86	17.79	50.52	25.63	55.18	27.21	60.76	24.96	57.26	<0.001	0.22
Sophisticated Word *	88.85	9.52	51.12	21.09	29.05	17.64	23.42	16.06	603.28	<0.001	0.75
Academic Frequency *	2.78	0.01	2.77	0.01	2.73	0.01	2.72	0.01	12.41	<0.001	0.06
Causal Connectives *	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01	3.11	0.03	0.02
Temporal Connectives *	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.79	0.50	0.00
All Connectives *	0.05	0.01	0.05	0.01	0.05	0.01	0.05	0.01	3.54	0.02	0.02

* Significant effect. Small effect: η² = 0.01, Medium: η² = 0.06, Large: η² = 0.14 [95]. ** RS = Reading Skill, PK = Prior Knowledge.

Table 4. Descriptive statistics and main effects of LLMs. Source: authors’ contribution.

Linguistic Features	Claude		Llama		Gemini		ChatGPT		Main Effects of LLMs
Linguistic Features	M	SD	M	SD	M	SD	M	SD	F (3, 303)	p	η2
Academic Writing	48.42	31.73	53.28	30.50	45.78	30.98	47.37	29.23	1.98	0.12	0.01
Language Variety *	47.16	29.78	40.04	27.99	54.42	28.78	47.92	25.37	13.15	<0.001	0.06
Lexical Density *	0.62	0.12	0.61	0.12	0.62	0.12	0.61	0.12	5.26	0.001	0.03
Sentence Cohesion *	63.35	28.59	41.55	27.20	47.94	27.31	51.05	29.52	21.73	<0.001	0.10
Noun-to-Verb Ratio *	2.56	0.84	2.38	0.56	2.44	0.57	2.33	0.51	7.88	<0.001	0.17
Sentence Length *	12.46	4.38	16.32	5.15	16.38	5.09	16.35	5.88	42.71	<0.001	0.17
Word Concreteness	47.33	26.23	46.25	26.89	51.94	27.85	50.80	26.00	1.60	0.189	0.10
Sophisticated Word *	47.30	32.20	46.70	27.88	49.38	31.52	49.05	30.82	2.72	0.04	0.01
Academic Frequency *	2.71	0.01	2.77	0.01	2.73	0.01	2.80	0.01	17.57	<0.001	0.08
Causal Connectives	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.56	0.64	0.00
Temporal Connectives *	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01	11.87	<0.001	0.06
All Connectives *	0.05	0.01	0.05	0.01	0.05	0.01	0.05	0.01	4.03	0.007	0.02

* Significant effect. Small effect: η² = 0.01, Medium: η² = 0.06, Large: η² = 0.14 [95].

Table 5. Descriptive statistics and main effects of text types. Source: authors’ contribution.

Linguistic Features	Science Texts		History Texts		Main Effects of Text Types
Linguistic Features	M	SD	M	SD	F (1, 303)	p	η2
Academic Writing	50.08	27.98	47.34	33.15	2.80	0.10	0.01
Language Variety	47.06	29.30	47.71	27.56	0.84	0.36	0.00
Lexical Density *	0.72	0.06	0.51	0.05	5743.50	<0.001	0.90
Sentence Cohesion	51.26	28.63	50.68	29.80	0.09	0.76	0.00
Noun-to-Verb Ratio *	2.34	0.58	2.51	0.68	18.31	<0.001	0.03
Sentence Length *	17.67	5.21	13.09	4.59	254.88	<0.001	0.30
Word Concreteness	49.53	28.42	48.62	25.10	0.12	0.73	0.00
Sophisticated Word *	49.27	30.95	46.94	30.25	5.00	0.03	0.01
Academic Frequency *	2.80	0.01	2.81	0.01	97.11	<0.001	0.14
Causal Connectives *	0.01	0.01	0.00	0.00	78.44	<0.001	0.11
Temporal Connectives *	0.01	0.01	0.01	0.01	17.01	<0.001	0.03
All Connectives *	0.06	0.01	0.05	0.01	26.50	<0.001	0.04

* Significant effect. Small effect: η² = 0.01, Medium: η² = 0.06, Large: η² = 0.14 [95].

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huynh, L.; McNamara, D.S. Natural Language Processing as a Scalable Method for Evaluating Educational Text Personalization by LLMs. Appl. Sci. 2025, 15, 12128. https://doi.org/10.3390/app152212128

AMA Style

Huynh L, McNamara DS. Natural Language Processing as a Scalable Method for Evaluating Educational Text Personalization by LLMs. Applied Sciences. 2025; 15(22):12128. https://doi.org/10.3390/app152212128

Chicago/Turabian Style

Huynh, Linh, and Danielle S. McNamara. 2025. "Natural Language Processing as a Scalable Method for Evaluating Educational Text Personalization by LLMs" Applied Sciences 15, no. 22: 12128. https://doi.org/10.3390/app152212128

APA Style

Huynh, L., & McNamara, D. S. (2025). Natural Language Processing as a Scalable Method for Evaluating Educational Text Personalization by LLMs. Applied Sciences, 15(22), 12128. https://doi.org/10.3390/app152212128

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Natural Language Processing as a Scalable Method for Evaluating Educational Text Personalization by LLMs

Abstract

1. Introduction

1.1. Personalized Learning

1.2. Text Personalization Evaluation Using Natural Language Processing

1.3. Current Research

2. Materials and Methods

2.1. LLM Selection and Implementation Details

2.2. Text Corpus

2.3. Descriptions of Reader Profiles

2.4. Procedure

3. Results

3.1. Main Effect of Reader Profile on Variations in Linguistic Features of Modified Texts

3.2. Main Effect of LLMs

3.3. Main Effect of Text Types

3.4. Interaction Effect Reader × Text Genre

4. Discussion

4.1. Texts Adapted for Different Reader Profiles

4.2. LLMs Generated Outputs with Unique Linguistic Patterns

4.3. Linguistic Differences in Adapted Texts from Science Versus History Domain

4.4. Implications

4.5. Limitations and Future Directions

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. LLM Descriptions

Appendix B. Reader Profile Descriptions

Appendix C. Prompt Used

Appendix D. Quality Assessment Rubric

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI