Multi-Agentic LLMs for Personalizing STEM Texts

Vaccaro, Michael; Friday, Mikayla; Zaghi, Arash

doi:10.3390/app15137579

Open AccessArticle

Multi-Agentic LLMs for Personalizing STEM Texts

by

Michael Vaccaro, Jr.

^*

,

Mikayla Friday

and

Arash Zaghi

^*

School of Civil and Environmental Engineering, College of Engineering, University of Connecticut, Storrs, CT 06269, USA

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(13), 7579; https://doi.org/10.3390/app15137579

Submission received: 9 June 2025 / Revised: 2 July 2025 / Accepted: 4 July 2025 / Published: 6 July 2025

(This article belongs to the Special Issue Future Horizons in Multi-Agent Systems: Pioneering Trends and Breakthrough Innovations)

Download

Browse Figures

Versions Notes

Abstract

Multi-agent large language models promise flexible, modular architectures for delivering personalized educational content. Drawing on a pilot randomized controlled trial with middle school students (n = 23), we introduce a two-agent GPT-4 framework in which a Profiler agent infers learner-specific preferences and a Rewrite agent dynamically adapts science passages via an explicit message-passing protocol. We implement structured system and user prompts as inter-agent communication schemas to enable real-time content adaptation. The results of an ordinal logistic regression analysis hinted that students may be more likely to prefer texts aligned with their profile, demonstrating the feasibility of multi-agent system-driven personalization and highlighting the need for additional work to build upon this pilot study. Beyond empirical validation, we present a modular multi-agent architecture detailing agent roles, communication interfaces, and scalability considerations. We discuss design best practices, ethical safeguards, and pathways for extending this framework to collaborative agent networks—such as feedback-analysis agents—in K-12 settings. These results advance both our theoretical and applied understanding of multi-agent LLM systems for personalized learning.

Keywords:

multi-agent system (MAS); large language model (LLM); generative pretrained transformer 4 (GPT-4); personalized learning; large language model-based multi-agent system; AI-Generated Content (AIGC); Randomized Controlled Trail (RCT); K-12 education

1. Introduction

Recent advancements in large language models (LLMs) and multi-agent systems (MASs) have opened new opportunities for tackling complex challenges in education, such as the on-demand personalization of learning materials. Traditionally, MAS designs have followed rule-based or probabilistic methods, assigning clearly defined roles and tasks to specialized agents that interact through predefined communication protocols [1]. At times, these inter-agent communications have lacked interpretability (e.g., [2]), limiting the potential for human-in-the-loop artificial intelligence (AI) in MASs. However, the development of powerful LLMs has revealed a new frontier in MASs. These so-called LLM-based MASs utilize multiple instances of an LLM, with each instance acting as an agent. These agents communicate by passing data in a natural language between their outputs and inputs [3,4]. This greatly improves the interpretability of agent communications, allowing humans to understand the data that the MAS is processing and for them to join the conversation. Past works have recognized explainable AI as a critical area of research that fosters trust between humans and AI algorithms and ensures that model output and decisions are ethically aligned with human values and norms [5]. Given the novelty of LLM-based MASs, there remains a lack of research on their ability to serve as personalization tools—especially in K-12 education. As an early step towards filling this gap, this paper presents and tests a two-agent GPT-4 framework for personalizing science texts. Specifically, we explore the MASs ability to identify and adapt texts to students’ unique learning preferences through a pilot randomized controlled trial (RCT) in a middle school setting.

To understand the significance of these advancements, we first consider the historical challenges and aspirations surrounding personalized learning (PL), which has long been recognized as an essential yet challenging educational goal. Because of this, the National Academy of Engineering named advancements in PL one of the fourteen grand challenges of the 21st century, thus reinvigorating a centuries-old aspiration to individualize instruction [6,7]. PL is now a broad term that encompasses a diverse range of interventions and educational programs [8,9]. Numerous definitions have been proposed, each differing in its incorporation of learner characteristics, instructional designs, and learning outcomes [10,11]. Given this variability, this study defines PL as “instruction in which the pace of learning and the instructional approach are optimized for the needs of each learner,” where “learning objectives, instructional approaches, and instructional content…may all vary based on learner needs” [12] (p. 9). In this paper, we use the term “adapt” to refer to the development of personalized materials.

In defining PL, we note that the term “personalization” is closely related to but is also unique from terms like “differentiation” and “individualization”. Differentiation provides varying instructional paths (e.g., learning materials, difficulty, assessments, and final products) to help all students reach the same learning objectives. Individualization, by contrast, adjusts the pace and method of instruction on the single-student level to suit their unique needs and to help them meet common learning goals [13].

Despite its potential, the implementation of PL in K-12 classrooms has often been burdened by constraints such as teacher workload and resource availability [14,15,16]. To overcome these challenges, some educators have turned to technologies like intelligent tutoring systems (ITSs), which use domain-specific knowledge and student achievement tracking to differentiate or individualize learning materials [17]. In some cases, an ITS may be implemented as an MAS (e.g., [18]), with multiple agents working together to create adaptive learning environments. This link between MASs and ITSs is also demonstrated by researchers like Giuffra et al. [19], who developed a two-agent ITS to analyze student learning, provide feedback, and share achievement-level-appropriate content.

MASs evolved from the field of distributed AI, which recognizes that single AI systems often lack the ability to solve complex problems [20]. The agents in these systems may either share common characteristics and functions or may be specialized for distinct tasks [21]. Building upon the principles of distributed AI and MASs, LLMs present a unique opportunity for PL. Owing to the transformer architecture and their few-shot learning capabilities [22,23], LLMs can quickly synthesize information and adapt their outputs to diverse inputs. This ability suggests that an LLM-based MAS could infer students’ individual needs and preferences; however, more research is needed to confirm this ability. LLMs also possess broad interdisciplinary knowledge, potentially moving PL beyond the traditional limits of domain-specific agents in multi-agent ITSs [24,25,26].

In this paper, we contribute an MAS PL architecture in which two LLM agents—a Profiler and a Rewrite agent—are connected via a formal message-passing protocol. Because LLMs are used, we specify our inter-agent protocol using the models’ natural language system and user message inputs. This enables clear role definitions and enhanced interpretability, crucial to the development of ethical and explainable AI systems [5]. The main contribution of this work is the testing of a two-agent LLM message-passing protocol designed to identify and adapt texts to students’ preferences in a real-world setting with human participants. We hypothesized that the multi-agent network would show promising results with human participants based on the authors’ prior simulation work [27], which used LLMs only. We assess this potential with a small cohort of middle school students (n = 23). Middle school was selected as the population of interest as the early teen years represent a crucial time for fostering student interest in STEM [28].

The remainder of this paper is organized as follows. Section 2 introduces relevant research on ITSs and multi-agent LLMs. Section 3 then describes the pilot RCT design, participant pool, and procedure. Section 4 and Section 5 present the results and a discussion of the study’s significance and limitations, respectively. Section 5 also notes ethical considerations surrounding AI in education. Section 6 then concludes this paper.

2. Related Work

2.1. Intelligent Tutoring Systems in Education

In general, the goal of PL is to adjust both the learning process and educational materials to each student’s preferences and interests. This goal, whether technology-enabled or human-driven, is not a modern concept. PL has a rich history in the United States that stretches as far back as the origins of its education system [6]. As legislation emerged during the Progressive era requiring school attendance for children [29], so too did several theories on how best to educate large groups of students. As Dockterman [6] points out, the early 1800s competency-based approach to education was overwhelmed by rapidly rising enrollments. This led to calls for PL and increasingly numerous investigations into technology-enhanced learning as computers grew in popularity.

Researchers have investigated the potential of technology-enabled PL since the 1960s [30]. After six decades of research, no single platform has been recognized as optimal. Thus, researchers have continued to develop and evaluate new technology-enabled PL frameworks, including ITSs, in K-12 and post-secondary education [31,32,33,34,35]. This earlier research has set the stage for PL using modern AI models. In fact, several of the principles used to develop early ITSs can still be found in more recent applications. As discussed in a foundational review by Shute and Psotka [36], an ITS must satisfy the following three criteria: the system must (i) possess relevant content knowledge, (ii) be able to describe a student’s existing understanding, and (iii) be able to implement different teaching strategies. As an early example, Reiser et al.’s [37] Lisp ITS met these three criteria, using an ideal-student model to provide the ITS with domain-specific knowledge, a model tracing module with a bug catalog to model errors and to track student progress, and a tutoring module to provide guidance to students. Although techniques and approaches have varied over time, the definition provided by Shute and Psotka has remained essentially unchanged, as evidenced by more recent ITS developments [38,39,40].

Modern ITSs have been shown to be effective learning tools when compared with human-led tutoring [41,42]. For example, Contrino et al. [43] developed an adaptive learning environment for business students in an introductory statistics course and attributed modest grade improvements to the adaptive technology. In another study, McCarthy et al. [44] demonstrated how a natural language processing (NLP) model could evaluate and improve high school students’ reading comprehension. In their study, student comprehension was first scored using the NLP model. These scores were then used to select the next passage. This process was found to provide students with meaningful scaffolds as the difficulty increased [44].

As Bulger [45] points out, however, modern PL systems often act as “responsive” rather than “adaptive”. This is because modern systems assign students to predetermined materials based on performance, rather than actively generating unique content aligned with their needs and learning preferences. Unlike responsive systems, LLMs can be prompted with critical information about a learner. This information can come in several forms, ranging from examples of past performance to specific hobbies or interests [46]. To move beyond the decision-tree approach of responsive systems, the present study deploys a two-agent GPT-4 system to both identify and adapt science texts to student preferences. Student preferences, as operationalized in this study, represent the types of educational texts a student would select when given the option [47]. We consider these preferences to be non-mutually exclusive and potentially dynamic.

2.2. Multi-Agent Large Language Models

LLMs have been applied in diverse fields including psychology [48], neuroscience [49], and medicine [50]. Studies in these fields have largely been successful, reflecting the diversity of knowledge captured in LLMs’ training data [51]. In addition to their training data, the multi-disciplinary success of these models can be attributed to the transformer architecture, which allows the LLM to parse context and generate relevant responses [23,52]. While LLMs can make mistakes, their accuracy has continued to improve with newer releases [53,54]. While many studies have investigated teachers’ use of LLMs (e.g., grading [55], assessment development [56]), fewer have assessed how students can benefit from these models. In one recent work, Guo et al. [57] tested whether integrating LLMs in science and engineering education would improve middle school students’ knowledge acquisition. The results of their controlled study showed that students who used the LLM achieved significantly higher post-test scores than students who learned in a traditional format. Similar work integrating ChatGPT-4o into first-year undergraduate computer science education showed a significant improvement in student knowledge following an eight-week period [58].

Much research has focused on understanding the relationships between user-provided prompts and the outputs that LLMs generate. Research of this kind has led to the new field of prompt engineering [59], which focuses on crafting inputs that elicit the best possible outputs [60]. While some characteristics of a good prompt are transferable between use cases, they are often task-specific. This makes prompt design and LLM-powered MAS architectures an important area of ongoing research. Prompt engineering is especially important in education, where prompts must be designed to be interpretable and ensure that accurate information is generated across subject areas. Chain-of-thought prompting, which decomposes large tasks into smaller steps, has been found to improve model performance [61] and to let users inspect the LLMs’ reasoning. This prompting strategy could be valuable to both teachers and students, allowing them to better understand the model’s internal logic and the output it generates. For students, the combination of chain-of-thought prompting with LLMs’ rapid feedback may help scaffold the learning process [62,63].

LLMs have also demonstrated improved performance when complex tasks are decomposed among multiple specialized agents. For instance, in education, we may envision one model that describes the status of a student’s learning, another that focuses on the content to be taught, and a third that decides on an optimal teaching strategy for the student. By passing the output of one of these models to the others, we may achieve a framework capable of on-demand personalization. Reminiscent of ITSs, we recognize that this structure—where multiple LLMs serve as different agents in an ITS—is not unique to the present work [64,65]. In just one example, Wang et al. [66]’s GenMentor ITS uses five LLM agents to develop a goal-oriented learning environment.

Despite the existing research on LLMs in education, there is a notable lack of experimental literature exploring the ability of multi-agent LLM systems to identify and adapt science texts to students’ learning preferences. This is especially true of research in K-12, which is scarce when compared to post-secondary settings. Thus, the present study systematically tests the feasibility of using a two-agent LLM system to adapt educational materials to individual middle school students’ learning preferences. We intend our findings to serve as a starting point for future applications of LLMs in K-12 education.

3. Methods

3.1. Study Design

This pilot RCT evaluated a two-agent GPT-4 system’s ability to develop personalized science texts for middle school students. Personalization was achieved by tuning GPT-4’s outputs to each student’s learning preferences using carefully designed system and user prompts [67,68]. These prompts were engineered and iteratively refined by the research team in a previous five-agent simulation study [27]. This work uses the Profiler and Rewrite agents developed in that study.

In the simulation study by Friday et al. [27], the authors used a Similarity Checker agent to assess the accuracy of the profiles generated by the Profiler on a scale from 0 to 100. The average profile accuracy score given by the Similarity Checker was 79 out of 100. An intraclass correlation coefficient of 0.710 (95% confidence interval (CI): (0.620, 0.785)) was computed between the Similarity Checker and the first two authors’ independently rated scores on a subset of one hundred student profiles. This indicated that the Similarity Checker, and thus the Profiler agent, performed well. Readers interested in further details about the development of the Profiler and Rewrite agents are directed to the authors’ past work [27].

3.1.1. Component 1: Training Session

The present study comprises three components, summarized in Figure 1. The first component, referred to as the training session, aims to identify the students’ learning preferences. During the training session, the student selects their preferred paragraph from a pair of science texts a total of four times. Each pair discusses the same topic and contains the same content. To capture a range of text presentations, the paragraphs within each pair were rewritten using GPT-4 to target the extremes on one dimension of the Felder–Silverman model [69]. This model consists of four dimensions assessing how students best perceive (sensing or intuitive), receive (visually or verbally), process (actively or reflectively), and understand (sequentially or globally) information. Because this work focuses on text as an educational medium, the visual/verbal dimension was implemented as an imagery/no-imagery dimension.

Before continuing, we find it important to note that traditionally termed learning-style models have been heavily criticized. For instance, some researchers have rejected the idea of placing students into one-size-fits-all categories [70]. The authors of this work agree and recognize that student preferences are likely to fall on a spectrum within each dimension. With this in mind, we emphasize that students’ learning preferences as originally forwarded by Felder and Silverman [69] (i.e., using a formal survey) were not assessed in this study. Rather, the Felder–Silverman model was used to develop diverse texts for the training session (see Figure 1, part 1). Because student profiles were developed from these samples, it was crucial that the texts be substantially distinct from one another. This method of discerning student’s preferences was used as it avoids collecting sensitive data, which may raise ethical concerns if handled improperly.

The paragraphs used in the training session were designed for the simulation study noted above [27]. The topics of these paragraph pairs, chosen to be appropriate for Connecticut middle school students, include the water cycle, climate change, photosynthesis, and the states of matter. These topics are aligned with the Next-Generation Science Standards (NGSS), which were adopted by Connecticut in 2015. NGSS emphasizes a comprehensive, interdisciplinary three-dimensional approach to education, ensuring that students engage in hands-on learning and critical thinking bounded by science and engineering practices, disciplinary core ideas, and cross-cutting concepts [71]. LLMs may help achieve these goals, exposing students to content-specific scientific practices, relating topics to students’ life experiences, and making parallels across adjacent subject areas.

3.1.2. GPT-4 Agents and Component 2: Test Session

Data from the training session is fed into two GPT-4 agents (v. gpt-4-1106-preview, temperature: 1.0). The first agent uses the student’s choices to generate a short description (or “profile”) of the student’s preferences. Once generated, this profile is fed into the user message of the Rewrite agent. The Rewrite agent then outputs a profile-aligned version of a given science-related text. We have made these training paragraphs along with the two agents’ prompts available on the first author’s GitHub page [72]. The main components of the prompts for these agents are highlighted in Figure 2.

In total, there are four sequential calls to the OpenAI API for each student: two by the Profiler and two by the Rewrite agent. On average, each API call to the Profiler and Rewrite agents took 6.81 ± 1.73 and 9.07 ± 3.38 s to complete, respectively.

Once generated, the personalized texts are tested against a generic text, as shown in Figure 1. To ensure consistency in content, the generic texts were derived from the same source texts as those used to develop the personalized options. In addition, the generic texts were rewritten by GPT-4 to ensure consistency in the tone of the two options. The generic texts were kept the same across participants as they were not adapted to any specific learning preferences.

3.1.3. Component 3: Profile Evaluation

The final component of this study is the profile evaluation. In this phase, the accuracy of the student profile used to create the personalized texts is assessed. Participants select between the true and opposite profiles (see Figure 1), which are presented without labels to avoid bias. Participants are also asked to give a brief written or verbal justification for their choice between the two profiles.

3.1.4. Control Group and Summary

Figure 1 outlines the study’s structure for the experimental group. For the control group, the opposite profile is used to generate the personalized texts in the test session. The opposite profile was generated by deliberately inverting the students’ choices from the test session and, thus their measured preferences. This can be seen in Table 1, which provides an example of a true and opposite profile pair generated for one participant. This structure was chosen to heighten the contrast between the personalized texts generated under the experimental and control conditions. Participants assigned to the experimental group were thus expected to select the personalized texts, while those in the control group were expected to select the generic texts.

The third and fourth rows of Table 1 show a sample of a text which was adapted by the Rewrite agent to match the student’s true profile. Because part of the agent’s system message was to ensure the text was accessible to a middle school student, the Rewrite agent simplified some of the language and adopted a more conversational tone. We also see that the Rewrite agent successfully incorporated aspects of the true profile. Examples include the use of imagery in the first sentence of the personalized text as well as having clear and organized descriptions.

3.2. Participants

This study was approved by the University of Connecticut’s IRB (H23-0348, approved 31 July 2023). A total of twenty-four seventh- and eighth-grade students (ages 11 to 13) from one middle school in Connecticut were recruited. The middle school where this study took place has approximately 500 students, out of which approximately 200 to 250 are in grades 7 and 8. An IRB-approved email with an advertisement flyer was emailed to all parents and/or legal guardians of seventh- and eighth-grade students by school administrators in order to recruit participants. Hard copies of the flyer were also available for these students to bring home. Emails and flyers contained a link to an online permission form which, once signed by a parent/legal guardian, enrolled students in the study. As this research involved minors, participants were also required to provide assent. Of the potential participant pool, twenty-four students’ parents/legal guardians filled out the online permission form, creating a self-selected (i.e., volunteer) sample. One participant withdrew prior to data collection (valid n = 23). Following completion of all activities, participants were paid with either a pocket microscope or a scientific calculator, both valued near USD 20. Participants were free to select the item they wished to receive.

The full sample was decomposed into two groups (experimental and control) using stratified random sampling, with strata formed based on gender (male/female/non-binary). Twelve participants were randomly assigned to the experimental group, and the remaining eleven participants were assigned to the control group. Although stratified random sampling was only conducted on gender, the experimental and control groups were well-balanced in terms of participant age, gender, and ethnicity. Table 2 summarizes these group demographics. To be eligible to participate in this study, students needed to be fluent in the English language and to have no uncorrected vision/hearing impairments that would prohibit them from reading text on a computer screen or from listening to and following verbal instructions. We did not assess students’ reading level in this study.

3.3. Research Procedure

Data collection took place in the middle school to minimize barriers to participation. Students were scheduled into one 45 min timeslot between 8 a.m. and noon by school administrators on one of five days. The study was conducted in an available office space and participants were not told if they had been assigned to the experimental or control group. The office space featured a central round table where researchers and the participant could sit during the study. A separate desk was available for a school-approved faculty member who was present to supervise researchers’ interactions with the students, as required by the IRB-approved study protocol and by the school administration. The school faculty member was not involved in the study activities. Data collection took place between February and March of 2024.

One student participated in the study at a time. On average, each student needed 25 to 35 min to complete the activities. After students assented to participation, computer screen and audio recordings were started to capture any verbal responses or reactions from the participants and to serve as a back-up in case any issues occurred in data saving. The study procedure contained in Figure 1 was implemented using a custom-built graphical user interface (GUI) developed in Python v. 3.11.6 using CustomTkinter v. 5.2.2 [74]. The GUI interacted with GPT-4 through OpenAI’s API v. 1.9.0 and implemented the study program for both groups. The code for the GUI is available for download [72].

A screenshot of the GUI during the training session is presented in Figure 3. The same layout was also used in the test session. The GUI was designed to be simple and intuitive. In addition to stating the topic of the paragraph pair and providing the text options, a related, pre-generated image was included to add some visual intrigue. The images used across the six science topics were the same for each participant. Participants were provided with the simple instruction to “Please select your preferred paragraph from the two options below” and the gentle note to “Remember there is no correct response”.

After the test session, the GUI displayed a screen which asked participants to select the paragraph they believed best described their preferences. This corresponds to the profile evaluation phase of the study and is shown in Figure 4. The location of the true profile, i.e., Option 1 or Option 2, was randomized for each participant.

4. Results

All analyses were performed using IBM SPSS Statistics v. 29. All participants completed the study activities; thus, the dataset contains no missing or incomplete entries. A choice score was calculated for each participant based on data collected during the test session. This score was equal to the number of times a participant selected the personalized paragraph over the generic text. Recall that participants in the experimental group were expected to select the personalized texts, while those in the control group were expected to select the generic texts. Because the test session consisted of two paragraph pairs, each participant’s final choice score could be either 0, 1, or 2. Table 3 summarizes the results of the test session for the experimental and control groups.

Given the count nature of the dependent variable, we considered the use of a negative binomial regression model. However, we find that our data are not over-dispersed (grand mean and variance are 1.0 and 0.545, respectively), which is a core assumption of a negative binomial regression. In addition, we recognize that our dependent variable is strictly bound to integers between 0 and 2, which precludes the use of count models that assume an unbounded range of non-negative integers.

With this in mind, ordinal logistic regression was selected to model the effect of group membership (i.e., experimental or control) on the ability of the two-agent GPT-4 MAS to identify and adapt text to the participant’s learning preferences. In this model, the dependent variable is treated as an ordinal variable. The test of parallel lines was used to check the assumption of proportional odds (χ²(1) = 1.193, p = 0.275 > 0.05). Since p > 0.05, we fail to reject the null hypothesis assumption of proportional odds. Other assumptions, such as no multicollinearity between independent variables and the independence of observations, are satisfied by the study design (i.e., only one independent variable—study group—is tested, and all participants are only observed once). Thus, an ordinal logistic regression model was determined to be appropriate for these data.

The final model trended towards significance at the 0.05-level when compared to the intercept-only model (χ²(1) = 3.183, p = 0.074). Students in the experimental group (i.e., students who had texts personalized to their preferences) were estimated to be 4.373 (95% CI: (0.815, 23.454)) times more likely (Wald χ²(1) = 2.965, p = 0.085) than students in the control group to prefer the personalized text. This odds ratio trended towards significance at the 0.05-level, suggesting that participants in the experimental group may have been more likely to select the personalized paragraphs during the test session. We obtain a pseudo-R² (Nagelkerke) of 0.147, meaning that our model explains approximately 14.7% of the variance observed in our dependent variable. Finally, we note that the confidence interval of the odds ratio is quite wide due to the limited sample size and exploratory nature of the study, so we suggest that the results of this pilot RCT be interpreted with extreme caution. While we find the results to be suggestive of a potential relationship within the context of this feasibility study, we caution readers against generalizing beyond this work.

Of the 23 participants enrolled in this study, 14 selected their true profile in the profile evaluation phase (Table 4). This value is not significantly different (p = 0.405) from the expected value of the binomial distribution, suggesting that, on average, participants performed no better than random when tasked with selecting their true profile. A z-test for independent proportions (Z = 1.450, p = 0.147) suggests that the differences in the group proportions observed in Table 4 are no greater than those expected by chance.

Of the 9 students who selected their opposite profile during the profile evaluation phase, 7 explicitly referenced themes that were discussed in the options presented. As an example, one student justified their selection of the opposite profile by saying the following: “I enjoy straightforward, factual information because it makes more sense. I like factual information because it’s more clearly described so that I can more easily understand it.” Table 5 presents this student’s true and opposite profiles for reference. While the preferences noted in the true profiles did not resonate with these 7 students, we note that there were 9 students who selected their true profile. This suggests that more data (e.g., qualitative descriptions of preferences from structured interviews or surveys, additional preference measures like past achievement) may be needed in the training session to accurately capture these dimensions of students’ learning preferences.

5. Discussion

This RCT evaluated the ability of a two-agent GPT-4 framework to systematically adapt educational science-related texts to the unique learning preferences of individual middle school students. To evaluate this ability, participants in the experimental group selected their preferred paragraph between (i) a text aligned with their true profile and (ii) a generic text, while those in the control group chose between (i) a text aligned with their opposite profile and (ii) the same generic text. An ordinal logistic regression suggested that participants in the experimental group may have been more likely to select the personalized paragraphs during the test session (p = 0.085). These data suggest that LLMs may be able to effectively adapt educational texts to the preferences of individual students. This finding is consistent with the results of past research, which has used LLMs to adapt text to academic achievement levels [65] and has developed a multi-agent LLM framework to implement ITSs capable of identifying learner needs and providing tailored content [66]. With this in mind, we again caution the reader that this study was exploratory in nature and that more research is needed to confirm these findings.

The definition of PL provided earlier notes that a personalized environment must be capable of both pacing and tailoring the content of instruction to individual students and their personal interests [12]. The Felder–Silverman model was used here to develop the training paragraphs and, by extension, the initial profile describing each student’s learning preferences. The profile was therefore limited to the model’s four dimensions, which do not explicitly account for other critically important characteristics such as a learners’ preferred pace, specific interests, or existing knowledge. As discussed in Section 4, more data of more diverse types (e.g., additional training examples, student input, teacher input, formal questionnaires) should be incorporated to generate more accurate profiles. Moreover, the profiles remained static during the experiment, limiting the amount of information that could be gleaned about each student in relation to the four choices made during the training session. Despite these limitations, this study provides a starting point for developing adaptive (see Bulger [45]) LLM environments. NLP models, of which LLMs are a subset, have previously performed well when tasked with evaluating student performance and knowledge [44]. Thus, profiles can likely be updated as students continue to interact with the tool based on their performance and evolving preferences. Specifically, the present two-agent framework may be expanded to include more agents capable of revising the initial profiles.

It is important to recognize that we do not expect an LLM-powered multi-agent PL system to be implemented exactly as it is presented in this pilot RCT. Beyond the creation of the initial profile, students’ use of a multi-agent PL system may shift toward a chat-style interaction. In this way, students could ask the LLM as many questions as needed without the fear of judgment or the fear that the model will lose interest [75,76]. LLMs can also provide quick, valuable feedback that can help keep students motivated, especially when they are engaged in difficult content [77]. Finally, further interactions between students and an LLM-based PL tutor could provide valuable insight into student learning [16,78]. These insights could then be incorporated into a student’s profile, potentially leading to an improvement in the text personalization achieved over time.

Owing to their diverse capabilities and growing context awareness [52,79], LLMs also have the unique potential to support neurodiverse students [80]. These students are often disadvantaged by traditional education systems, which intend to accommodate diverse ways of thinking and learning through approaches like universal design for learning but do so from a deficit-based lens [81]. This study, however, highlights the importance of focusing on PL from a strengths-based perspective when leveraging LLM-based tools. Specifically, we propose that LLMs may help students with neurodiversities such as dyslexia by reorganizing or simplifying difficult texts. Research has also shown that PL may be extended beyond the adaptation of learning materials to formative or summative assessments [82]. In the context of this work, student profiles may be used to inform the design of tests or homework assignments. This extension could offer students a more wholistic PL experience that creates a more level playing field in educational spaces. This potential should be considered in future research, especially as the role of LLMs in education continues to evolve.

5.1. Limitations and Future Research

While this pilot RCT showed promise for LLM-enhanced PL, there are several key limitations that affect the generalizability of the results. Specifically, the generalizability of this study is limited by the small sample size (n = 23), marginally significant findings (p = 0.085), and wide 95% CI (0.815, 23.454). While non-confirmatory of an effect, the results highlight the need for more research on multi-agent LLMs in education.

Despite emailing several middle school administrations, researchers only received approval to conduct the study at one school. This greatly limited the number and population diversity of students we could sample. As a result, our sample was limited to students who lived in a predominantly affluent region of Connecticut. Some other factors leading to this small sample size may have included student self-selection and the level of comfort parents/legal guardians had with their child participating in an AI-based research study. Our sample was also limited to students in grades 7 and 8. Future research should aim to evaluate the broader efficacy of LLMs in personalizing education with larger samples and for a wider range of students before, during, and after the middle school years.

The texts included in this research study were limited to a handful of middle school science topics. These texts were intended to be introductory so that they would be accessible to students of all achievement levels. As a result, students did not need to follow extensive lines of reasoning or solve problems mathematically. However, these are all competencies which should be incorporated into a PL system. Therefore, future research should also develop LLM-based PL tools that are broadly applicable and not specialized to one topic or one type of learning task.

Although this study provided support for the ability of GPT-4 to adapt texts to students’ preferences based on their prior choices, the period of data collection was limited to a single short session. In addition, this study did not assess differences in student learning or achievement between the experimental and control groups. This was a conscious choice in the study’s design which allowed students to focus on the delivery of each text without the stress that may be caused by learning a new topic and an upcoming assessment. Given this, we recognize that more research is needed to evaluate the long-term effects of LLM-powered PL on student success and if the benefits of personalization observed in this study persist as students use the platform over time. Future RCTs should develop content-specific pre/post-comprehension tests to measure learning gain differences between students in an experimental (LLM-personalized) versus control (non-personalized) setting. There is also a need for more longitudinal-type studies that measure long-term learning associated with LLM use over academic units or years. In addition to pre/post-comprehension tests, students may also demonstrate their knowledge through qualitative works like written essays, open response questions, or presentations.

5.2. Ethical Considerations

There are several key ethical considerations to be made when integrating advanced AI into education. Specifically in terms of AI-enhanced education, researchers have expressed concerns regarding the protection of sensitive student data, the propagation of biases by LLMs, and the need to ensure that AI is used to enhance, and not replace, the learning process [83]. The remainder of this section briefly describes each of these concerns and discusses them in the context of the present study.

Many of the debates surrounding data privacy with advanced AI are similar to those that have played out for educational data mining. While student data including grades and demographic information have been shown to contain valuable information on academic performance, researchers have recognized that these data are private and that steps should be taken to safeguard them [84,85]. When using LLMs, there are instances where students’ private data could be stored for future model training. In addition, modern LLMs have memory and history features, which could pose a security risk if log-in information is compromised. As such, students and researchers should avoid using personal data as inputs [86]. In the present work, experiments were designed to eliminate the potential for students to share private information with the LLM. Specifically, learning preferences and profiles were gleaned from student choices without information like name, gender, age, and past academic performance. As noted earlier, future work may aim to build upon the profiles generated in this study. However, this should be performed cautiously to ensure sensitive data remain protected.

Even though data security issues are of great importance, the ethical application of AI in education extends well beyond security and privacy. Specifically, it is important to recognize the inherent biases (e.g., racial disparity, gender, outgroup hostility [87]) present in machine learning systems that result from their training data. LLMs are trained on a large corpus of data and their outputs are biased to this training data [88]. Thus, we recognize that extreme care must be taken when integrating AI into K-12 education. Oversight on AI when used in the classroom is likely to come from teachers, who should maintain an active role in their students’ education [83]. In this work, the authors monitored students’ screens during the experiment to monitor for off-topic or biased outputs. This was performed from a separate computer using macOS’s screen-mirroring application. Given the small scale of this study and the factual nature of the source texts used in the Rewrite agent, systemic bias was not expected to be a major concern. However, we recognize that our platform generated texts “on the fly” and, thus there was potential for biases to appear. A dedicated “Refresh” (i.e., regenerate) button was added to the GUI during the test session that researchers could click if warranted. We note that there were no cases where researchers had to click the Refresh button in this study.

Lastly, we recognize that the ethical use of AI models in education requires that end-users—namely teachers, students and, by extension of informed consent, parents/legal guardians—be AI-literate [89]. Here, literacy is specifically tied to the use of AI as a learning tool and requires full transparency surrounding data sharing, data storage, and the strengths and weaknesses associated with the model’s use. Such transparency is necessary as it allows users to fully understand both the accuracy of an output and the reasons why it was generated [90]. In this work, parents/legal guardians signed an informed consent form that described how no sensitive personal information would be collected by the LLM. This was also discussed with participants before obtaining their assent. We believe that students should understand what a model does and does not know when interpreting its suggestions, ensuring that AI serves as a supportive tool rather than as a source of confusion, frustration, and error.

6. Conclusions

Large language models have made considerable progress in recent years, making strides in text generation and expanding into other media formats such as images, audio, and video [91]. Together, these abilities offer intriguing possibilities for PL. The main contribution of the present work was the testing of the message-passing protocol with human participants where two GPT-4 agents were used to glean information about and to adapt educational material to a student’s learning preferences. We assessed this potential in a pilot RCT with n = 23 participants. The results suggested that students were more likely to prefer a personalized text during the test session when it was rewritten in accordance with their profile. This result was trending towards significance at the 0.05-level.

These findings are promising for the future of PL as online resources like LLMs become increasingly available to students. Future work in LLM-powered PL should aim to build upon the results of this study, especially given its small sample size and the limited generalizability of its results. Future studies should strive to involve both a larger, more diverse set of students and should assess student learning gains as compared to traditional education using pre- and post-comprehension tests. While this study demonstrated that GPT-4 could develop an initial profile, future efforts should be dedicated to designing more accurate and dynamic profiles. In addition to the constructs included in this pilot study, future profiles should contain information that can help adapt the pace of content presentation, account for learners’ interests, and model students’ prior knowledge. In future studies, pacing and interests may be controlled by student interactions (e.g., “I am confused about…”, “can you explain that using a sports analogy?”, etc.). Other aspects like prior knowledge may require more sophisticated hierarchical models that can be updated by teachers and/or the LLM as students meet new objectives. These preferences may be modeled on continuous scales in a high-dimensional feature space, avoiding the reduction of learner preferences to exclusive categories—a common criticism of models like that of Felder and Silverman [70]. As in any application, the use of AI must be ethical, so careful attention must be given to the design and implementation of these systems.

Applications of AI in education are still in their infancy. LLMs will continue to evolve in the near term, becoming increasingly powerful as new models are developed and released. As this happens, our responsibility as educators to prepare students for their future will remain unchanged. In such a rapidly changing field, teachers must become heavily involved in research to ensure all students benefit from, and have the experience necessary to succeed with, AI.

Author Contributions

Conceptualization, M.V.J., M.F. and A.Z.; methodology, M.V.J., M.F. and A.Z.; software, M.V.J. and M.F.; validation, M.V.J., M.F. and A.Z.; formal analysis, M.V.J. and M.F.; investigation, M.V.J. and M.F.; resources, A.Z.; data curation, M.V.J. and M.F.; writing—original draft preparation, M.V.J.; writing—review and editing, M.V.J., M.F. and A.Z.; visualization, M.V.J.; supervision, A.Z.; project administration, A.Z.; funding acquisition, A.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This material is based upon work supported by the National Science Foundation under Award No. 2120888. The first author (M.V.) was supported by a National Science Foundation Research Traineeship (NRT) under Award No. 2152202 “TRANSCEND”.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board of the University of Connecticut (protocol code H23-0348; Approval Date: 31 July 2023).

Informed Consent Statement

As the subjects of this study were minors, informed consent was obtained from all subjects’ parents/legal guardians prior to their involvement in the student. Assent was also obtained from all subjects involved in the study.

Data Availability Statement

The de-identified, raw data supporting the conclusions of this article will be made available by the authors on request. All prompts and codes used to develop and run the GPT-4 models discussed in this article are available on the first author’s GitHub page at https://github.com/m-vaccaro/LLMs-and-Personalized-Learning (accessed on 18 March 2025).

Acknowledgments

The authors greatly appreciate the support of Larry Barlow, Danielle Vliet, Jada Vercosa, Trent Alsup, Nikhil Ghosh, and Zeynep G Akdemir-Beveridge. During the preparation of this manuscript, the authors used OpenAI’s GPT models as a writing assistant to check grammar and enhance the clarity of the written text. These models were used with extreme oversight and care. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial intelligence
API	Application programming interface
CI	Confidence interval
GPT	Generative pre-trained transformer
GUI	Graphical user interface
ITS	Intelligent tutoring system
LLM	Large language model
MAS	Multi-agent system
NGSS	Next-Generation Science Standards
PL	Personalized learning

References

Wooldridge, M. An Introduction to Multiagent Systems, 2nd ed.; John Wiley & Sons Ltd.: West Sussex, UK, 2009. [Google Scholar]
Havrylov, S.; Titov, I. Emergence of Language with Multi-Agent Games: Learning to Communicate with Sequences of Symbols. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U., Von Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Han, S.; Zhang, Q.; Yao, Y.; Jin, W.; Xu, Z. LLM Multi-Agent Systems: Challenges and Open Problems. arXiv 2024, arXiv:2402.03578. [Google Scholar]
Li, G.; Al Kadeer Hammoud, H.A.; Itani, H.; Khizbullin, D. CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society. In Proceedings of the Advances in Neural Information Processing Systems 36, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Al-Dahdooh, R.; Marouf, A.; Ghali, M.J.A.; Mahdi, A.O.; Abu-Nasser, B.S.; Abu-Naser, S.S. Explainable AI (XAI). Int. J. Acad. Eng. Res. 2024, 8, 65–70. [Google Scholar]
Dockterman, D. Insights from 200+ Years of Personalized Learning. NPJ Sci. Learn. 2018, 3, 15. [Google Scholar] [CrossRef] [PubMed]
National Academy of Engineering Grand Challenges for Engineering—Advance Personalized Learning. Available online: http://www.engineeringchallenges.org/challenges/learning.aspx (accessed on 7 June 2024).
Shemshack, A.; Spector, J.M. A Systematic Literature Review of Personalized Learning Terms. Smart Learn. Environ. 2020, 7, 33. [Google Scholar] [CrossRef]
Walkington, C.; Bernacki, M.L. Appraising Research on Personalized Learning: Definitions, Theoretical Alignment, Advancements, and Future Directions. J. Res. Technol. Educ. 2020, 52, 235–252. [Google Scholar] [CrossRef]
Bernacki, M.L.; Greene, M.J.; Lobczowski, N.G. A Systematic Review of Research on Personalized Learning: Personalized by Whom, to What, How, and for What Purpose(s)? Educ. Psychol. Rev. 2021, 33, 1675–1715. [Google Scholar] [CrossRef]
Zhang, L.; Basham, J.D.; Yang, S. Understanding the Implementation of Personalized Learning: A Research Synthesis. Educ. Res. Rev. 2020, 31, 100339. [Google Scholar] [CrossRef]
U.S. Department of Education. National Education Technology Plan Update; U.S. Department of Education: Washington, DC, USA, 2017.
Zhang, L.; Carter, R.A.; Bernacki, M.L.; Greene, J.A. Personalization, Individualization, and Differentiation: What Do They Mean and How Do They Differ for Students with Disabilities? Exceptionality 2024. [Google Scholar] [CrossRef]
Adelman, H.S.; Taylor, L.L. Addressing Barriers to Learning: In the Classroom and Schoolwide; University of California, Los Angeles: Los Angeles, CA, USA, 2018. [Google Scholar]
Arnaiz Sánchez, P.; de Haro Rodríguez, R.; Maldonado Martínez, R.M. Barriers to Student Learning and Participation in an Inclusive School as Perceived by Future Education Professionals. J. New Approaches Educ. Res. 2019, 8, 18–24. [Google Scholar] [CrossRef]
Tetzlaff, L.; Schmiedek, F.; Brod, G. Developing Personalized Education: A Dynamic Framework. Educ. Psychol. Rev. 2021, 33, 863–882. [Google Scholar] [CrossRef]
Kulik, J.A.; Fletcher, J.D. Effectiveness of Intelligent Tutoring Systems: A Meta-Analytic Review. Rev. Educ. Res. 2016, 86, 42–78. [Google Scholar] [CrossRef]
Cardoso, J.; Bittencourt, G.; Frigo, L.B.; Pozzebon, E.; Postal, A. Mathtutor: A Multi-Agent Intelligent Tutoring System. In Artificial Intelligence Applications and Innovations; Bramer, M., Devedvic, V., Eds.; Springer: Boston, MA, USA, 2004; Volume 154, pp. 231–242. [Google Scholar]
Giuffra, P.; Cecelia, E.; Silveria, R. A Multi-Agent System Model to Integrate Virtual Learning Environments and Intelligent Tutoring Systems. Int. J. Interact. Multimed. Artif. Intell. 2013, 2, 51–58. [Google Scholar] [CrossRef][Green Version]
Tweedale, J.; Ichalkaranje, N.; Sioutis, C.; Jarvis, B.; Consoli, A.; Phillips-Wren, G. Innovations in Multi-Agent Systems. J. Netw. Comput. Appl. 2007, 30, 1089–1115. [Google Scholar] [CrossRef]
Dorri, A.; Kanhere, S.S.; Jurdak, R. Multi-Agent Systems: A Survey. IEEE Access 2018, 6, 28573–28593. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A. Language Models Are Few-Shot Learners. Adv. Neural. Inf. Process Syst. 2020, 33, 1877–1901. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U., Von Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Chen, J.; Liu, Z.; Huang, X.; Wu, C.; Liu, Q.; Jiang, G.; Pu, Y.; Lei, Y.; Chen, X.; Wang, X. When Large Language Models Meet Personalization: Perspectives of Challenges and Opportunities. World Wide Web 2024, 27, 42. [Google Scholar] [CrossRef]
Henkel, O.; Hills, L.; Boxer, A.; Roberts, B.; Levonian, Z. Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability To Mark Short Answer Questions in K-12 Education. In Proceedings of the Eleventh ACM Conference on Learning @ Scale, Atlanta, GA, USA, 18–20 July 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 300–304. [Google Scholar]
Nunes, D.; Primi, R.; Pires, R.; Lotufo, R.; Nogueira, R. Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission Exams. arXiv 2023. [Google Scholar] [CrossRef]
Friday, M.; Vaccaro, M.; Zaghi, A. Leveraging Large Language Models for Early Study Optimization in Educational Research. In Proceedings of the 2025 ASEE Annual Conference & Exposition, Montreal, QC, Canada, 22–25 June 2025. [Google Scholar]
Maltese, A.V.; Melki, C.S.; Wiebke, H.L. The Nature of Experiences Responsible for the Generation and Maintenance of Interest in STEM. Sci. Educ. 2014, 98, 937–962. [Google Scholar] [CrossRef]
Mendez, S.L.; Yoo, M.S.; Rury, J.L. A Brief History of Public Education in the United States. In The Wiley Handbook of School Choice; Fox, R.A., Buchanan, N.K., Eds.; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2017; pp. 13–27. [Google Scholar]
Bartolomé, A.; Castañeda, L.; Adell, J. Personalisation in Educational Technology: The Absence of Underlying Pedagogies. Int. J. Educ. Technol. High. Educ. 2018, 15, 14. [Google Scholar] [CrossRef]
Hooshyar, D.; Pedaste, M.; Yang, Y.; Malva, L.; Hwang, G.-J.; Wang, M.; Lim, H.; Delev, D. From Gaming to Computational Thinking: An Adaptive Educational Computer Game-Based Learning Approach. J. Educ. Comput. Res. 2021, 59, 383–409. [Google Scholar] [CrossRef]
Huang, X.; Craig, S.D.; Xie, J.; Graesser, A.; Hu, X. Intelligent Tutoring Systems Work as a Math Gap Reducer in 6th Grade After-School Program. Learn Individ. Differ. 2016, 47, 258–265. [Google Scholar] [CrossRef]
Liu, M.; McKelroy, E.; Corliss, S.B.; Carrigan, J. Investigating the Effect of an Adaptive Learning Intervention on Students’ Learning. Educ. Technol. Res. Dev. 2017, 65, 1605–1625. [Google Scholar] [CrossRef]
Price, T.W.; Dong, Y.; Lipovac, D. ISnap: Towards Intelligent Tutoring in Novice Programming Environments. In Proceedings of the 2017 ACM SIGCSE Technical Symposium on Computer Science Education, Seattle, WA, USA, 8–11 March 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 483–488. [Google Scholar]
Sibley, L.; Lachner, A.; Plicht, C.; Fabian, A.; Backfisch, I.; Scheiter, K.; Bohl, T. Feasibility of Adaptive Teaching with Technology: Which Implementation Conditions Matter? Comput. Educ. 2024, 219, 105108. [Google Scholar] [CrossRef]
Shute, V.J.; Psotka, J. Intelligent Tutoring Systems: Past, Present, and Future. In Handbook of Research for Educational Communications and Technology; Macmillian: New York, NY, USA, 1994; pp. 570–600. [Google Scholar]
Reiser, B.J.; Anderson, J.R.; Farrel, R.G. Dynamic Student Modelling in an Intelligent Tutor for Lisp Programming. In Proceedings of the IJCAI 1985, Los Angeles, CA, USA, 18–23 August 1985. [Google Scholar]
Ford, L. A New Intelligent Tutoring System. Br. J. Educ. Technol. 2008, 39, 311–318. [Google Scholar] [CrossRef]
Heffernan, N.T.; Koedinger, K.R. An Intelligent Tutoring System Incorporating a Model of an Experienced Human Tutor. In International Conference on Intelligent Tutoring Systems; Springer: Berlin/Heidelberg, Germany, 2002; pp. 596–608. [Google Scholar]
Keleş, A.; Ocak, R.; Keleş, A.; Gülcü, A. ZOSMAT: Web-Based Intelligent Tutoring System for Teaching–Learning Process. Expert Syst. Appl. 2009, 36, 1229–1239. [Google Scholar] [CrossRef]
VanLehn, K. The Relative Effectiveness of Human Tutoring, Intelligent Tutoring Systems, and Other Tutoring Systems. Educ. Psychol. 2011, 46, 197–221. [Google Scholar] [CrossRef]
Xu, Z.; Wijekumar, K.; Ramirez, G.; Hu, X.; Irey, R. The Effectiveness of Intelligent Tutoring Systems on K-12 Students’ Reading Comprehension: A Meta-analysis. Br. J. Educ. Technol. 2019, 50, 3119–3137. [Google Scholar] [CrossRef]
Contrino, M.F.; Reyes-Millán, M.; Vázquez-Villegas, P.; Membrillo-Hernández, J. Using an Adaptive Learning Tool to Improve Student Performance and Satisfaction in Online and Face-to-Face Education for a More Personalized Approach. Smart Learn. Environ. 2024, 11, 6. [Google Scholar] [CrossRef]
McCarthy, K.S.; Watanabe, M.; Dai, J.; McNamara, D.S. Personalized Learning in ISTART: Past Modifications and Future Design. J. Res. Technol. Educ. 2020, 52, 301–321. [Google Scholar] [CrossRef]
Bulger, M. Personalized Learning: The Conversations We’re Not Having. Data Soc. 2016, 22, 1–29. [Google Scholar]
Chen, E.; Lee, J.-E.; Lin, J.; Koedinger, K. GPTutor: Great Personalized Tutor with Large Language Models for Personalized Learning Content Generation. In Proceedings of the Eleventh ACM Conference on Learning @ Scale, Atlanta, GA, USA, 18–20 July 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 539–541. [Google Scholar]
Kärger, P.; Olmedilla, D.; Abel, F.; Herder, E.; Siberski, W. What Do You Prefer? Using Preferences to Enhance Learning Technology. IEEE Trans. Learn. Technol. 2008, 1, 20–33. [Google Scholar] [CrossRef]
Dhingra, S.; Singh, M.; Vaisakh, S.B.; Malviya, N.; Gill, S.S. Mind Meets Machine: Unravelling GPT-4’s Cognitive Psychology. BenchCouncil Trans. Benchmarks Stand. Eval. 2023, 3, 100139. [Google Scholar] [CrossRef]
Lee, D.H.; Chung, C.K. Enhancing Neural Decoding with Large Language Models: A GPT-Based Approach. In Proceedings of the 12th International Winter Conference on Brain-Computer Interface (BCI), Gangwon, Republic of Korea, 26–28 February 2024; IEEE: Piscataway, NJ, USA, 2024. [Google Scholar]
Waisberg, E.; Ong, J.; Masalkhi, M.; Kamran, S.A.; Zaman, N.; Sarker, P.; Lee, A.G.; Tavakkoli, A. GPT-4: A New Era of Artificial Intelligence in Medicine. Ir. J. Med. Sci. 2023, 192, 3197–3200. [Google Scholar] [CrossRef]
OpenAI GPT-4. Available online: https://openai.com/index/gpt-4-research/ (accessed on 27 June 2024).
Bansal, G.; Chamola, V.; Hussain, A.; Guizani, M.; Niyato, D. Transforming Conversations with AI—A Comprehensive Study of ChatGPT. Cognit. Comput. 2024, 16, 2487–2510. [Google Scholar] [CrossRef]
Katz, D.M.; Bommarito, M.J.; Gao, S.; Arredondo, P. GPT-4 Passes the Bar Exam. Philos. Trans. R. Soc. A 2024, 382, 20230254. [Google Scholar] [CrossRef]
Roy, S.; Khatua, A.; Ghoochani, F.; Hadler, U.; Nejdl, W.; Ganguly, N. Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions. In Proceedings of the SIGIR ’24, 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 1073–1082. [Google Scholar]
Kortemeyer, G. Toward AI Grading of Student Problem Solutions in Introductory Physics: A Feasibility Study. Phys. Rev. Phys. Educ. Res. 2023, 19, 20163. [Google Scholar] [CrossRef]
Doughty, J.; Wan, Z.; Bompelli, A.; Qayum, J.; Wang, T.; Zhang, J.; Zheng, Y.; Doyle, A.; Sridhar, P.; Agarwal, A.; et al. A Comparative Study of AI-Generated (GPT-4) and Human-Crafted MCQs in Programming Education. In Proceedings of the 26th Australasian Computing Education Conference, Sydney, NSW, Australia, 29 January–2 February 2024; pp. 114–123. [Google Scholar]
Guo, Q.; Zhen, J.; Wu, F.; He, Y.; Qiao, C. Can Students Make STEM Progress With the Large Language Models (LLMs)? An Empirical Study of LLMs Integration Within Middle School Science and Engineering Practice. J. Educ. Comput. Res. 2025, 63, 372–405. [Google Scholar] [CrossRef]
Mohammed, I.A.; Bello, A.; Ayuba, B. Effect of Large Language Models Artificial Intelligence Chatgpt Chatbot on Achievement of Computer Education Students. Educ. Inf. Technol. 2025, 30, 11863–11888. [Google Scholar] [CrossRef]
Mizrahi, G. Understanding Prompting and Prompt Techniques. In Unlocking the Secrets of Prompt Engineering: Master the Art of Creative Language Generation to Accelerate Your Journey from Novice to Pro; Packt Publishing Limited: Birmingham, UK, 2024; ISBN 978-1-83508-383-3. [Google Scholar]
Heston, T.F.; Khun, C. Prompt Engineering in Medical Education. Int. Med. Educ. 2023, 2, 198–205. [Google Scholar] [CrossRef]
Lee, A.V.Y.; Teo, C.L.; Tan, S.C. Prompt Engineering for Knowledge Creation: Using Chain-of-Thought to Support Students’ Improvable Ideas. AI 2024, 5, 1446–1461. [Google Scholar] [CrossRef]
Leinonen, J.; Denny, P.; MacNeil, S.; Sarsa, S.; Bernstein, S.; Kim, J.; Tran, A.; Hellas, A. Comparing Code Explanations Created by Students and Large Language Models. In Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1 (ITiCSE 2023), Turku, Finland, 10–12 July 2023; pp. 124–130. [Google Scholar]
Zhang, P.; Tur, G. A Systematic Review of ChatGPT Use in K-12 Education. Eur. J. Educ. 2024, 59, e12599. [Google Scholar] [CrossRef]
Chu, Z.; Wang, S.; Xie, J.; Zhu, T.; Yan, Y.; Ye, J.; Zhong, A.; Hu, X.; Liang, J.; Yu, P.S.; et al. LLM Agents for Education: Advances and Applications. arXiv 2025, arXiv:2503.11733. [Google Scholar]
Jauhiainen, J.S.; Guerra, A.G. Generative AI and ChatGPT in School Children’s Education: Evidence from a School Lesson. Sustainability 2023, 15, 14025. [Google Scholar] [CrossRef]
Wang, T.; Zhan, Y.; Lian, J.; Hu, Z.; Yuan, N.J.; Zhang, Q.; Xie, X.; Xiong, H. LLM-Powered Multi-Agent Framework for Goal-Oriented Learning in Intelligent Tutoring System. In Proceedings of the Companion ACM on Web Conference 2025, Sydney, NSW, Australia, 28 April–2 May 2025; ACM: New York, NY, USA, 2025; pp. 510–519. [Google Scholar]
OpenAI Prompt Engineering. Available online: https://platform.openai.com/docs/guides/prompt-engineering (accessed on 31 May 2024).
Lee, U.; Jung, H.; Jeon, Y.; Sohn, Y.; Hwang, W.; Moon, J.; Kim, H. Few-Shot Is Enough: Exploring ChatGPT Prompt Engineering Method for Automatic Question Generation in English Education. Educ. Inf. Technol. 2024, 29, 11483–11515. [Google Scholar] [CrossRef]
Felder, R.M.; Silverman, L.K. Learning and Teaching Styles in Engineering Education. Eng. Educ. 1988, 78, 674–681. [Google Scholar]
Kirschner, P.A. Stop Propagating the Learning Styles Myth. Comput. Educ. 2017, 106, 166–171. [Google Scholar] [CrossRef]
NGSS Lead States. Next Generation Science Standards: For States, By States; National Academies Press: Washington, DC, USA, 2013. [Google Scholar]
Vaccaro, M.; Friday, M.; Zaghi, A. LLMs and Personalized Learning. Available online: https://github.com/m-vaccaro/LLMs-and-Personalized-Learning (accessed on 18 March 2025).
Wikipedia Contributors Electricity. Available online: https://en.wikipedia.org/w/index.php?title=Electricity&oldid=1194309762 (accessed on 11 January 2024).
Schimansky, T. CustomTkinter. Available online: https://customtkinter.tomschimansky.com/ (accessed on 5 June 2024).
Choi, D.; Lee, S.; Kim, S.-I.; Lee, K.; Yoo, H.J.; Lee, S.; Hong, H. Unlock Life with a Chat(GPT): Integrating Conversational AI with Large Language Models into Everyday Lives of Autistic Individuals. In Proceedings of the CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024; Floyd Mueller, F., Kyburz, P., Williamson, J.R., Sas, C., Wilson, M.L., Toups Dugas, P., Shklovski, I., Eds.; Association for Computing Machinery: New York, NY, USA, 2024; p. 72. [Google Scholar]
Rogers, M.P.; Hillberg, H.M.; Groves, C.L. Attitudes Towards the Use (and Misuse) of ChatGPT: A Preliminary Study. In Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1, Portland, OR, USA, 20–23 March 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 1147–1153. [Google Scholar]
Meyer, J.; Jansen, T.; Schiller, R.; Liebenow, L.W.; Steinbach, M.; Horbach, A.; Fleckenstein, J. Using LLMs to Bring Evidence-Based Feedback into the Classroom: AI-Generated Feedback Increases Secondary Students’ Text Revision, Motivation, and Positive Emotions. Comput. Educ. Artif. Intell. 2024, 6, 100199. [Google Scholar] [CrossRef]
Chin, C. Student-Generated Questions: What They Tell Us about Students’ Thinking. In Proceedings of the Annual Meeting of the American Educational Research Association, Seattle, WA, USA, 10–14 April 2001. [Google Scholar]
Elyoseph, Z.; Hadar-Shoval, D.; Asraf, K.; Lvovsky, M. ChatGPT Outperforms Humans in Emotional Awareness Evaluations. Front. Psychol. 2023, 14, 1199058. [Google Scholar] [CrossRef]
Addy, T.; Kang, T.; Laquintano, T.; Dietrich, V. Who Benefits and Who Is Excluded?: Transformative Learning, Equity, and Generative Artificial Intelligence. J. Transform. Learn. 2023, 10, 92–103. [Google Scholar]
Chrysochoou, M.; Zaghi, A.E.; Syharat, C.M. Reframing Neurodiversity in Engineering Education. Front. Educ. 2022, 7, 995865. [Google Scholar] [CrossRef]
Maya, J.; Luesia, J.F.; Pérez-Padilla, J. The Relationship between Learning Styles and Academic Performance: Consistency among Multiple Assessment Methods in Psychology and Education Students. Sustainability 2021, 13, 3341. [Google Scholar] [CrossRef]
Kasneci, E.; Seßler, K.; Küchemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; Günnemann, S.; Hüllermeier, E. ChatGPT for Good? On Opportunities and Challenges of Large Language Models for Education. Learn. Individ. Differ. 2023, 103, 102274. [Google Scholar] [CrossRef]
Kop, R.; Fournier, H.; Durand, G. A Critical Perspective on Learning Analytics and Educational Data Mining. In Handbook of Learning Analytics; Lang, C., Siemens, G., Wise, A., Gašević, D., Eds.; Society for Learning Analytics Research (SoLAR): Beaumont, AB, Canada, 2017; Volume 319, pp. 319–326. [Google Scholar]
Yağcı, M. Educational Data Mining: Prediction of Students’ Academic Performance Using Machine Learning Algorithms. Smart Learn. Environ. 2022, 9, 11. [Google Scholar] [CrossRef]
Wu, X.; Duan, R.; Ni, J. Unveiling Security, Privacy, and Ethical Concerns of ChatGPT. J. Inf. Intell. 2024, 2, 102–115. [Google Scholar] [CrossRef]
Hu, T.; Kyrychenko, Y.; Rathje, S.; Collier, N.; van der Linden, S.; Roozenbeek, J. Generative Language Models Exhibit Social Identity Biases. Nat. Comput. Sci. 2024, 5, 65–75. [Google Scholar] [CrossRef] [PubMed]
Navigli, R.; Conia, S.; Ross, B. Biases in Large Language Models: Origins, Inventory, and Discussion. J. Data Inf. Qual. 2023, 15, 10. [Google Scholar] [CrossRef]
UNESCO. Recommendation on the Ethics of Artificial Intelligence; UNESCO: Paris, France, 2022. [Google Scholar]
Potgieter, I. Privacy Concerns in Educational Data Mining and Learning Analytics. Int. Rev. Inf. Ethics 2020, 28. [Google Scholar] [CrossRef]
OpenAI Creating Video from Text. Available online: https://openai.com/index/sora/ (accessed on 18 June 2024).

Figure 1. Three components of study design used to generate and evaluate personalized texts, including (1) training session, (2) test session, and (3) profile evaluation. Process shown corresponds to experimental group. Study components remain same for control group, except for GPT-4 rewrite agent, which takes opposite profile as input in place of true profile. In (1), “A” and “B” represent two options where one can be chosen.

Figure 2. Major components of Profiler and Rewrite agent system and user messages. System messages are structured by defining agent’s role, assigning task, and defining and restricting output. Inputs are entered into user message.

Figure 3. Screenshot of graphical user interface (GUI). One paragraph pair used in the training session is shown (Topic: water cycle).

Figure 4. Profile evaluation component of this study. True and opposite profiles generated for one participant by GPT-4 are shown.

Table 1. Sample true and opposite student profiles generated by Profiler agent. We also show sample original and personalized text based on true profile.

Type	Text
True Profile	You learn best when information is given to you in a clear and organized way. You prefer details and facts, and like steps and sequences in what you’re learning. You also appreciate learning with pictures and diagrams. When it comes to working with new information, you often like to work with others and talk about the ideas.
Opposite Profile	You like to see the big picture and understand how ideas are connected in a broad sense. You enjoy when learning includes examples and stories that make a topic real and lively. When reading and writing, you’re good at taking in lots of information and like different viewpoints. It’s also fun for you to work in groups where you can share ideas and talk about what you’re learning.
Original Text (source: [73])	The movement of electric charge is known as an electric current, the intensity of which is usually measured in amperes. Electric current can flow through some things, electrical conductors, but will not flow through an electrical insulator. By historical convention, a positive current is defined as having the same direction of flow as any positive charge it contains, or to flow from the most positive part of a circuit to the most negative part. Current defined in this manner is called conventional current.
Personalized Text	Electric current is like a stream of tiny charged particles moving through a wire, and we measure its strength in units called amperes. It can only travel through materials that let it pass, called conductors. It can’t move through insulators, which block the flow. We say that electric current flows from the positive side to the negative side of a circuit, even though it’s really the electrons moving the other way. This kind of current is called conventional current because that’s the way scientists have agreed to talk about it.

Table 2. Demographics of the experimental (n = 12) and control (n = 11) groups (Total n = 23).

Group	Grade		Age			Gender ¹			Ethnicity ¹
Group	7	8	11	12	13	M	F	NB	White	Asian	Other ²
Experimental	5	7	0	4	8	7	4	1	8	3	1
Control	4	7	1	3	7	7	4	0	6	3	2
Total	9	14	1	7	15	14	8	1	14	6	3

¹ Gender and ethnicity were reported by a parent or legal guardian. M = male; F = female; NB = non-binary. ² Other ethnicity includes “American Israeli” (Count: 1, Group: experimental), “Asian/White” (Count: 1, Group: control), and “Nepali” (Count: 1, Group: control).

Table 3. Results of the test session comparing the number of personalized paragraphs selected by the experimental and control groups.

Study Group	No. Selected Personalized Paragraphs			Mean	Std. Dev.
Study Group	0	1	2	Mean	Std. Dev.
Experimental	1	7	4	1.25	0.622
Control	5	4	2	0.73	0.786

Table 4. Summary of data collected in profile evaluation phase of study.

Group	Selected Profile ¹
Group	True Profile	Opposite Profile
Experimental (n = 12)	9 (0.75)	3 (0.25)
Control (n = 11)	5 (0.455)	6 (0.545)

¹ Counts are shown. Values in parentheses indicate proportions.

Table 5. True and opposite student profiles generated by the Profiler agent.

Type

Text

True Profile

You seem to like learning in a way that connects ideas together, showing how things work in the big picture. You enjoy explanations that make you feel involved in a story, bringing science to life in a way that is less about lists and more about the flow of ideas. You also appreciate clear, straightforward facts that show how things work step by step at times.

Opposite Profile

You seem to learn best when facts and steps are laid out clearly for you. You like to see information presented in a sequence or process that you can follow from beginning to end. You often prefer descriptions that are detailed, explaining the how and why of things in a straightforward way. However, sometimes you also appreciate a clear demonstration of concepts, where you can see how things change or behave under different conditions.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Vaccaro, M., Jr.; Friday, M.; Zaghi, A. Multi-Agentic LLMs for Personalizing STEM Texts. Appl. Sci. 2025, 15, 7579. https://doi.org/10.3390/app15137579

AMA Style

Vaccaro M Jr., Friday M, Zaghi A. Multi-Agentic LLMs for Personalizing STEM Texts. Applied Sciences. 2025; 15(13):7579. https://doi.org/10.3390/app15137579

Chicago/Turabian Style

Vaccaro, Michael, Jr., Mikayla Friday, and Arash Zaghi. 2025. "Multi-Agentic LLMs for Personalizing STEM Texts" Applied Sciences 15, no. 13: 7579. https://doi.org/10.3390/app15137579

APA Style

Vaccaro, M., Jr., Friday, M., & Zaghi, A. (2025). Multi-Agentic LLMs for Personalizing STEM Texts. Applied Sciences, 15(13), 7579. https://doi.org/10.3390/app15137579

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Agentic LLMs for Personalizing STEM Texts

Abstract

1. Introduction

2. Related Work

2.1. Intelligent Tutoring Systems in Education

2.2. Multi-Agent Large Language Models

3. Methods

3.1. Study Design

3.1.1. Component 1: Training Session

3.1.2. GPT-4 Agents and Component 2: Test Session

3.1.3. Component 3: Profile Evaluation

3.1.4. Control Group and Summary

3.2. Participants

3.3. Research Procedure

4. Results

5. Discussion

5.1. Limitations and Future Research

5.2. Ethical Considerations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI