Exploring the Educational Applications of Large Language Models: A Systematic Review and Topic Analysis

Cibu, Bianca-Raluca; Crăciun, Liliana; Molănescu, Anca Gabriela; Cotfas, Liviu-Adrian

doi:10.3390/electronics14234683

Open AccessEditor’s ChoiceSystematic Review

Exploring the Educational Applications of Large Language Models: A Systematic Review and Topic Analysis

by

Bianca-Raluca Cibu

¹

,

Liliana Crăciun

²,

Anca Gabriela Molănescu

² and

Liviu-Adrian Cotfas

^1,*

¹

Department of Economic Informatics and Cybernetics, Bucharest University of Economic Studies, 0105552 Bucharest, Romania

²

Department of Economics and Economic Policies, Bucharest University of Economic Studies, 0105552 Bucharest, Romania

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(23), 4683; https://doi.org/10.3390/electronics14234683

Submission received: 18 October 2025 / Revised: 15 November 2025 / Accepted: 25 November 2025 / Published: 27 November 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

In light of the accelerated growth of artificial intelligence (AI), large language models (LLMs) have become a central topic of interest in scientific research and practical applications across various fields. The present paper aims to perform a comprehensive systematic review of the scientific literature on LLMs in education published between 2023 and 2024, based on a dataset from the Web of Science, which includes 507 documents from 322 sources. The accelerated dynamics of research in this field are confirmed by the high annual growth rate of 369.66%. The study identifies the themes presented in the scientific literature by using thematic maps and analyzing the evolution of said thematic maps. In addition, Latent Dirichlet Allocation (LDA) and BERTopic are used to outline the research field more clearly. Due to LDA’s ability to discover high-level research topics using probabilistic discovery and BERTopic’s ability to capture deeper semantic patterns and the emergence of various topics by searching, this paper first identifies the main research topics in the extracted dataset, which are then discussed in the paper through a review of applications. As a result, a range of applications are discovered in areas related to teaching and learning, academic assessment, integrity, academic feedback, medical education, ethics, bias, regulation, and social challenges. The conclusions provide a roadmap for researchers, practitioners and stakeholders in highlighting the current situation of LLMs in educational practice, while opening the door for future explorations in this domain.

Keywords:

artificial intelligence; large language models; systematic review; LDA; BERTopic

1. Introduction

A few years ago, when discussing new technologies, attention was focused on techniques for using the internet and computers, which were considered particularly significant in terms of societal development [1]. Although in the past artificial intelligence (AI) was not considered a useful tool in various discussions about technology, this perspective has now changed. This shift in perspective is not just a meaningful technological change, but also has an influence on the dynamics in most contexts [2]. AI’s ability to create customized learning environments tailored to each student’s needs is one of its greatest strengths, increasing positive outcomes and engagement [3]. In addition, natural language processing systems (various virtual assistants or chatbots) have the ability to provide immediate assistance to students by helping them with career development advice, mental health resources, or academic tasks [4]. AI also supports scientific research by enabling the analysis of large quantities of data in a short amount of time. Automated learning algorithms help teachers recognize patterns and gain new insights that were previously difficult to access, accelerating the advancement of knowledge in fields such as engineering, social sciences and medicine. Predictive analytics also provides educational institutions with tools to improve academic performance and formulate data driven interventions [5,6].

Thanks to the benefits offered by AI and machine learning (ML), many higher education institutions have come to recognize their essential role in both current and future education. These new technologies have the potential to offer an advanced and interactive educational learning experience for students, and 65% of US universities are already utilizing AI- and ML-assisted learning solutions. At the same time, these systems actively support teachers, improving the teaching process [7]. Major growth in AI use in education is also expected in the US, with a 47.5% increase observed between 2017 and 2021 [8]. According to a study conducted on a sample of 1045 teenagers by Common Sense Media in 2024, 7 out of 10 teenagers use generative AI specifically for homework, with a teacher’s approval [9,10]. Using AI has improved students’ performance and helped create a more inclusive environment [11,12], facilitated the supervision of online assessment [13] and offered benefits for students with learning disabilities [14].

LLMs are massive models with transformative capabilities trained on large datasets [15]. ChatGPT is an LLM launched by OpenAI using Generative Pre-trained Transformer (GPT) models [16]. The GPT-3 model managed to achieve 175 billion parameters on its own by 2021 [17]. Meta’s LLaMA was instructed on 1.4 trillion tokens [18], while the Bidirectional Encoder Representations from Transformers (BERT) model, which was trained on a dataset of 3.3 billion words, has around 110 million parameters [19]. Even though they are hugely successful, it should be noted that they also face various challenges, such as the possibility of recording erroneous data from the training dataset or the constant need for massive computational resources. In fact, due to their widespread use, it is very important to be sure that they are being used in a responsible and ethical manner [16]. It is important to note that LLM is a subcategory of AI, more specifically in the field of ML based AI and, in particular, deep learning, designed to work with human language.

Due to the growing interest in AI, empirical studies have been conducted on its involvement in education, more specifically in higher education: one study analyzed whether AI can revolutionize the higher education environment [20], while another used structural equation modeling [21]. In fact, researchers have also begun to analyze the challenges and opportunities that AI integration brings to higher education [7]. Song and Wang [22] also performed a bibliometric analysis covering the 2000–2019 period, using a database of Scopus indexed papers, where they observed a rising appetite for AI in education. In fact, Li and Wang [23] observed an increase in personalized learning, also using a bibliometric analysis conducted for the period 2000–2022, using papers from both the Web of Science (WoS) and Scopus databases. Sekwatlakwatla and Malele [24] conducted research that analyzes, through a bibliometric approach, how LLMs are evaluated. The paper highlights that LLMs can provide incorrect but convincing answers. Furthermore, it explores the relevant scientific literature and highlights the main techniques and research directions used in evaluating LLMs. Among the key areas identified are natural language processing (NLP), information classification, and computational linguistics.

Although interest in the development of LLM research in education has been highlighted by the above-mentioned works, conducting a systematic analysis for the period 2023–2024 improves the bibliometric approaches in the field highlighted above. As observed, this field is changing fast, and new publications can show big changes in research trends, methods, or how LLMs can be used. On the other hand, the fact that the period analyzed is short and recent allows us to focus on the most current works, providing a clear, accurate and well-defined snapshot of the present moment in research.

Given the extent of AI use in recent years and, implicitly, the use of LLM in all fields of activity, especially in education, it was deemed necessary to conduct a study to observe how the interest of the academic community in this topic has materialized. The choice of a review-enhanced analysis, in which topic modeling has been added to the equation, is justified by several key arguments related to both the depth of the method and the relevance of the topic addressed. Firstly, this approach offers a systematic examination of academic output in the considered field through extensive and established databases, such as WoS. Additionally, by using topic analysis, the main research directions of LLMs in education are highlighted, supporting the review of the works in the field.

To complement the main objective, this paper aims to answer the following questions:

SQ1: What are the current publication trends, collaboration patterns, and disciplinary intersections in research on large language models (LLMs) in education during 2023–2024?
SQ2: How do technological, pedagogical, and ethical dimensions interact within the main research themes related to the use of LLMs in education?
SQ3: What emerging paradigms and methodological directions (e.g., agentic AI, multimodal LLMs, and RLHF-based feedback systems) can inform a future roadmap for responsible and effective integration of LLMs in education?

The paper consists of five sections. In addition to Section 1 (“Introduction”), which gives a historical review of the rapid development of LLMs in AI and sets out the research questions, Section 2 (“Review Method”) provides a description of the methodologies used for collecting, cleaning, and reviewing the extracted datasets. It details the criteria for selecting publications, the reference period, keywords, and the tools used. In addition, it justifies the choice of using the WoS database. Section 3 is the most complex (“Results”), and is divided into several subsections to systematically present the results obtained: “Description of the dataset” which provides an overview of the dataset: the number of papers, annual growth rate, sources, average number of citations per article, average age of documents, number of references, etc. It involves a “Review of Top-10 most cited articles” which analyzes the papers that have been cited the most, meaning the ones that have really influenced the field, as well as the main topics covered. The subsection “Thematic maps” presents a comprehensive analysis of the most commonly used keywords and author keywords. The “Theme analysis” uses LDA (Latent Dirichlet Allocation) and BERTopic to identify the themes of the research papers in the dataset. The “Systematic review based on identified themes” analyzes the directions indicated by the thematic maps and theme analysis. Section 4 (“Discussions and limitations”) includes an interpretation of the results obtained, correlating them with the larger context of LLM research and pointing out the limitations of the study. Section 5, “Conclusions”, summarizes the main conclusions of the analysis, highlighting the dominant trends in LLM research and summarizing the obtained results in line with the proposed aims.

2. Method of Review

2.1. Choosing the Database

Firstly, attention was focused on identifying the most suitable database from which to extract relevant articles for further analysis. By considering a series of research papers from various fields, such as papers in life and environmental sciences [25,26], computer science [27], medical sciences [28], information technology [29], intelligent transportation systems [30], or industrial revolution [31], it was found that the best choice for the database used for extracting the papers in the field of LLMs in education would be the Clarivate Analytics WoS database.

Previous bibliometric analyses have shown that the WoS database has the capacity to provide rigorous and selective coverage of research in the field of education and educational technology. Hinojo-Lucena et al. [32] conducted a study showing that scientific output related to artificial intelligence in higher education indexed in WoS is still in its infancy but is on an upward trend, characterized by slow and steady growth. In turn, Urzua et al. [33] investigated the impact of generative chats as a feedback tool in the academic writing process for higher education students, using, among other things, the WoS database. To capture relevant publications in the field of education, Osorio Vanegas et al. [34] also used the WoS database to synthesize, through a systematic review, the competencies, skills, training models, and methods used in teacher training in the field of educational technology.

Over time, WoS has been considered one of the most comprehensive and well-known databases in terms of scientific literature; it is very diverse and even manages to cover some of the most important results of various research studies conducted worldwide [35]. Compared to the Scopus database, even though it contains a larger number of publications, the literature emphasizes that the academic influence of journals indexed in WoS is generally greater [36]. Another argument for choosing WoS is the existence of the Keywords Plus feature, which provides additional terms automatically extracted from the titles of cited bibliographies, thus facilitating a broader and more accurate thematic exploration of the field studied [37,38,39,40,41]. This feature was useful for identifying significant works and establishing the background for the study.

With regard to the process of documenting and extracting relevant scientific publications, a detailed analysis was carried out regarding how to access and use the WoS database provided by Clarivate Analytics. According to the guidelines provided by Liu [42,43], it is essential for researchers to follow certain clear steps in the data extraction process, taking into account the type of subscription they have. Access to WoS is subscription-based, which can significantly influence the number and type of articles available for analysis [44,45,46,47,48,49]. Some memberships give you restricted access to only certain indexes, while others let you access a wider range of indexes and publications. As Liu [42] points out, it is important for authors to explicitly mention which indexes were available through their subscription, as this affects the size and relevance of the analyzed dataset. In this case, the accessible indexes are presented in the following list:

Book Citation Index—Science (BKCI-S)—2010–present;
Book Citation Index—Social Sciences and Humanities (BKCI-SSH)—2010–present;
Index Chemicus (IC)—2010–present;
Current Chemical Reactions (CCR-Expanded)—2010–present;
Emerging Sources Citations Index (ESCI)—2005–present;
Arts and Humanities Citation Index (A&HCI)—1975–present;
Social Sciences Citation Index (SSCI)—1975–present;
Conference Proceedings Citation Index—Social Sciences and Humanities (CPCI-SSH)—1990–present;
Conference Proceedings Citation Index—Science (CPCI-S)—1990–present;
Science Citation Index Expanded (SCIE)—1900–present.

As observed in this study, all publications indexed in the WoS core collections were included, without any restriction to Q1 journals. This decision was made in order to obtain a complete and representative picture of the scientific field related to the use of LLMs in education. Given that this field of research is new and rapidly expanding, many relevant contributions initially appear in Q2–Q4 journals or in journals indexed in ESCI, which, although not classified in Q1, play a significant role in shaping research directions [50,51]. Thus, a comprehensive selection of all publications indexed in WoS allows for a more complex analysis of the dynamics and diversity of the field, reflecting all relevant works and authors at the beginning of the formation of a research community.

2.2. Application of the PRISMA 2020 Guideline

To ensure transparency and rigor in the process of selecting the studies that were included in this analysis, the “Preferred Reporting Items for Systematic Reviews and Meta Analyses” (PRISMA) 2020 guidance has been utilized. The paper complies with the PRISMA guidelines and a protocol registration has been made available at the following link: https://www.protocols.io/private/A5D41821AE4811F087950A58A9FEAC02 (accessed on 15 November 2025). Additional information regarding the PRISMA flow diagram can be found in Supplementary Materials.

The PRISMA diagram (Figure 1) summarizes the steps taken in the process of searching, filtering, and including articles, highlighting the number of studies initially identified, those eliminated following the evaluation of eligibility criteria, and those included in the final analysis.

According to the PRISMA 2020 guideline [52], its use in a systematic review is very useful for ensuring the transparency and scientific rigor of the reporting process. Complete reporting of all elements facilitates replication of the study, subsequent updating of the review, and its integration into meta-analyses or good practice guidelines, thereby reducing waste of resources and research efforts [51,52,53]. Thus, in this paper, the PRISMA 2020 guideline was applied in stages. In the “Identification” stage, relevant publications were selected from WoS based on predefined keywords; in the “Screening” stage, duplicates and irrelevant articles were removed; in the “Eligibility” stage, the full texts were analyzed according to the inclusion and exclusion criteria; and in the final stage (“Included”), only the studies relevant to the analysis were retained.

To determine the dataset suitable for this review on the use of LLM in education, the initial step was to identify specific terms in the title (TI), abstract (AB), or authors’ keywords (AKs). Using the initial group of words “large_language_model”, a total of 19,830 papers were identified. Subsequently, searching for “education” yielded a much higher number of papers, as this is a much broader and more complex field, with a total of 1,853,954 papers. By combining the two terms, a total of 2227 articles were found in the database, according to the “Identification” stage in the PRISMA diagram. It is important to note that the use of the singular form of the terms, with an asterisk (“*”) at the end, was employed. This is a common technique in bibliographic searches that allows for the automated retrieval of all the derived forms, including the plural or any particular endings. Furthermore, to increase the accuracy of searches and avoid extracting irrelevant articles containing only isolated terms, underscores (“_”) were used between words, forcing the system to search for exact expressions in which the terms appear together in the same order. Thus, the phrase “large_language_model” allowed the identification of articles containing the complete expression “large language model” in the title, abstract, or authors’ keywords, and not just separate terms such as “large”, “language” or “model”.

During the “Selection” stage, 16 works that had not been written in English, 934 that had not been marked as articles, and 436 works out of 2025 were eliminated because they were considered too recent to offer stable perspectives and could distort the interpretation of trends. Thus, this step resulted in a total of 841 papers. The “Eligibility” stage played a very important role in creating the dataset. The field chosen for this analysis was very complex, and required a manual selection of the initially identified papers, as they were not in line with the set objectives, resulting in the elimination of 334 papers. In the final stage, “Included Stage,” a total of 507 articles remained, of which 10 will be described in detail in this paper.

Furthermore, the simultaneous inclusion of multiple databases could have led to issues of deduplication and terminological inconsistency, as the same works may be indexed differently depending on the publisher and field, an issue also highlighted by Mongeon & Paul-Hus [50] and subsequently confirmed by Harzing & Alakangas [54] and Martín-Martín et al. [55]. For this paper, the main objective was to obtain a homogeneous, high-quality corpus that reflects relevant works of scientifically validated research on LLM in education. The study did not include preprints or conference papers, as these do not undergo the full peer-review process and may therefore introduce significant methodological variations or incomplete citation data [56]. The choice of the period 2023–2024 was based on considerations of data stability and indexing completeness, as many works from 2025 are still in the process of being registered bibliometrically [57].

In order to argue for the selection steps considered in the dataset extraction, a series of papers have been considered. Cerqueira et al. [58] investigated how the concept of trust in LLMs is defined, discussed, and put into practice in academic research. The study combines a bibliometric analysis of a sample of 2006 publications from 2019 to 2025 using the Bibliometrix package with an in-depth manual review of 68 articles. Comparing this research with the study in the present paper, the database was also extracted from WoS, the combination “Large Language Model*” was used, only conference papers and peer-reviewed journal articles were selected, and only articles in English were included. In another bibliometric analysis of LLMs, conducted for the period 2017–2023, papers were selected from the WoS Core Collection using the filters “large language model” and “large language models” at the topic level [59]. Furthermore, researching the integration of LLM in the medical sector using bibliometric analysis, Gencer and Gencer [60] kept only English language works in their database extracted from WoS, these being only published articles or reviews.

2.3. Steps Followed in the Analysis

A series of steps have been considered for the systematic review, as highlighted in Figure 2.

Step 1 focused on the general characterization of the dataset extracted, providing an overall picture of the volume and diversity of the publications included. The aim was to highlight the high level of interest in LLMs in education within a short period of time, given that the papers analyzed are from a recent and short period. The emphasis was placed on the diversity of publication sources, the presentation of document types and the degree of collaboration between researchers, including their international dimension. At the same time, elements related to scientific impact were analyzed, such as the visibility of the papers through citations and the degree of maturity of the field reflected by the relatively young age of the articles. Another aspect that was examined was thematic diversity, highlighted by the high number of keywords used by authors and by Keywords Plus.

Step 2 focused on the most influential works in the dataset, i.e., articles that accumulated a higher number of citations in a short time. The aim was to identify the research directions considered essential by researchers and to understand how they have helped shape and develop the field. The analysis included not only the number of citations, but also the number of authors, the objectives of the studies, and the types of data used.

This approach aimed to highlight the topics addressed in the literature. Another objective was to capture the diversity of approaches, from theoretical studies to experiments applied in educational or medical contexts.

Step 3 highlights the main thematic directions in the analyzed literature. This subchapter aims to understand how different concepts, ideas, or themes constantly recur in works dedicated to LLM in education. Using keywords and expressions extracted from titles and abstracts, thematic maps and conceptual groups were constructed, providing an illustration of the connections between terms and identifying major areas of interest (the technological foundations of LLMs, their impact on performance and the learning process, the quality of AI-generated learning content, and ethical issues and responsibility in their use). The analysis sought to capture the balance between well-developed areas and those still in their infancy.

Step 4 consisted of two complementary analysis methods: LDA and BERTopic. Through LDA, attention was focused on identifying dominant topics that capture how LLMs are integrated into education. BERTopic was used to create different clusters, which include more detailed and applied discussions, such as the general uses of ChatGPT and AI in education, as well as very specific areas.

Step 5 brings together the results of thematic analyses and topic discovery in order to provide an overview of the major research directions in the domain. This allowed the field to be structured into four main areas: “Technological Foundations of LLMs in Education,” “Impact on Students, Performance, and Learning Outcomes,” “Content Quality, Readability, and Literacy in AI Outputs,” and “Ethics, Personalization, and Responsible Use of LLMs.”

2.4. Rationale for Selecting the 10 Most Frequently Cited Articles

A detailed analysis of the 10 most cited articles was added to the analysis to complement the extensive analysis of the 507 papers included in the database. These articles were selected in order to obtain detailed results for the field under analysis.

It should be noted that the identification of the most frequently cited articles was based on indicators such as the total number of citations (TCs), the number of citations per year (TCY), and the normalized number of citations (NTCs), which corrected for differences in publication year and allowed for comparison of influence between papers from different years. These highly cited articles were treated as works that significantly shaped the evolution of the field through the ideas, theoretical frameworks, or methodologies proposed.

Second, the selection of these articles was guided by a number of additional criteria. The use or implications of LLMs in educational contexts were taken into account; both conceptual works and empirical studies, interdisciplinary analyses, and applied research were included; the selected articles come from areas such as medical education, university education, academic ethics, and educational policies, covering perspectives from multiple regions and disciplines; each article was correlated with the main themes and topics resulting from the automatic LDA and BERTopic analyses applied to the entire corpus.

A part of the graphs and analyses discussed in the following were created using RStudio. This was chosen due to its diversity in terms of both graphics and statistics [61]. The version of RStudio used was 4.2.1, and the version of the Biblioshiny library was 4.2.3. Also, another part of the figures, specifically the ones regarding the topic analysis through LDA and BERTopic, are generated through the use of Python 3.12 [62].

3. Results

This section was created to discuss the results obtained, following the steps described in Figure 2.

3.1. Dataset Description

The main information related to the extracted dataset is summarized below. The average age indicator for documents represents the difference between the current year and the year of each document’s publication, and then the average is calculated. Its value of 1.18 is due to the fact that most publications are very recent. The mean number of citations received by each article in the dataset (the mean number of citations per document) is 15.03, which indicates a rapid scientific impact, with the field quickly attracting the attention of the academic community.

Table 1: Due to the novelty of the phenomenon analyzed—LLMs in education—as expected, the period analyzed is quite limited, with the selected works being published within the last two years (2023–2024). From this, one can deduce the emerging nature of the research, with the total number of published papers being 507 documents.

Even if the time period was not a long one, the total number of articles (507) is quite high, considering that the analysis includes papers from a two-year period. This indicates a rapid expansion of the field, which is also supported by the high annual growth rate indicator (369.66%). In terms of the type of articles, the total number of articles includes 2 book chapters and 49 early access papers. In addition, both the 322 sources and the 19,016 references support, on the one hand, a wide dissemination of research, including publications in various fields, but also the idea of a solid theoretical basis.

The average age indicator for documents represents the difference between the current year and the year of each document’s publication, and then the average is calculated. Its value of 1.18 is due to the fact that most publications are very recent. The mean number of citations received by each article in the dataset (the mean number of citations per document) is 15.03, which indicates a rapid scientific impact, with the field quickly attracting the attention of the academic community.

Regarding the authors of articles in the dataset (Table 2), there are a total of 52 articles written by a single author, representing approximately 10.26% of the overall number of articles. On average, there were four authors per paper, suggesting the idea of preferences for collaboration in conducting research. The global character of this field is deduced from the international co-authorships indicator (26.04%), allowing for educational comparisons between different countries. The number of authors’ keywords (1451) reflects a thematic diversity.

Looking more closely at the annual scientific output, most papers were published in 2024, totaling 418 articles, compared to 2023, when only 89 papers were published. This is due to the ever-increasing expansion of AI, of which LLM is a sub-branch. Compared to other fields of activity, Wu et al. [63] conducted a bibliometric analysis in which they set out to systematically and comprehensively observe how AI contributes to technological innovation in green buildings. As in the present study, by tracking scientific output, the authors deduced a substantial growth in the number of articles during the years 2014–2015.

The total average citations per article (MeanTCperArt) is the average number of citations accumulated by each article published in that year (regardless of how many years have passed since publication) and the annual average citations per article (MeanTCperYear) illustrates citations by the age of the articles, in order to compare articles from different years. Thus, it can be stated that older articles (2013) had a longer period of time to accumulate citations, even if, on an annual basis, they register a higher number than more recent ones (2024).

3.2. Top-10 Most Cited Articles Review

The following section will describe in more detail the 10 most-cited papers, taking into account the author count and discussing the purpose of the articles, the data sources used, and the results obtained.

First, for each of the 10 most cited works, some general information was presented (Table 3), like the first author’s name, publication date, journal, total number of authors, total number of citations (TCs), total citations per year (TCY calculated as the ratio between the number of citations of the paper and the number of total citations in that year), and the total normalized citations (NTCs, which are calculated based on the TC of the paper and the mean number of citations of all papers in the dataset that have been published in the same year as the paper in question). NTCs help to compare articles (in total citations) and the average of the rest of the articles that were published in the same year. As can be seen, the article by Dwivedi et al. [64] is the most impressive article on the list. The very large number of authors (73), 1336 TCs, TCY of 445.33, and 22.08 NTCs suggest complex international collaboration. Cooper [65], being a single author, managed to record some quite impressive indicator values: 415 TCs, a TCY of 138.33, and 7.17 NTCs. These suggest an extremely relevant and original contribution to scientific education.

Dwivedi et al. [64] brought together 43 interdisciplinary contributions from diverse fields, including marketing, education, IT systems, public policy, management, publishing and healthcare, to analyze the impact of transformative generative AI technologies such as ChatGPT. These tools, which are capable of producing sophisticated texts that are hard to differentiate from human-authored ones, offer substantial opportunities for productivity gains in sectors such as tourism, banking, management, IT, and marketing. However, the authors also highlight the associated challenges: technological limitations, security and privacy risks, possible imbalances caused by errors and biases in training data, and the dangers of misinformation or misuse. Opinions on the need for regulation are divided.

Cooper [65] explores the potential of generative AI, particularly ChatGPT, in the field of science education. The study follows three main directions: how ChatGPT responds to questions related to science education, how teachers could integrate this tool into teaching, and the author’s reflections on its use as a research tool. The results show that ChatGPT’s responses are often coherent and well matched with the core themes of science education. However, the author warns of the risk that AI may be perceived as an absolute epistemic authority, presenting information without sufficient evidence or nuance. Ethical concerns include the risk of violating certain rights, the impact on the environment, and existing challenges related to content moderation.

Yeo et al. [66] investigate how accurate and useful ChatGPT is in providing information about cirrhosis and hepatocellular carcinoma (HCC). ChatGPT answered most questions correctly (79.1% for cirrhosis and 74% for HCC), but only some of the answers were considered comprehensive. It performed better on questions related to lifestyle and treatments, but weaker on diagnosis and prevention. It failed to capture regional differences in medical guidelines, but provided practical and emotional support to patients. The authors conclude that ChatGPT may be a useful complementary instrument to inform patients and support physicians.

Rahman and Watanobe [67] set out to analyze the opportunities and threats presented by ChatGPT for education and research, from the perspective of both learners and teachers. The research also sought to highlight how ChatGPT can support programming learning by generating code, pseudocode, and error correction. For validation, practical experiments and questionnaires were conducted with students and teachers.

Perkins [68] analyzes the implications of using AI, particularly LLMs such as ChatGPT, on academic integrity in the formal assessment context. The benefits and risks of these tools are highlighted, particularly their ability to generate original texts that can evade plagiarism detection. The authors emphasize that it is not the use of AI itself that constitutes a violation, but rather the lack of transparency in declaring this use. Thus, the responsibility lies with higher education institutions, which must update their academic integrity policies to reflect the new educational realities.

Jeon and Lee [69] investigated how teachers can collaborate with ChatGPT in the educational process, analyzing the relationship between technology and the role of teachers. A group of 11 foreign language teachers used ChatGPT for two weeks in their teaching activities, and were subsequently interviewed and their interaction logs analyzed. The results identified four different roles for ChatGPT (evaluator, content provider, teaching assistant, and interlocutor) and three key roles for teachers (critical thinking stimulators, pedagogical coordinators, and promoters of ethics in AI use). The conclusion highlights the complementarity between teachers’ pedagogical expertise and AI functionalities, providing directions for effective human–machine collaboration in future education.

Lyu et al. [70] decided to investigate the usefulness of ChatGPT for translating radiology reports into language that is as accessible as possible for both healthcare and patient providers, with the aim of providing better medical education. A total of 138 reports (62 chest CT scans and 76 brain MRIs) were analyzed, and radiologists rated the translations generated by ChatGPT with a mean rating of 4.27 out of 5. The model produced very few omissions or misinformation and provided relevant recommendations in approximately 37% of cases. There were cases of overly simplified responses or omitted information, but these limitations can be reduced by formulating more detailed requests.

Alqahtani et al. [71] provide a detailed overview of natural language processing through AI and LLMs, such as BARD and GPT-4, highlighting the powerful impact these could have on research and education. The researchers discuss the challenges, benefits, and innovative new applications of these technologies, such as data analysis, text generation, personalized educational assessment and support, and literature review. The article emphasizes the importance of addressing ethical issues and algorithmic biases to maximize the benefits of AI in the field. The goal is to contribute to the debate on the role of AI in education and research, highlighting its potential to improve outcomes for teachers, students, and researchers.

Garcia-Penalvo [72] researched the impact of ChatGPT. Although artificial intelligence already exists, ChatGPT has brought both its advantages and risks back into discussion, particularly in the area of education, due to its capability to generate near-human texts. User reactions to this technology have been mixed, ranging from enthusiasm to fear. The authors emphasize that, although ChatGPT is a disruptive force, its success depends on how it will be understood and integrated into practice. Banning or denying it will not stop its spread, which is why it is considered essential to understand its advantages and risks in order to minimize them.

Oh et al. [73] evaluated the capacity of ChatGPT models, specifically GPT-3.5 and GPT-4, to understand complex information in the context of general surgery, using 280 questions found in surgery exams from 2020 to 2022. Findings showed that GPT-3.5 achieved an accuracy of 46.8%, while GPT-4 performed significantly better, with an accuracy of 76.4%. GPT-4 performed consistently across all surgical specialties, with accurate rates ranging from 63.6% to 83.3%. The conclusion is that GPT-4 demonstrates a very good ability to understand information, but its use must be complementary to human judgment and expertise.

A brief description of the items described above is presented in Table 4.

3.3. Thematic Analysis

To gain a thorough understanding of the structure of data and the evolution of topics of interest within the field of study, in this section of the paper, we will apply an analytical approach that combines exploration of the relationships between concepts with an analysis of how they develop and interact over time. This perspective allows us to identify the main thematic directions, highlight significant connections between terms, and delineate coherent groups of ideas or topics.

Factor analysis is a method used to reduce an initial set of interrelated variables to a smaller number of factors that reflect the internal structure and common relationships between these variables. Its purpose is to identify latent dimensions (factors) that explain common variability, without referring to an output variable, but exclusively to the relationships between the analyzed variables [74]. Figure 3 illustrates such an analysis, based on the 40 most frequently used Keywords Plus, which were grouped into three clusters.

The factor analysis identified two significant dimensions (Dim.1 and Dim.2). Dim.1 has an educational and scientific component, consisting of terms such as “performance,” “skills,” “higher education,” “attitudes,” “students,” and “science,” thus presenting high positive values, indicating a strong association with topics related to the learning process, assessment, and academic progress. On the other hand, terms such as “literacy,” “readability,” “information,” and “health” have significant negative values, suggesting possible difficulties or concerns related to the clarity and effective transmission of information. Dim.2 highlights the technological and innovation component, consisting of terms such as “generation,” “design,” “technologies,” “teachers,” “big data,” and “management,” which are positively correlated, indicating an increased interest in integrating technologies into education and educational management processes. In contrast, terms such as “skills,” “science,” and “impact” have negative scores, which may signal challenges in the practical application of innovations in educational contexts.

Based on the two dimensions identified, three main clusters were subsequently created. Cluster 1 is the largest group, bringing together concepts related to education, artificial intelligence, and communication (“chatgpt,” “knowledge,” “education,” “communication,” “literacy,” “motivation”). Cluster 2 includes terms with high values on Dim.1, such as “performance,” “impact,” “skills,” “higher education,” “perceptions,” “attitudes,” “science,” and “metaanalysis,” representing the area associated with academic results and evaluation processes. Cluster 3 is positively correlated with Dim.2, with terms such as “big data,” “technologies,” and “management,” reflecting a concern for digital transformation and innovation in the administration of educational processes.

This time, a factorial analysis was performed on the bigrams found in the titles, and, as in the previous case, two dimensions and three clusters were identified Figure 4. The first dimension included terms related to the technological side of AI infrastructure: “generative pre-trained” (referring to the architecture and learning mode behind modern AI models), “pre-trained transformer” (the technical term describing the technological foundation of LLMs), and “generative chat” (an applied term describing the interactive functionality of generative models). In the second dimension, groups of words related to advanced, theoretical, or reflective education were identified: “self-regulated learning” (describes the student’s ability to control their own learning process), “science education” (teaching and learning scientific disciplines), as well as “intelligence tools” (refers to applications that use AI).

Based on these two dimensions, three main thematic clusters were identified. Cluster 1 consists of terms referring to the use of artificial intelligence in education, the development of teaching materials, and academic assessment. These include “language models” “artificial intelligence”, “prompt engineering”, “academic integrity”, “machine learning”, “comparative study” and “nursing education”. This cluster highlights the interest in integrating AI technologies into educational and research processes. Cluster 2 brings together more theoretical terms such as “generative artificial”, “science education”, “intelligence chatbots”, “self-regulated learning” and “intelligence tools”. These concepts indicate researchers’ concern for pedagogical aspects. Cluster 3 consists of the basic concepts of modern AI architectures, such as “generative pre-trained”, “pre-trained transformer” and “generative chat” which define the structure and technical operating principles of LLMs.

Figure 5 shows the 150 most common Keywords Plus, with a minimum cluster frequency of 15, grouped by thematic context. The thematic map often features two metrics. On the one hand, there is centrality, which can be used to deduce the impact that certain keywords have on the context; and on the other hand, there is density, which serves to identify the degree of development [75].

As can be seen, four groups have been formed. The “Basic Themes” quadrant consists of two words and 38 occurrences: “chatgpt” (16 occurrences) and “artificial intelligence” (22 occurrences). This group highlights the technological dimension, particularly AI tools, with a focus on ChatGPT. The terms are associated with discussions on the concrete use of AI in education, but also in other fields. The second group, in the “Niche Topics” quadrant, contains three words and 32 occurrences: “students” (12 occurrences), “performance” (11 occurrences), and “impact” (9 occurrences). When researching the field of education, it is normal for these words to appear in the “Basic Themes” quadrant, as the main subjects of the educational process are students, and the goal of all applied techniques is the performance they achieve. In fact, the emphasis is on assessing the evaluation of AI on the learning results. The third group, also in the “Basic Themes” quadrant, is composed of two words and 19 occurrences: “readability” (11 occurrences) and “literacy” (8 occurrences). These terms suggest a concern for understanding digital content and the level of literacy required to use educational or AI-generated materials. The last cluster, from the “Emerging or Declining Themes” quadrant, is the smallest, with a single word (“education”) appearing 10 times. The uniqueness of this term may suggest the connectivity it creates with the other clusters. It can be interpreted as a central semantic node in the network, and is, in fact, the common background of the entire analysis.

Similarly to the previous figure, another thematic map was created Figure 6, but this time it focused on bigrames (two words) found in the article abstracts in the database. A total of 250 words were picked, with a minimum appearance frequency in groups of 15. Even though the number of groups is smaller (only three), the occurrence of the words that form them is much higher compared to the previous situation.

The highest word count was in the green group, which is on the border of the “Basic Themes” and “Motor Themes” quadrants, totaling 78 word groups. Even though most occurrences were recorded by the word groups “artificial intelligence” (224 occurrences), “intelligence ai” (111 occurrences), “generative ai” (65 occurrences), or “generative artificial” (63 occurrences), this cluster reflects both the impact and potential that AI has on the educational process, ethics, and personalized training. In other words, it is a cluster that synthesizes the vision of AI use in education. The next cluster in components, in pink, is located at the border line between the “Emerging or Declining Themes” quadrant and the “Basic Themes” quadrant, totaling 35 word groups.

Even though it consists of 35 word groups, the number of occurrences that some of them have recorded is higher than in the previous cluster. In this category, the following groups can be mentioned: “language models” (329 occurrences), “models llms” (919 occurrences) or “language models” (114 occurrences). This cluster focused on LLM technology, how it works, how it is trained (“pre-trained transformers”), but also its direct applications in education (“multiple-choice questions”, “reading comprehension”), research, or health (“clinical practice”). In other words, it refers to the mechanisms and technical validations of AI models. The last cluster, located in the “Niche Themes” quadrant, consists of 24 word groups. Within this cluster, the most frequent occurrences were recorded by the following groups of words: “patient education” (31 occurrences), “grade level” (20 occurrences), “education materials” (18 occurrences) and “reading level” (15 occurrences). This group focuses on quality assessment of AI-generated texts, particularly for educational and medical purposes. Readability measurement tools (“Flesch–Kincaid grade”, “Likert scale”) are mentioned, and the suitability of content created by AI models for patients, students, or general users is investigated.

Based on the above considerations related to the thematic maps, a series of key interest themes can be highlighted:

Theme 1—Technological Foundations of LLMs in education—pointing towards the technical aspects of LLMs in education, highlighting how AI and LLMs are positioned as transformative tools;
Theme 2—Educational Impact for Students, Performance, and Learning Outcomes—focusing on how AI influences learners and their achievements;
Theme 3—Content Quality, Readability, and Literacy—including works that focus on the concerns raised regarding the accessibility and appropriateness of AI-generated content;
Theme 4—Ethics, Personalization, and Responsible Use of AI in Education—advocating for the responsible use and adoption of LLMs in education, touching on themes related to fairness, personalization, and potential risks.

Considering similar works from the field which have considered thematic map analysis, the research conducted by Lampropoulos [76] aimed to observe how intelligent tutoring systems behave in virtual reality and augmented reality environments by conducting a synthetic analysis of the relevant literature in the field, published from 2015 to 2024. According to the author, these systems can transform traditional teaching and learning processes and transform the quality of education at all levels. By creating a thematic map, the author identified five themes: a theme related to virtual reality, intelligent tutoring systems, and AI; a theme related to learning, students, teachers, and teaching; a theme related to architecture, design, and frameworks; a theme relating to training, healthcare education, simulation, and clinical competency; and a theme relating to intelligent systems, intelligent tutors, or intelligent agents. Comparing the results obtained in Lampropoulos [76] with those obtained in the current study, there is consistency in the themes associated with technological foundations, as well as an emphasis on students, teaching, and learning, and a partial overlap in areas related to medical education.

3.4. Topic Discovery

The discovery of themes was achieved using LDA and BERTopic. The results obtained are discussed below.

3.4.1. Topic Discovery Through LDA

Through the LDA analysis, four topics have been identified across the corpus, which capture the works focusing on the use of LLMs in education, as shown in Figure 7.

The largest topic, Topic 1, gathering 65.7% of tokens as depicted in Figure 8, accounts for the majority of terms and focuses on students, learning, and the integration of chatbots and generative AI in providing feedback, assessment, and content generation, as suggested by the most salient keywords, namely student, generative_ai, feedback, gpt, design, technology, learning, support, enhance, impact, educator, and educational.

The second-largest topic, referred to as Topic 2, gathers 21.6% of tokens and highlights a series of the most relevant terms, including keywords such as question, response, patient, patient_education, accuracy, readability, assess, safety, exam, evaluate, analysis, and test, as shown in Figure 9. The topic focuses on evaluating LLM-generated responses in various contexts, such as patient education and assessment solving.

The remainder of the identified topics (Topic 3 in Figure 10 and Topic 4 in Figure 11) each retain almost the same amount of tokens, namely 6.4% for Topic 3 and 6.3% in the case of Topic 4.

Considering the words listed in the category of 30 most relevant terms, in the case of Topic 3, it can be observed that these gravitate around key terms such as chatbot, accuracy, readability, health_literacy, patient, evaluate, quality, expert, summary, report, and caregiver. As a result, it can be stated that Topic 3 is dedicated to chatbots for healthcare/educational purposes, with emphasis on health literacy, evaluation of quality, readability of AI-generated content, and caregiver/patient use cases.

On the other hand, Topic 4 provides a series of keywords such as child, student, learning, creativity, emotion, development, bias, literacy, disparity, ai_driven, science, and classifier, highlighting that the works associated with this topic are focusing on a broader educational and ethical dimensions: student development, creativity, bias, disparities, literacy, and fairness. Thus, it connects LLMs to responsible and inclusive use in education.

Taking into account the findings of LDA results and comparing them with previous results obtained through thematic maps, it can be seen that a large proportion of topics are supported by the discovered themes, confirming the thematic synthesis and further highlighting research directions in the domain of LLM in education.

More specifically, Topic 1 links the technological foundations of LLMs and their educational impact, showing how the tools created by them (e.g., ChatGPT) are discussed simultaneously as technical innovations and as practical support for students and educators. Topic 2 connects educational impact with content quality, and emphasizes issues of accuracy, readability, and evaluation in both educational and patient education contexts. Furthermore, Topic 3 discusses content quality and literacy themes, with particular emphasis on chatbot applications and health literacy, supporting the growing cross-domain relevance of LLMs outputs. Finally, Topic 4 can be connected with the theme of ethics and responsible use, while also overlapping with educational impact by addressing student development, creativity, and disparities.

Overall, the LDA analysis not only validates the four key themes identified through thematic mapping but also reveals cross-cutting overlaps, showing that technological, pedagogical, quality, and ethical concerns are often intertwined in current research on LLMs in education. Please consider the results provided in Table 5.

3.4.2. Topic Discovery Through BERTopic

In this paper, BERTopic has been used to uncover the topics related to LLMs in education covered by the papers included in the dataset. As a result of running this analysis, four topics have been identified, as depicted in Figure 12.

For each of the four topics, a list of the 10 most used words has been extracted. The list of the top five keywords is depicted in Figure 13 for each topic, noted in this case from Topic 0 to Topic 3.

Topic 0 is the dominant topic, characterized by words such as AI, education, ChatGPT, learning, students, research, and study. Given the broad terms extracted for this topic, it can be stated that Topic 0 is basically a general discourse cluster, capturing broad discussions on how AI/ChatGPT is applied in education and research.

Topic 1 gravitates around articles discussing the accuracy and evaluation of ChatGPT’s responses, especially in medical/dental education contexts, featuring a series of domain-specific keywords such as ChatGPT, questions, responses, patient, accuracy, surgery, dental, education, and information.

The last two topics are reduced in size even in this case, focusing on either comparisons between ChatGPT and other models (Google/Gemini) in patient education and readability of medical information (Topic 2) or use of GPT-4 in radiology for generating patient pamphlets and summaries (Topic 3).

Furthermore, it should be mentioned that for Topic 2, which is characterized by keywords such as Google, responses, patients, questions, ChatGPT, readability, floaters, ophthalmologists, and Gemini, the mention of floaters/ophthalmologists shows a niche application in ophthalmology.

Also, in the case of Topic 3, even though at first glance, one might think that the listing of keyword pamphlet along the other keywords specific to this topic, such as Radiology, reports, summaries, radiologists, GPT-4, and accuracy, might be an error, it should be mentioned that the use of this keyword makes sense in the context of medical/clinical education literature, as it refers to patient education pamphlets (also called leaflets or brochures) which are a common format used to communicate medical information in an accessible way. Radiology departments often generate pamphlets or informational leaflets to help patients understand diagnostic procedures, results, or risks. In recent studies, LLMs (e.g., ChatGPT, GPT-4) have been tested for their ability to generate simplified patient education pamphlets from technical radiology reports. As a result, it can be mentioned that the use of the keyword pamphlets within Topic 3 is tied to the content quality and readability dimension of AI outputs in radiology, where the model is used to generate easy-to-read summaries for non-experts, contributing to patient education.

Considering the content of BERTopic 2 and BERTopic 3, it can be further noted that both clusters highlight highly domain-specific applications in healthcare, ophthalmology and radiology, respectively, which both emphasize content quality and literacy, particularly in relation to patient education and the accessibility of AI-generated medical content.

After considering the most used keywords for each topic, the following topics have been extracted: BERTopic 0—General Applications of AI/ChatGPT in Education with Focus on Student and Learning; BERTopic 1—Accuracy and Evaluation of ChatGPT in Medical or Dental Education; BERTopic 2—Comparisons of ChatGPT vs. Gemini in Patient Education with a Focus on Ophthalmology; and BERTopic 3—GPT-4 in Radiology for Patient Education.

Furthermore, if we refer to the themes identified through the thematic maps, it can be observed that the BERTopics align with Themes 1–3 in thematic maps, as highlighted in Table 6. Moreover, unlike the LDA model, BERTopic did not generate a distinct theme corresponding to ethics, personalization, and responsible use. This suggests that while discussions of fairness and bias are present in the literature, they are less tightly clustered in specific contexts than the more concrete domains of student learning and patient education. Taken together, the BERTopic results confirm the centrality of Themes 1–3 in the discourse on LLMs in education, while the LDA results add nuance by highlighting Theme 4 as a cross-cutting but less domain-specific concern.

3.5. Review Based on Identified Themes and Topics

Analysis of the themes identified by the LDA and BERTopic models highlights an increasingly strong link between the technological and ethical dimensions of research on the use of LLMs in education. The studies identified in 2023 showed technological enthusiasm, emphasizing the potential of LLMs for efficiency, automation, and content generation. Compared to this year, in 2024, there is a shift towards interest in the responsibility of use and the impact on academic integrity, data privacy, and algorithmic fairness. This temporal evolution suggests an emerging tension between innovation and regulation, in which technological development is accompanied by a rise in ethical and social concerns. In fact, a progressive link can be observed between the pedagogical, technological, and ethical dimensions, as seen in the common orientation toward personalized learning and student-centered education, but also in attempts to ensure a balance between automation and human critical thinking. This correlation between development directions indicates that the field of LLM-based education is evolving from a performance-centered technological phase to an interdisciplinary and reflective one, in which ethics, pedagogy, and innovation play very important roles in educational development.

Based on the results provided by the thematic maps, LDA and BERTopic, four research areas are identified, as discussed below.

3.5.1. Technological Foundations of LLMs in Education

This area derives from the researchers’ interests observed as a result of Theme 1, LDA Topic 1, and BERTopic 0 and it covers a wide range of research focusing on AI tools, ChatGPT, GPT-4, generative AI, integration into classrooms, and educational research.

Over time, researchers such as Uribe et al. [79] have investigated the perceptions of dental educators regarding the utilization of AI chatbots and LLMs, like Gemini or ChatGPT, in dental education. A global survey conducted in May–June 2023 revealed that although 64% of respondents recognize the use potential of these types of tools in teaching, only 31% are currently using them. AI is considered beneficial for research, knowledge accumulation, and clinical decision-making, but there are concerns about reduced human interaction and the lack of clear implementation guidelines.

Another area where the integration of LLMs into education was explored was chemical engineering [80]. The main purpose for which they were used was to improve problem solving, critical thinking, and understanding of fundamental concepts. LLMs proved to be both accessible and useful when, in an experimental course, students used LLMs to solve practical problems and build models such as calculating the efficiency of a steam turbine cycle. Compared to the previous study in chemical engineering, Smith et al. [81] conducted a study in social psychiatry, exploring the utilization of ChatGPT as a learning and teaching support tool based on case studies. The researchers analyzed how ChatGPT 3.5 can be used to generate educational materials, facilitate debates, encourage self-directed learning, and provide theoretical support. A concrete example was the generation of a case vignette relevant to social psychiatry. The conclusions show that although ChatGPT has potential as an active teaching tool, there are limitations related to the accuracy and bias of information, and the involvement of human educators remains essential.

In turn, Satpute et al. [82] considered it relevant to explore the potential of LLMs to generate natural language code for computational experiments in materials science. The study focuses on the use of LLMs to solve partial differential equations (PDEs) in the phase field, which are used for modeling the microstructures of materials. Findings show that LLMs are capable of generating code for medium-complexity multiphysics problems but struggle with highly complex problems that require detailed instructions and deeper understanding. Given their widespread use in education, Azaiz et al. [83] tested the ability of LLMs (specifically GPT-3.5) to provide personalized formative feedback to programming students in a higher education context. The results show that the model correctly identified 73% of submissions and provided quality feedback in 59% of cases, but there were also problems, such as misidentifying errors.

As an area of interest, foreign language learning has also been a field in which the impact of artificial intelligence, particularly LLM-based chatbots, has been studied. In this regard, a study was conducted in which 52 students were divided into two groups, one using a chatbot and the other not, and both learned the same words for eight weeks [84]. Findings revealed that the chatbot-assisted group achieved superior results on vocabulary tests, both receptive and productive, and retained words better in the long term. While this study focused on foreign language vocabulary learning, evaluating the effectiveness of an LLM-based chatbot in supporting word memorization and retention, Liu et al. [85] argued that it is of interest to analyze the improvement of writing in English as a foreign language (EFL) through training in self-regulated learning (SRL) strategies using a model called CALLA-LLM (LLM-supported). The participants were 65 elementary school pupils. They were divided into an experimental group, which was taught using the CALLA-LLM, and a control group, which was trained using the traditional CALLA model. The results indicated that the group using the CALLA-LLM showed significant progress in writing, motivation, and SRL use, outperforming the control group.

3.5.2. Impact on Students, Performance, and Learning Outcomes

This research direction is connected with Theme 2, LDA Topics 1 and 2, and BERTopic 1 and focuses on student learning, performance, testing, comprehension, and feedback.

Mannekote et al. [86] conducted research to investigate how LLMs could be used to design environments that cater to both the emotional and cognitive needs of students. Researchers identified three main challenges: understanding how LLMs represent students, tailoring feedback based on these representations, and creating educational agents based on LLMs. The goal was to develop intelligent tutors that support each student individually. Similarly, Alfirević et al. [87] examined the utilization of LLMs in teaching and research through an innovative method of digital ethnography, in which LLMs (ChatGPT-4 and Gemini Ultra) were treated as research participants. Their potential roles in educational processes were identified, as well as risks such as discrimination, bias, or conflicts of interest. In the same vein, Zhang and Zhang [88] investigated how LLMs can improve the learning experience and support teachers’ work, using self-determination theory (SDT) as a theoretical basis. In a study conducted in two middle schools in southwestern China, they observed the effects of using LLM in a two-week project-based learning lesson. The results show that when LLMs respond to students’ psychological needs, they can increase engagement and learning performance. Another innovative interior design teaching technique was presented by Şekerci et al. [89], who used AI to transform multisensory ideas into visual images. By combining LLM with the TextFX tool, students are able to express their creativity beyond traditional design.

Over time, questions have been raised about the capacity of LLM to perform evaluation based on criteria and analyze the possible impact of precise wording of requirements on the score [90]. Based on this premise, it has been observed that even free LLMs can provide accurate and detailed assessments, highlighting that domain specific understanding is more important than model complexity. This underscores the potential of LLMs to provide scalable educational feedback. Another topic addressed was the investigation of the consistency of GPT-4-generated assessments for responses to various macroeconomics tasks in higher education [91]. Using the intraclass correlation coefficient (ICC), it was deduced that GPT-4 provides very consistent assessments of time and pattern style, with ICC scorings ranging from 0.94 to 0.99. The results indicate the high reliability of the model in assessing content and style, and the prompt used is presented in detail. As this has been a rather controversial topic lately, Perkins [68] wanted to observe the effect of AI use, particularly that of LLMs like ChatGPT, on academic integrity in formal assessments. The author focused both on the risks associated with generating original texts that are difficult to detect as being AI-assisted and on the benefits of these tools in supporting digital writing and language learning. The conclusion is not that AI use defines a case of plagiarism, but rather that a lack of transparency in declaring this use does and that higher education policies need to be updated to respond to these challenges.

To test the accuracy of anatomical information, Arun et al. [92] considered that it would be interesting to develop a customized chatbot that, based on specialized knowledge, would provide contextualized answers and compare its performance with that of ChatGPT 3.5 in responding to queries in the thoracic anatomy field. The researchers’ results showed that the custom chatbot was able to provide significantly more factually accurate answers than ChatGPT 3.5, while for the other criteria evaluated (relevance, completeness, coherence, and fluency), there were no statistically significant differences between the responses analyzed. However, Ehlert et al. [93] considered it relevant to note the answer accuracy of LLMs to pharmacy exam questions. When testing Chatsonic, GPT-4 and GPT-3.5, GPT-4 was found to have the highest accuracy, significantly outperforming GPT-3.5 and Chatsonic. In their turn, Jeong et al. [94] thought it would be interesting to compare ChatGPT, Bing Chat, Bard, and ChatGPT Plus, testing their responses to different questions related to maxillofacial and oral radiography with those of dental students. The highest scores were identified among students, followed by those provided by ChatGPT, Bard, ChatGPT Plus, and only then Bing Chat. Even though ChatGPT Plus managed to provide better answers in terms of basic knowledge, all chatbots had problems interpreting images.

3.5.3. Content Quality, Readability, and Literacy in AI Outputs

Based on Theme 3, LDA Topics 2 and 3, and BERTopic 2 and 3, this research area emerged, covering issues related to quality of AI outputs, readability levels, and suitability for students and patients.

In terms of this research direction, the ability of LLMs to be integrated into various intelligent applications related to the field of ophthalmology was analyzed [95]. Thus, the potential of these models to support physicians, patients, and decision-makers in the learning, documentation, and diagnosis processes was observed. In addition, another study aimed to deduce the response quality of different LLMs to a common ophthalmological question [96]. Although all platforms provided accurate data, they did not specify the severity of the medical problem.

Compared to the aforementioned research, Lyu et al. [70] considered it interesting to analyze the feasibility of LLMs in translating radiological reports so that they could be understood by both healthcare providers and patients. The results showed that ChatGPT is capable of translating these reports into simple language and even offering suggestions to patients based on the results in the report. Also to help patients, the reliability and accuracy of educational brochures created with LLMs related to interventional radiology procedures was discussed [97]. This time, however, the results showed that approximately 30% of the brochures contained deficiencies, such as omissions regarding sedation or inaccurate descriptions of complications. Although the language was generally coherent and clinically relevant, significant variations were observed between the responses generated for the same procedures, raising consistency issues.

Based on the same principle of educating patients medically, Cohen et al. [98] studied the quality of information about cataract surgery from ChatGPT and Google. Using frequently asked questions, specialists evaluated the responses generated by the two platforms, finding that ChatGPT was more detailed but also more complex in terms of language, requiring a university level of reading, with the rate of incorrect information being significantly lower for ChatGPT (6%) compared to Google (27%).

3.5.4. Ethics, Personalization, and Responsible Use of LLMs

Even though BERTopic has not strongly addressed this direction, it has been highlighted by Theme 4 and LDA Topic 4, being related to fairness, bias, disparity, personalization of learning, creativity in the usage of LLMs in education.

Considering it relevant to their research, Warr et al. [99] identified experimental evidence of racial bias in ChatGPT 3.5 in the educational assessment context. Using identical passages from student papers but with different demographic descriptions (school type, race, socioeconomic status), the authors observed that the model assigns different scores when racial cues are implicit, not explicit, but significantly biased. Sequential effects were also identified, in which changes in variables within the same chat amplified bias. In a turn, Wachter et al. [100] address the concept of “careless speech” generated by LLMs, which represents a new type of risk that poses a long-term threat to diverse fields like educational, social, or scientific truth in a democratic society. Even if they provide convincing answers, LLMs can provide inaccurate or misleading information, thus contributing to the degradation of knowledge. The authors propose establishing a legal obligation for LLM suppliers to make their models more realistic through democratic and transparent processes. The study supports the need for a “legal obligation to truth” and analyzes existing European regulations.

To demonstrate the educational capability of LLMs, a research study was carried out to examine the ability of ChatGPT, based on GPT-4, to generate synthetic answers to Force Concept Inventory (FCI), a test which is widely utilized in physics research [101]. Researchers explored the degree to how a generative pattern can be trained to solve the test accurately, but also to simulate learning behaviors specific to different groups of students or individuals with distinct preconceptions about force and mechanics concepts. The results showed that, although the model maintains a high performance in quantitative reasoning, it can also be calibrated to reproduce various cognitive patterns. While in the field of physics, LLMs have been explored mainly for simulating cognitive processes and personalizing learning, other research directions have focused on the ethical and cultural dimensions of using these technologies in educational communication. A significant example is the study investigating how AI, through models such as GPT-4, can be used to generate educational content on climate issues [102]. The authors examine the promises and challenges of these models in creating texts, while also highlighting the risks associated with bias or lack of cultural depth.

As their prevalence increases, various LLMs have begun to achieve comparable human performance in complex tasks, offering significant opportunities for personalizing treatments, reducing administrative burdens, or even facilitating interdisciplinary collaboration. However, Naqvi et al. [103] sought to draw attention to challenges related to data accuracy and algorithmic bias, both of which can affect clinical decisions and the quality of professional education. The researchers emphasize the importance of actively involving professionals in the field in the training and monitoring of AI models, promoting ethical, responsible, and human-supervised use. In this context, Rahimzadeh et al. [104] analyzed whether LLMs can replace or complement traditional education in medical ethics. The study compared the objectives of ethics training with the current capabilities of LLMs, showing that they can support the educational process through examples, reasoning, and simulated discussions, but cannot replace the human dimension of moral reflection and empathy. The integration of ChatGPT into educational programs must be conducted strategically, with clear guidelines, periodic evaluations, and attention to quality and fairness. Thus, ChatGPT may be a helpful instrument in ethical teaching, as long as it is used in a complementary manner, under human supervision, and continuously adapted to technological developments.

4. Discussion and Limitations

The next section gives a quick overview of the results of the analysis, followed by a look at some of the study’s limitations.

4.1. Results Obtained

In the analysis, the data selected was for the period 2023–2024, and included a total of 507 articles. The involvement of LLMs in education is a novel topic, with results showing an accelerated expansion of scientific interest, supported by a remarkable annual growth rate (369.66%). Indicators regarding documents and sources demonstrate widespread and rapid dissemination, with 322 sources and over 19,000 references, aspects that highlight both multidisciplinarity and a solid theoretical basis. At an average of 15.03 citations per paper, it can be concluded that this field has rapidly gained visibility and relevance within the academic community. In relation to the authors implicated in field research, scientific collaboration prevails, with a mean of 4.48 authors per papers and only about 10% of papers being individually authored. The global nature of the research is also supported by a significant percentage of international co-authorship (26.04%). The thematic diversity is confirmed by the large number of keywords used by authors (1451). From the perspective of annual scientific production, 2024 saw a significant peak, with 418 articles published compared to only 89 in 2023, this increase being similar to trends observed in other areas of application of AI. Also, older articles (2023) received more citations, showing not only the ongoing interest in the topic, but also that scientific impact grows stronger over time.

Using various tools such as thematic maps, LDA, and BERTopic, four major research directions that structure the field were identified. The first direction, “Technological Foundations of LLMs in Education,” brings together studies investigating the differences in architecture and functionality between GPT, Gemini, or other LLMs, the evolution from the early waves of educational artificial intelligence (such as ITS systems and virtual reality) to the current generations based on LLMs, as well as the degree of adoption of these tools by teachers and educators. The second direction, “Impact on Students, Performance, and Learning Outcomes,” focuses on the use of LLMs as assessment and feedback tools, integration into multiple choice tests, text comprehension exercises, or exam preparation, with particular applications in medical and dental education, where the accuracy of the generated responses is evaluated. The third direction, “Content Quality, Readability, and Literacy in AI Outputs,” aims to analyze the quality and readability of AI-generated output, particularly in contexts such as patient education in ophthalmology and radiology, using readability indices (Flesch–Kincaid, Likert scales) and addressing concerns about accessibility and literacy in health; here, comparisons between models such as ChatGPT and Gemini are also noteworthy. The fourth direction, “Ethics, personalization, and responsible use of LLMs,” discusses the implications of ethics in the use of AI in learning (from bias and fairness to inclusion and personalization of learning paths), as well as the implications of these technologies for creativity and student development in AI-based learning environments. Thus, the analysis confirms that current research is conducted on a continuum between technological fundamentals, pedagogical impact, the quality of generated content, and the responsibilities associated with using LLMs in education.

The analysis observed that a significant proportion of the papers included in the corpus focus on the application of LLMs in medical education and patient training. This thematic predominance does not reflect a limitation of the field, but rather a natural phase of research development, in which fields with a solid scientific basis, strict ethical requirements, and a high degree of standardization become testing grounds for emerging technologies. Medicine, as an educational field, provides an ideal experimental setting for studying the performance and educational risks associated with LLMs, as it combines theoretical, practical, and relational components. This focus on medical education can be seen as a natural development in terms of the learning process and rapid development of the medical system.

Analyzing the existing literature, a number of significant research gaps were identified that deserve to be addressed in future studies. Most works focus on descriptive aspects and general perceptions related to the use of LLMs in education, while rigorous empirical evaluations of pedagogical effectiveness, impact on learning, and cognitive skill development are still limited. Furthermore, the current literature does not sufficiently address issues related to ethics and digital sustainability, including algorithmic transparency, data protection, and teacher training regarding the responsible use of LLMs. At the same time, the interdisciplinary dimension remains incomplete: collaborations between fields such as education sciences, computer science, and cognitive psychology are rare, which limits understanding of the integration of artificial intelligence in education.

4.2. Specific LLMs

By examining the papers extracted from the database, a diverse range of LLMs used in educational research was observed, with a clear dominance of the ChatGPT model, which was present in more than half of the studies published between 2023 and 2024. This predominance can be explained by its ease of integration into teaching and assessment processes, high accessibility, and the generalist nature of the model, which allows it to be used in multiple educational fields, from academic writing to automatic feedback generation. For example, Hobensack et al. [105] used ChatGPT in medical education, highlighting its ease of integration into training processes and its potential for automatic feedback, while Kirwan et al. [106] discussed the role of ChatGPT in university activities, emphasizing its formative value and the risks associated with authentic assessment.

In contrast to ChatGPT, GPT-4 is preferred in research requiring complex reasoning, increased accuracy, and contextual interpretation, and is frequently used in comparative analyses and controlled experiments. Jablonka et al. [107] used GPT-4 to solve complex scientific problems, highlighting its superior performance and reasoning ability.

The Bard and Gemini models have been adopted to a lesser extent, particularly for evaluating multimodal capabilities and integrating artificial intelligence into STEM disciplines, while LLaMA, due to its open-source nature, is present in experimental studies focused on model customization and local training. Wang et al. [108] used Gemini for multimodal analysis, applying it to educational performance prediction, and Teckwani et al. [109] used LLaMA for local adaptation of scientific content, thus demonstrating the potential of open-source resources.

At the same time, the emergence of methodological paradigms such as retrieval-augmented generation (RAG) and prompt engineering confirms the transition from the generic use of LLMs to a stage of optimization and contextual integration, in which models are calibrated to respond to specific pedagogical and disciplinary needs. This evolution indicates a gradual maturation of the field, in which LLMs are no longer used only as tools for experimentation, but as adaptive and applicative mechanisms capable of supporting personalized learning, critical thinking, and AI-assisted formative assessment.

4.3. Student Engagement and AI Impact

When it comes to education, it is very important to mention the impact that AI development has had on students, which is why this Table 7 was created.

As an example, Lang et al. [110] set out to explore how students can use the potential of LLMs to improve the efficiency of the learning process in entrepreneurial education, as well as to experimentally validate the performance of these models in solving complex tasks involving semantic reasoning and mathematical calculation. In turn, Wang and Reynolds [111] analyze how Chinese learners use LLMs (such as ChatGPT) to learn English. Based on SDT (Self-Determination Theory) and UTAUT (Unified Theory of Acceptance and Use of Technology) theories, the research shows, through SEM (structural equation modeling) applied to 568 participants, that autonomy, relatedness, competence, perceived effort, and social influence significantly influence the use of LLMs, with perceived ease of use being the decisive factor.

A new approach described by Hershcovits et al. [112] analyzed student engagement in e-learning using Principal Component Analysis to extract patterns of behavior and construct trajectories of activity over time. Based on these trajectories, the authors identified cohorts of students with different patterns of engagement (increasing, decreasing, or constant) and showed that these differences are reflected in how students use the system and the duration of their participation. The results may support more effective interventions in the future. In contrast, Li et al. [113] investigated the depth of factors influencing student engagement in online learning using the Presage–Process–Product (3P) model and data from nearly 2000 high school students. The authors investigate the role of information literacy, self-directed learning skills, and academic emotions on the level of online engagement. The results showed that all these factors play an important role in increasing student engagement, with positive emotions strongly mediating the impact of self-directed skills.

Table 7. Articles related to student engagement and AI impact.

First Author	Article Title	Research Focus
Lang Q. [110]	Exploring the Answering Capability of large language models in Addressing Complex Knowledge in Entrepreneurship Education	Analyzing how large language models (LLMs) can be used to improve the efficiency of the learning process in entrepreneurial education.
Wang X.C. [111]	Beyond the Books: Exploring Factors Shaping Chinese English Learners’ Engagement with large language models for Vocabulary Learning	Identifying the psychological and technological factors that influence students’ intention and behavior to use LLMs in the process of learning English vocabulary
Morris W. [114]	Automated Scoring of Constructed Response Items in Math Assessment Using large language models	Developing and testing an LLM-based approach for automatically scoring extended-response math items on the National Assessment of Educational Progress (NAEP).
Kieser F. [115]	David vs. Goliath: comparing conventional machine learning and a large language model for assessing students’ concept use in a physics problem	Comparing the performance of LLMs with that of conventional machine learning algorithms in assessing students’ use of concepts in a physics problem-solving task
Kurian N. [116]	‘No, Alexa, no!’: designing child-safe AI and protecting children from the risks of the ‘empathy gap’ in large language models	Proposing design and policy recommendations for the development of safe and ethical LLMs for children
Morris W. [117]	Formative Feedback on Student-Authored Summaries in Intelligent Textbooks Using large language models	Studying the impact of AI-generated feedback on students’ self-regulation, writing skills, and motivation to engage with revision tasks
Kirwan A. [106]	ChatGPT and university teaching, learning and assessment: some initial reflections on teaching academic integrity in the age of large language models	Exploring the impact of LLM emergence, particularly that of ChatGPT, on assessment in higher education
Hershcovits H. [112]	Modeling Engagement in Self-Directed Learning Systems Using Principal Component Analysis	A new method for analyzing students’ engagement in e-learning, which tracks how their behavior evolves over time within the system and groups students according to these trajectories in order to understand who persists, who drops out, and why
Li H. [113]	Impact of information literacy, self-directed learning skills, and academic emotions on high school students’ online learning engagement: A structural equation modeling analysis	Investigating how information literacy, self-directed learning, and academic emotions influence high school students’ engagement in online learning.

4.4. Complement and Validation of the Results via AI-Based Tools

Here, we tried to perform a search for papers on LLMs on education by using Elicit, an AI-assisted search and systematic review platform. After prompting the search query into Elicit, a series of papers were retrieved, which were further filtered based on the number of received citations. The results were compared with our WoS-based corpus. As a result of the comparison, no direct overlap between the ten most cited WoS articles and the most prominent Elicit results was found. However, several highly cited journal articles identified by Elicit (e.g., systematic reviews in the British Journal of Educational Technology, Medical Education, Frontiers in Education, and JMIR Medical Education) are very likely part of our broader 507-article WoS dataset, even though they do not appear among the top 10 most cited documents within WoS. The remaining Elicit results are predominantly arXiv preprints, conference proceedings, editorials, or 2025 publications, which fall outside our inclusion criteria (peer-reviewed journal articles indexed in WoS Core Collection, 2023–2024). This comparison suggests that AI-based tools tend to emphasize recent surveys, preprints, and conference literature, while our WoS-based approach systematically captures the peer-reviewed journal segment of the field.

Next, in order to assess the robustness of our Web of Science results, we conducted a further exploratory comparison with AI-assisted search outputs generated by Perplexity. We asked the tool to list the papers related to the use of LLMs in education. As a response, the tool retrieved a set of recent surveys, scoping reviews, and conceptual papers on LLMs in education, many of which are hosted on arXiv or published in venues outside the WoS Core Collection (e.g., preprints, conference proceedings, and editorials). By comparing the retrieved list with our ten most-cited WoS articles, we observed that none of the specific papers surfaced by Perplexity overlap directly with the ten most-cited WoS articles in our dataset, nor do they appear among the WoS-indexed journal articles included in our final corpus. However, moving further, it has been noticed that despite the lack of one-to-one overlap, the thematic content of the Perplexity results aligns closely with the findings of our WoS analysis: both sources emphasize the transformative educational potential of LLMs, their applications in teaching and assessment, the implications for higher education, and the recurring concerns regarding ethics, bias, student autonomy, and responsible use. This comparison suggests that AI-enhanced discovery tools typically foreground the most recent broad reviews and preprints, while our WoS-based approach captures the peer-reviewed journal literature. Taken together, the convergence of themes across these two retrieval methods strengthens confidence in the comprehensiveness and validity of our WoS-derived insights.

4.5. Limitations

In order to provide as transparent a picture as possible of the analysis carried out, this section will discuss the limitations encountered during the writing process.

When creating the database, the filters used selected a total of 507 articles, flagged as “Article,” that were published in English between 2023 and 2024.

The extraction was performed only from the WoS database, selecting only articles that contained words referring to LLMs and education. This function, the use of keywords, is a significant advantage that allows the automatic identification of relevant terms in the cited bibliographies, expanding the semantic area of thematic analysis and facilitating the discovery of conceptual connections between works. Given its longevity, indexes, and the impressive number of articles it contains, as discussed earlier in the first chapters of the paper, the authors considered it relevant to extract articles only from here. However, even though WoS is renowned, using a single database can be considered a limitation, as it restricts coverage to publications indexed in WoS and may omit valuable articles indexed in other sources, like Scopus, IEEE Xplore, or Google Scholar. Although WoS ensures consistency, quality, and international comparability, the inclusion of additional databases could have expanded diversity, contributing to a more complete representation of the field under investigation. But the decision to limit the analysis to this database was also motivated by the desire to obtain a unified, clean, and comparable dataset, avoiding frequent overlap or duplication of publications.

Searching for specific keywords, even if it helps to identify articles in the field, can also be a limitation. Even if the most common words that make sense in the field analyzed were searched for, it is possible that certain relevant works were not included due to differences in terminology used by authors or alternative expressions not covered by the query. As a field in constant development, new terms and concepts emerge that may be indexed differently.

In addition, limiting the analysis to articles published exclusively in English and classified as “articles” may be considered another source of restriction in the coverage of the research. This was motivated by the fact that English is one of most common languages for scientific communication and the most common option in studies indexed in major databases. However, such a decision may lead to the possible exclusion of valuable contributions in other languages, particularly in regional or national contexts where education and artificial intelligence research is active but not always published in international journals. Furthermore, focusing exclusively on articles implies the elimination of other relevant forms of scientific output, such as conference papers, book chapters, or technical reports, which may contain diverse perspectives. These types of sources are often the first to reflect innovative trends in rapidly expanding fields such as large language modeling.

Another aspect that needs to be discussed is the manual exclusion of 334 works. The authors chose to eliminate them because some of these publications, although they contained valuable information, did not meet the selection criteria established for the specific objectives of the research. This manual cleaning of the dataset was intended to improve the results of the analysis and reduce thematic differences, but it involved a certain degree of subjectivity in the process of making decisions.

Limiting the scope to articles published before 2025 may also be perceived as narrowing the historical perspective, as it excludes the newest papers that have been published in 2025 and which have been indexed in the WoS database until the moment of the papers’ extraction. Although 2025 represents a natural continuation of this research, it was not included in the analysis because of the lack of complete bibliographic coverage at the moment of the research. The indexing process in major databases is often delayed, which could have led to statistical distortions or unintentional omissions.

Furthermore, as the papers included in the study were extracted from WoS Core Collections without a further filtering based on a more “strong” criterion, such as selecting the papers in the Q1 journals or 10% of Q1 journals, it should be mentioned that by applying the mentioned limitations to the extracted dataset, the coverage may have been dramatically distorted. Also, as the analyzed field is new and rapidly evolving, excluding journals that are not in Q1 or 10% of Q1 would have hidden emerging research directions, as well as it would have created a bias in the analysis toward established disciplines. Thus, one might face a situation in which considering only Q1, and specifically 10% of Q1, might lead one to overlook early impactful AI-in-education papers, which first appeared in Q2–Q4 journals, before reaching Q1.

With all this in mind, as in the present paper, we have presented the top 10 most-cited papers in detail as the citations of a paper are associated with its impact on the academic community. Furthermore, we have decided to examine the quartiles in which the journals that have published the most-cited papers are indexed, and it has been observed that 7 out of 10 are in Q1 according to the journal impact factor (JIF). This information is further provided in Table 8 in terms of the JIF value and its quartile according to the last report released by Clarivate Analytics ISI Web of Science. Also, it should be mentioned that this search can be further conducted for the remainder of the sources extracted from the dataset, but, as the dataset contains a high number of distinct sources (322), and each search is performed manually and on an individual basis, the search is time-consuming. Thus, using another proxy for impactful papers, such as the number of gathered citations, is representative and follows the practices described in the scientific literature [40,118,119,120].

The purpose of listing these aspects was not to highlight certain disadvantages of the study, but to provide a clear picture of how the database selections were made and to describe the analysis technique applied.

5. Conclusions

This paper aimed to provide a systemic review of the development of academic research focused on the engagement of LLMs in teaching, with emphasis on the recent period 2023–2024. This research set out to provide a detailed description of the 10 papers that received the most citations and perform a thematic analysis to figure out the most talked-about topics, based on a total of 507 papers indexed in 322 relevant academic sources. This analysis was considered important due to the rapid pace at which AI-based technologies are being integrated into educational processes, which requires understanding and creating a clear, data-driven overview of how academic research is addressing this phenomenon.

Conducting a systemic analysis is essential to creating an integrated and coherent vision of applied LLM research in education. This approach allows not only the identification and synthesis of existing information, but also the highlighting of relationships, interdependencies, and emerging trends between different areas of study. In a field that is constantly expanding, systemic analysis provides a solid basis for a comprehensive understanding of the phenomenon. In addition, the period analyzed is extremely recent, providing up-to-date and concentrated information that is essential for decision-makers, researchers, and practitioners. Furthermore, the paper highlights gaps in knowledge, providing concrete directions for future research.

The results obtained highlight increased scientific interest due to the fact that the annual growth rate in terms of publications was 369.66%, and the mean age of the papers was only 1.18 years, reflecting the concentration and recency of this topic of research.

Based on the results obtained from LDA and BERTopic analyses, four main directions in the investigation of LLM use in learning were identified: the technological foundations of LLMs, their impact on student performance and learning processes, the quality and readability of AI-generated knowledge, and the ethics and responsible use of these technologies. These directions reflect the fast evolution and diversification of the domain, marking a convergence between technological, pedagogical, and ethical aspects. Overall, the results confirm that recent studies have succeeded in establishing a solid framework for future interdisciplinary research aimed at the balanced integration of LLMs in education.

In addition, thematic maps contributed to a much more compact visual understanding of the domain, showing the existence of well-defined groups corresponding to the four major research directions. Their role was to illustrate both the maturity of some subfields and the potential for expansion of others, contributing to an integrated understanding of how LLMs are transforming the field of education.

An important result of this research is the finding that the medical field and patient education constitute a major core in the current educational applications of LLMs, highlighting the direction in which they are tested, validated, and refined before being extended to other fields. Thus, it can be deduced that LLMs are mechanisms of educational transformation which, starting from fields with strict standards such as medicine, can generate replicable models in general, university, and professional education. Consequently, the consistent presence of medical themes confirms the interdisciplinary nature of AI-based education and opens up directions for future research.

With regard to future research directions, a first direction concerns the evaluation of the pedagogical effectiveness of LLM in the educational context, analyzing their impact on the achievement of pupils and learners according to level of education or field of study. As specified in the article, it is necessary to further explore the ethical implications of the integration of this technology into the learning process, given the risks associated with plagiarism. Furthermore, research could be conducted regarding how learning can be personalized with the help of LLMs, what level of accessibility to education these tools provide, and how they can support equal and adapted access to learning for all pupils and students.

Following these directions, a comprehensive and differentiated roadmap can be outlined, aimed at developing research and educational practices based on LLMs. For teachers, the priority should be the gradual introduction of AI tools into teaching and assessment activities, through the development of prompt engineering skills, the most responsible use of AI-generated content, and the strengthening of critical information analysis skills. For policymakers, there is a need to develop ethical and institutional frameworks that support innovation while ensuring data protection, algorithmic transparency, and equitable access to AI-based educational resources. For technologists and researchers, the future focus should be on exploring emerging paradigms such as agentic AI, multimodal language models (multimodal LLMs), and RLHF (reinforcement learning from human feedback)-based educational feedback systems, which can lead to more adaptive, context-sensitive, and learner centered learning environments. Future research should consider both longitudinal and mixed approaches to analyze not only technological performance but also the pedagogical and ethical impact of these tools on the educational process.

Based on the four thematic directions identified in the analysis, this study proposes a roadmap designed to guide research on and practical applications of LLMs in education. For “Technological Foundations of LLMs in Education” future research should move beyond technical comparisons between models, such as GPT vs. Gemini, and focus on investigating how specific architectures, training datasets, and multimodal integration influence pedagogical effectiveness. Furthermore, studies related to the interpretability and transparency of models used in educational contexts could be deepened. With regard to “Impact on Students, Performance, and Learning Outcomes” systematic empirical evaluation of learning outcomes across different disciplines is recommended, with the aim of identifying disciplines in which the use of LLMs leads to measurable improvements and where it may diminish cognitive engagement or critical thinking. For “Content Quality, Readability, and Literacy in AI Outputs” comparative and longitudinal studies can be recommended to assess the readability, bias, and appropriateness of AI-generated content, especially for learners with different levels of literacy or those from different cultural backgrounds. Considering “Ethics, Personalization, and Responsible Use of LLMs”, it is necessary to mention research dedicated to developing ethical and regulatory frameworks for the responsible use of artificial intelligence in education, including ethical audit mechanisms, teacher training, and the protection of students’ personal data.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/electronics14234683/s1, File S1: LLM-PRISMA_2020_checklist. File S2: LLM-PRISMA_2020_flow_diagram.

Author Contributions

Conceptualization, B.-R.C., L.C., A.G.M. and L.-A.C.; Data curation, B.-R.C., L.C. and L.-A.C.; Formal analysis, B.-R.C., L.C. and L.-A.C.; Funding acquisition, L.-A.C.; Investigation, B.-R.C., L.C., A.G.M. and L.-A.C.; Methodology, B.-R.C., L.C. and L.-A.C.; Project administration, L.-A.C.; Resources, B.-R.C., A.G.M. and L.-A.C.; Software, B.-R.C. and L.-A.C.; Supervision, L.-A.C.; Validation, B.-R.C., L.C. and L.-A.C.; Visualization, B.-R.C. and L.-A.C.; Writing—original draft preparation, B.-R.C., L.C. and A.G.M.; Writing—review and editing, B.-R.C. and L.-A.C. All authors have read and agreed to the published version of the manuscript.

Funding

The work is supported by a grant of the Romanian Ministry of Research, Innovation and Digitalization, project CF 178/31.07.2023—‘JobKG—A Knowledge Graph of the Romanian Job Market based on Natural Language Processing’. This study was co-financed by The Bucharest University of Economic Studies during the PhD program.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Blázquez-Jiménez, C.; Sanchis, J.R. La coopetencia interempresarial. Descripción teórica y aplicación a sectores tecnológicos. RETOS. Rev. Cienc. Adm. Econ. 2023, 13, 325–340. [Google Scholar] [CrossRef]
Torres-Cruz, F.; Yucra-Mamani, Y.J. Técnicas de inteligencia artificial en la valoración de la enseñanza virtual por estudiantes de nivel universitario. Hum. Rev. Int. Humanit. Rev./Rev. Int. Humanidades 2022, 11, 1–11. [Google Scholar] [CrossRef]
Nguyen, A.; Ngo, H.N.; Hong, Y.; Dang, B.; Nguyen, B.-P.T. Ethical Principles for Artificial Intelligence in Education. Educ. Inf. Technol. 2023, 28, 4221–4241. [Google Scholar] [CrossRef]
López-Chila, R.; Llerena-Izquierdo, J.; Sumba-Nacipucha, N.; Cueva-Estrada, J. Artificial Intelligence in Higher Education: An Analysis of Existing Bibliometrics. Educ. Sci. 2024, 14, 47. [Google Scholar] [CrossRef]
Rubinger, L.; Gazendam, A.; Ekhtiari, S.; Bhandari, M. Machine Learning and Artificial Intelligence in Research and Healthcare. Injury 2023, 54, S69–S73. [Google Scholar] [CrossRef] [PubMed]
Xu, Y.; Liu, X.; Cao, X.; Huang, C.; Liu, E.; Qian, S.; Liu, X.; Wu, Y.; Dong, F.; Qiu, C.-W.; et al. Artificial Intelligence: A Powerful Paradigm for Scientific Research. Innovation 2021, 2, 100179. [Google Scholar] [CrossRef] [PubMed]
Kuleto, V.; Ilić, M.; Dumangiu, M.; Ranković, M.; Martins, O.M.D.; Păun, D.; Mihoreanu, L. Exploring Opportunities and Challenges of Artificial Intelligence and Machine Learning in Higher Education Institutions. Sustainability 2021, 13, 10424. [Google Scholar] [CrossRef]
Richard, C. Artificial Intelligence to Grow 47.5% in Education over Next 4 Years. The Journal, 24 March 2017. Available online: https://thejournal.com/articles/2017/03/24/ai-market-to-grow-47,-d-,5-percent-over-next-four-years.aspx (accessed on 22 June 2025).
Zara, A. Classrooms Are Adapting to the Use of Artificial Intelligence. Available online: https://www.apa.org/monitor/2025/01/trends-classrooms-artificial-intelligence (accessed on 22 June 2025).
Common Sense Media the Dawn of the AI Era: Teens, Parents, and the Adoption of Generative AI at Home and School. Available online: https://www.commonsensemedia.org/research/the-dawn-of-the-ai-era-teens-parents-and-the-adoption-of-generative-ai-at-home-and-school (accessed on 22 June 2025).
Al-Khafaji, M.; Eryilmaz, M. Using Artificial Intelligence Methods to Predict Student Academic Achievement. In Proceedings of the Future Technologies Conference (FTC) 2021, Virtual, 28–29 October 2021; Arai, K., Ed.; Springer International Publishing: Cham, Switzerland, 2022; Volume 2, pp. 403–414. [Google Scholar]
Huang, A.Y.Q.; Lu, O.H.T.; Yang, S.J.H. Effects of Artificial Intelligence–Enabled Personalized Recommendations on Learners’ Learning Engagement, Motivation, and Outcomes in a Flipped Classroom. Comput. Educ. 2023, 194, 104684. [Google Scholar] [CrossRef]
Rahman, A. Mapping the Efficacy of Artificial Intelligence-Based Online Proctored Examination (OPE) in Higher Education during COVID-19: Evidence from Assam, India. Int. J. Learn. Teach. Educ. Res. 2022, 21, 76–94. [Google Scholar] [CrossRef]
UNESCO 272 Million Children, Adolescents and Youth Are Out-of-School|Global Education Monitoring Report. Available online: https://www.unesco.org/gem-report/en (accessed on 22 June 2025).
Brants, T.; Popat, A.C.; Xu, P.; Och, F.J.; Dean, J. Large Language Models in Machine Translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, 28–30 June 2007; Eisner, J., Ed.; Association for Computational Linguistics: Prague, Czech Republic, 2007; pp. 858–867. [Google Scholar]
Li, J.; Dada, A.; Puladi, B.; Kleesiek, J.; Egger, J. ChatGPT in Healthcare: A Taxonomy and Systematic Review. Comput. Methods Programs Biomed. 2024, 245, 108013. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1 (Long and Short Papers). Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar]
Bates, T.; Cobo, C.; Mariño, O.; Wheeler, S. Can Artificial Intelligence Transform Higher Education? Int. J. Educ. Technol. High. Educ. 2020, 17, 42. [Google Scholar] [CrossRef]
Chatterjee, S.; Bhattacharjee, K.K. Adoption of Artificial Intelligence in Higher Education: A Quantitative Analysis Using Structural Equation Modelling. Educ. Inf. Technol. 2020, 25, 3443–3463. [Google Scholar] [CrossRef]
Song, P.; Wang, X. A Bibliometric Analysis of Worldwide Educational Artificial Intelligence Research Development in Recent Twenty Years. Asia Pac. Educ. Rev. 2020, 21, 473–486. [Google Scholar] [CrossRef]
Li, K.C.; Wong, B.T.-M. Artificial Intelligence in Personalised Learning: A Bibliometric Analysis. Interact. Technol. Smart Educ. 2023, 20, 422–445. [Google Scholar] [CrossRef]
Sekwatlakwatla, S.P.; Malele, V. Evaluations of Large Language Models a Bibliometric Analysis. Indones. J. Comput. Sci. 2024, 13. [Google Scholar] [CrossRef]
Mohammed, E.A.H.; Kovács, B.; Kuunya, R.; Mustafa, E.O.A.; Abbo, A.S.H.; Pál, K. Antibiotic Resistance in Aquaculture: Challenges, Trends Analysis, and Alternative Approaches. Antibiotics 2025, 14, 598. [Google Scholar] [CrossRef] [PubMed]
Sun, B.; Liu, J.; Zhang, X. A Bibliometric Analysis of the Three-North Shelter Forest Program. Forests 2025, 16, 977. [Google Scholar] [CrossRef]
Weng, M.; Qu, W.; Ma, E.; Wu, M.; Dong, Y.; Xi, X. Bibliometric Analysis of Digital Watermarking Based on CiteSpace. Symmetry 2025, 17, 871. [Google Scholar] [CrossRef]
Manchanda, N.; Chhabra, D.; Malik, M.; Kumar, K. WoS Bibliometric-Based Review on IoT in Healthcare Sector. In Proceedings of the 2023 International Conference on Communication, Security and Artificial Intelligence (ICCSAI), Greater Noida, India, 23–25 November 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 107–112. [Google Scholar]
Günek, B.; Yurttakal, A.H. Bibliometric Analysis of Research Papers on Blockchain Technologies. In Proceedings of the 2022 Innovations in Intelligent Systems and Applications Conference (ASYU), Antalya, Turkey, 7–9 September 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–5. [Google Scholar]
Pop, M.-D.; Micea, M.V. Visual Analysis of the Bibliometric Data Associated with the Calibration of Car-Following Models. In Proceedings of the 2024 20th International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT), Abu Dhabi, United Arab Emirates, 29 April–1 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 647–652. [Google Scholar]
Vijayalekshmi, S.; Twetwa-Dube, S.; Vinoth-Kumar, D.; Gumbo, S. The Role of Higher Education Institutions in Enabling the Fourth Industrial Revolution: A Bibliometric Analysis. In Proceedings of the 2023 IEEE AFRICON, Nairobi, Kenya, 20–22 September 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–3. [Google Scholar]
Hinojo-Lucena, F.-J.; Aznar-Díaz, I.; Cáceres-Reche, M.-P.; Romero-Rodríguez, J.-M. Artificial Intelligence in Higher Education: A Bibliometric Study on Its Impact in the Scientific Literature. Educ. Sci. 2019, 9, 51. [Google Scholar] [CrossRef]
Urzúa, C.A.C.; Ranjan, R.; Saavedra, E.E.M.; Badilla-Quintana, M.G.; Lepe-Martínez, N.; Philominraj, A. Effects of AI-Assisted Feedback via Generative Chat on Academic Writing in Higher Education Students: A Systematic Review of the Literature. Educ. Sci. 2025, 15, 1396. [Google Scholar] [CrossRef]
Osorio Vanegas, H.D.; Segovia Cifuentes, Y. de M.; Sobrino Morrás, A. Educational Technology in Teacher Training: A Systematic Review of Competencies, Skills, Models, and Methods. Educ. Sci. 2025, 15, 1036. [Google Scholar] [CrossRef]
Pranckutė, R. Web of Science (WoS) and Scopus: The Titans of Bibliographic Information in Today’s Academic World. Publications 2021, 9, 12. [Google Scholar] [CrossRef]
Singh, V.K.; Singh, P.; Karmakar, M.; Leta, J.; Mayr, P. The Journal Coverage of Web of Science, Scopus and Dimensions: A Comparative Analysis. Scientometrics 2021, 126, 5113–5142. [Google Scholar] [CrossRef]
ISI, W. KeyWords Plus Generation, Creation, and Changes. Available online: https://support.clarivate.com/ScientificandAcademicResearch/s/article/KeyWords-Plus-generation-creation-and-changes?language=en_US (accessed on 2 March 2024).
Sandu, A.; Diaconu, P.; Delcea, C.; Domenteanu, A. Emphasizing Grey Systems Contribution to Decision-Making Field Under Uncertainty: A Global Bibliometric Exploration. Mathematics 2025, 13, 1278. [Google Scholar] [CrossRef]
Cotfas, L.-A.; Sandu, A.; Delcea, C.; Diaconu, P.; Frăsineanu, C.; Stănescu, A. From Transformers to ChatGPT: An Analysis of Large Language Models Research. IEEE Access 2025, 13, 146889–146931. [Google Scholar] [CrossRef]
Panait, M.; Cibu, B.R.; Teodorescu, D.M.; Delcea, C. European Fund Absorption and Contribution to Business Environment Development: Research Output Analysis Through Bibliometric and Topic Modeling Analysis. Businesses 2025, 5, 45. [Google Scholar] [CrossRef]
Profiroiu, C.M.; Cibu, B.; Delcea, C.; Cotfas, L.-A. Charting the Course of School Dropout Research: A Bibliometric Exploration. IEEE Access 2024, 12, 71453–71478. [Google Scholar] [CrossRef]
Liu, W. The Data Source of This Study Is Web of Science Core Collection? Not Enough. Scientometrics 2019, 121, 1815–1824. [Google Scholar] [CrossRef]
Liu, F. Retrieval Strategy and Possible Explanations for the Abnormal Growth of Research Publications: Re-Evaluating a Bibliometric Analysis of Climate Change. Scientometrics 2023, 128, 853–859. [Google Scholar] [CrossRef] [PubMed]
Domenteanu, A.; Cibu, B.; Delcea, C.; Cotfas, L.-A. The World of Agent-Based Modeling: A Bibliometric and Analytical Exploration. Complexity 2025, 2025, 2636704. [Google Scholar] [CrossRef]
Domenteanu, A.; Cotfas, L.-A.; Diaconu, P.; Tudor, G.-A.; Delcea, C. AI on Wheels: Bibliometric Approach to Mapping of Research on Machine Learning and Deep Learning in Electric Vehicles. Electronics 2025, 14, 378. [Google Scholar] [CrossRef]
Domenteanu, A.; Delcea, C.; Florescu, M.-S.; Gherai, D.S.; Bugnar, N.; Cotfas, L.-A. United in Green: A Bibliometric Exploration of Renewable Energy Communities. Electronics 2024, 13, 3312. [Google Scholar] [CrossRef]
Sandu, A.; Cotfas, L.-A.; Stănescu, A.; Delcea, C. Guiding Urban Decision-Making: A Study on Recommender Systems in Smart Cities. Electronics 2024, 13, 2151. [Google Scholar] [CrossRef]
Sandu, A.; Cotfas, L.-A.; Delcea, C.; Ioanăș, C.; Florescu, M.-S.; Orzan, M. Machine Learning and Deep Learning Applications in Disinformation Detection: A Bibliometric Assessment. Electronics 2024, 13, 4352. [Google Scholar] [CrossRef]
Dobre, F.; Sandu, A.; Tătaru, G.-C.; Cotfas, L.-A. A Decade of Studies in Smart Cities and Urban Planning Through Big Data Analytics. Systems 2025, 13, 780. [Google Scholar] [CrossRef]
Mongeon, P.; Paul-Hus, A. The Journal Coverage of Web of Science and Scopus: A Comparative Analysis. Scientometrics 2016, 106, 213–228. [Google Scholar] [CrossRef]
Donthu, N.; Kumar, S.; Mukherjee, D.; Pandey, N.; Lim, W.M. How to Conduct a Bibliometric Analysis: An Overview and Guidelines. J. Bus. Res. 2021, 133, 285–296. [Google Scholar] [CrossRef]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 Statement: An Updated Guideline for Reporting Systematic Reviews. PLOS Med. 2021, 18, e1003583. [Google Scholar] [CrossRef]
Page, M.J.; Altman, D.G.; Shamseer, L.; McKenzie, J.E.; Ahmadzai, N.; Wolfe, D.; Yazdi, F.; Catalá-López, F.; Tricco, A.C.; Moher, D. Reproducible Research Practices Are Underused in Systematic Reviews of Biomedical Interventions. J. Clin. Epidemiol. 2018, 94, 8–18. [Google Scholar] [CrossRef]
Harzing, A.-W.; Alakangas, S. Google Scholar, Scopus and the Web of Science: A Longitudinal and Cross-Disciplinary Comparison. Scientometrics 2016, 106, 787–804. [Google Scholar] [CrossRef]
Martín-Martín, A.; Orduna-Malea, E.; Thelwall, M.; Delgado López-Cózar, E. Google Scholar, Web of Science, and Scopus: A Systematic Comparison of Citations in 252 Subject Categories. J. Informetr. 2018, 12, 1160–1177. [Google Scholar] [CrossRef]
Sugimoto, C.; Work, S.; Larivière, V.; Haustein, S. Scholarly Use of Social Media and Altmetrics: A Review of the Literature. J. Assoc. Inf. Sci. Technol. 2017, 68, 2037–2062. [Google Scholar] [CrossRef]
Moed, H. Citation Analysis in Research Evaluation; Springer: Dordrecht, The Netherlands, 2005; ISBN 978-1-4020-3713-9. [Google Scholar]
de Cerqueira, J.S.; Kemell, K.-K.; Rousi, R.; Xi, N.; Hamari, J.; Abrahamsson, P. Mapping Trustworthiness in Large Language Models: A Bibliometric Analysis Bridging Theory to Practice. arXiv 2025, arXiv:2503.04785. [Google Scholar]
Fan, L.; Li, L.; Ma, Z.; Lee, S.; Yu, H.; Hemphill, L. A Bibliometric Review of Large Language Models Research from 2017 to 2023. ACM Trans. Intell. Syst. Technol. 2024, 15, 91. [Google Scholar] [CrossRef]
Gencer, G.; Gencer, K. Large Language Models in Healthcare: A Bibliometric Analysis and Examination of Research Trends. J. Multidiscip. Healthc. 2025, 18, 223–238. [Google Scholar] [CrossRef]
Aria, M.; Cuccurullo, C. Bibliometrix: An R-Tool for Comprehensive Science Mapping Analysis. J. Informetr. 2017, 11, 959–975. [Google Scholar] [CrossRef]
Řehůřek, R.; Sojka, P. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, 22 May 2010. [Google Scholar] [CrossRef]
Wu, J.; Wang, Q.; Guo, Z.; Peng, C. AI-Driven Green Building Technology Innovation: Knowledge Structure, Evolution Trends, Research Paradigms and Future Prospects. Buildings 2025, 15, 1754. [Google Scholar] [CrossRef]
Dwivedi, Y.K.; Kshetri, N.; Hughes, L.; Slade, E.L.; Jeyaraj, A.; Kar, A.K.; Baabdullah, A.M.; Koohang, A.; Raghavan, V.; Ahuja, M.; et al. Opinion Paper: “So What If ChatGPT Wrote It?” Multidisciplinary Perspectives on Opportunities, Challenges and Implications of Generative Conversational AI for Research, Practice and Policy. Int. J. Inf. Manag. 2023, 71, 102642. [Google Scholar] [CrossRef]
Cooper, G. Examining Science Education in ChatGPT: An Exploratory Study of Generative Artificial Intelligence. J. Sci. Educ. Technol. 2023, 32, 444–452. [Google Scholar] [CrossRef]
Yeo, Y.H.; Samaan, J.S.; Ng, W.H.; Ting, P.-S.; Trivedi, H.; Vipani, A.; Ayoub, W.; Yang, J.D.; Liran, O.; Spiegel, B.; et al. Assessing the Performance of ChatGPT in Answering Questions Regarding Cirrhosis and Hepatocellular Carcinoma. Clin. Mol. Hepatol. 2023, 29, 721–732. [Google Scholar] [CrossRef]
Rahman, M.M.; Watanobe, Y. ChatGPT for Education and Research: Opportunities, Threats, and Strategies. Appl. Sci. 2023, 13, 5783. [Google Scholar] [CrossRef]
Perkins, M. Academic Integrity Considerations of AI Large Language Models in the Post-Pandemic Era: ChatGPT and Beyond. J. Univ. Teach. Learn. Pract. 2023, 20, 1–24. [Google Scholar] [CrossRef]
Jeon, J.; Lee, S. Large Language Models in Education: A Focus on the Complementary Relationship between Human Teachers and ChatGPT. Educ. Inf. Technol. 2023, 28, 15873–15892. [Google Scholar] [CrossRef]
Lyu, Q.; Tan, J.; Zapadka, M.E.; Ponnatapura, J.; Niu, C.; Myers, K.J.; Wang, G.; Whitlow, C.T. Translating Radiology Reports into Plain Language Using ChatGPT and GPT-4 with Prompt Learning: Results, Limitations, and Potential. Vis. Comput. Ind. Biomed. Art. 2023, 6, 9. [Google Scholar] [CrossRef] [PubMed]
Alqahtani, T.; Badreldin, H.A.; Alrashed, M.; Alshaya, A.I.; Alghamdi, S.S.; bin Saleh, K.; Alowais, S.A.; Alshaya, O.A.; Rahman, I.; Al Yami, M.S.; et al. The Emergent Role of Artificial Intelligence, Natural Learning Processing, and Large Language Models in Higher Education and Research. Res. Soc. Adm. Pharm. 2023, 19, 1236–1242. [Google Scholar] [CrossRef]
García-Peñalvo, F.J. La percepción de la Inteligencia Artificial en contextos educativos tras el lanzamiento de ChatGPT: Disrupción o pánico. Educ. Knowl. Soc. (EKS) 2023, 24, e31279. [Google Scholar] [CrossRef]
Oh, N.; Choi, G.-S.; Lee, W.Y. ChatGPT Goes to the Operating Room: Evaluating GPT-4 Performance and Its Potential in Surgical Education and Training in the Era of Large Language Models. Ann. Surg. Treat. Res. 2023, 104, 269–273. [Google Scholar] [CrossRef]
Nettleton, D. Chapter 6—Selection of Variables and Factor Derivation. In Commercial Data Mining; Nettleton, D., Ed.; Morgan Kaufmann: Boston, MA, USA, 2014; pp. 79–104. ISBN 978-0-12-416602-8. [Google Scholar]
Wilczewski, M.; Alon, I. Language and Communication in International Students’ Adaptation: A Bibliometric and Content Analysis Review. High. Educ. 2023, 85, 1235–1256. [Google Scholar] [CrossRef]
Lampropoulos, G. Augmented Reality, Virtual Reality, and Intelligent Tutoring Systems in Education and Training: A Systematic Literature Review. Appl. Sci. 2025, 15, 3223. [Google Scholar] [CrossRef]
Chuang, J.; Manning, C.D.; Heer, J. Termite: Visualization techniques for assessing textual topic models. In Proceedings of the AVI’12: Proceedings of the International Working Conference on Advanced Visual Interfaces, Capri Island, Italy, 21–25 May 2012; pp. 74–77. [Google Scholar]
Sievert, C.; Shirley, K. LDAvis: A method for visualizing and interpreting topics. In Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, Baltimore, MA, USA, 27 June 2014; pp. 63–70. [Google Scholar]
Uribe, S.E.; Maldupa, I.; Kavadella, A.; El Tantawi, M.; Chaurasia, A.; Fontana, M.; Marino, R.; Innes, N.; Schwendicke, F. Artificial Intelligence Chatbots and Large Language Models in Dental Education: Worldwide Survey of Educators. Eur. J. Dent. Educ. 2024, 28, 865–876. [Google Scholar] [CrossRef]
Tsai, M.-L.; Ong, C.W.; Chen, C.-L. Exploring the Use of Large Language Models (LLMs) in Chemical Engineering Education: Building Core Course Problem Models with Chat-GPT. Educ. Chem. Eng. 2023, 44, 71–95. [Google Scholar] [CrossRef]
Smith, A.; Hachen, S.; Schleifer, R.; Bhugra, D.; Buadze, A.; Liebrenz, M. Old Dog, New Tricks? Exploring the Potential Functionalities of ChatGPT in Supporting Educational Methods in Social Psychiatry. Int. J. Soc. Psychiatry 2023, 69, 1882–1889. [Google Scholar] [CrossRef]
Satpute, P.; Tiwari, S.; Gupta, M.; Ghosh, S. Exploring Large Language Models for Microstructure Evolution in Materials. Mater. Today Commun. 2024, 40, 109583. [Google Scholar] [CrossRef]
Azaiz, I.; Deckarm, O.; Strickroth, S. AI-Enhanced Auto-Correction of Programming Exercises: How Effective Is GPT-3.5? Int. J. Eng. Pedagog. (Ijep) 2023, 13, 67–83. [Google Scholar] [CrossRef]
Zhang, Z.; Huang, X. The Impact of Chatbots Based on Large Language Models on Second Language Vocabulary Acquisition. Heliyon 2024, 10, e25370. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.-M.; Hwang, G.-J.; Chen, C.-Q.; Chen, X.-D.; Ye, X.-D. Integrating Large Language Models into EFL Writing Instruction: Effects on Performance, Self-Regulated Learning Strategies, and Motivation. Comput. Assist. Lang. Learn. 2024, 1–25. [Google Scholar] [CrossRef]
Mannekote, A.; Davies, A.; Pinto, J.D.; Zhang, S.; Olds, D.; Schroeder, N.L.; Lehman, B.; Zapata-Rivera, D.; Zhai, C.X. Large language models for whole-learner support: Opportunities and challenges. Front. Artif. Intell. 2024, 7, 1460364. [Google Scholar] [CrossRef] [PubMed]
Alfirević, N.; Rendulić, D.; Fošner, M.; Fošner, A. An Ethnographic Research Study of Artificial Intelligence. Informatics 2024, 11, 78. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, W. Integrating large language models into project-based learning based on self-determination theory. Interact. Learn. Environ. 2024, 33, 3580–3592. [Google Scholar] [CrossRef]
Şekerci, Y.; Kahraman, M.U.; Develier, M. Enhancing interior design education with artificial intelligence: A multisensory hotel design. Interiors 2023, 13, 195–229. [Google Scholar] [CrossRef]
Zhang, D.-W.; Boey, M.; Tan, Y.Y.; Jia, A.H.S. Evaluating Large Language Models for Criterion-Based Grading from Agreement to Consistency. Npj Sci. Learn. 2024, 9, 79. [Google Scholar] [CrossRef]
Hackl, V.; Müller, A.E.; Granitzer, M.; Sailer, M. Is GPT-4 a Reliable Rater? Evaluating Consistency in GPT-4′s Text Ratings. Front. Educ. 2023, 8, 1272229. [Google Scholar] [CrossRef]
Arun, G.; Perumal, V.; Urias, F.P.J.B.; Ler, Y.E.; Tan, B.W.T.; Vallabhajosyula, R.; Tan, E.; Ng, O.; Ng, K.B.; Mogali, S.R. ChatGPT versus a Customized AI Chatbot (Anatbuddy) for Anatomy Education: A Comparative Pilot Study. Anat. Sci. Educ. 2024, 17, 1396–1405. [Google Scholar] [CrossRef]
Ehlert, A.; Ehlert, B.; Cao, B.; Morbitzer, K. Large Language Models and the North American Pharmacist Licensure Examination (NAPLEX) Practice Questions. Am. J. Pharm. Educ. 2024, 88, 101294. [Google Scholar] [CrossRef]
Jeong, H.; Han, S.-S.; Yu, Y.; Kim, S.; Jeon, K.J. How well do large language model-based chatbots perform in oral and maxillofacial radiology? Dentomaxillofac Radiol. 2024, 53, 390–395. [Google Scholar] [CrossRef] [PubMed]
Tan, T.F.; Thirunavukarasu, A.J.; Campbell, J.P.; Keane, P.A.; Pasquale, L.R.; Abramoff, M.D.; Kalpathy-Cramer, J.; Lum, F.; Kim, J.E.; Baxter, S.L.; et al. Generative Artificial Intelligence Through ChatGPT and Other Large Language Models in Ophthalmology: Clinical Applications and Challenges. Ophthalmol. Sci. 2023, 3, 100394. [Google Scholar] [CrossRef]
Wu, G.; Zhao, W.; Wong, A.; Lee, D.A. Patients with Floaters: Answers from Virtual Assistants and Large Language Models. Digit. Health 2024, 10, 20552076241229933. [Google Scholar] [CrossRef]
Kooraki, S.; Hosseiny, M.; Jalili, M.H.; Rahsepar, A.A.; Imanzadeh, A.; Kim, G.H.; Hassani, C.; Abtin, F.; Moriarty, J.M.; Bedayat, A. Evaluation of ChatGPT-Generated Educational Patient Pamphlets for Common Interventional Radiology Procedures. Acad. Radiol. 2024, 31, 4548–4553. [Google Scholar] [CrossRef]
Cohen, S.A.; Brant, A.; Fisher, A.C.; Pershing, S.; Do, D.; Pan, C. Dr. Google vs. Dr. ChatGPT: Exploring the Use of Artificial Intelligence in Ophthalmology by Comparing the Accuracy, Safety, and Readability of Responses to Frequently Asked Patient Questions Regarding Cataracts and Cataract Surgery. Semin. Ophthalmol. 2024, 39, 472–479. [Google Scholar] [CrossRef]
Warr, M.; Oster, N.J.; Isaac, R. Implicit Bias in Large Language Models: Experimental Proof and Implications for Education. J. Res. Technol. Educ. 2024, 1–24. [Google Scholar] [CrossRef]
Wachter, S.; Mittelstadt, B.; Russell, C. Do Large Language Models Have a Legal Duty to Tell the Truth? R. Soc. Open Sci. 2024, 11, 240197. [Google Scholar] [CrossRef]
Kieser, F. Educational Data Augmentation in Physics Education Research Using ChatGPT. Phys. Rev. Phys. Educ. Res. 2023, 19, 020150. [Google Scholar] [CrossRef]
Nguyen, H.; Nguyen, V.; Ludovise, S.; Santagata, R. Misrepresentation or Inclusion: Promises of Generative Artificial Intelligence in Climate Change Education. Learn. Media Technol. 2025, 50, 393–409. [Google Scholar] [CrossRef]
Naqvi, W.M.; Shaikh, S.Z.; Mishra, G.V. Large Language Models in Physical Therapy: Time to Adapt and Adept. Front. Public Health 2024, 12, 1364660. [Google Scholar] [CrossRef] [PubMed]
Rahimzadeh, V.; Kostick-Quenet, K.; Blumenthal Barby, J.; McGuire, A.L. Ethics Education for Healthcare Professionals in the Era of ChatGPT and Other Large Language Models: Do We Still Need It? Am. J. Bioeth. 2023, 23, 17–27. [Google Scholar] [CrossRef] [PubMed]
Hobensack, M.; von Gerich, H.; Vyas, P.; Withall, J.; Peltonen, L.-M.; Block, L.J.; Davies, S.; Chan, R.; Van Bulck, L.; Cho, H.; et al. A Rapid Review on Current and Potential Uses of Large Language Models in Nursing. Int. J. Nurs. Stud. 2024, 154, 104753. [Google Scholar] [CrossRef]
Kirwan, A. ChatGPT and University Teaching, Learning and Assessment: Some Initial Reflections on Teaching Academic Integrity in the Age of Large Language Models. Ir. Educ. Stud. 2024, 43, 1389–1406. [Google Scholar] [CrossRef]
Jablonka, K.M.; Ai, Q.; Al-Feghali, A.; Badhwar, S.; Bocarsly, J.D.; Bran, A.M.; Bringuier, S.; Brinson, L.C.; Choudhary, K.; Circi, D.; et al. 14 Examples of How LLMs Can Transform Materials Science and Chemistry: A Reflection on a Large Language Model Hackathon. Digit. Discov. 2023, 2, 1233–1250. [Google Scholar] [CrossRef]
Wang, S.; Ni, L.; Zhang, Z.; Li, X.; Zheng, X.; Liu, J. Multimodal Prediction of Student Performance: A Fusion of Signed Graph Neural Networks and Large Language Models. Pattern Recognit. Lett. 2024, 181, 1–8. [Google Scholar] [CrossRef]
Teckwani, S.H.; Wong, A.H.-P.; Luke, N.V.; Low, I.C.C. Accuracy and Reliability of Large Language Models in Assessing Learning Outcomes Achievement across Cognitive Domains. Adv. Physiol. Educ. 2024, 48, 904–914. [Google Scholar] [CrossRef] [PubMed]
Lang, Q.; Tian, S.; Wang, M.; Wang, J. Exploring the Answering Capability of Large Language Models in Addressing Complex Knowledge in Entrepreneurship Education. IEEE Trans. Learn. Technol. 2024, 17, 2053–2062. [Google Scholar] [CrossRef]
Wang, X.; Reynolds, B.L. Beyond the Books: Exploring Factors Shaping Chinese English Learners’ Engagement with Large Language Models for Vocabulary Learning. Educ. Sci. 2024, 14, 496. [Google Scholar] [CrossRef]
Hershcovits, H.; Vilenchik, D.; Gal, K. Modeling Engagement in Self-Directed Learning Systems Using Principal Component Analysis. IEEE Trans. Learn. Technol. 2020, 13, 164–171. [Google Scholar] [CrossRef]
Li, H.; Zhu, S.; Wu, D.; Yang, H.H.; Guo, Q. Impact of Information Literacy, Self-Directed Learning Skills, and Academic Emotions on High School Students’ Online Learning Engagement: A Structural Equation Modeling Analysis. Educ. Inf. Technol. 2023, 28, 13485–13504. [Google Scholar] [CrossRef]
Morris, W.; Holmes, L.; Choi, J.S.; Crossley, S. Automated Scoring of Constructed Response Items in Math Assessment Using Large Language Models. Int. J. Artif. Intell. Educ. 2025, 35, 559–586. [Google Scholar] [CrossRef]
Kieser, F.; Tschisgale, P.; Rauh, S.; Bai, X.; Maus, H.; Petersen, S.; Stede, M.; Neumann, K.; Wulff, P. David vs. Goliath: Comparing Conventional Machine Learning and a Large Language Model for Assessing Students’ Concept Use in a Physics Problem. Front. Artif. Intell. 2024, 7, 1408817. [Google Scholar] [CrossRef]
Kurian, N. ‘No, Alexa, No!’: Designing Child-Safe AI and Protecting Children from the Risks of the ‘Empathy Gap’ in Large Language Models. Learn. Media Technol. 2024, 1–14. [Google Scholar] [CrossRef]
Morris, W.; Crossley, S.; Holmes, L.; Ou, C.; Dascalu, M.; McNamara, D. Formative Feedback on Student-Authored Summaries in Intelligent Textbooks Using Large Language Models. Int. J. Artif. Intell. Educ. 2025, 35, 1022–1043. [Google Scholar] [CrossRef]
Nica, I. Bibliometric Mapping in the Landscape of Cybernetics: Insights into Global Research Networks. Kybernetes 2024, 54, 3322–3357. [Google Scholar] [CrossRef]
Nica, I.; Georgescu, I.; Chiriță, N. Simulation and Modelling as Catalysts for Renewable Energy: A Bibliometric Analysis of Global Research Trends. Energies 2024, 17, 3090. [Google Scholar] [CrossRef]
Domenteanu, A.; Diaconu, P.; Florescu, M.-S.; Delcea, C. The Road to Autonomy: A Systematic Review Through AI in Autonomous Vehicles. Electronics 2025, 14, 4174. [Google Scholar] [CrossRef]

Figure 1. Dataset selection.

Figure 2. Steps in systematic review.

Figure 3. Factorial analysis—Keywords Plus.

Figure 4. Factorial analysis—title bigrams.

Figure 5. Thematic map—Keywords Plus.

Figure 6. Thematic map—abstract bigrams.

Figure 7. LDA results [77,78].

Figure 8. LDA results—Topic 1 [77,78].

Figure 9. LDA results—Topic 2 [77,78].

Figure 10. LDA results—Topic 3 [77,78].

Figure 11. LDA results—Topic 4 [77,78].

Figure 12. BERTopic results.

Figure 13. BERTopic top 5 keywords by topic.

Table 1. Data information.

Indicator	Value
Timespan	2023:2024
Sources (Journals, Books, etc.)	322
Documents	507
Annual Growth Rate %	369.66
Document Average Age	1.18
Average citations per document	15.03
References	19,016
Keywords Plus	413

Table 2. Main information about authors.

Indicator	Value
Single-authored docs	52
Co-authors per document	4.48
International co-authorships %	26.04
Authors’ Keywords	1451

Table 3. Top 10 most cited documents.

No.	Paper (First Author, Year, Journal, Reference)	Number of Authors	Total Citations (TCs)	Total Citations per Year (TCY)	Normalized TCs (NTCs)
1	Dwivedi YK, 2023, International Journal of Information Management [64]	73	1336	445.33	23.08
2	Cooper G, 2023, Journal of Science Education and Technology [65]	1	415	138.33	7.17
3	Yeo YH, 2023, Clinical and Molecular Hepatology [66]	11	320	106.67	5.53
4	Rahman MM, 2023, Applied Sciences—Basel [67]	2	304	101.33	5.25
5	Perkins M, 2023, Journal of University Teaching and Learning Practice [68]	1	222	74.00	3.84
6	Jeon J, 2023, Education and Information Technologies [69]	2	188	62.67	3.25
7	Lyu Q, 2023, Visual Computing for Industry, Biomedicine, and Art [70]	8	136	45.33	2.35
8	Alqahtani T, 2023, Research in Social and Administrative Pharmacy [71]	11	104	34.67	1.80
9	Garcia-Penalvo FJ, 2023, Education in the Knowledge Society (EKS) [72]	1	103	34.33	1.78
10	Oh N, 2023, Annals of Surgical Treatment and Research [73]	3	98	32.67	1.69

Table 4. Content of the 10 most-cited papers.

No.	Paper (Primary Author, Year, Journal, Reference)	Title	Purpose
1	Dwivedi YK, 2023, International Journal of Information Management [64]	Opinion Paper: “So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy	Exploring the impact of generative AI technologies, such as ChatGPT, on organizations, society, and individuals in an interdisciplinary manner.
2	Cooper G, 2023, Journal of Science Education and Technology [65]	Examining Science Education in ChatGPT: An Exploratory Study of Generative Artificial Intelligence	The article aims to stimulate a broader discussion around the place of generative AI in science education.
3	Yeo YH, 2023, Clinical and Molecular Hepatology [66]	Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma	Assessment of the accuracy, comprehensiveness, and emotional support capabilities of ChatGPT in answering related questions regarding management, knowledge, and support for patients suffering from cirrhosis or hepatocellular carcinoma.
4	Rahman MM, 2023, Applied Sciences—Basel [67]	ChatGPT for Education and Research: Opportunities, Threats, and Strategies	Investigating both the opportunities and risks associated with using ChatGPT in research and education, with a particular focus on supporting programming education.
5	Perkins M, 2023, Journal of University Teaching and Learning Practice [68]	Academic Integrity considerations of AI large language models in the post-pandemic era: ChatGPT and beyond	Exploring the implications of using generative AI, such as ChatGPT, for academic integrity.
6	Jeon J, 2023, Education and Information Technologies [69]	Large language models in education: A focus on the complementary relationship between human teachers and ChatGPT	Exploring the ChatGPT/teacher relationship, focusing on identifying the complementary roles each can play in the educational process.
7	Lyu Q, 2023, Visual Computing for Industry, Biomedicine, and Art [70]	Translating radiology reports into plain language using ChatGPT and GPT-4 with prompt learning: results, limitations, and potential	Assessing the feasibility of using ChatGPT for translating radiology reports into clear language that is accessible to patients and healthcare professionals.
8	Alqahtani T, 2023, Research in Social and Administrative Pharmacy [71]	The emergent role of artificial intelligence, natural learning processing, and large language models in higher education and research	Detailed overview of AI, natural language processing, and LLMs, highlighting their potential impact on education and research.
9	Garcia-Penalvo FJ, 2023, Education in the Knowledge Society (EKS) [72]	The perception of artificial intelligence in educational contexts after the launch of ChatGPT: disruption or panic	Understanding and analyzing the impact of ChatGPT technology, particularly in the field of education, in order to capitalize on its benefits and prevent any negative effects, adapting existing processes to new technological realities.
10	Oh N, 2023, Annals of Surgical Treatment and Research [73]	ChatGPT goes to the operating room: evaluating GPT-4’s performance and its potential in surgical education and training in the era of large language models	Evaluating the performance of ChatGPT models (GPT-4 and GPT-3.5) for understanding complex information in the field of general surgery.

Table 5. LDA Topics parallel to themes in thematic maps.

Topic (Percentage of Tokens Retained)	Most Salient Words	LDA Topic Focus	Matching Theme
Topic 1 (65.7% of tokens)	student, generative_ai, gpt, feedback, design, technology, learning, support, educator	LDA Topic 1—Student-Centered Applications of Generative AI in Learning	Theme 1—Technological Foundations of LLMs in education Theme 2—Educational Impact for Students, Performance, and Learning Outcomes
Topic 2 (21.6% of tokens)	question, response, patient, patient_education, accuracy, readability, safety, exam, evaluate, test	LDA Topic 2—Evaluation, Readability, and Safety in AI-Generated Educational Content	Theme 2—Educational Impact for Students, Performance, and Learning Outcomes Theme 3—Content Quality, Readability, and Literacy
Topic 3 (6.4% of tokens)	chatbot, accuracy, readability, health_literacy, evaluate, quality, summary, caregiver, expert	LDA Topic 3—Chatbots, Health Literacy, and Content Quality	Theme 3—Content Quality, Readability, and Literacy
Topic 4 (6.3% of tokens)	child, student, creativity, emotion, development, bias, disparity, literacy, ai_driven, science	LDA Topic 4—Ethics, Bias, and Fairness in AI-Driven Education	Theme 4—Ethics, Personalization, and Responsible Use of AI in Education Theme 2—Educational Impact for Students, Performance, and Learning Outcomes

Table 6. BERTopics parallel to themes in thematic maps.

BERTopic (Size)	Most Salient Words	BERTopic Focus	Matching Theme
BERTopic 0 (size 397)	ai, education, chatgpt, learning, students, research, study, use, educational, using	BERTopic 0—General Applications of AI/ChatGPT in Education with Focus on Student and Learning	Theme 1—Technological Foundations of LLMs in education Theme 2—Educational Impact for Students, Performance, and Learning Outcomes
BERTopic 1 (size 51)	chatgpt, questions, responses, patient, accuracy, ai, surgery, education, dental, information	BERTopic 1—Accuracy and Evaluation of ChatGPT in Medical or Dental Education	Theme 2—Educational Impact for Students, Performance, and Learning Outcomes Theme 3—Content Quality, Readability, and Literacy
BERTopic 2 (size 7)	google, responses, patients, questions, chatgpt, readability, floaters, ophthalmologists, patient, gemini	BERTopic 2—Comparisons of ChatGPT vs. Gemini in Patient Education with a Focus on Ophthalmology	Theme 3—Content Quality, Readability, and Literacy
BERTopic 3 (size 4)	radiology, pamphlets, reports, summaries, reader, 99, radiology reports, gpt4, radiologists, accuracy	BERTopic 3—GPT-4 in Radiology for Patient Education	Theme 3—Content Quality, Readability, and Literacy

Table 8. IF for the journals which have published the top 10 most-cited papers.

Journal	2024 JIF	JIF Quartile
International Journal of Information Management	27.0	Q1
Clinical and Molecular Hepatology	16.9	Q1
Visual Computing for Industry Biomedicine and Art	6.0	Q1
Journal of Science Education and Technology	5.5	Q1
Education and Information Technologies	5.4	Q1
Journal of University Teaching and Learning Practice	4.4	Q1
Education in the Knowledge Society	2.8	Q1
Research in Social & Administrative Pharmacy	2.8	Q2
Applied Sciences—Basel	2.5	Q2
Annals of Surgical Treatment and Research	1.6	Q3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cibu, B.-R.; Crăciun, L.; Molănescu, A.G.; Cotfas, L.-A. Exploring the Educational Applications of Large Language Models: A Systematic Review and Topic Analysis. Electronics 2025, 14, 4683. https://doi.org/10.3390/electronics14234683

AMA Style

Cibu B-R, Crăciun L, Molănescu AG, Cotfas L-A. Exploring the Educational Applications of Large Language Models: A Systematic Review and Topic Analysis. Electronics. 2025; 14(23):4683. https://doi.org/10.3390/electronics14234683

Chicago/Turabian Style

Cibu, Bianca-Raluca, Liliana Crăciun, Anca Gabriela Molănescu, and Liviu-Adrian Cotfas. 2025. "Exploring the Educational Applications of Large Language Models: A Systematic Review and Topic Analysis" Electronics 14, no. 23: 4683. https://doi.org/10.3390/electronics14234683

APA Style

Cibu, B.-R., Crăciun, L., Molănescu, A. G., & Cotfas, L.-A. (2025). Exploring the Educational Applications of Large Language Models: A Systematic Review and Topic Analysis. Electronics, 14(23), 4683. https://doi.org/10.3390/electronics14234683

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exploring the Educational Applications of Large Language Models: A Systematic Review and Topic Analysis

Abstract

1. Introduction

2. Method of Review

2.1. Choosing the Database

2.2. Application of the PRISMA 2020 Guideline

2.3. Steps Followed in the Analysis

2.4. Rationale for Selecting the 10 Most Frequently Cited Articles

3. Results

3.1. Dataset Description

3.2. Top-10 Most Cited Articles Review

3.3. Thematic Analysis

3.4. Topic Discovery

3.4.1. Topic Discovery Through LDA

3.4.2. Topic Discovery Through BERTopic

3.5. Review Based on Identified Themes and Topics

3.5.1. Technological Foundations of LLMs in Education

3.5.2. Impact on Students, Performance, and Learning Outcomes

3.5.3. Content Quality, Readability, and Literacy in AI Outputs

3.5.4. Ethics, Personalization, and Responsible Use of LLMs

4. Discussion and Limitations

4.1. Results Obtained

4.2. Specific LLMs

4.3. Student Engagement and AI Impact

4.4. Complement and Validation of the Results via AI-Based Tools

4.5. Limitations

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI