Use of ChatGPT as a Virtual Mentor on K-12 Students Learning Science in the Fourth Industrial Revolution

Castañeda, Rafael; Martínez-Gómez-Aldaraví, Andrea; Mercadé, Laura; Gómez, Víctor Jesús; Mengual, Teresa; Díaz-Fernández, Francisco Javier; Sinusia Lozano, Miguel; Navarro Arenas, Juan; Barreda, Ángela; Gómez, Maribel; Pinilla-Cienfuegos, Elena; Ortiz de Zárate, David

doi:10.3390/knowledge4040031

Open AccessArticle

Use of ChatGPT as a Virtual Mentor on K-12 Students Learning Science in the Fourth Industrial Revolution

by

Rafael Castañeda

^1,*,

Andrea Martínez-Gómez-Aldaraví

²,

Laura Mercadé

²

,

Víctor Jesús Gómez

²

,

Teresa Mengual

²,

Francisco Javier Díaz-Fernández

²

,

Miguel Sinusia Lozano

²

,

Juan Navarro Arenas

³,

Ángela Barreda

⁴

,

Maribel Gómez

²

,

Elena Pinilla-Cienfuegos

²

and

David Ortiz de Zárate

^2,*

¹

IES de Benaguasil, Calle Segorbe 2, Benaguasil, 46180 València, Spain

²

Nanophotonics Technology Center (NTC), Universitat Politècnica de València (UPV), Camí de Vera s/n, 46022 València, Spain

³

Department for Quantum Technology, Faculty of Physics, 48149 Münster, Germany

⁴

Group of Displays and Photonics Applications, Carlos III University of Madrid, Avda. de la Universidad, 30, Leganés, 28911 Madrid, Spain

^*

Authors to whom correspondence should be addressed.

Knowledge 2024, 4(4), 582-614; https://doi.org/10.3390/knowledge4040031

Submission received: 10 September 2024 / Revised: 5 November 2024 / Accepted: 3 December 2024 / Published: 5 December 2024

(This article belongs to the Special Issue New Trends in Knowledge Creation and Retention)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Education 4.0 arises to provide citizens with the technical/digital competencies and cognitive/interpersonal skills demanded by Industry 4.0. New technologies drive this change, though time-independent learning remains a challenge, because students might face a lack of support, advice and surveillance when teachers are unavailable. This study proposes complementing presential lessons with online learning driven by ChatGPT, applied as an educational tool able to mentor K-12 students learning science at home. First, ChatGPT’s performance in the field of K-12 science is evaluated, scoring A (9.3/10 in 2023, and 9.7/10 in 2024) and providing detailed, analytic, meaningful, and human-like answers. Then, an empirical interventional study is performed to assess the impact of using ChatGPT as a virtual mentor on real K-12 students. After the intervention, the grades of students in the experimental group improved by 30%, and 70% of students stated a positive perception of the AI, suggesting a positive impact of the proposed educational approach. After discussion, the study concludes ChatGPT might be a useful educational tool able to provide K-12 students learning science with the functional and social/emotional support they might require, democratizing a higher level of knowledge acquisition and promoting students’ autonomy, security and self-efficacy. The results probe ChatGPT’s remarkable capacity (and immense potential) to assist teachers in their mentoring tasks, laying the foundations of virtual mentoring and paving the way for future research aimed at extending the study to other areas and levels, obtaining a more realistic view of AI’s impact on education.

Keywords:

ChatGPT; virtual mentor; K-12; science education; education 4.0; artificial intelligence; blended learning

1. Introduction

“It’s the end of the world as we know it” is not only the title of a song, but also a proper description of current societies worldwide, facing the significant and rapid changes brought about by the Fourth Industrial Revolution (4IR) that every citizen is witnessing [1,2].

The previous industrial revolutions were driven by technologic advancements, and their effect on industry and societies was profound [3]. While the 1IR employed steam and waterpower to mechanize manufacturing, the 2IR enabled mass production through the use of electricity and the division of labor, and the 3IR was powered by electronics and computerization, allowing the automation of manufacturing and the analog to digital transition, changing the world’s capacity to store information in digital format from less than 1% in 1980s, to more than 99% by 2014 [4].

1.1. Fourth Industrial Revolution (4IR)

Conversely, the present industrial revolution pursues a new paradigm of smart, autonomous, and sustainable manufacturing: the so-called Industry 4.0 [5], whose foundational pillars were conceived to empower every citizen and every government to build a better and more inclusive, human-centered world [3]. This revolution still exploits the key developments to digitalize the information set by the Digital Revolution (3IR) as early as 1947 [6]−that is transistors, integrated circuits, microprocessors and computers, Internet, digital mobile phones and even digital TV. However, the 4IR is mainly driven by a set of disruptive technologies that blur the lines between the physical, digital, and biological worlds (through cyber-physical systems) [3]. These technologies aim to [5,7]:

(1): Increasing connectivity, data, and computational power (cloud technology, smart sensors and actuators—even wearables, blockchain…)
(2): Boosting analytics and system intelligence (advanced analytics, machine learning, neural networks, artificial intelligence (AI)…)
(3): Promoting machine–machine and human–machine interaction (extended reality, XR—including virtual, augmented and mixed reality, that is VR, AR and MR, respectively, digital twins, robotics, automation, autonomous guided vehicles, Internet of Things, Internet of Systems…)
(4): Enhancing advanced engineering (additive manufacturing such as 3D printing, ICTs, nanotechnology, renewable energies, biotechnology…)

All these disruptive technologies are leading Industry 4.0 towards the concept of smart manufacturing (as production becomes faster, closer and more responsive to customer/market requirements), exhibiting unprecedented degrees of interoperability, virtualization, decentralization, real-time capability, modularity, information transparency and technical assistance [2,8]. Consequently, the 4IR is not only improving the efficiency of business and keeping billions of people interconnected, but also enhancing sustainability through better management of resources, contributing to regenerating the natural environment and potentially reverting the damage earlier industrial revolutions provoked [9].

Thus, the 4IR is already changing the way we perceive the world and work, and the way we think, relate and live on an unprecedented scale, scope and complexity. The opportunity and the responsibility to build a better and sustainable future through Industry 4.0 must be embraced by worldwide governments, companies, industries, academia, and civil society [3].

Since technology is only one element within a complex adaptive landscape, workers and citizens should also recalibrate their skillsets to navigate a future where automated systems will increasingly assume heavier duties, leaving humans tasks related to programming, supervision and reparation. Consequently, skills that remain outside the capabilities of machines are anticipated to hold substantial value within the labor market. These competencies can be categorized into three primary domains [10]:

(1): Advanced technical skills: Proficiency in information and communication technologies (ICT), Big Data and data analysis, network management, programming, 3D printing, nano/biotechnology…
(2): High-order cognitive skills: Abilities that encompass critical thinking, complex problem-solving, and informed decision-making.
(3): Human and interpersonal skills: Competencies including creativity, social engagement and emotional intelligence.

Societies might empower their citizens with those skills and digital competencies, and the best way to do this is through education.

1.2. Education

Old-school education based on the Empty Container Paradigm (students are an empty container that must be filled with knowledge), presents several problems to satisfy the fit-for-purpose curriculum within the context of Industry 4.0 [11,12]:

(1): The current teaching process has plenty of room for interactivity improvement (which is required for better and longer learning)
(2): Assessments only evaluate the amount of learned knowledge, but not the acquired competencies/skills
(3): There is a wide time gap between receiving knowledge and its application in practice.

1.3. Education 4.0

Education evolved over the years until reaching the concept of Education 4.0 [13,14,15,16], aimed at providing students with not only technical/digital competencies, but also cognitive and human/interpersonal skills required by society and Industry 4.0. Indeed, it was clearly boosted worldwide by COVID-19-related school closures [17,18,19,20]. The foundations of Education 4.0 might be summarized as [11,16]:

(1): New student-centered learning strategies (heutagogy, peeragogy, cybergogy…)
(2): Location and time-independent learning
(3): Personalized learning
(4): Interactive/collaborative learning
(5): Gamification to raise engagement
(6): Online sources of information (web, massive open online courses—MOOCs…)
(7): Teacher to mentor transition

Many of them are currently being fulfilled by using 4IR technologies. According to the bibliography [17,21,22], the most frequently applied technologies within Education 4.0 are VR, AR, MR, eye-tracking, learning analytics, robotics, and simulation, as well as the tools enabling online learning such as streaming lectures, virtual classrooms, digital boards, cloud systems, and MOOCs. AI is also applied in combination with learning analytics to assess the students’ progress in order to find weaknesses and adapt the education process to their particular needs, enabling a more personalized education [17,23,24,25,26].

A simple problem hampering efficient time-independent learning (part of the second pillar of Education 4.0) is the availability of mentors to assist students, which cannot be complete. Therefore, there is a growing interest in MOOCs and virtual mentoring, as they are effective and complementary tools outside of schools. However, MOOCs lack interactivity, mentoring capacity, and freedom to provide personalized answers to the student’s particular doubts, while current virtual mentoring still refers to remote mentoring, that is connecting students with their human mentors, which still does not address the availability limitation.

2. Literature Review

2.1. Origins of AI in Education: Intelligent Tutoring Systems

The use of AI in education has been explored since the 1970s, when computers were devised not only as tools but also as potential tutors [27], developing a new field of research using computers to smartly coach students termed Intelligent Computer Assisted Instruction or Intelligent Tutoring Systems (ITS) [28,29,30,31]. Computer learning has evolved over time, integrating artificial intelligence. Thus, plenty of dialogue-based systems have been developed [29,30,31] for assisting students in learning different subjects, from STEM sciences to politics, claiming to provide students with a meaningful interaction, which is key for long-term learning [32], and pointing out students’ difficulties in order to duly address them, promoting customized learning. However, ITS effectiveness has been debated over the years [33,34]. Additionally, these solutions have been purposely designed for teaching, so they are forced to follow a pre-programmed method to teach students, which limits the ITS capacity of solving the student’s particular doubts arising when doing homework without any tutor. This might explain why these solutions have barely been employed in experimental lessons, including problem-solving and/or decision-making in chemistry, physics, and clinical fields [30].

Other researchers have assessed the use of old chatbots as “real-time educational assistants for the teacher”, but they actually lack interactivity, only relating to students through pre-programmed answers lacking intelligence, and their only aim is raising engagement and generating statistics of students’ understanding levels [32,35,36,37]. Again, these systems lack the freedom to answer the student’s particular doubts.

2.2. AI Evolution: Generative AI

The growth of AI during the last few years has recently been quantified, not only in numbers but also in real applications within the market: the adoption of AI in companies (50%) has doubled since 2017 (20%) and the average number of AI capabilities they exploit (such as natural-language generation or computer vision), has also doubled from 1.9 in 2018 to 3.8 in 2022 [38], according to a McKinsey Global Survey [39]. In 2023, 149 new AI foundation models were released (mainly from the industry), more than double the amount appeared in 2022, and the number of AI patents sharply increased worldwide (67% from 2021 to 2022). Surprisingly, while the US is the leading source of those models, China led global AI patent origins with 61% (compared to 21% of US origin) [40]. Today, AI can be applied not only to process information and make decisions (in fields such as weather forecasting [41] or chemistry [42,43], among many others), but also to generate original text, images and video content, and even if AI has already overcome human performance on several benchmarks and domains (such as image classification, visual reasoning, and English understanding), human beings still perform better than AI within more complex tasks like competition-level mathematics, visual commonsense reasoning and planning [40].

Concerning natural language processing, in 2018 Open AI (San Francisco, CA, USA) developed an autoregressive language model called GPT (Generative Pre-trained Transformer), exploiting deep learning to produce human-like text from image/text inputs [44], whose latest version (at the time of writing), GPT-4, released on 14 March 2023, feeds an artificial intelligence chatbot called ChatGPT, able to write and debug computer programs [45,46], generate documents such as essays [47], answer test questions [48], write poetry and lyrics, compose music [49], and even provide assistance in complex scientific tasks [50,51,52]. Despite this impressive demonstration of capabilities within a short time of existence, ChatGPT is just a large-scale multimodal model powered by AI, so its application in fields different from the strict generation of text is currently unveiling problems related to a lack of appropriate training, guarantee of intellectual property rights preservation, and hallucinations (mistaken or nonsensical answers that seem semantically or syntactically correct), which are allegedly limited in the last version of the language model, and will tend to reduce with increased training [53]. As a consequence, this technology is actually earning as many positive judgements as widespread criticism from artists, ethicists, academics, educators, and journalists [54,55,56,57].

2.3. Generative AI in Education

There is a recent and growing interest in assessing the potential of AI-powered chatbots (such as ChatGPT) as an educational tool in quite different fields such as language, programming, mathematics, medicine and the economy, among many others [58,59,60,61]. Many of those studies focus on evaluating the chatbot ability to ask and also answer particular fact-based and test questions. Even OpenAI has evaluated its own technology by means of exams purposely designed for humans. While GPT-3.5 performance lied at the bottom 10% of test takers, GPT-4 scored higher than the majority of human test takers (top 10%). Those previous studies have promoted an open debate around the potential benefits and limitations of ChatGPT in the field of education [44,58,62,63,64,65,66,67], suggesting its potential use as a teaching assistant [64,65,66,68,69] automatically tackling duties such as creating assessment tests, grading, guiding, and recommending learning materials, though on the other hand, also claiming that ChatGPT “lacks the ability to provide emotional support, and facilitate critical thinking and problem-solving skills in science learning”, from a theoretical perspective [68].

2.4. Generative AI in Science Education

Concerning the use of ChatGPT in the field of science education, studies usually explored the generative artificial intelligence’s capability to answer a few theoretical questions [68,70,71], or applied questions such as acid/base problems [72]. Some studies have developed a theoretical framework for applying generative AI in the field of education [73]. Nevertheless, those publications conclude that the future research should focus on evaluating the impact of ChatGPT (or other AI applications) on real students’ learning outcomes, such as academic achievement, motivation and engagement, in different contexts, as there is still a need for real cases of AI impact assessment on real students, that is, empirical and systematic evaluations of the use of ChatGPT on real students learning science [56,73]. In order to satisfy this need, this study focused on assessing if the use of the chatbot as a virtual mentor to 15 to 16-year-old students learning science, when teachers were unavailable, had an impact on students’ learning outcomes.

3. Methods

The present exploratory study will attempt to evaluate the use of generative AI in 15 to 16-year-old students learning science (in the frame of K-12 Education, from kindergarten to 12th grade, in the publicly supported primary and secondary education found in many countries). The hypothesis of this study is that ChatGPT is able to display both functional and social/emotional support to students learning science when teachers are unavailable, providing not only real-time correct answers to their particular doubts, despite the current limitations, but also a meaningful interaction satisfying their needs (in agreement with Maslow and the Relationships Motivation theories), promoting an improvement in students’ outcomes. The hypothesis will be validated by addressing the following research questions:

RQ1: Does ChatGPT provide a trustworthy time-independent learning experience to K-12 students, when teachers are unavailable? Assessing the performance of ChatGPT within the field of interest (chemistry and physics knowledge to be acquired by 15 to 16-year-old students) will allow us to determine whether generative AI offers a trustworthy learning experience or not.

RQ2: Can ChatGPT create meaningful interactions with K-12 students? The meaningfulness of the interaction between ChatGPT and students was evaluated as an analog of human−human interaction. Thus, factors such as the number of interactors, the participants, the activity during the interplay, the communication medium, the synchronicity, and the motivation/engagement were assessed.

RQ3: What is the real impact of using ChatGPT as a virtual mentor on K-12 students learning science when teachers are unavailable? The direct impact on students might be assessed by monitoring two outcomes: the evolution of students’ grades after the intervention, as well as their opinion regarding the utility of ChatGPT as an educational tool.

Following previous works on the use of AI within an educational context [74,75,76], this exploratory study will address these research questions by evaluating ChatGPT’s competence to become an educational tool aimed at providing K-12 students with personalized, meaningful, and location- and time-independent learning, in a safe environment and in real time, assisting teachers in the task of mentoring students through specific duties such as homework correcting and solving doubts at home. A special focus will be set on assessing: (a) student’s proficiency before and after the intervention, and (b) students’ perception of AI as a useful educational tool, once duly evaluated. To the best of our knowledge, this is the first empirical assessment of the real impact of using ChatGPT as a virtual mentor on K-12 students learning chemistry and physics when teachers are unavailable, (i.e., at home), within the frame of a blended learning pedagogical approach combining constructivist/connectivist presential learning (Education 2.0 and 3.0) with student-centered self-regulated cybergogy (Education 4.0) [16].

This study was designed following the previously described IDEE theoretical framework for using generative AI in the field of education [73]. According to this, the main pillars of the study have been identified:

Desired outcomes: This empirical study aims to systematically assess the complementary and well-defined use of ChatGPT outside the traditional school environment (mainly focused on correcting specific homework assignments designed by the teacher and solving students’ particular doubts and needs) on K-12 (15 to 16 year-olds) students learning chemistry and physics. The outcomes to be monitored in order to evaluate the impact of AI on students were their proficiency (through grades evolution) and their perception of the AI as an educational tool, before and after the intervention.
Appropriate level of automation: The study has been designed within a blended learning pedagogical approach, where the teacher role is essential as not only mentor but also facilitator [77]. Thus, K-12 students kept the constructivist/connectivist presential learning at school in combination with online learning experiences designed by the teacher (flipped learning/cybergogy [78]). The only difference arose for those students in the experimental group, who might complement their homework tasks by means of ChatGPT, employed as an educational tool able to correct assignments, solve doubts and guide the students towards a better understanding of the lesson and a stronger and longer-term settlement of knowledge. Therefore, only partial automation is considered.
Ethical considerations: The procedures performed in this study involving human participants were in accordance with the national and European ethical standards (European Network of Research Ethics Committees), the 1964 Helsinki Declaration and its later amendments, the 1978 Belmont report, the EU Charter of Fundamental Rights (26 October 2012), and the EU General Data Protection Regulation (2016/679). As the study involved 15 to 16-year-old students, parental informed consent was obtained from all individual participants included in the study. Main ethical concerns discussed in the bibliography are related to intellectual property, privacy, biases, fairness, accuracy, transparency, lack of robustness against “jailbreaking prompts”, and the electricity and water consumption needed to sustain the AI servers [79,80,81,82]. The use of ChatGPT planned in this study leaves little room for intellectual property, privacy or transparency issues. Also, jailbreaking prompts seem not to be useful for students in this case. However, students misusing ChatGPT to do their homework instead of positively exploiting AI to correct their homework and solve their doubts might be a potential problem [56], but this technology is so new and attractive that students will easily be engaged to test ChatGPT and its potential benefits. Anyhow, the potential misuse might easily be detected by comparing students’ grades before and after the intervention, as grades of students misusing the AI should never show any improvement. Another potential consideration might be the generation of incorrect or biased information, as the AI answers are limited by the previous training and some mathematical hallucinations have already been detected [83]. Thus, a previous validation of ChatGPT’s performance in the specific field of K-12 chemistry and physics will be assessed. In the case of large language models, bias can be defined as the appearance of systematic misrepresentations, attribution errors or factual distortions based on learned patterns, that might drive support for certain groups or ideas over others, preserving stereotypes or even making incorrect assumptions [84]. Training data, algorithms and other factors might contribute to the rise of demographic, cultural, linguistic, temporal, confirmation, and ideological/political biases [84]. However, these potential preexisting biases within the model should not affect the utility of AI within the field of interest (K-12 science education), even if users should and will be aware of this possibility. Besides those considerations, the impact of this study on learners focuses on achieving a better understanding of the lesson, and a stronger and longer-term settlement of knowledge. Concerning teachers, they would be assisted in a time- and location-independent manner by AI in their task of mentoring students, leaving teachers more time to personally satisfy particular students’ needs.
Evaluation of the effectiveness: According to the bibliography, the gold standard for measuring change after any intervention (i.e., within educational research) is the experimental design model [85], and it was chosen to assess the proposed educational approach.

3.1. Proposition of the Experiment

The aim of the experiment was to evaluate the effect of the complementary use of ChatGPT when teachers are unavailable (acting as a virtual mentor, helping students to correct specific assignments and solve doubts) on the K-12 students’ learning outcomes: their grades and their perception of AI’s usefulness as an educational tool.

3.2. Independent Variable (Factor): Complementary Use of AI

Description: The independent variable or factor was the use of a learning tool (AI).
Levels: This variable counts on two levels:
▪
With tool (Experimental group): Students who used the learning tool.
▪
Without tool (Control group): Students who did not use the learning tool.
Objective: To assess whether using the learning tool affects both academic performance and students’ perception of its usefulness.

3.3. Response Variables (Dependent Variables): Complementary Use of AI

The design included two response variables, each measuring the potential effects of the independent variable.
Response variable 1: Students’ grades
▪
Description: This variable represented students’ academic performance, typically expressed on a numerical scale.
▪
Objective: To determine if there was a difference in academic performance between the experimental and the control group.
▪
Type of variable: Continuous, as grades are generally expressed in numerical values that allow for statistical calculations.
Response variable 2: Students’ perception of the learning tool
▪
Description: This variable referred to students’ perceptions regarding the use of the learning tool.
▪
Objective: To evaluate how students perceived the tool’s usefulness.
▪
Type of variable: Ordinal—the students’ perception was recorded by means of a five-level Likert scale.

3.4. Type of Design: Non-Randomized Unifactorial Design with Control Group

Unifactorial design: Only one factor is under study, which is the use of the tool.
Non-randomized design: Randomization was not an option for the present study, as counting on two groups of students (the experimental and the control group) with balanced students’ level of proficiency (low, medium and high) might avoid potential biases coming from groups with unbalanced levels of proficiency. Despite this advantage, it implied a potential selection bias.
Control and Experimental Groups: A control group (without tool) was used to compare outcomes with the experimental group (with tool) and to observe any significant differences.

3.5. Design Analysis

Counting on two response variables (students’ grades and perception of AI), the analysis was conducted separately for each response variable.
Comparison of means (grades): Grades were the primary measure of performance, so a difference-in-means test such as the t-test was used between groups to assess the tool’s impact.

3.6. Design Limitations

Lack of randomization: As the design was non-randomized, it was subject to selection bias, as other uncontrolled factors (such as prior skills or students’ motivation) might influence the results.
Internal and External Validity: The lack of randomization limited the internal validity, by hampering the capacity to ensure that differences are due to the tool and not to other factors, and might also affect the external validity, by not applying the generalization of the obtained results to other contexts or populations.

3.7. Participants

Benaguasil is a town with a per capita income close to the average in Spain. The IES Benaguasil, a young public school of only 18 years of experience close to a quiet residential neighborhood, had only one class of eligible Chemistry and Physics students. The study environment was diverse because of the eligible nature of the subject: some students chose the subject because their aim was to pursue a scientific route towards university, though some others were not following that route. 23 Caucasian students participated in the study, 6 males and 17 females. Students were indistinctly assigned to the control or experimental group, only taking into account their proficiency within the first term (with no AI help). The students’ proficiency in the subject during the first term followed a Gaussian distribution, with double representation of the medium (4–7/10) over low (0–4/10) or high (7–10/10) proficiency. Accordingly, both the control and the experimental group were created following the same distribution of students’ proficiency (25, 50 and 25% of students displaying low, medium and high proficiency, respectively). It is important to note that the profile of students displaying low proficiency tended to present particular circumstances (they were repeaters, suffered from familial problems etc.), so they also required special attention. First, the real performance of ChatGPT in the field of chemistry and physics for K-12 students (precisely 15 to 16-year-old students) was systematically evaluated by the authors. The AI-powered chatbot answered a test specifically designed for real K-12 students, including a set of 52 selected theoretical questions and problems summarizing the knowledge and problem-solving skills to be acquired during a complete academic course, in a similar way to previous studies [48,59,60], always keeping in mind that this technology is not purposely designed for education, despite its great potential. No difficult or impossible questions were removed from the set of questions as other studies previously did (i.e., questions requesting drawings as outputs, or analyzing images as inputs) [86], in order to obtain a fair and accurate perception of the performance of ChatGPT within this particular field, including all type of knowledge and skills requested to 15 to 16-year-old students learning chemistry and physics. Eleven teachers including chemists, physicists, and engineers evaluated the answers. The AI replies to theoretical questions were assessed by looking for clarity, accuracy and soundness, while more applied questions such as problems were not only evaluated by the accuracy of the final result, but also by the validity and clarity of the procedure to reach that result (related to problem-solving skills), paying special attention to those resources enabling a stronger and longer-term knowledge settlement in a pedagogical manner. Once the theoretical performance of the chatbot was assessed, the authors judged the experimental capacity of this tool to assist teachers in the task of mentoring real 15 to 16-year-old students learning chemistry and physics when educators were unavailable. The focus was set on solving theoretical doubts and correcting homework assignments (including problem-solving questions) in real time and without time restrictions. Therefore, this study empirically assessed the impact of providing students with a meaningful interaction with the chatbot through which they could experience completely personalized learning, improving their knowledge and skills while boosting their engagement. The knowledge improvement after the intervention was monitored through two indicators: students’ grades (taking into account both proficiency and problem-solving skills) and their perception of AI as a useful educational tool.

Finally, among the different chatbots powered by AI (both free and paid), ChatGPT was selected because of two main reasons: ChatGPT was totally free at the time (which could contribute to reducing inequities in the field of education), and it exploits the original OpenAI GPT technology, which counts on more training and constant updates, thus ensuring the use of the latest and most powerful version, which is less prone to hallucination. Indeed, the study started with ChatGPT powered by GPT-3.5 (which evidenced frequent hallucinations when performing mathematical operations and proved a lack of chemical reasoning), and ended up employing GPT-4 model, released in March 2023. Other large language models such as Bard (Google), LlaMa-2 (Meta) and AWS services (Amazon) were also released, but their capacities are not comparable to those of ChatGPT at the moment [71]. Finally, the last GPT model (GPT-5) is expected to be released soon, and it has supposedly been announced that it is able to reach the Artificial General Intelligence (AGI), an AI able to pass the Turing test [87], that is an AI so developed that it might be indistinguishable from human intelligence.

Assessment of ChatGPT’s performance in the field of chemistry and physics for K-12 students

A set of 52 theoretical questions and problems were carefully selected to systematically ascertain the real competence of ChatGPT in the field of interest, covering the main knowledge and problem-solving skills to be acquired by 15 to 16-year-old students during a complete academic course. Gathering both theoretical questions and problems allowed us to analyze not only ChatGPT’s current strengths (textual output) but also its potential weaknesses (handling images). Thus, the study aimed to explore the capacity of AI to deal with problem-solving (combining text recognition with mathematical calculation) and also to verify its capacity to deal with inputs and outputs other than text (i.e., requesting it to draw the Lewis structures of some molecules, as this is a fundamental part of the knowledge to be reached by chemistry students). The whole set of questions is available within the Supporting Information. The objective of this part of the study was not to verify if ChatGPT fails, as we already knew this, but to systematically assess the number of mistakes displayed within a real test, and grade it in accordance with a human scale. This might allow us to verify if ChatGPT is a trustworthy tool in K-12 science education. Finally, other parameters concerning the quality of the answer will also be taken into consideration (clarity, insight, systematicity, simplicity etc.).

Assessment of ChatGPT’s impact on real K-12 (15 to 16-year-old) students learning chemistry

In order to evaluate the impact of ChatGPT on K-12 students learning chemistry and physics without their teachers, students in the experimental group were requested to employ this tool in four sessions in which they had to correct their specifically designed homework and also solve theoretical or problem-solving doubts, after a previous demonstration performed by the teacher in the classroom. Then, two key performance indicators (KPIs) were proposed to monitor the influence of the use of ChatGPT over the students’ proficiency: students’ grades (in comparison with previous term grades) and students’ perception of AI as an educational tool. Thus, both quantitative and qualitative data were collected in order to monitor the outcomes before and after the intervention. Qualitative data such as the students’ perception of AI as an educational tool was assessed by a set of questions formulated for students after each session, applying the Likert scale to address and scale the answers in the survey [88]. Specifically, a typical five-level Likert scale design was used to measure variations in agreement, whose levels accounted for:

Strongly disagree.
Disagree.
Neither agree nor disagree.
Agree.
Strongly agree.

The evaluation of students’ proficiency was conducted through quantitative data analysis: the direct comparison of students’ grades before and after the intervention (through a paired sample t-test) was performed on both the control and the experimental groups after the normal distribution of data was verified, as previously described [89,90,91].

The homework was divided into four sessions focused on the “Chemical Reactions Unit”, and more specifically on a topic that is usually hard to understand to the general profile of 4th ESO students: the fundamental chemical entity quantifying the “amount of substance”, whose unit is the mole. Mole calculations at this level are related to: (a) the number of particles and atoms in a specific substance using Avogadro’s number, (b) the macroscopic mass of the substance (including the molecular mass, or more precisely relative mass), and (c) the number of moles in a gas sample related to the system conditions, including pressure (atmospheres), temperature (Kelvin), and volume (litres). Within this frame, two profiles of sessions were designed:

(a): Chemical calculations (gas laws). Sheets 1 and 2 presented a concretion of the gas equation of the state, in the final form of ideal gas law, related to the mole content of the gas sample. The exercises on sheets 1 and 2 request the direct and single calculation of moles, volume, or pressure from the exercise statement including the complete dataset. Each sheet includes six exercises.
(b): Gas or volume to mole relationships, as a more advanced learning. Sheets 3 and 4 introduced Avogadro’s Law. Each sheet introduces a set of six exercises considering the calculation of a single parameter (n (mole) or V (volume)) both in the reactant and/or product species of a particular chemical reaction. Pressure (in atmospheres), temperature (absolute, in Kelvin, K), and stoichiometric factors were provided in the exercise to reduce complexity. The document included a brief theoretical exercise requesting a particular reformulation of Avogadro’s Law (mole to volume ratio).

The complete set of exercises and questions within the different sessions is included in the Supporting Information.

At the end of each session, the students were requested to answer a brief questionnaire to control both the students’ assignment performance and their impressions concerning ChatGPT’s utility:

How long did it take you to complete the session?
What aspects of the session have you found more difficult?
Rate your level of agreement (1: Strongly disagree, 2: Disagree, 3: Neither agree nor disagree, 4: Agree, 5: Strongly agree) with the following sentences:
3.1
You have understood the theoretical concepts.
3.2
You know how to apply the theoretical concepts.

Question 4 might only be answered by students having employed ChatGPT.

4.

Rate your level of agreement (1–5) with the following sentences:

4.1: The approach offered by ChatGPT to solve the exercise is correct.
4.2: The numerical result of the exercise provided by ChatGPT is correct.
4.3: ChatGPT is useful as a complementary educational tool (for solving theoretical doubts or correcting problems) in the absence of a teacher.

4. Results

4.1. Assessment of ChatGPT’s Performance in the Field of Chemistry and Physics for K-12 Students

The performance of ChatGPT in the field of physics and chemistry for 15 to 16-year-old students was assessed by careful evaluation of the AI’s answers to the set of 52 questions previously mentioned, over the time of the study. The score for each question relied exclusively on two parameters: the accuracy of the final result (0.5/1), and the procedure to reach that outcome (0.5/1). The score for those questions, including several sections, was the same, so each section contributed proportionally to the final score. Finally, those resources enabling a stronger and longer-term knowledge settlement in a pedagogical manner (clarity, brevity, simplicity, use of examples etc.) were positively valued beyond ChatGPT’s performance, contributing to the chatbot’s validity as an educational tool, from a pedagogical point of view.

Preliminary tests were conducted to judge the best prompts for ChatGPT. The language was obviously not a problem for the tool (being a large language model): the same question was posed in Spanish and English, and the only difference was the language used to answer it (Figures S7 and S8). In addition, the straight question asked to the AI ended up with a relatively concise answer, while using a more specific prompt (“Acting as a chemistry/physics teacher, please explain…”) provided more detailed but still clear answers, including accurate and illustrative examples. Therefore, the 52 questions were evaluated by using English language and the specific prompt already mentioned. The results are summarized in Table 1 (2023) and Table 2 (2024) and will be chronologically discussed, in order to provide a clear comparison with time.

The 52 questions within the test were quite balanced according to their discipline, because 27 of them were related to chemistry, while the other 25 dealt with physics (Figure 1). According to the nature of the questions, almost 60% were theoretical queries, while 30% were experimental problems (Figure 1). Even if there was no balance, there was at least a significant number of experimental questions to show ChatGPT’s competence to deal with problem-solving tasks.

While Figure 1 describes the distribution of questions by discipline and/or nature, Figure 2 displays the assessment of ChatGPT’s performance in the field of K-12 chemistry and physics students, in 2023.

ChatGPT performance in 2023

Among the answers to those 52 questions, 46 of them were completely correct and carefully explained in 2023, that is 88% of the total answers, half of them related to chemistry and the other half to physics. Additionally, among the six questions that were not correct, only two of them were completely wrong and scored 0 (one within the chemistry and physics syllabus, respectively). Four questions were partially correct (two 0.5/1 and two 0.67/1), therefore increasing the final score from 8.8/10 to 9.3/10. Thus, ChatGPT obtained a final grade of A, which demonstrated the AI displayed a quite reliable performance within the 15 16-year-old students’ chemistry and physics syllabus, independent of the questions’ nature (theoretical or problem-solving queries).

Some of the chatbot strengths that might be deduced from the results obtained in 2023, were:

ChatGPT, being a language model, handled understanding questions and providing answers in different languages perfectly (Figures S7 and S8).
The AI, being a language model, elaborated the answers according to the literal request of information.
The chatbot provided an answer in real-time, but it was written word by word, probably in an attempt to resemble a human, which contributed to a closer and meaningful experience for the user.
ChatGPT was responsive to different prompts (write, act, create, list, translate, summarize, define, analyze etc.). In this case, the prompt “act” was exploited to request the AI to behave as a science teacher able to explain in detail the solutions to the different questions (Figure S4).
The chatbot took into account the context of the conversation, which might improve its comprehension of the subject being debated and allow it to make reference to a concept or idea that was previously discussed (Figure S5).
ChatGPT furnished information that was sensitive to operators as “TRUE, FALSE” (question 8, Figures S22 and S23).
The AI could handle not only theoretical doubts, but also problem-solving tasks. In the latter case, the chatbot perfectly recognized and applied the values, unities and, more importantly, the concept being requested.
ChatGPT could return answers to several questions formulated at the same time. However, it usually provided more detailed answers when questions were divided.

According to these results and the answers included within the Supporting Information, ChatGPT displayed not only a great ability to provide correct answers to a considerable number of theoretical questions and applied problems (9.3/10, 93%), but also clear, detailed and human-like explanations to theoretical queries and problem-solving duties that might help students to better understand the two disciplines of study. This might imply that AI could exhibit a great competence to guide real students to a better knowledge settlement, by correcting their homework and solving their particular doubts or mistakes in real time through a positive, human-like and meaningful interaction, within an immersive and safe environment (far from teachers or peer judgement [92]), also promoting students’ confidence and self-regulation.

Considering now the six incorrect answers in 2023, they were balanced by discipline, with three of them related to chemistry and the other three related to physics. However, there was no balance by nature, as there were two issues with practical problems and only one issue concerning theoretical questions within chemistry, while the opposite situation occurred with wrong answers in physics. Anyhow, no tendencies concerning the theoretical or practical nature of the incorrect answers could be extracted.

Furthermore, the main problems encountered by ChatGPT in 2023 focused on its own inability to recognize or produce images (question 38, Figure S64, and questions 9 and 21, Figures S24, S42 and S43, respectively). Despite this, an accurate textual description was provided instead. Additionally, the AI displayed mathematical hallucinations (question 50, Figures S81 and S82), even if the theoretical procedure and the substitution of numerical values in the equations were correct. Finally, the AI also found some troubles when predicting periodic properties of elements (the direct consequence of electronic configurations) when ordering some elements according to certain properties, such as radius and reactivity (question 5, Figure S17), as well as discussing the type of energy (kinetic or potential) exploited in several sources of energy, specifically in tidal energy (which might exploit both kinds of energy, even if the chatbot was forced to decide one of them). In summary, the AI ‘s difficulty in processing images as inputs or outputs involved three of the six incorrect questions in 2023. While this issue might only find a solution through an application redesign, the rest of the difficulties might be solved with better training of the GPT model, enabling improved and more accurate answers.

A brief description of the chatbot weaknesses in 2023 that might be extracted from the analysis of the mistaken answers were:

Being a language model, ChatGPT could not recognize images as proper inputs.
When the inputs required to understand the question (such as subindexes and superindexes) could not be normally written, the user was forced to develop an alternative code to introduce the lacking data (such as that in question 1, involving a customized notation created in real-time to make the AI understand how to recognize the atomic mass and atomic number of some isotopes that were provided within the question, Figure S5).
The chatbot could not create images as outputs even with the first version of GPT-4 (i.e., question 21), though the textual description that was offered instead was very clear, illustrative and correct.
The AI, being a language model, encountered some problems with mathematical calculations (question 50, Figures S81 and S82). Even if they were more frequent in the GPT-3.5 model, occasional mathematical hallucinations persisted in the GPT-4 model.
The chatbot did not handle correctly all the periodic properties of elements.

ChatGPT performance in 2023

The chatbot was further evaluated in 2024, and it was even more competent than before. Among the 52 questions, 49 of them (94%) were completely correct (Figure 3), and there was no completely wrong answer (0).

The three questions that were partially correct increased the final grade from 94 to 97%, still scoring A but displaying a significant performance improvement in 2024. All the theoretical questions were correctly answered by ChatGPT in 2024, including that regarding the properties of elements according to their position within the periodic table (question 5, Figures S17 and S17b), that concerning the kinetic and also potential energy of tidal energy (question 48, Figures S77–S79 and S79b), the problem of handling images as inputs was solved (question 38, Figures S64, S65 and S65b), and recognizing the vectorial character of force within an image provided by the user.

Despite the verified improvement with training, the AI still exhibited some difficulties with three practical questions, specifically with handling images as outputs (questions 9 and 21, Figures S24, S42 and S43, respectively) and also with some mathematical calculations (question 50, Figures S81, S82 and S82b). Some Lewis structures were clearly improved, and the textual description was perfect, but the final image was still confusing (Figure S24b). The same stood for the energy diagram requested in question 21 (Figure S43b): even if the scheme indicated in parentheses that the energy of the reaction products was lower, the drawing showed the energy of reactants in a lower position than that of the reaction products, which could be confusing for students. In addition, reactants and products were on the same level within the x axis (reaction progress), therefore the student might not appreciate the variation of energy during the reaction process in a clear way, which is another aim of that part of the question.

Regardless of these minor problems (many of which were duly addressed by training in 2024), the consistent, exhaustive, and positive results obtained within the test demonstrated the remarkable performance of ChatGPT to answer both theoretical and problem-solving questions (9.7/10, 97%), scoring A, thus being trustworthy for K-12 physics and chemistry students. Furthermore, the first part of the study unveiled the positive resources (beyond performance) to enhance students’ learning process: clarity and a high level of detail and organization in the answers provided in a time-independent manner, as well as a human-like meaningful response, which provided students with complete freedom to exploit this tool to find real-time answers to their particular doubts when they are studying or doing homework. Therefore, this educational approach supports students in a way no other educational tool would allow under these circumstances (when teachers are unavailable). These advantages pave the way for the potential use of ChatGPT in assisting teachers in their task of mentoring real students.

4.2. Assessment of ChatGPT’s Impact on Real K-12 (15 to 16-Year-Old) Students Learning Chemistry

The impact of using ChatGPT as a virtual mentor on real 15 to 16-year-old students learning chemistry when teachers are unavailable has been assessed through an empirical interventional study performed in a real school, monitoring two KPIs: the users’ perception of AI as an educational tool (evaluated by a set of questions formulated to students after each session, see Supporting Information, applying a typical five-level Likert scale), and students’ proficiency (by comparing students’ grades before and after the intervention, that is the first and second term, respectively).

The study comprised the analysis of several exercise sheets including the main calculations concerning chemical reactions, as presented in the former section, and the subsequent use of ChatGPT to verify the correction of the homework assignment, as well as solving any mistake or doubt. Before the release of the homework sheets, the teacher corrected one problem with the help of ChatGPT in the classroom; thus, the students had an initial guide to the use of this tool with autonomy (prompts, possible mathematical hallucinations, trying to promote students’ critical thinking etc.). The AI followed a general procedure to solve problems that basically consisted of: (1) identifying the data (including unities) and the unknown factor among pressure (P), volume (V), amount of matter (n), and temperature (T); (2) determining the formula required to solve the problem, and (3): performing the substitution of real values within the formula (sheets 1, 2) or basically identifying species in the chemical reaction, associating data with them, and performing the substitution of real values within the formula (sheets 3, 4). Yet in the preparatory class, the use of ChatGPT 3.5 released some minimal and basic calculation errors that could be solved with the human factor (the help of a teacher and careful surveillance of students), boosting students’ critical thinking but conditioning to a certain extent their initial opinion of the tool. The text of the solution improved slightly from trial to trial, and this was the furthest scope of ChatGPT 3.5’s approach. Upon completion of the study, some questions were tested again with ChatGPT 4.0, reducing the issues with mathematical calculations and improving the chemical ability to solve problems and teach students, as the AI was then able to establish relations among the chemical species in the chemical reaction (i.e., assignment of a given formula to a reactant, or a product), and determine not only their stoichiometric coefficients, but also other important conditions such as limiting and excess reactants. In fact, this means a huge step forward in solving chemical reaction exercises in comparison with the previous version, ensuring a stable and sure pathway to the correct solution.

Once the data sheets from the four sessions were collected, the students’ tasks were qualified. The students’ performance questionaries were registered to be evaluated after the sessions’ completion. These are the results:

4.2.1. How Long Did It Take You to Complete the Session?

Considering question 1, Figure 4 represents the time spent by students to solve the homework (without AI), displaying a tendency with a Gaussian shape, centered on 24 min for sheets 1 and 2, and 47 min for sheets 3 and 4, which is reasonable as the latter entailed a much higher calculation load.

4.2.2. In What Aspects of the Session Have You Found More Difficulties?

The most common response to this question was the effort to understand chemical concepts and consequently, the way to apply them to a real problem.

4.2.3. Rate Your Level of Agreement (1: Strongly Disagree, 2: Disagree, 3: Neither Agree nor Disagree, 4: Agree, 5: Strongly Agree) with the Following Statements

You have understood the theoretical concepts.
You know how to apply the theoretical concepts.

As expected, the results for question 3 (Figure 5) supported the most common answer of students to question 2 (“In what aspects of the session have you found more difficulties?”). This demonstrated that the majority of students (69%) did not agree with having understood theoretical concepts, and also that 75% of them could not apply them within a problem-solving task implying some calculations, without the teacher’s or AI’s help. Because of this, the use of ChatGPT to solve doubts and to correct the homework assignments arose as a potential solution for those students facing problems when completing the sheets exercises, improving students’ understanding of theoretical concepts in real time, and guiding them to apply those concepts within real problems, in a very clear and detailed manner.

Finally, question 4 was only requested to students having employed ChatGPT.

4.2.4. Rate Your Level of Agreement (1–5) with the Following Sentences

The approach offered by ChatGPT to solve the exercise is correct.
The numerical result of the exercise provided by ChatGPT is correct.
ChatGPT is useful as a complementary educational tool (for solving theoretical doubts or correcting problems) in the absence of a teacher.

Students used the chatbot then evaluated the AI ability to solve chemical problems. As expected, Figure 6a points out the positive perception of students towards the competence of ChatGPT (even if the GPT-3.5 model was firstly used) to define the theoretical approach to solve the exercise, displaying a distribution of values centered at 4 (agree).

In a similar manner, the results for question 4.2 (Figure 6b) showed a high degree of acceptance of the statement, that is, students appraised the AI capacity to provide correct numerical results most of the time, even if there were some mathematical hallucinations, more frequent with the GPT-3.5 model (and still present to a lower extent in the GPT-4 model). Those errors of the GPT-3.5 model might be the reason for a wider distribution in Figure 6b, tending towards lower values because some students (13%) disagreed that ChatGPT offered correct numerical results (in general), but the average still centered at 4 (agree). The mathematical errors might easily be detected by a human user, and this fact could be positively exploited by both students and teachers. Concerning students, the potential presence of hallucinations might improve their attention as well as their critical-thinking ability. Regarding teachers, the risk of finding a mathematical mistake prevents students from negative behaviors like the direct copy of results, promoting a good use of AI based on following ChatGPT’s guidance to the solution through a perfectly detailed theoretical approach. In conclusion, ChatGPT might be perceived as a patient and wise mentor, correcting homework and solving students’ particular doubts in real time, displaying occasional mathematical distractions that can easily be detected by students, all of which provides a meaningful and positive interaction with K-12 students, improving their learning process. However, students might not blindly trust the AI answers.

The questionary concludes with question 4.3 (Figure 6c), which provides an overall impression of students’ perception of AI as an educational tool. Only 37% of students appreciated ChatGPT’s ability to guide them towards the right pathway to solve/correct a problem (agreeing (4) and strongly agreeing (5) with the utility of AI as an educational tool). Thus, there is a considerable number of students who were not sure about ChatGPT’s usefulness (31%) or strongly disagreed with its utility (32%), which could be explained by the fact that they started the study testing the GPT-3.5 model. These students probably put more focus on the problem of mathematical hallucinations rather than valuing the AI capacity to detail a perfect theoretical approach to solve any chemical exercise. Furthermore, there were additional issues underneath that might have conditioned this result, which were: (1) the limited time of use and the evaluation of the AI soon after (right after the completion of each homework session), and (2) the lack of objective indicators for students to measure the improvement in their learning process (such as the increase of their proficiency in chemistry). Thus, the same survey was repeated after a whole term of evaluation, in order to look for any remarkable difference in students’ perception of AI. In addition, this KPI was not the only one employed to monitor ChatGPT’s impact on 15 to 16-year-old students learning chemistry.

Students’ proficiency before and after the intervention was also assessed by direct comparison of students’ grades after one term of AI use with those obtained the previous term (with no AI assistance). This was performed through a paired sample t-test before and after the intervention, on both the control and the experimental groups, after verifying the normal distribution of data. The results of the data analysis are summarized in Table 3 and Figure 7.

The general overall marks of these students have improved since the utilization of the AI tool, although grades are indeed a complex factor. Within the control group, even if the average marks slightly improved in the second term (one point out of ten, Figure 6), the data analysis revealed that there were no statistically significant differences between the students’ grades before and after the intervention, with 95% confidence (which was set for the study) as the p-value associated with the contrast statistic of paired samples Students’ t test (p = 0.29) was greater than 0.05 (5%). This might be expected, as the control group did not employ ChatGPT as a virtual mentor, so these students could not exploit all the advantages of the proposed educational approach.

On the contrary, the experimental group displayed statistically significant differences between the students’ grades before and after the intervention, with 95% confidence as the p-value associated with the contrast statistic of paired sample Student’s t test (p = 1.67 × 10⁻⁶) was smaller than 0.05 (5%). The students’ mean scores in the experimental group improved almost three points out of ten (Figure 7) after one complete term using ChatGPT as a virtual mentor. Besides being statistically significant, this remarkable improvement in students’ proficiency might be due to a large extent the use of ChatGPT as a virtual mentor in the absence of their teacher, because the improvement (1) was almost three times higher than that of the control group, and (2) was manifested by 90% of students in the experimental group, independent of their level of proficiency.

The students with a high level of proficiency improved their marks significantly, but the main differences were observed for students with medium and low level of proficiency, as their grades improved remarkably. Students with a failed first term (marks between 3 and 5) obtained second term grades between 6 and 9.55, and the student with the lowest mark in the first term (1.61, showing great difficulties to follow the class level), improved their knowledge during the second term using AI to finally obtain a 4.78 (slightly higher final grade, 5.52, with additional contributions from other tasks of no interest for this study), reaching the minimal acceptable level of knowledge for a student in Spain (5/10). In conclusion, the average level of proficiency of the class climbed from low-medium to medium-high, with the only aim of this educational tool.

Finally, before the end of the final term the students’ perception of ChatGPT’s utility as an educational tool was again assessed, in order to compare with the previous results (Figure 8). At that moment, the students had used the AI for a much longer time (two terms), and they also counted on several objective criteria to assess whether their proficiency in chemistry had increased or not (their own knowledge, which could still be somehow subjective, and their grades, which were objectively assigned by the teacher).

The overall perception of students about the use of ChatGPT as an educational tool after two terms of use was significantly more positive than previous results: 70% of students agreed or strongly agreed that the AI was a useful educational tool, instead of the initial 37%. This meant that almost double the students declared a positive perception of ChatGPT as a useful educational tool. According to previous discussion, a higher time of use and objective indicators of students’ proficiency improvement (their real proficiency and their grades) might account for this change in students’ perception of AI, realizing that ChatGPT had been an efficient educational tool to boost their proficiency in chemistry in a short time.

In addition to the data analysis, the survey allowed us to gather some students’ opinions that went beyond a positive perception: they agreed the homework sheets and the complementary use of ChatGPT had been crucial for the improvement of their learning process, because of an additional reason. When students faced the homework assignments, teachers were unavailable, and regretfully, many of them could not count on parents or any other tutor able to help them with their doubts in real time (which is a common problem for teen students, who are allegedly supposed to be more independent). Those students found themselves suffering from a lack of surveillance, advice and/or support from a meaningful human being around them that no one could fill in. However, that feeling disappeared after one term using AI at home, because students now could count on ChatGPT to provide them with the real time support (both technical and emotional) they needed. Thus, the study empirically demonstrated the remarkable abilities of ChatGPT to mentor 15 16-year-old students through a human-like, meaningful and personalized interaction, which not only ensured a remarkable improvement of students’ proficiency, but also contributed to promoting a feeling of care and support some that students lacked [64].

5. Discussion

The results obtained within the present exploratory study allowed us to address the original research questions in the frame of K-12 science education:

RQ1: Does ChatGPT provide a trustworthy time-independent learning experience to K-12 students, when teachers are unavailable?

Previous studies of ChatGPT performance unveiled quite diverse results, even within the same field of application and target population [58]. For instance, ChatGPT returned neither reliable nor accurate information concerning anatomy for university studies [60], it displayed poor results (45% correct answers) for MSE (the main specialization exam for medicine in Türkiye) [93], and finally, it provided a high level of concordance, accuracy, and insight on the United States Medical Licensing Exam (USMLE) [48]. Several factors might account for this variability across the literature, besides the knowledge level and the AI model (which are the same in these cases): systematicity, type of questions and the level of encoding. The first study attempted a quick assessment of the AI performance within the field of anatomy by means of a few questions, while the rest of them were more exhaustive and systematic, i.e., evaluating AI performance through three tests of around 300 multiple-choice questions. The second study provided the AI with multiple-choice questions, while the third study used not only multiple-choice but also open-ended questions. In addition, the latter study paid special attention to encoding, including both multiple-choice single answers with and without forced justification prompting. Surprisingly, the most exhaustive and systematic study, which included several types of questions beyond multiple-choice ones and paid particular attention to encoding, concluded the AI performed with a high level of concordance, accuracy, and insight on the United States Medical Licensing Exam. This suggests that any assessment of current AI capabilities within certain fields of knowledge should take (at least) all those variables into account, that is, knowledge level, AI model, systematicity, types of questions (not only multiple-choice but also open-ended questions), and encoding.

In the field of science education there are some publications exploring the ability of generative AI to answer a few theoretical questions [68,70,71], but more systematic studies paying attention to the previously described variables would increase the knowledge in the field.

In this study, the lower level of knowledge (restricted to K-12 science education) and the higher AI model planned to be used during the experiment (GPT-3.5 at first, and then GPT-4) made the authors foresee a positive performance. Furthermore, special attention was dedicated to performing an exhaustive and systematic assessment, including more than 50 open-ended questions of K-12 level chemistry and physics, and adequate encoding. No demographic, cultural, linguistic, temporal, or ideological/political biases were detected within the answers, probably because the questions belonged to the scientific field, leaving aside those sources of potential bias from the language model. The exhaustive and systematic study revealed that ChatGPT provided answers with remarkable accuracy and insight into theoretical questions, and a clear, detailed and well-organized procedure to find the solution to problem-solving questions. The consistent and positive results obtained through this performance test (9.3/10 in 2023 with the GPT-3.5 model, and 9.7/10 in 2024 with GPT-4 model) demonstrated that ChatGPT provided trustworthy answers (97%) in the field of K-12 science education, despite some limitations that have been considered and included within the test design (half of which have been duly addressed by OpenAI after a GPT model update and more model training). Thus, a realistic view of current ChatGPT performance within the field of interest has been provided to the community, which is in line with previous publications considering similar variables within different fields of education [48,72]. These results suggested ChatGPT might be a reliable tool to help K-12 students learning science to reinforce theoretical knowledge and enhance their problem-solving skills with no time and location restrictions, when teachers are unavailable.

RQ2: Can ChatGPT create meaningful interactions with K-12 students?

Even if there is still a limited understanding of what meaningful interactions are, from a holistic perspective, some authors recently summarized the main factors contributing to meaningfulness across different cultures, taking into account today’s media landscape [94]. Those aspects included a partner and what happens before, during, and after the interaction [95], the number of interactors [96], the activity during the interplay [97], as well as the communication medium [98], synchronicity [99,100], and motivation/engagement. According to this study, factors regarding the interaction characteristics had more influence over the meaningfulness than the communication channel [94]. Furthermore, among the different communication media, text, instant messaging, and social media networks had a stronger influence than calls, video calls and even face-to-face interactions (which displayed the lowest coefficient from linear regression analysis) [94].

If the interaction of users with ChatGPT could be examined as an analog of the interaction of two human beings, from a theoretical perspective, ChatGPT might be benefitting from some of those factors promoting meaningful interactions with users, to a certain extent. First, the interplay involves only one interactor (and some authors suggest people might find more meaning in small groups [96]). In addition, the communication medium is an instantaneous text message conversation, which is the best option among the different media, and one of the activities providing maximum meaning to the interaction [94]. Furthermore, the conversation aim is studying, which is also another activity driving an interaction of maximum meaning. Finally, synchronicity is another strength of ChatGPT, which might enhance the meaningfulness of the interaction through “amplification effects”, related to more vivid memories, motivation and engagement [94,99,100]. Thus, the AI puts special effort into interacting with the user like a human being would do, not only displaying answers in real time, but more precisely typing them word by word, sentence by sentence, contributing to a closer and more realistic interaction. In conclusion, ChatGPT counts on the required resources to create meaningful interactions with the user, from a theoretical point of view.

During the experiments to assess ChatGPT’s performance in the field of chemistry and physics for K-12 students, and those aimed at evaluating ChatGPT’s impact on real K-12 (15 to 16-year-old) students learning chemistry when their teachers were unavailable, both the authors and the students verified AI’s remarkable ability to explain scientific theoretical concepts and guide students to solving problems like a real human would do, following conventional well-organized procedures in a clear way, and providing real-life examples promoting a deeper and longer-term understanding of the subject. Both authors and students agreed that the interaction with ChatGPT resembled interacting with a real person with a deep knowledge in the field (K-12 physics and chemistry). Many aspects of ChatGPT were responsible for this resemblance: such as the real-time instant messaging-like communication medium (contributing to both synchronicity, motivation and engagement) with only one interactor; typing the answer as a human would do; the ability of AI to take the context and previous discussion into account; very open and versatile prompts enabling the user to request ChatGPT answer in a particular framework (i.e., acting as a secondary school teacher); and its ability to provide rich answers adapted to those different requested behaviors. All these experimental aspects of ChatGPT promoted meaningful interactions with the user.

Besides the functional advantages of AI enabling a meaningful interaction with students, a different facet enriching the meaningfulness of the interaction might also be considered, such as the emotional support the AI might provide to students. Several authors recently suggested ChatGPT cannot offer emotional support or enhance critical thinking/problem-solving skills in science learning [68], despite its impressive capabilities (assessing, grading, guiding, and recommending students), but the discussion kept within the theoretical plane, mainly referring to the fact that ChatGPT is not able to substitute the role of teachers (which is true).

However, the present study proposed AI might assist the teachers in their role of mentoring students (not substituting them), and thus has experimentally verified that ChatGPT displayed not only functional but also social features. In fact, the AI was able to satisfy individual needs in real time (K-12 scientific doubts), within a safe environment (far from peers’, parents’ or teachers’ judgement [92]) and without time or location restrictions, providing both fast answers (at least within the field of K-12 science) to users’ particular questions as well as the interactive and personalized support desirable from the perfect assistant [64]. Thus, the AI−human interaction promoted users’ autonomy and productivity, which, in agreement with Maslow [101] and the Relationships Motivation Theory, [102] contributed to satisfy their physiological, emotional, security, dignity and self-actualization requirements, even promoting a feeling of care, support and social camaraderie [64]. Emotional support is usually provided by family, significant others, friends, colleagues, counselors, therapists, clergy, and support groups, but also by online groups or even social networks. In this case, the meaningful interaction established between the AI and the human user, in combination with the functional support offered by the AI, enabled ChatGPT to provide students with a kind of emotional support that, according to the bibliography [103], might bring students reassurance, acceptance, encouragement, and a sense of being cared for.

This was not only verified by students’ positive perception of the AI after two terms of evaluation, but also through students’ opinions that were obtained within the survey, corresponding to students who did not have a parent or tutor to help them with their doubts during their time to solve the homework assignments. They highlighted the importance of counting on the AI to solve their doubts in real time for their learning process. The meaningful interaction established between ChatGPT and those students in a safe environment, in combination with the remarkable technical support provided by the AI in a time- and location-independent manner, promoted students’ sense of care and reassurance over time (two terms), which made them finally recognize the emotional support they felt knowing they could count on ChatGPT to help them.

In conclusion, ChatGPT was able to create meaningful interactions with K-12 students (15 to 16-year-olds), and the emotional support provided by this singular human−AI interaction reinforced its ability to assist the teachers when they were unavailable with the virtual mentor role proposed in the study.

RQ3: What is the real impact of using ChatGPT as a virtual mentor on K-12 students learning science when teachers are unavailable?

The objective of using ChatGPT on K-12 students learning science when teachers are unavailable is to help students correct homework assignments, solve doubts and guide them towards a better understanding of the lesson and a stronger and longer-term settlement of knowledge (technical support), while improving students’ sense of care and reassurance (emotional support). Thus, the impact of this approach might be monitored by a raising of students’ knowledge and skills, and the evolution of students’ perception of ChatGPT as a useful educational tool.

The increase in students’ knowledge and skills was experimentally assessed by a unifactorial quasi-experimental analysis [104]. Even if an experimental analysis would have provided stronger causal effects, randomization in a small group of students (only one class) might have created groups with unbalanced levels of proficiency, therefore it was avoided. Future studies will soon be carried out with a higher number of students, considering several classes and schools from different regions/countries, and thus an experimental analysis will be chosen to develop a large-scale assessment.

Once the remarkable performance of the chatbot in K-12 chemistry and physics was demonstrated, and the meaningful human−AI interaction could be verified, the effectiveness of the proposed educational approach was monitored through two different outcomes: the evolution of students’ grades before and after the intervention and students’ perception of the AI as an educational tool.

According to the bibliography, grades offer a limited capacity to evaluate the students’ level of proficiency (due to a generalized lack of standardization across institutions or even nations). Despite this, grades are still one of the most frequently used indicators, attempting to systematically take into account both theoretical knowledge and applied skills and competencies through achievement tests [105]. Furthermore, recent studies demonstrated that high-school grades might be a stronger predictor of college grades than standardized tests (because they are thought to capture both students’ academic and noncognitive factors that play a role in academic success, such as perseverance and a positive mindset) [106]. In conclusion, grades might not only be considered as reliable indicators, but also good predictors of future performance.

On the one hand, the present study demonstrated that students’ average grades in the experimental group improved 30% after one term using ChatGPT to correct their homework assignments and also to solve theoretical and problem-solving doubts (with respect to their grades in the previous term). Besides being statistically significant, the improvement in students’ proficiency was almost three times higher than that of the control group and was manifested by 90% of students in the experimental group, independently of their level of proficiency. Considering students’ perspective within the experimental group, the students with lower levels of proficiency (displaying great difficulties to follow the lessons) were able to pass their exams (some of them reaching good marks), while the students with higher level of proficiency still increased their grades. To sum up, the classes grade average improved from a low-medium to a medium-high level of proficiency, with the only additional help of ChatGPT as an educational tool assisting students when teachers were unavailable.

On the other hand, students’ perception of AI varied over the course of the study. The first reason is because they handled evolving versions of ChatGPT: they started using GPT-3.5, displaying more limitations and hallucinations, and ended employing GPT-4, which addressed most of those problems. Furthermore, time allowed students to provide a more meditated perception of AI. After one term using ChatGPT, students counted not only on perspective, but also on quantitative inputs to verify if the AI had been a useful educational tool for them, or not: their own proficiency, and grades. Both reasons made the students’ perception of AI as an educational tool evolve from an average of 3 (neither agree nor disagree), with 33% of students strongly disagreeing (1), towards an average of 4.05 (agree), with most students agreeing or strongly agreeing (70%).

Both indicators in this quasi-experimental analysis clearly demonstrated that the intervention was successful: ChatGPT was capable of providing students with the technical and emotional support they required when teachers were unavailable, so their grades and perception of AI usefulness as an educational tool increased significantly after only one term.

Similar approaches have already been proposed in the field of education [56,73,107], but no experimental analysis was performed, and the lack of empirical results was highlighted. A recent work assessed the students’ perception of ChatGPT [108], and despite the differences in the experimental design, the conclusions were quite aligned. The previous study intervention took place during the second examination (three in a semester) of college students learning physics, and the aim was to verify the students’ authentic perception of AI, without preconceived ideas about it. Thus, their results revealed that half of the students strongly trusted ChatGPT’s answers regardless of their accuracy, and many of them suffered from the typical misconceptions regarding AI, such as believing AI knows everything, and anthropomorphism. In opposition, the present study developed over the whole academic year (the first term as a control term, and the rest of the terms as both control and experimental terms), focusing on 15 to 16-year-old students learning physics and chemistry, and the aim was assessing ChatGPT after training the students in its use and potential limitations (verified by the authors within the first term, and re-evaluated over time until the present). Therefore, students knew about AI’s limitations before facing the chatbot answers, so they would not trust answers so strongly. 80% of students recognized ChatGPT was a useful educational tool, highlighting its ability to provide correct answers to theoretical questions and a correct theoretical approach to solve problems, in spite of minor mathematical hallucinations. However, 65% of students still thought the numerical result was correct (when it sometimes was not), totally in line with previous results suggesting students strongly trust AI answers, believing misconceptions such as AI’s super-intelligence and its ability to know everything [108].

Concerning students’ proficiency, our findings support previous conclusions of AI chatbots enhancing learning performance, verified in other educational areas such as language learning [109,110,111] and even through a meta-analysis [61], claiming the positive impact of AI chatbots in several learning outcomes.

Previous publications [107] described potential limitations of ChatGPT, including: (1) an effectiveness that has not been fully tested, (2) the quality of data used to train the AI, and (3) the complexity of the tasks to be performed by ChatGPT. The present study demonstrates that ChatGPT might overcome those limitations in certain contexts. Restricting the knowledge level to well-established science, such as chemistry and physics for K-12 students, the difficulty of the tasks performed by the AI was moderate to low, and thus the quality of the well-known data used to train the AI in this context was noticeable (as the few mistakes of AI during the performance assessment were related to a mathematical hallucination and the handling of image inputs/outputs, all of which should be improved and was not related to the quality of data). Those reasons, and the powerful AI model employed (GPT-4 in the end) might explain the remarkable theoretical performance obtained (97%), within the specified context. Once ChatGPT was applied to mentor students while teachers were unavailable, such an educational tool promoted a 30% increase in the grades of students within the experimental group. Surprisingly, this finding is not aligned with the conclusions of previous studies assessing the effect of AI chatbots across different educational levels. According to Garzón and Acevedo [112], the support from AI chatbots significantly improved university students’ learning outcomes, but there was no significant effect on primary school and secondary school students’ outcomes. Some studies [113] verified that the lower language competency of primary school students might hamper effective interaction with AI chatbots. However, the present study demonstrated not only that secondary school students are able to create a meaningful and effective interaction with ChatGPT (15 to 16-year-old students count on high enough language competencies), but also that the impact of AI on secondary school students was remarkable. Maybe the improvement in both the AI training during the present study (2023 and 2024) and its ability to meaningfully interact with the user accounted for this positive change of trend.

In order to evaluate these results, it is important to describe the educational context. The Spanish education law (LOMCE at the moment, evolving towards LOMLOE) established a curricular system for K12 students which, particularly for the fourth degree of compulsory secondary education (hereinafter fourth ESO), introduced basic concepts such as molar concentration, ideal gas law, stoichiometry, and simple calculations regarding mass conservation law, or limiting reagent. The fourth ESO is the last course of compulsory education in Spain, and even if the “physics and chemistry” subject is optative at this level, it is frequently chosen among the different options by a high number of students. However, a significant number of students will not undergo further bachelor or university studies, so the teacher must do their best to keep motivation at a high level. Therefore, blended approaches involving AI like the one proposed in this study might contribute to motivating and engaging students in an efficient manner. Besides this advantage, there is another important issue to consider in the temporal educational context: the academic years 2019/2020, 2020/2021, and 2021/2022 included strong educational changes due to the COVID-19 pandemic, and despite the employment of new TIC resources, most of the students still showed gaps in their STEM learning process. Thus, ChatGPT might not only raise students’ motivation and engagement, but also assist teachers in their tasks of guiding and supporting students (when they were unavailable), solving their particular doubts and even bridging those educational gaps in a time-independent manner, allowing all students to catch up and gain confidence. Therefore, the obtained results suggested AI might become an educational tool able to democratize a higher level of knowledge acquisition, without the need for parents/tutors/private teachers’ help, thus promoting students’ autonomy and security.

This idea aligns with the conclusions of previous studies, emphasizing the intertwined evolution of society, education, and technology [114]. New opportunities for a more inclusive, accessible and effective education might arise when using ChatGPT as an assistive technology that automates communication in this field. While the use of AI could respond to societal needs (such as providing students with assistance in a time- and location- independent manner), ChatGPT might also have an impact on society and education, and this impact might promote the development of more responsible technological advancements, at the same time including newer learning opportunities and ever closer AI−human interaction [115].

Many advantages of the use of ChatGPT as an educational tool might explain the obtained results, among them the rise in students’ confidence (solving all the doubts the student has in real time, even from their backgrounds, to understand theoretical concepts and problem-solving exercises, correcting their homework and localizing potential mistakes), an increase in their motivation and engagement (testing a disruptive educational tool which implies a meaningful interaction with a virtual entity, with a certain degree of gamification), the benefits from virtual mentoring, such as completely personalized learning (students ask exactly what they might need), location- and time-independent learning (students might use AI at any place, with any media e.g., pc or mobile phone, and any time they want to learn), the meaningful interaction with the chatbot, ensuring a long-term knowledge acquisition (functional support), and improving students’ sense of care and reassurance (emotional support) etc. Again, these conclusions align with previous research providing reasons why students assisted by AI could increase their learning performance: an increased confidence, motivation, self-efficacy, etc. [61].

This study suggested some social and pedagogical implications regarding the utilization of ChatGPT as an educational tool in K-12 Physics and Chemistry classes. The first implication of the study came from the findings suggesting ChatGPT improved students’ proficiency, as this might help students with diverse learning difficulties, i.e., social or familial problems, lack of attention, lack of problem-solving abilities (which is a common barrier for many students), educational gaps from COVID or other sources etc., provided those difficulties do not hamper the human−AI interaction. Furthermore, the results demonstrated ChatGPT could increase the classroom average proficiency from low-medium to medium-high, with only the help of the complementary use of ChatGPT at home. Therefore, the educational system would obtain better results with no additional cost for the state, and professors might face a lower number of students’ doubts in the classroom (enabling more dynamic and higher-quality classes and reducing teachers’ stress, thus improving teachers’ health and motivation, contributing to a lower number of sick leaves, with both educational and economic benefits for students and the educational system, respectively). Furthermore, ChatGPT being free (at least its open version), the proposed educational approach might contribute to reducing disparities in education by providing universal access to learning resources, thus democratizing a higher level of knowledge acquisition.

The study also suggested ChatGPT provided not only functional but also emotional support, which might also promote students’ autonomy, security and self-efficacy. This might help addressing a common problem in current society where both parents work and do not have much time and/or knowledge to help their children to solve their doubts when they face them, and the teacher is unavailable (at home), at least for K-12 students learning science.

Despite improving students’ proficiency, ChatGPT still experienced some hallucinations, and students demonstrated their blind trust of AI’s answers. This implies teaching AI literacy to students and training them in prompt engineering is required to develop critical thinking in students and also to obtain the best results with the proposed educational approach. Again, it is important to highlight the role of teachers, as they must direct the educational approach even if the complementary use of ChatGPT takes place when they are unavailable (i.e., at home).

Lastly, the complementary use of ChatGPT might be a reliable pedagogic tool for the development of soft skills and key competences such as autonomy, critical thinking, communication, problem solving, time management, etc. ChatGPT’s intuitive interface and sophisticated language model provides students with clear educational information, and especially in mandatory courses this can be a big advantage. Due to the fact that ChatGPT is a language model capable of performing simple calculations, it can provide information related to well established scientific concepts and applications, which includes non-sophisticated equations or formulas (sufficient for K-12 students). This provides confidence to students and develops a positive perception of ChatGPT based on critical thinking. These soft skills might be relevant for the students to fulfill their “Physics and Chemistry” subject in compulsory education, but also helpful in other areas of life. The main limitation of this study lied in the limited number of students participating in the interventional non-randomized experiment (23 students within one class of science in a single school in Benaguasil, Valencia, Spain). However, this study aimed to provide the science education community with the first proof of the positive impact of using ChatGPT as a virtual mentor of K-12 students learning science, when teachers were unavailable, so the allowance to conduct such a new interventional study was restricted to only one class. After the positive results obtained within this study, highlighting an improvement in 30% of students’ grades after the intervention, and with the aim of overcoming the existing limitations of the study already discussed, future experimental analysis (randomized studies) including a higher number of students, a broader diversity (considering age, gender, nationality, ethnicity, ability, religion, socioeconomic conditions, experience, sexual orientation or geographical diversity) and a broader scope (different levels and even different subjects) will soon be conducted to gain a more complete understanding of the AI impact on education. New studies aimed at evaluating the practical implications of the use of ChatGPT following the proposed methodological approach are also foreseen in order to provide education stakeholders with a broader insight into this area, i.e., whether the approach is reproducible and/or useful in different contexts (levels and subjects), if it might also improve learning outcomes of students with learning difficulties, studies assessing the impact on teachers, and also studies measuring the impact of students’ autonomy, security and self-efficacy, and also of students’ soft skills and key competences.

Finally, two potential biases might have influenced the results: the Hawthorne effect (or the modification in human behavior when the individual being observed is aware of being studied) and the John Henry effect (the change in human behavior in individuals belonging to a control group, trying to compensate for their apparent disadvantage). In this case, no solutions to those biases are foreseen for future studies, as individuals participating within such experiments must use or avoid using AI, so they are evidently aware of being studied, and also conscious of belonging to the experimental or control group.

Despite the limitations and potential biases, the study suggested ChatGPT might be a useful educational tool able to provide K-12 students learning science with the functional and emotional support they might need, also reducing disparities in education by providing universal access to learning resources, democratizing a higher level of knowledge acquisition with no additional help from parents/tutors, and promoting students’ autonomy, security and self-efficacy. The results probe ChatGPT’s experimental capacity (and huge potential) to assist teachers in their mentoring tasks, when they are unavailable, paving the way for future studies to allow for a more realistic perception of AI’s impact on education.

As a final recommendation, students and professors should be exhaustively trained in order to unleash the vast potential predicted for AI in the field of education. This would allow them to better exploit the benefits of AI technologies over time, and also to gain insight into their potential risks, biases and limitations. Extending learning experiences such as the one proposed within the present study to other classes/subjects/schools could benefit both teachers and students, as they would gain experience training (with) AI, exploring its strengths and weaknesses. Teachers could complete new performance tests to assess the capacity of different AIs to provide correct answers within specific fields and knowledge levels, analyzing real possibilities to be exploited by students in designed learning experiences. On the other side, students might benefit from those learning experiences led by their teachers by becoming familial with a powerful educational tool that might present some limitation (always alert to detect potential biases and limitations) but might also offer them both functional and emotional support, promoting their knowledge acquisition, assessing their critical thinking, and improving their confidence.

6. Conclusions

Current reviews on the impact of AI on education conclude that its main limitation is the lack of empirical studies assessing the effect of using AI on students’ learning outcomes. And this is even more relevant within the field of science education.

As a consequence, this study aimed to evaluate the real impact of using ChatGPT as a virtual mentor on K-12 students learning science, within the frame of a blended learning educational strategy, complementing the constructivist/connectivist presential learning with student-centered self-regulated location and time-independent cybergogy. More specifically, AI was meant to assist teachers when they were unavailable, by virtually addressing students’ doubts and homework correction in real-time, within a safe environment, through a personalized, meaningful and flexible learning experience that is independent of time and location, providing students with the support, advice and surveillance they might require.

Firstly, the real competence of ChatGPT within K-12 chemistry and physics was systematically verified through a test designed for human students, paying special attention to encoding and the use of open-ended questions. ChatGPT provided answers with remarkable accuracy and insight to theoretical questions, and a clear, detailed and well-organized procedure to find the solution to problem-solving questions. The consistent and noticeable results obtained through this performance test (9.3/10 in 2023 with the GPT-3.5 model, and 9.7/10 in 2024 with the GPT-4 model) demonstrated ChatGPT provided trustworthy answers (97%) in the field of K-12 science education, despite some minor limitations that were duly discussed. These results suggested ChatGPT might be a reliable tool to help K-12 students learning science to reinforce theoretical knowledge and enhance their problem-solving skills with no time and location restrictions, when teachers are unavailable.

Furthermore, several aspects of the use of ChatGPT within the proposed pedagogical approach were discussed to promote meaningful interactions with students, from a theoretical perspective, considering the current media landscape.

Then, the real impact of using ChatGPT as virtual mentor on K-12 students learning science, when teachers were unavailable, was assessed through a quasi-experimental analysis. The learning outcomes monitored before and after the intervention were students’ proficiency and students’ perception of AI as a useful educational tool.

On the one hand, the grades of students belonging to the experimental group increased 30% after the intervention, three times higher than that of the control group, which was manifested by 90% of students in the experimental group, independently of their level of proficiency. The class’s average grade improved from a low-medium to medium-high level of proficiency, with only the additional help of ChatGPT as an educational tool assisting students when teachers were unavailable, verifying the functional support AI might offer to students.

On the other hand, students’ perception of AI as a useful educational tool was measured through a Likert scale, reaching an average of 4.05 (agree), with most students agreeing or strongly agreeing (70%) ChatGPT was a useful educational tool. Furthermore, the study also revealed that students with no parent/tutor able to help them with their particular doubts when teachers were unavailable, felt reassurance counting on ChatGPT, verifying that AI provided not only functional but also social/emotional support.

After a profound discussion, the study concluded that ChatGPT might be a useful educational tool able to furnish K-12 students learning science with the functional and social/emotional support they might require, democratizing a higher level of knowledge acquisition without parent/tutor help, also promoting students’ autonomy, security and self-efficacy. These results probe ChatGPT’s outstanding capacity (and vast potential) to assist teachers in their mentoring tasks, when they are unavailable, laying the foundations of virtual mentoring and paving the way for more discussion and future empirical studies, extending the research to other areas and levels, allowing us to get a more realistic perception of AI’s impact on education.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/knowledge4040031/s1. The document contains the study to assess ChatGPT’s performance in the field of chemistry and physics for K-12 students, and the material used to evaluate ChatGPT’s impact on real K-12 (15 to 16-year-old) students learning chemistry.

Author Contributions

Conceptualization, D.O.d.Z. and R.C.; methodology, R.C., A.M.-G.-A. and D.O.d.Z.; software, V.J.G.; validation, R.C. and D.O.d.Z.; formal analysis, D.O.d.Z., L.M. and R.C.; investigation, D.O.d.Z., R.C., V.J.G., J.N.A., F.J.D.-F., M.S.L., M.G., E.P.-C., T.M., Á.B. and L.M.; resources, F.J.D.-F., Á.B. and T.M.; data curation, R.C., M.S.L. and J.N.A.; writing—original draft preparation, D.O.d.Z. and R.C.; writing—review and editing, R.C., V.J.G., T.M., L.M., F.J.D.-F., M.S.L., J.N.A., M.G., E.P.-C., Á.B., A.M.-G.-A. and D.O.d.Z.; visualization, M.G. and E.P.-C.; supervision, R.C. and D.O.d.Z.; project administration, R.C. and D.O.d.Z.; funding acquisition, D.O.d.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study was conducted in accordance with the 1964 Declaration of Helsinki and its later amendments, the 1978 Belmont report, the EU Charter of Fundamental Rights (26 October 2012), the national and European ethical standards (European Network of Research Ethics Committees), and the EU General Data Protection Regulation (2016/679).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are unavailable because of privacy preservation reasons. Requests to access the data will require a certificate of approval by the I.E.S. Benaguasil, and the participants’ consent.

Acknowledgments

The authors acknowledge students from I.E.S. Benaguasil participating within the study, and the I.E.S. for supporting the research. A.B. gratefully acknowledges financial support from the Spanish national project No. PID2022-137857NA-I00. A.B. thanks MICINN for the Ramon y Cajal Fellowship (grant No. RYC2021-030880-I). F.J.D.-F. acknowledges the Next Generation EU program, Spanish National Research Council (Ayuda Margarita Salas), and Universitat Politècnica de València (PAID-06-23). E.P.-C acknowledges funding from Generalitat Valenciana (Grant No. SEJIGENT/2021/039) and AGENCIA ESTATAL DE INVESTIGACIÓN of Ministerio de Ciencia e Innovacion (PID2021-128442NA-I00).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Schwab, K. The Fourth Industrial Revolution. Foreign Affairs, 12 December 2015. [Google Scholar]
Wang, Y.; Ma, H.S.; Yang, J.H.; Wang, K.S. Industry 4.0: A way from mass customization to mass personalization production. Adv. Manuf. 2017, 5, 311–320. [Google Scholar] [CrossRef]
Schwab, K. The Fourth Industrial Revolution: What It Means, How to Respond. World Economic Forum, 2016. Available online: https://www.weforum.org/agenda/2016/01/the-fourth-industrial-revolution-what-it-means-and-how-to-respond/ (accessed on 19 February 2023).
Hilbert, M.; López, P. The World’s Technological Capacity to Store, Communicate, and Compute Information. Science 2011, 332, 60–65. [Google Scholar] [CrossRef]
Esposito, M. World Economic Forum White Paper: Driving the Sustainability of Production Systems with Fourth Industrial Revolution Innovation. World Economic Forum, 2018. Available online: https://www.researchgate.net/publication/322071988_World_Economic_Forum_White_Paper_Driving_the_Sustainability_of_Production_Systems_with_Fourth_Industrial_Revolution_Innovation (accessed on 20 February 2023).
Bondyopadhyay, P.K. In the beginning [junction transistor]. Proc. IEEE 1998, 86, 63–77. [Google Scholar] [CrossRef]
What Are Industry 4.0, The Fourth Industrial Revolution, and 4IR? McKinsey, 17 August 2022. Available online: https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-are-industry-4-0-the-fourth-industrial-revolution-and-4ir (accessed on 20 February 2023).
Bai, C.; Dallasega, P.; Orzes, G.; Sarkis, J. Industry 4.0 technologies assessment: A sustainability perspective. Int. J. Prod. Econ. 2020, 229, 107776. [Google Scholar] [CrossRef]
Marr, B. Why Everyone Must Get Ready for the 4th Industrial Revolution. Forbes, 2016. Available online: https://www.forbes.com/sites/bernardmarr/2016/04/05/why-everyone-must-get-ready-for-4th-industrial-revolution/?sh=366e89503f90 (accessed on 22 February 2023).
Mudzar, N.M.B.M.; Chew, K.W. Change in Labour Force Skillset for the Fourth Industrial Revolution: A Literature Review. Int. J. Technol. 2022, 13, 969–978. [Google Scholar] [CrossRef]
Goldin, T.; Rauch, E.; Pacher, C.; Woschank, M. Reference Architecture for an Integrated and Synergetic Use of Digital Tools in Education 4.0. Procedia Comput. Sci. 2022, 200, 407–417. [Google Scholar] [CrossRef]
Cónego, L.; Pinto, R.; Gonçalves, G. Education 4.0 and the Smart Manufacturing Paradigm: A Conceptual Gateway for Learning Factories. In Smart and Sustainable Collaborative Networks 4.0; Camarinha-Matos, L.M., Boucher, X., Afsarmanesh, H., Eds.; PRO-VE 2021. IFIP Advances in Information and Communication Technology; Springer: Cham, Switzerland, 2021; Volume 629. [Google Scholar]
Costan, E.; Gonzales, G.; Gonzales, R.; Enriquez, L.; Costan, F.; Suladay, D.; Atibing, N.M.; Aro, J.L.; Evangelista, S.S.; Maturan, F.; et al. Education 4.0 in Developing Economies: A Systematic Literature Review of Implementation Barriers and Future Research Agenda. Sustainability 2021, 13, 12763. [Google Scholar] [CrossRef]
González-Pérez, L.I.; Ramírez-Montoya, M.S. Components of Education 4.0 in 21st Century Skills Frameworks: Systematic Review. Sustainability 2022, 14, 1493. [Google Scholar] [CrossRef]
Bonfield, C.A.; Salter, M.; Longmuir, A.; Benson, M.; Adachi, C. Transformation or evolution?: Education 4.0, teaching and learning in the digital age. High. Educ. Pedagog. 4th Ind. Revolut. 2020, 5, 223–246. [Google Scholar] [CrossRef]
Miranda, J.; Navarrete, C.; Noguez, J.; Molina-Espinosa, J.M.; Ramírez-Montoya, M.S.; Navarro-Tuch, S.A.; Bustamante-Bello, M.R.; Rosas-Fernández, J.B.; Molina, A. The core components of education 4.0 in higher education: Three case studies in engineering education. Comput. Electr. Eng. 2021, 93, 107278. [Google Scholar] [CrossRef]
Chiu, W.-K. Pedagogy of Emerging Technologies in Chemical Education during the Era of Digitalization and Artificial Intelligence: A Systematic Review. Educ. Sci. 2021, 11, 709. [Google Scholar] [CrossRef]
Mhlanga, D.; Moloi, T. COVID-19 and the digital transformation of education: What are we learning on 4IR in South Africa? Educ. Sci. 2020, 10, 180. [Google Scholar] [CrossRef]
Peterson, L.; Scharber, C.; Thuesen, A.; Baskin, K. A rapid response to COVID-19: One district’s pivot from technology integration to distance learning. Inf. Learn. Sci. 2020, 121, 461–469. [Google Scholar] [CrossRef]
Guo, Y.J.; Chen, L.; Guo, Y.; Chen, L. An Investigation on Online Learning for K12 in Rural Areas in China during COVID-19 Pandemic. In Proceedings of the Ninth International Conference of Educational Innovation through Technology (EITT), Porto, Portugal, 13–17 December 2020; pp. 13–18. [Google Scholar]
Mogos, R.; Bodea, C.N.; Dascalu, M.; Lazarou, E.; Trifan, L.; Safonkina, O.; Nemoianu, I. Technology enhanced learning for industry 4.0 engineering education. Rev. Roum. Des Sci. Tech.—Ser. Électrotechnique Énergétique 2018, 63, 429–435. [Google Scholar]
Moraes, E.B.; Kipper, L.M.; Hackenhaar Kellermann, A.C.; Austria, L.; Leivas, P.; Moraes, J.A.R.; Witczak, M. Integration of Industry 4.0 technologies with Education 4.0: Advantages for improvements in learning. Interact. Technol. Smart Educ. 2022, 20, 271–287. [Google Scholar] [CrossRef]
Ciolacu, M.I.; Tehrani, A.F.; Binder, L.; Svasta, P. Education 4.0—Artificial Intelligence Assisted Higher Education: Early recognition System with Machine Learning to support Students’ Success. In Proceedings of the IEEE 24th International Symposium for Design and Technology in Electronic Packaging (SIITME), Iași, Romania, 25–28 October 2018; pp. 23–30. [Google Scholar]
Chen, Z.; Zhang, J.; Jiang, X.; Hu, Z.; Han, X.; Xu, M.; Savitha; Vivekananda, G.N. Education 4.0 using artificial intelligence for students performance analysis. Intel. Artif. 2020, 23, 124–137. [Google Scholar]
Tahiru, F. AI in Education: A Systematic Literature Review. J. Cases Inf. Technol. 2021, 23, 1–20. [Google Scholar] [CrossRef]
Miao, F.; Holmes, W.; Huang, R.; Zhang, H. AI and Education Guidance for Policy-Makers; UNESCO Publishing: Paris, France, 2021; Available online: https://unesdoc.unesco.org/ark:/48223/pf0000376709 (accessed on 8 March 2023).
Carbonell, J.R. AI in CAI: An artificial-intelligence approach to computer-assisted instruction. IEEE Trans. Man-Mach. Syst. 1970, 11, 190–202. [Google Scholar] [CrossRef]
Psotka, J.; Massey, L.D.; Mutter, S.A. (Eds.) Intelligent Tutoaring Systems: Lessons Learned; Lawrence Erlbaum Associates, Inc.: Mahwah, NJ, USA, 1988. [Google Scholar]
Piramuthu, S. Knowledge-based web-enabled agents and intelligent tutoring systems. IEEE Trans. Educ. 2005, 48, 750–756. [Google Scholar] [CrossRef]
Mousavinasab, E.; Zarifsanaiey, N.; Kalhori, S.R.N.; Rakhshan, M.; Keikha, L.; Saeedi, M.G. Intelligent tutoring systems: A systematic review of characteristics, applications, and evaluation methods. Interact. Learn. Environ. 2021, 29, 142–163. [Google Scholar] [CrossRef]
Alrakhawi, H.; Jamiat, N.; Abu-Naser, S. Intelligent tutoring systems in education: A systematic review of usage, tools, effects and evaluation. J. Theor. Appl. Inf. Technol. 2023, 101, 1205–1226. [Google Scholar]
Song, D.; Oh, E.Y.; Rice, M. Interacting with a conversational agent system for educational purposes in online courses. In Proceedings of the 10th International Conference on Human System Interactions (HSI), Ulsan, Republic of Korea, 17–19 July 2017; pp. 78–82. [Google Scholar]
Shute, V.J.; Psotka, J. Intelligent Tutoring Systems: Past, Present, and Future. Human Resources Directorate Manpower and Personnel Research Division. 1994, pp. 2–52. Available online: https://myweb.fsu.edu/vshute/pdf/shute%201996_d.pdf (accessed on 19 February 2023).
VanLehn, K. The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educ. Psychol. 2011, 46, 197–221. [Google Scholar] [CrossRef]
Fernoaga, P.V.; Sandu, F.; Stelea, G.A.; Gavrila, C. Intelligent Education Assistant Powered by Chatbots. In Proceedings of the 14th International Scientific Conference of eLearning and Software for Education (eLSE), Bucharest, Romania, 19–20 April 2018; pp. 376–383. [Google Scholar]
Hamam, D. The New Teacher Assistant: A Review of Chatbots’ Use in Higher Education. In HCI International 2021—Posters. HCII 2021. Communications in Computer and Information Science; Stephanidis, C., Antona, M., Ntoa, S., Eds.; Springer: Cham, Switzerland, 2021; Volume 1421. [Google Scholar]
Satu, M.S.; Parvez, M.H.; Al-Mamun, S. Review of integrated applications with AIML based chatbot. In Proceedings of the International Conference on Computer and Information Engineering (ICCIE), Rajshahi, Bangladesh, 26–27 November 2015; pp. 87–90. [Google Scholar]
The State of AI in 2022—And a Half Decade in Review. Available online: https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2022-and-a-half-decade-in-review (accessed on 19 February 2023).
The State of AI in 2023: Generative AI’s Breakout Year McKinsey AI Global Survey 2023. Available online: https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2023-generative-AIs-breakout-year#/ (accessed on 18 April 2024).
Maslej, N.; Fattorini, L.; Perrault, R.; Parli, V.; Reuel, A.; Brynjolfsson, E.; Etchemendy, J.; Ligett, K.; Lyons, T.; Manyika, J.; et al. The AI Index 2024 Annual Report; AI Index Steering Committee, Institute for Human-Centered AI: Stanford, CA, USA, 2024. [Google Scholar]
Lam, R.; Sanchez-Gonzalez, A.; Willson, M.; Wirnsberger, P.; Fortunato, M.; Alet, F.; Ravuri, S.; Ewalds, T.; Eaton-Rosen, Z.; Hu, W.; et al. Learning skillful medium-range global weather forecasting. Science 2023, 382, 1416–1421. [Google Scholar] [CrossRef]
Merchant, A.; Batzner, S.; Schoenholz, S.S.; Schoenholz, S.S.; Aykol, M.; Cheon, G.; Cubuk, E.D. Scaling deep learning for materials discovery. Nature 2023, 624, 80–85. [Google Scholar] [CrossRef]
Boiko, D.A.; MacKnight, R.; Kline, B.; Gomes, G. Autonomous chemical research with large language models. Nature 2023, 624, 570–578. [Google Scholar] [CrossRef] [PubMed]
Rudolph, J.; Tan, S.; Tan, S. ChatGPT: Bullshit spewer or the end of traditional assessments in higher education? J. Appl. Learn. Teach. 2023, 6, 1. [Google Scholar]
Castelvecchi, D. Are ChatGPT and AlphaCode going to replace programmers? Nature 2022. [Google Scholar] [CrossRef]
Tung, L. ChatGPT Can Write Code. Now Researchers Say It’s Good at Fixing Bugs, Too. ZDNET 2023. Archived from the Original on 3 February 2023. Available online: https://www.zdnet.com/article/chatgpt-can-write-code-now-researchers-say-its-good-at-fixing-bugs-too/ (accessed on 13 March 2023).
Stokel-Walker, C. AI bot ChatGPT writes smart essays—Should professors worry? Nature 2022. Available online: https://www.nature.com/articles/d41586-022-04397-7 (accessed on 13 March 2023). [CrossRef]
Kung, T.H.; Cheatham, M.; Medenilla, A.; Sillos, C.; De Leon, L.; Elepaño, C.; Madriaga, M.; Aggabao, R.; Diaz-Candido, G.; Maningo, J.; et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digit. Health 2023, 2, e0000198. [Google Scholar] [CrossRef]
Koe, C. ChatGPT Shows Us How to Make Music with ChatGPT. Published Online 27 January 2023. Available online: https://musictech.com/news/gear/ways-to-use-chatgpt-for-music-making/ (accessed on 13 March 2023).
Zheng, Z.; Zhang, O.; Borgs, C.; Chayes, J.T.; Yaghi, O.M. ChatGPT Chemistry Assistant for Text Mining and the Prediction of MOF Synthesis. J. Am. Chem. Soc. 2023, 145, 18048–18062. [Google Scholar] [CrossRef]
Pradhan, T.; Gupta, O.; Chawla, G. The Future of ChatGPT in Medicinal Chemistry: Harnessing AI for Accelerated Drug Discovery. Chem. Sel. 2024, 9, e202304359. [Google Scholar] [CrossRef]
Zhang, W.; Wang, Q.; Kong, X.; Xiong, J.; Ni, S.; Cao, D.; Niu, B.; Chen, M.; Li, Y.; Zhang, R.; et al. Fine-tuning Large Language Models for Chemical Text Mining. Chem. Sci. 2024, 15, 10600–10611. [Google Scholar] [CrossRef] [PubMed]
OpenAI. 2023. Available online: https://arxiv.org/pdf/2303.08774.pdf (accessed on 21 March 2023).
Roose, K. The Brilliance and Weirdness of ChatGPT. The New York Times, 5 December 2022. Available online: https://www.nytimes.com/2022/12/05/technology/chatgpt-ai-twitter.html (accessed on 14 March 2023).
Sanders, N.E.; Schneier, B. Opinion|How ChatGPT Hijacks Democracy. The New York Times, 15 January 2023. Available online: https://archive.is/Cyaac (accessed on 14 March 2023).
García-Peñalvo, F.J. La percepción de la Inteligencia Artificial en contextos educativos tras el lanzamiento de ChatGPT: Disrupción o pánico. Educ. Knowl. Soc. (EKS) 2023, 24, e31279. [Google Scholar] [CrossRef]
Chomsky, N.; Roberts, I.; Watumull, J. Opinion|Noam Chomsky: The False Promise of ChatGPT. The New York Times, 12 March 2023. Available online: https://archive.is/SM77M (accessed on 14 March 2023).
Lo, C.K. What Is the Impact of ChatGPT on Education? A Rapid Review of the Literature. Educ. Sci. 2023, 13, 410. [Google Scholar] [CrossRef]
Terwiesch, C. Would ChatGPT Get a Wharton MBA? A Prediction Based on Its Performance in the Operations Management Course; Mack Institute for Innovation Management at the Wharton School, University of Pennsylvania: Pennsylvania, PA, USA, 2023; Available online: https://mackinstitute.wharton.upenn.edu/wp-content/uploads/2023/01/Christian-Terwiesch-Chat-GTP.pdf (accessed on 13 March 2023).
Mogali, S.R. Initial impressions of ChatGPT for anatomy education. Anat. Sci. Educ. 2023, 17, 444–447. [Google Scholar] [CrossRef]
Wu, R.; Yu, Z. Do AI chatbots improve students learning outcomes? Evidence from a meta-analysis. Br. J. Educ. Technol. 2023, 55, 10–33. [Google Scholar] [CrossRef]
Rospigliosi, P. Artificial intelligence in teaching and learning: What questions should we ask of ChatGPT? Interact. Learn. Environ. 2023, 31, 1–3. [Google Scholar] [CrossRef]
Pavlik, J.V. Collaborating With ChatGPT: Considering the Implications of Generative Artificial Intelligence for Journalism and Media Education. J. Mass Commun. Educ. 2023, 78, 84–93. [Google Scholar] [CrossRef]
Jeon, J.; Lee, S. Large language models in education: A focus on the complementary relationship between human teachers and ChatGPT. Educ. Inf. Technol. 2023, 28, 15873–15892. [Google Scholar] [CrossRef]
Luan, L.; Lin, X.; Li, W. Exploring the Cognitive Dynamics of Artificial Intelligence in the Post-COVID-19 and Learning 3.0 Era: A Case Study of ChatGPT. arXiv 2023, arXiv:2302.04818. [Google Scholar]
Rahman, M.M.; Watanobe, Y. ChatGPT for Education and Research: Opportunities, Threats, and Strategies. Appl. Sci. 2023, 13, 5783. [Google Scholar] [CrossRef]
Malinka, K.; Peresíni, M.; Firc, A.; Hujnák, O.; Janus, F. On the Educational Impact of ChatGPT: Is Artificial Intelligence Ready to Obtain a University Degree? In Proceedings of the ITiCSE 2023: Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education, V. 1, Turku, Finland, 10–12 July 2023; pp. 47–53. [Google Scholar]
Zhai, X. ChatGPT for Next Generation Science Learning (20 January 2023). Available online: https://ssrn.com/abstract=4331313 (accessed on 13 March 2023). [CrossRef]
Sebastian, W.; Jan, S.; Daniele, D.M.; Joshua, W.; Marc, R.; Hendrik, D. Are We There Yet?—A Systematic Literature Review on Chatbots in Education. Front. Artif. Intell. 2021, 4, 654924. [Google Scholar]
Cooper, G. Examining Science Education in ChatGPT: An Exploratory Study of Generative Artificial Intelligence. J. Sci. Educ. Technol. 2023, 32, 444–452. [Google Scholar] [CrossRef]
Dos Santos, R.P. Enhancing Chemistry Learning with ChatGPT, Bing Chat, Bard, and Claude as Agents-to-Think-With: A Comparative Case Study. arXiv 2023, arXiv:2311.00709. [Google Scholar] [CrossRef]
Schulze Balhorn, L.; Weber, J.M.; Buijsman, S.; Hildebrandt, J.R.; Ziefle, M.; Schweidtmann, A.M. Empirical assessment of ChatGPT’s answering capabilities in natural science and engineering. Sci. Rep. 2024, 14, 4998. [Google Scholar] [CrossRef]
Su, J.; Yang, W. Unlocking the Power of ChatGPT: A Framework for Applying Generative AI in Education. ECNU Rev. Educ. 2023, 6, 355–366. [Google Scholar]
Mercadé, L.; Díaz-Fernández, F.J.; Lozano, M.S.; Navarro-Arenas, J.; Gómez, M.; Pinilla-Cienfuegos, E.; de Zárate, D.O.; Hernández, V.J.G.; Díaz-Rubio, A. Research mapping in the teaching environment: Tools based on network visualizations for a dynamic literature review. In Proceedings of the INTED2023 Proceedings, Valencia, Spain, 6–8 March 2023; pp. 3916–3922. [Google Scholar]
Mercadé, L.; de Zárate, D.O.; Barreda, A.; Pinilla-Cienfuegos, E. Leveraging artificial intelligence and problem-based learning to foster critical analysis and scientific communication in graduate students. In Proceedings of the INTED2023 Proceedings, Valencia, Spain, 6–8 March 2023; pp. 6175–6179. [Google Scholar]
Barreda, A.; García-Cámara, B.; de Zárate Díaz, D.O.; Pinilla-Cienfuegos, E.; Mercadé, L. Utilizing artificial intelligence as a tool to enhance student participation in the classroom through the effective evaluation of research works’ quality. In Proceedings of the INTED2023 Proceedings, Valencia, Spain, 6–8 March 2023; pp. 2547–2554. [Google Scholar]
Bizami, N.A.; Tasir, Z.; Kew, S.N. Innovative pedagogical principles and technological tools capabilities for immersive blended learning: A systematic literature review. Educ. Inf. Technol. 2023, 28, 1373–1425. [Google Scholar] [CrossRef]
Chen, C.K.; Huang, N.T.N.; Hwang, G.J. Findings and implications of flipped science learning research: A review of journal publications. Interact. Learn. Environ. 2022, 30, 949–966. [Google Scholar] [CrossRef]
Stahl, B.C.; Eke, D. The ethics of ChatGPT—Exploring the ethical issues of an emerging technology. Int. J. Inf. Manag. 2024, 74, 102700. [Google Scholar] [CrossRef]
Wu, X.; Duan, R.; Ni, J. Unveiling security, privacy, and ethical concerns of ChatGPT. J. Inf. Intell. 2024, 2, 102–115. [Google Scholar] [CrossRef]
Peng, L.; Zhao, B. Navigating the ethical landscape behind ChatGPT. Big Data Soc. 2024, 11, 20539517241237488. [Google Scholar] [CrossRef]
Zhou, J.; Müller, H.; Holzinger, A.; Chen, F. Ethical ChatGPT: Concerns, Challenges, and Commandments. arXiv 2023, arXiv:2305.10646. [Google Scholar]
Frieder, S.; Pinchetti, L.; Griffiths, R.R.; Salvatori, T.; Lukasiewicz, T.; Petersen, P.C.; Chevalier, A.; Berner, J. Mathematical Capabilities of ChatGPT. arXiv 2023, arXiv:2301.13867. [Google Scholar]
Ferrara, E. Should ChatGPT be Biased? Challenges and Risks of Bias in Large Language Models. arXiv 2023, arXiv:2304.03738. [Google Scholar]
Jenkinson, J. Measuring the Effectiveness of Educational Technology: What are we Attempting to Measure? Electron. J. e-Learn. 2009, 7, 273–280. [Google Scholar]
Gilson, A.; Safranek, C.W.; Huang, T.; Socrates, V.; Chi, L.; Taylor, R.A.; Chartash, D. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med. Educ. 2023, 9, e45312. [Google Scholar] [CrossRef]
Turing, A. Computing Machinery and Intelligence; Mind, LIX; Springer: Dordrecht, The Netherlands, 1950; pp. 433–460. [Google Scholar]
Boone, H.N.; Boone, D.A. Analysing Likert data. J. Ext. 2012, 50, 1–5. [Google Scholar]
Wu, S.; Wang, F. Artificial intelligence-based simulation research on the flipped classroom mode of listening and speaking teaching for English majors. Mob. Inf. Syst. 2021, 4344244. [Google Scholar] [CrossRef]
Klos, M.C.; Escoredo, M.; Joerin, A.; Lemos, V.N.; Rauws, M.; Bunge, E.L. Artificial Intelligence-Based Chatbot for Anxiety and Depression in University Students: Pilot Randomized Controlled Trial. JMIR Form. Res. 2021, 5, e20678. [Google Scholar] [CrossRef]
Mishra, P.; Singh, U.; Pandey, C.M.; Mishra, P.; Pandey, G. Application of student’s t-test, analysis of variance, and covariance. Ann. Card. Anaesth. 2019, 22, 407–411. [Google Scholar] [CrossRef]
Hsu, M.H.; Chen, P.S.; Yu, C.S. Proposing a task-oriented chatbot system for EFL learners speaking practice. Interact. Learn. Environ. 2021, 31, 4297–4308. [Google Scholar] [CrossRef]
Ilgaz, H.B.; Çelik, Z. The Significance of Artificial Intelligence Platforms in Anatomy Education: An Experience with ChatGPT and Google Bard. Cureus 2023, 15, e45301. [Google Scholar] [CrossRef]
Litt, E.; Zhao, S.; Kraut, R.; Burke, M. What Are Meaningful Social Interactions in Today’s Media Landscape? A Cross-Cultural Survey. Soc. Media + Soc. 2020, 6, 2056305120942888. [Google Scholar] [CrossRef]
Cooper, H.; Okamura, L.; Gurka, V. Social activity and subjective well-being. Personal. Individ. Differ. 1992, 13, 573–583. [Google Scholar] [CrossRef]
Hilvert-Bruce, Z.; Neill, J.T.; Sjöblom, M.; Hamari, J. Social motivations of live-streaming viewer engagement on Twitch. Comput. Hum. Behav. 2018, 84, 58–67. [Google Scholar] [CrossRef]
Offer, S. Family time activities and adolescents’ emotional well-being. J. Marriage Fam. 2013, 75, 26–41. [Google Scholar] [CrossRef]
Gonzales, A.L. Text-based communication influences self-esteem more than face-to-face or cellphone communication. Comput. Hum. Behav. 2014, 39, 197–203. [Google Scholar] [CrossRef]
Brennan, S.E. The grounding problem in conversations with and through computers. In Social and Cognitive Approaches to Interpersonal Communication; Fussell, S.R., Kreuz, R.J., Eds.; Lawrence Erlbaum: Hillsdale, NJ, USA, 1998; pp. 201–225. [Google Scholar]
Boothby, E.J.; Clark, M.S.; Bargh, J.A. Shared experiences are amplified. Psychol. Sci. 2014, 25, 2209–2216. [Google Scholar] [CrossRef]
Maslow, A.H. Preface to motivation theory. Psychosom. Med. 1943, 5, 85–92. [Google Scholar] [CrossRef]
Deci, E.L.; Ryan, R.M. Autonomy and need satisfaction in close relationships: Relationships motivation theory. In Human Motivation and Interpersonal Relationships: Theory, Research, and Applications; Springer: Berlin/Heidelberg, Germany, 2014; pp. 53–73. [Google Scholar]
Burleson, B.R. The experience and effects of emotional support: What the study of cultural and gender differences can tell us about close relationships, emotion and interpersonal communication. Pers. Relatsh. 2003, 10, 1–23. [Google Scholar] [CrossRef]
Cook, T.D.; Campbell, D.T. Quasi-Experimentation: Design & Analysis Issues for Field Settings, 1st ed.; Rand McNally: Chicago, IL, USA, 1979. [Google Scholar]
Caspersen, J.; Smeby, J.C.; Aamodt, P.O. Measuring learning outcomes. Eur. J. Educ. 2017, 52, 20–30. [Google Scholar] [CrossRef]
The Importance of Grades. Urban Education Institute. University of Chicago. 2017. Available online: https://uei.uchicago.edu/sites/default/files/documents/UEI%202017%20New%20Knowledge%20-%20The%20Importance%20of%20Grades.pdf (accessed on 13 July 2024).
Moon, J.; Yang, R.; Cha, S.; Kim, S.B. ChatGPT vs. Mentor: Programming Language Learning Assistance System for Beginners. In Proceedings of the 2023 IEEE 8th International Conference on Software Engineering and Computer Systems (ICSECS), Penang, Malaysia, 25–27 August 2023; pp. 106–110. [Google Scholar]
Ding, L.; Li, T.; Jiang, S.; Gapud, A. Students’ perceptions of using ChatGPT in a physics class as a virtual tutor. Int. J. Educ. Technol. High. Educ. 2023, 20, 63. [Google Scholar] [CrossRef]
Kim, N.Y. A study on the use of artificial intelligence chatbots for improving English grammar skills. J. Digit. Converg. 2019, 17, 37–46. [Google Scholar]
Mageira, K.; Pittou, D.; Papasalouros, A.; Kotis, K.; Zangogianni, P.; Daradoumis, A. Educational AI chatbots for content and language integrated learning. Appl. Sci. 2022, 12, 3239. [Google Scholar] [CrossRef]
Hwang, W.Y.; Guo, B.C.; Hoang, A.; Chang, C.C.; Wu, N.T. Facilitating authentic contextual EFL speaking and conversation with smart mechanisms and investigating its influence on learning achievements. Comput. Assist. Lang. Learn. 2024, 37, 2095406. [Google Scholar] [CrossRef]
Garzón, J.; Acevedo, J. Meta-analysis of the impact of augmented reality on students’ learning gains. Educ. Res. Rev. 2019, 27, 244–260. [Google Scholar] [CrossRef]
Jeon, J. Exploring AI chatbot affordances in the EFL classroom: Young learners’ experiences and perspectives. Comput. Assist. Lang. Learn. 2022, 37, 1–26. [Google Scholar] [CrossRef]
Watson, S.; Romic, J. ChatGPT and the entangled evolution of society, education, and technology: A systems theory perspective. Eur. Educ. Res. J. 2024, 14749041231221266. [Google Scholar] [CrossRef]
Anderson, P.W. More is different. Science 1972, 177, 393–396. [Google Scholar] [CrossRef]

Figure 1. Type of questions requested to ChatGPT: (a) by discipline; (b) by nature; (c) by discipline and nature.

Figure 2. Assessment of ChatGPT’s performance in the field of chemistry and physics for 15 to 16-year-old students in 2023: (a) Final score only including totally correct answers; (b) Final score including partially correct answers; (c) Final score including partially correct answers by discipline.

Figure 3. Assessment of ChatGPT’s performance in the field of chemistry and physics for 15 to 16-year-old students in 2024: (a) Final score only including totally correct answers; (b) Final score including partially correct answers; (c) Final score including partially correct answers by discipline.

Figure 4. Question 1 concerning the working time devoted to complete sessions (a) 1 and 2; (b) 3 and 4.

Figure 5. Question 3, regarding students’ perception of their understanding of: (a) theoretical concepts; (b) the application of theoretical concepts.

Figure 6. Question 4, concerning the correction of (a) the approach to solve the exercise; (b) the numerical result; and (c) the usefulness of ChatGPT as an educational tool.

Figure 7. Second KPI: students’ grades in the first and second term (before and after the intervention) for: (a) the control group and (b) the experimental group.

Figure 8. Question 4, concerning the usefulness of ChatGPT as an educational tool, after two terms using the AI.

Table 1. Results obtained by ChatGPT within the 52-questions test to assess ChatGPT’s performance in the field of chemistry and physics for 15 to 16-year-old students in 2023.

Question	Score	Question	Score	Question	Score
1	1	19	1	37	1
2	1	20	1	38	0
3	1	21	0.67	39	1
4	1	22	1	40	1
5	0	23	1	41	1
6	1	24	1	42	1
7	1	25	1	43	1
8	1	26	1	44	1
9	0.50	27	1	45	1
10	1	28	1	46	1
11	1	29	1	47	1
12	1	30	1	48	0.67
13	1	31	1	49	1
14	1	32	1	50	0.50
15	1	33	1	51	1
16	1	34	1	52	1
17	1	35	1
18	1	36	1	Final Score	9.3/10

Table 2. Results obtained by ChatGPT within the 52-questions test to assess ChatGPT’s performance in the field of chemistry and physics for 15 to 16-year-old students in 2024.

Question	Score	Question	Score	Question	Score
1	1	19	1	37	1
2	1	20	1	38	1
3	1	21	0.67	39	1
4	1	22	1	40	1
5	0	23	1	41	1
6	1	24	1	42	1
7	1	25	1	43	1
8	1	26	1	44	1
9	0.50	27	1	45	1
10	1	28	1	46	1
11	1	29	1	47	1
12	1	30	1	48	1
13	1	31	1	49	1
14	1	32	1	50	0.50
15	1	33	1	51	1
16	1	34	1	52	1
17	1	35	1
18	1	36	1	Final Score	9.7/10

Table 3. Assessment of students’ proficiency by comparison of students’ grades before and after the use of ChatGPT as a virtual mentor during one term, through paired sample t-tests applied over the control and the experimental groups.

Control Group	Before	After	Experimental Group	Before	After
Mean	5.62	6.69	Mean	4.37	7.11
Standard deviation	6.8225	5.2588	Standard deviation	2.5190	4.3867
Observations	4	4	Observations	19	19
Pearson correlation coefficient	0.7697		Pearson correlation coefficient	0.5951
Hypothetical difference of means	0		Hypothetical difference of means	0
Degrees of freedom	3		Degrees of freedom	18
t statistic	−1.2654		t statistic	−6.9602
P(T ≤ t) one-tailed test	0.1475		P(T ≤ t) one-tailed test	8.3829 × 10⁻⁷
t critical value (one-tailed test)	2.3533		t critical value (one-tailed test)	1.7341
P(T ≤ t) two-tailed test	0.2951		P(T ≤ t) two-tailed test	1.6766 × 10⁻⁶
t critical value (two-tailed test)	3.1824		t critical value (two-tailed test)	2.10092204

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Castañeda, R.; Martínez-Gómez-Aldaraví, A.; Mercadé, L.; Gómez, V.J.; Mengual, T.; Díaz-Fernández, F.J.; Sinusia Lozano, M.; Navarro Arenas, J.; Barreda, Á.; Gómez, M.; et al. Use of ChatGPT as a Virtual Mentor on K-12 Students Learning Science in the Fourth Industrial Revolution. Knowledge 2024, 4, 582-614. https://doi.org/10.3390/knowledge4040031

AMA Style

Castañeda R, Martínez-Gómez-Aldaraví A, Mercadé L, Gómez VJ, Mengual T, Díaz-Fernández FJ, Sinusia Lozano M, Navarro Arenas J, Barreda Á, Gómez M, et al. Use of ChatGPT as a Virtual Mentor on K-12 Students Learning Science in the Fourth Industrial Revolution. Knowledge. 2024; 4(4):582-614. https://doi.org/10.3390/knowledge4040031

Chicago/Turabian Style

Castañeda, Rafael, Andrea Martínez-Gómez-Aldaraví, Laura Mercadé, Víctor Jesús Gómez, Teresa Mengual, Francisco Javier Díaz-Fernández, Miguel Sinusia Lozano, Juan Navarro Arenas, Ángela Barreda, Maribel Gómez, and et al. 2024. "Use of ChatGPT as a Virtual Mentor on K-12 Students Learning Science in the Fourth Industrial Revolution" Knowledge 4, no. 4: 582-614. https://doi.org/10.3390/knowledge4040031

APA Style

Castañeda, R., Martínez-Gómez-Aldaraví, A., Mercadé, L., Gómez, V. J., Mengual, T., Díaz-Fernández, F. J., Sinusia Lozano, M., Navarro Arenas, J., Barreda, Á., Gómez, M., Pinilla-Cienfuegos, E., & Ortiz de Zárate, D. (2024). Use of ChatGPT as a Virtual Mentor on K-12 Students Learning Science in the Fourth Industrial Revolution. Knowledge, 4(4), 582-614. https://doi.org/10.3390/knowledge4040031

Article Menu

Use of ChatGPT as a Virtual Mentor on K-12 Students Learning Science in the Fourth Industrial Revolution

Abstract

1. Introduction

1.1. Fourth Industrial Revolution (4IR)

1.2. Education

1.3. Education 4.0

2. Literature Review

2.1. Origins of AI in Education: Intelligent Tutoring Systems

2.2. AI Evolution: Generative AI

2.3. Generative AI in Education

2.4. Generative AI in Science Education

3. Methods

3.1. Proposition of the Experiment

3.2. Independent Variable (Factor): Complementary Use of AI

3.3. Response Variables (Dependent Variables): Complementary Use of AI

3.4. Type of Design: Non-Randomized Unifactorial Design with Control Group

3.5. Design Analysis

3.6. Design Limitations

3.7. Participants

4. Results

4.1. Assessment of ChatGPT’s Performance in the Field of Chemistry and Physics for K-12 Students

4.2. Assessment of ChatGPT’s Impact on Real K-12 (15 to 16-Year-Old) Students Learning Chemistry

4.2.1. How Long Did It Take You to Complete the Session?

4.2.2. In What Aspects of the Session Have You Found More Difficulties?

4.2.3. Rate Your Level of Agreement (1: Strongly Disagree, 2: Disagree, 3: Neither Agree nor Disagree, 4: Agree, 5: Strongly Agree) with the Following Statements

4.2.4. Rate Your Level of Agreement (1–5) with the Following Sentences

5. Discussion

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI