Exploring the Scientific Validity of ChatGPT’s Responses in Elementary Science for Sustainable Education

Choi, Yoonsung

doi:10.3390/su17072962

Open AccessArticle

Exploring the Scientific Validity of ChatGPT’s Responses in Elementary Science for Sustainable Education

by

Yoonsung Choi

Department of Science Education, Cheongju National University of Education, Cheongju-si 28690, Republic of Korea

Sustainability 2025, 17(7), 2962; https://doi.org/10.3390/su17072962

Submission received: 2 February 2025 / Revised: 23 March 2025 / Accepted: 24 March 2025 / Published: 27 March 2025

(This article belongs to the Topic Education for Sustainable Development and Science Teaching)

Download Review Reports Versions Notes

Abstract

As AI integration in education increases, it is crucial to evaluate its effectiveness in elementary science learning, particularly in promoting sustainable education through equitable access to knowledge. This study aims to assess the validity and applicability of ChatGPT3.5 (free version) responses in elementary Earth and Space science. A document analysis of 1200 AI-generated responses was conducted to evaluate scientific validity, explanatory clarity, and pedagogical relevance. The analysis employed quantitative methods to assess accuracy and alignment with elementary curricula, while qualitative insights identified linguistic and conceptual challenges. The findings indicate that 94.2% of responses were scientifically valid, 70.6% were clear, but only 12.8% aligned with elementary curricula. While ChatGPT provides accurate information, many responses included complex terminology unsuitable for young learners. Additionally, 87.2% of responses lacked alignment with curricula, posing challenges for effective classroom integration. Despite these limitations, ChatGPT shows potential in simplifying scientific concepts and expanding access to educational resources. Refining AI-generated content through curriculum-based filtering, adaptive language processing, and teacher mediation is necessary. Strengthening AI-driven educational strategies with a sustainability focus can ensure long-term improvements in equitable science learning. This study highlights the need for further research on optimizing AI tools for elementary education.

Keywords:

ChatGPT; elementary science education; sustainability; scientific questions and answers

1. Introduction

The advancement of generative artificial intelligence (GenAI) has introduced transformative opportunities in education, particularly through large language models such as ChatGPT [1,2]. These AI-powered tools offer instant responses to scientific inquiries, supporting students’ comprehension of scientific concepts and promoting inquiry-based learning [1,2]. In this context, ChatGPT has the potential to bridge educational disparities by enhancing accessibility and reducing cognitive barriers, aligning with Sustainable Development Goal 4 (SDG 4), which emphasizes inclusive and equitable quality education [3,4]. However, the increased adoption of AI in classrooms necessitates rigorous evaluation of its reliability, particularly in ensuring that ChatGPT-generated responses maintain scientific validity, explanatory clarity, and pedagogical appropriateness [5].

In elementary science education, particularly in Earth and Space science, AI-generated responses should not only be scientifically accurate but also developmentally appropriate for young learners [6]. The discipline of Earth science encompasses multiple subfields, including geology, meteorology, oceanography, and astronomy, requiring an interdisciplinary approach to scientific inquiry [7]. Unlike higher education settings, science education does not merely focus on knowledge acquisition but also considers students’ cognitive development, conceptual understanding, and the transition from everyday language to scientific discourse [8]. Additionally, elementary science instruction must account for the process in which students gradually learn to translate everyday language into scientific terminology. Rather than merely delivering scientific terms and facts, elementary science education requires an approach that supports students in transitioning from intuitive, everyday concepts to formal scientific language [9]. In this context, an essential question emerges: Can ChatGPT provide scientifically valid, clear, and pedagogically appropriate responses that align with the developmental needs of elementary students and facilitate their conceptual understanding?

Prior research has explored the general role of AI in education, highlighting its potential to reduce teachers’ workloads, personalize learning experiences, and foster critical thinking [10]. However, few studies have closely examined the quality and educational appropriateness of AI-generated responses within the context of elementary science education, particularly in terms of scientific accuracy, clarity of explanation, and developmental suitability for young learners. Although recent empirical studies have begun to evaluate ChatGPT’s response quality in scientific domains—for instance, in natural science and engineering education contexts [11]—similar investigations in elementary science education remain scarce. To date, little is known about how AI-generated explanations align with the cognitive abilities and pedagogical needs of students in K–6 science classrooms. This gap is especially significant considering that elementary students are in a transitional phase where they shift from intuitive, everyday concepts to structured scientific reasoning. If ChatGPT provides responses that are scientifically inaccurate or overly complex, it may inadvertently reinforce misconceptions rather than support learning, thereby challenging the long-term viability of AI integration in education [12].

Ensuring the reliability of AI responses is not merely a technical challenge but a foundational requirement for sustainable learning environments [3]. The long-term integration of AI in education depends on its ability to provide trustworthy and developmentally appropriate support [13]. Inaccurate or inconsistent AI-generated content can erode students’ trust in digital learning tools and negatively impact their literacy [14]. Conversely, refining AI models to deliver more precise and pedagogically sound explanations can enhance AI’s role as a sustainable educational resource [15]. Therefore, evaluating ChatGPT’s ability to meet these educational standards is a critical step toward integrating AI into sustainable teaching and learning frameworks [8].

1.1. The Role of ChatGPT in Sustainable Science Education

ChatGPT has emerged as a transformative tool in science education, contributing to both instructional efficiency and student engagement. One of its most significant roles is supporting sustainable education by enhancing accessibility, personalization, and equity in learning. By automating lesson planning and resource generation, ChatGPT reduces the workload on teachers, allowing them to allocate more time to fostering deeper learning experiences and interactive discussions. This efficiency is crucial in creating a sustainable teaching model where AI assists educators without replacing the critical human element of instruction [15,16,17].

Another essential contribution of ChatGPT to sustainable education is its ability to generate differentiated learning materials [18,19]. Teachers can use ChatGPT to develop inquiry-based activities, scaffolded learning experiences, and adaptive assessments that align with students’ individual needs [20]. This level of customization ensures that students of diverse learning backgrounds receive equitable learning opportunities, thereby supporting the goal of inclusive and sustainable education [21,22]. By offering instant feedback and explanations, ChatGPT enables students to progress at their own pace, fostering self-directed learning that promotes long-term educational sustainability.

ChatGPT supports scientific inquiry by encouraging students to explore complex scientific phenomena interactively. The ability to ask AI-driven questions and receive immediate, tailored explanations enhances students’ engagement and scientific reasoning. For instance, when learning about climate change, students can ask ChatGPT about carbon cycles, renewable energy, or environmental policies, receiving explanations that connect scientific theory to real-world applications [12,23]. This process not only deepens understanding but also cultivates a mindset geared towards sustainable problem-solving and responsible scientific inquiry [22,24].

By integrating AI-powered learning tools such as ChatGPT into science education, schools can create more sustainable learning environments. These tools reduce educational disparities, enhance inquiry-based learning, and foster long-term engagement in science. However, ensuring their effectiveness requires continuous refinement of AI-generated content to align with educational goals and sustainability principles. The next section will explore the challenges and limitations associated with integrating ChatGPT into sustainable science education.

1.2. Addressing the Challenges of ChatGPT for Sustainable Science Education

While ChatGPT presents significant advantages in promoting personalized learning and scientific inquiry, its integration into science education also poses challenges that must be addressed to ensure its alignment with sustainable education principles [25]. The reliability, ethical implications, and pedagogical appropriateness of AI-generated responses play a crucial role in determining whether ChatGPT can be effectively integrated into sustainable science education frameworks.

One of the primary concerns regarding ChatGPT is the reliability of its responses. AI-generated content can sometimes include factual inaccuracies or oversimplifications, particularly in scientific contexts where precision is critical [16]. In cases where ChatGPT provides misleading or incorrect explanations, students may develop misconceptions, which can hinder their long-term scientific literacy. Sustainable education necessitates accuracy and consistency in information dissemination, emphasizing the need for rigorous verification mechanisms in AI-driven learning environments. Ensuring that ChatGPT provides scientifically valid and clear explanations is essential for maintaining trust in digital learning tools and fostering an environment conducive to long-term knowledge retention [5,15,26].

Ethical concerns also present significant challenges to the sustainable integration of ChatGPT into education [27]. Issues such as data privacy, algorithmic bias, and over-reliance on AI tools require the establishment of robust ethical frameworks to guide AI use in classrooms. Algorithmic bias, for instance, can lead to inequities in the representation of scientific knowledge, potentially reinforcing existing disparities in science education [3,18,23]. Additionally, excessive dependence on AI for information retrieval and content generation may reduce students’ engagement in critical thinking and independent problem-solving, contradicting the objectives of sustainable science education [28].

Another challenge is the preparedness of educators to incorporate AI tools effectively into their teaching practices. Many teachers lack formal training in AI literacy and may struggle to critically evaluate AI-generated content or integrate ChatGPT into their pedagogical strategies [10,16,29]. Sustainable science education requires a balanced approach where AI serves as a supplementary tool rather than a replacement for traditional instructional methods. Professional development programs focusing on AI literacy can equip teachers with the skills necessary to assess AI-generated content and leverage its strengths while mitigating its limitations [30].

To maximize the potential of ChatGPT in sustainable science education, strategic measures must be implemented. These include the continuous refinement of AI-generated content to align with scientific accuracy, the development of ethical guidelines to ensure fairness and inclusivity, and the provision of targeted AI literacy training for educators. Additionally, fostering a culture of critical evaluation among students can help them engage with AI-generated responses more discerningly, reinforcing their ability to analyze information critically and develop scientific reasoning skills [3,12,21]. By addressing these challenges, ChatGPT can contribute to the creation of a more sustainable and effective educational landscape, supporting equitable access to quality science education while ensuring the responsible use of AI-powered learning tools.

1.3. Research Purpose and Questions

The purpose of this study is to evaluate the validity of ChatGPT’s scientific responses in the Earth and Space domain of elementary science education and to explore its strengths and limitations as an educational tool. By assessing the reliability and pedagogical effectiveness of AI-generated content, this study aims to contribute to the development of a sustainable AI-assisted learning framework that ensures educational equity and scientific accuracy.

By examining its utility in facilitating student learning and supporting science instruction, this study provides insights into the effective integration of ChatGPT in elementary science education. Specifically, the study addresses the following research questions.

How valid are the scientific responses provided by ChatGPT in the Earth and Space domains of elementary science education?
What are the main types and characteristics of inappropriate scientific responses provided by ChatGPT in the Earth and Space domains of elementary science education?

These research questions not only assess the scientific validity of ChatGPT’s responses but also explore its potential role in supporting sustainable science education by identifying key areas for improvement. The findings of this study seek to offer actionable recommendations for the responsible and effective use of generative AI in elementary classrooms, ensuring that AI serves as a tool that enhances rather than compromises educational quality and sustainability.

2. Materials and Methods

In this study, document analysis was conducted to evaluate the scientific validity of ChatGPT’s responses in the Earth and Space domains of elementary science education in Korea. The term “document” refers to the responses generated by ChatGPT3.5 when preservice elementary teachers posed questions within these domains. Each participant conducted five question-and-answer exchanges for each subunit of the curriculum, focusing on key topics in Earth and Space science. The responses generated by ChatGPT3.5 were analyzed for their scientific validity, with individual sentences serving as the basic unit of analysis. Table 1 illustrates the structured steps followed by preservice teachers when formulating questions for ChatGPT3.5. These steps align with the questioning framework presented in prior research on evaluating scientific inquiry using AI tools [31].

2.1. Elementary Science Education in the Earth and Space Domain Under the 2015 Revised Curriculum in Korea

This study focuses on the Earth and Space science domains in elementary science education, specifically aligning with the 2015 Revised Science Curriculum. The curriculum consists of eight units, with four units designated for grades 3–4 and four units for grades 5–6. Preservice elementary teachers participated in generating scientific questions based on these units. Each preservice teacher formulated five question-and-answer exchanges for each unit, ensuring a structured and balanced dataset.

The units covered in the study include “The Appearance of the Earth”, “Changes in the Earth’s Surface”, “Strata and Fossils”, and “Volcanoes and Earthquakes” for grades 3–4, and “The Solar System and Stars”, “Weather and Our Daily Lives”, “Earth’s and Moon’s Motions”, and “Seasonal Changes” for grades 5–6. These units were selected based on the general and specific guidelines of the 2015 Revised Science Curriculum, ensuring that the study aligns with the intended learning objectives in elementary science education. By incorporating these fundamental Earth and Space science topics, the study aimed to evaluate how ChatGPT responds to questions that elementary students are likely to encounter in real classroom settings.

2.2. Participants

A total of 30 preservice elementary teachers participated in the study, contributing to the formulation of scientific questions for ChatGPT. These participants were third-year students enrolled at A National University of Education in Korea. The participants were typically aged between 21 and 23 years old. While we did not collect specific data on gender distribution, the participants included both male and female students. They had completed coursework related to natural science, covering key concepts in physics, chemistry, biology, and Earth science relevant to elementary education.

Furthermore, all participants had prior experience in elementary science instructional material research, having taken courses that explored the development and application of science teaching resources. They had also studied scientific content specific to elementary science education, ensuring that their questions were well informed and pedagogically meaningful.

Each preservice teacher was given autonomy to independently generate five questions per unit within the Earth and Space science domains, resulting in a total of 40 question sets per participant. These questions were designed to reflect real classroom scenarios, where students might pose similar inquiries during science lessons. The structured nature of this process ensured that the dataset comprehensively covered the fundamental topics in elementary Earth and Space science education. The participants utilized the free version of ChatGPT to generate question sets over approximately 100 days from September to mid-December 2024.

2.3. Question Design Framework

The questions generated by participants were designed using a structured three-step framework specifically developed to evaluate AI-generated responses in science education. This framework, based on previous research [30], was designed to systematically construct scientifically relevant and pedagogically meaningful questions. It ensured consistency across the dataset and enabled a more rigorous analysis of ChatGPT’s responses.

This framework follows a Question–Reason–Certainty (Q–R–C) structure, where participants ask a question, explore reasoning, and assess confidence in the response. Every question formulated by preservice teachers adhered to this structured approach. The following is an example of a Q-R-C question structure. This example is a translation of the Korean version into English.

First, the ‘Question’ section is: “Does the amount of sodium chloride vary between different seas?”.

The response to this question is as follows: “Yes, the amount of sodium chloride (salt) varies among different seas. The salinity of seawater depends on factors such as regional climate, precipitation, evaporation rates, river inflows, and ocean currents. For example: The Atlantic Ocean has relatively high salinity due to high evaporation and low precipitation. Polar seas, such as the Arctic Ocean, have lower salinity because of glacier melting and significant freshwater inflows. The Red Sea has the highest salinity in the world due to extremely high evaporation rates”.

Next, the ‘Reason’ section is: “Why do salinity levels and the amount of sodium chloride differ depending on regional characteristics?”.

ChatGPT’s response to this question is as follows. “The variation in salinity levels and sodium chloride content across different regions is due to water circulation and environmental factors. The key factors include Evaporation: In regions with strong solar radiation, seawater evaporates significantly, reducing the amount of water and increasing salinity. (e.g., the Red Sea, the Mediterranean Sea), Precipitation and River Inflows: When there is heavy rainfall or significant river inflow, seawater is diluted, leading to lower salinity. (e.g., equatorial regions and the Arctic Ocean), Glacier Melting: In polar regions, the melting of glaciers introduces freshwater into the ocean, decreasing salinity. Ocean Currents: Currents mix high-salinity and low-salinity waters, helping regulate salinity levels. However, in specific regions where currents are weak, salinity levels tend to remain constant. Geographical Characteristics: Enclosed seas or bays with high evaporation and low river inflows tend to have higher salinity. (e.g., the Dead Sea has the highest salinity in the world.) In conclusion, interactions among evaporation, freshwater inflows, climate conditions, and geographical features contribute to the differences in salinity levels and sodium chloride concentrations in various seas”.

Finally, the ‘Certainty’ section states: “Are you confident in your answer?”.

ChatGPT provided the following response: “Yes, this explanation is based on scientifically established facts. Numerous studies have demonstrated that factors such as evaporation, precipitation, river inflows, ocean currents, and glacier melting lead to regional variations in ocean salinity. If more specific data or numerical values regarding the salinity of a particular sea are needed, additional verification can be provided”.

2.3.1. Step 1: Question Formulation

In the first step, participants were required to construct scientifically relevant questions that addressed specific topics within the Earth and Space domains. These questions were aligned with the elementary science curriculum and included key concepts such as geological processes, weather systems, and celestial phenomena.

Additionally, efforts were made to reflect the types of inquiries students might encounter or pose in classroom discussions. For example, participants generated questions such as: “What causes earthquakes, and how do they impact the Earth’s surface?”, “How does the water cycle contribute to weather patterns?”. This question was designed to be clear, precise, and directly aligned with the learning objectives of elementary science education. This step emphasized the importance of constructing meaningful inquiries that enable ChatGPT to generate scientifically accurate and pedagogically valuable responses.

2.3.2. Step 2: Reasoning Articulation

The second step was designed not for blindly accepting ChatGPT’s responses but for evaluating their validity and reasoning. In this stage, preservice teachers were required to prompt ChatGPT to provide a logical rationale for its answers. This process helped contextualize the question, connect it to fundamental scientific principles, and align it with educational objectives.

The primary goal of this step was to determine whether ChatGPT’s response was scientifically grounded or speculative. This process of reasoning articulation ensured that each question was clearly formulated within a scientific context and served an explicit educational purpose. It also helped participants evaluate how relevant the question was to elementary science education and how it could foster meaningful learning experiences.

Through this step, participants were encouraged to critically reflect on their understanding of the subject and assess whether their questions were scientifically sound and pedagogically meaningful. This deliberate and structured approach contributed to a more rigorous and systematic question design process. For example, if a participant asked, “Why does the ocean appear blue?”, they would need to prompt ChatGPT with “Can you provide a scientific explanation for why the ocean appears blue?” to ensure that the response was based on light absorption and scattering principles rather than speculation. Similarly, if a participant asked, “How do volcanic eruptions affect the atmosphere?”, they would refine their prompt by requesting, “Please explain how volcanic eruptions influence the Earth’s atmosphere based on scientific principles”. This would encourage ChatGPT to provide reasoning grounded in concepts such as gas emissions, particulate matter, and climate impact. By incorporating this structured prompting strategy, participants ensured that their questions were framed within a scientific context and served a clear educational purpose.

2.3.3. Step 3: Certainty Assessment

In the final step, participants asked ChatGPT to assess its confidence in the responses it provided. This process involved prompting ChatGPT with questions such as “Are you confident in your response?” or “How certain are you about this answer?” to verify its certainty level. ChatGPT’s responses were then categorized into two levels by the researchers: “Highly Certain” and “Uncertain”.

The final stage encouraged self-reflection and critical evaluation, allowing participants to review the accuracy and appropriateness of their questions. Additionally, by explicitly asking ChatGPT about its confidence level, participants were able to gain insights into the reliability of AI-generated responses.

This process not only helped participants and researchers critically assess the validity of ChatGPT’s responses but also allowed them to identify misconceptions or areas where ChatGPT might lack confidence.

2.3.4. Justification of the Framework

This three-step framework was selected because it aligns with best practices in inquiry-based learning and provides a systematic approach for evaluating AI-generated content [30]. By supporting participants in constructing meaningful questions, this framework also enabled a structured analysis of ChatGPT’s responses. Moreover, requiring participants to engage in reflective reasoning and confidence assessment fostered a deeper understanding of scientific concepts and allowed exploration of how AI can complement human understanding.

2.4. Data Collection

In this study, we analyzed 1200 questions generated by preservice elementary teachers, with each participant creating 40 questions related to the Earth and Space domain of elementary science education. The questions were evenly distributed across eight Earth science units in the elementary science curriculum (grades 3–6), with each preservice teacher formulating five question sets per unit. A total of 30 preservice teachers participated, resulting in the collection of 1200 question–answer sets, ensuring a relatively balanced representation of key Earth science concepts. This structured distribution method was designed to minimize bias toward specific topics and to ensure that ChatGPT’s responses could be evaluated across various elementary science concepts.

To maintain consistency in question design, all questions followed a three-step framework consisting of question formulation, reasoning articulation, and certainty assessment. Participants were provided with guidance to construct scientifically valid and pedagogically meaningful questions, covering diverse subtopics in the Korean elementary science curriculum, such as the shape of the Earth, surface changes, earthquakes and volcanoes, the water cycle, and Earth’s rotation, etc. This approach allowed the dataset to reflect a broad and relatively comprehensive range of elementary science content.

All questions created by the preservice teachers were submitted through an online platform, ensuring a streamlined data collection process and maintaining consistency in response formats. This standardized approach facilitated the systematic organization of data, enhancing the efficiency of subsequent analyses.

Additionally, the generated questions were designed to align as closely as possible with the types of inquiries elementary students might naturally pose during science lessons. For instance, questions such as “How do earthquakes affect the Earth’s surface?” and “Why does the moon appear to change shape?” were developed based on topics that students are likely to explore in elementary science classrooms. This approach provided valuable insights into the potential role of ChatGPT as a supportive tool for elementary science education.

All 1200 collected questions were preserved in their original form for analysis, with careful attention given to maintaining data integrity. Each question served as an individual data point for evaluating the scientific validity and educational applicability of ChatGPT’s responses. The structured nature of the dataset allowed for a more systematic examination of the characteristics of AI-generated content.

The data collection process in this study was designed with the broader objective of exploring the potential applications of ChatGPT in elementary science education. By distributing the questions across all eight Earth science units, we aimed to minimize topic bias and ensure that ChatGPT’s responses could be assessed in relation to a wide range of elementary science concepts. This approach provided opportunities for a comprehensive evaluation of ChatGPT’s applicability in elementary science education.

2.5. Data Analysis

This study employed a systematic analysis process to evaluate the scientific validity and appropriateness of ChatGPT’s responses in the Earth and Space domains of elementary science education and to explore the types and characteristics of inappropriate responses. The analysis was guided by an evaluation framework specifically adapted from prior research, including [31,32]. This framework consisted of three criteria: scientific validity, explanatory clarity, and pedagogical relevance. These criteria were refined through an iterative process to align with the educational needs and cognitive levels of elementary students.

The process of developing the framework involved adapting the original criteria to the context of elementary science education. proposed scientific accuracy and explanatory aspects as primary evaluation criteria for AI-generated content [31], while emphasized five dimensions: contextual relevance, factual accuracy, completeness, logical coherence, and grammatical accuracy [32]. To address the unique characteristics of ChatGPT’s responses and their suitability for elementary education, these criteria were restructured into three key categories. Scientific validity assessed the alignment of responses with established scientific principles, integrating the focus on factual accuracy from both studies. Explanatory clarity was defined as the degree to which ChatGPT’s responses were logically structured, grammatically accurate, and comprehensible to elementary students, incorporating explanatory aspects from [31] and contextual coherence and accuracy from [32]. Pedagogical relevance evaluated whether the responses supported the learning objectives and inquiry-based approaches central to elementary science education, adapting the completeness and contextual relevance (Table 1).

This framework was applied to analyze a dataset of question–response pairs generated by ChatGPT in response to questions crafted by preservice elementary teachers. The dataset encompassed topics such as volcanoes, the water cycle, atmospheric phenomena, and celestial mechanics. Each response was evaluated independently against the three criteria to ensure consistency and objectivity in the analysis.

Scientific validity was analyzed by comparing ChatGPT’s responses to authoritative scientific sources and elementary science textbooks. For instance, the response “Volcanic eruptions occur due to the increased pressure of magma caused by tectonic plate movements” was considered valid, while a response attributing eruptions to atmospheric pressure was deemed invalid. This process ensured the factual accuracy of ChatGPT’s outputs in scientifically sensitive topics.

Explanatory clarity was assessed by examining grammatical accuracy, logical structure, and comprehensibility. For example, the question “Where does water go after it evaporates?” was evaluated as clear when ChatGPT responded, “Water evaporates into the atmosphere and forms clouds”. However, responses with overly complex terminology or ambiguous phrasing were flagged as lacking clarity.

Pedagogical relevance was evaluated by determining whether ChatGPT’s responses aligned with educational objectives and supported inquiry-based learning. For example, a response to the question “Why do the phases of the moon change?” was considered relevant if it accurately described the positions of the Sun, Moon, and Earth while fostering further exploration or discussion. Responses that lacked real-world connections or failed to align with learning goals were categorized as less relevant.

The analysis also explored the second research question regarding the types and characteristics of inappropriate responses. Responses lacking scientific validity issues included factual inaccuracies, such as attributing volcanic eruptions to increased atmospheric oxygen levels. Explanatory clarity often contained overly technical language or vague explanations, such as “Water vapor forms thermodynamic equilibrium during evaporation”. Pedagogical irrelevance was observed in responses that failed to support educational objectives, such as describing the phases of the Moon as “daily changes in the Moon’s surface structure”.

To uncover recurring patterns of inadequacy, a thematic analysis was conducted [33]. This highlighted strengths, such as ChatGPT’s ability to explain straightforward scientific concepts, and weaknesses, particularly in addressing complex or interdisciplinary topics. Qualitative observations provided deeper insight into these limitations and informed potential improvements in integrating ChatGPT into elementary science education.

The reliability and validity of the analysis were ensured through a rigorous review process. The primary researcher conducted an initial evaluation of all responses, applying the framework consistently. The findings were then reviewed by two science education experts and a professor from a university’s department of science education, whose feedback was incorporated through iterative discussions and revisions. This collaborative process enhanced the robustness and applicability of the framework, ensuring a comprehensive evaluation of ChatGPT’s capabilities and limitations within the context of elementary science education.

3. Results

The first research question analyzed a set of 1200 scientific question–response pairs. This research question explored the appropriateness of ChatGPT’s responses based on three criteria: scientific validity, explanatory clarity, and pedagogical relevance.

3.1. Statistical Results of 1200 Questions

This study analyzed 1200 responses generated by ChatGPT in the context of elementary science education. The responses were evaluated based on three key criteria: scientific validity, explanatory clarity, and pedagogical relevance. The results indicate that scientific validity was exceptionally high at 94.2% (N = 1130), explanatory clarity was moderate at 70.6% (N = 849), and pedagogical relevance was notably low at 12.8% (N = 153). These findings suggest that while ChatGPT provides factually accurate responses, many of them contain complex language or advanced concepts that may not align well with elementary-level curricula.

The scientific validity of ChatGPT’s responses was overwhelmingly high, with 94.2% (N = 1130) of the responses accurately reflecting well-established scientific principles. These responses correctly explained various natural phenomena, adhered to fundamental scientific laws, and avoided common misconceptions. Examples of scientifically valid responses include explanations such as “Earthquakes occur due to the sudden release of energy caused by the movement of tectonic plates. The energy propagates as seismic waves, which cause the ground to shake”. This response effectively communicates the core concept of seismic activity and is factually correct. Similarly, the response “The phases of the Moon change because sunlight illuminates different portions of the Moon’s surface as it orbits Earth. The changing angles create the appearance of different phases” accurately explains how the Moon’s orbit influences its visible shape. However, a small proportion of responses, approximately 5.8% (N = 70), contained minor scientific inaccuracies. For instance, one response incorrectly stated that “Earthquakes happen when the ground suddenly moves because of underground air pressure”, which misrepresents the geological mechanisms behind seismic activity. Although these inaccuracies were infrequent, they highlight the need for continued refinement in AI-generated educational content.

While the scientific validity was strong, the explanatory clarity of the responses was lower, with 70.6% (N = 849) of responses being well structured, grammatically correct, and logically coherent. These responses provided clear explanations in a way that was accessible to learners. For example, “Rain forms when water vapor in the air cools down and turns into tiny droplets. These droplets come together to form clouds, and when they become too heavy, they fall as rain” effectively explains the concept of precipitation in a simple and structured manner. Similarly, “The Sun appears to move across the sky because Earth is rotating. As Earth turns, we see the Sun rise in the east and set in the west” helps students understand daily celestial motion in an intuitive way. However, 29.4% (N = 351) of responses lacked explanatory clarity due to excessive technical terminology, unnecessarily complex sentence structures, or overly detailed explanations. An example of a response with low explanatory clarity is: “Precipitation occurs when atmospheric water vapor undergoes condensation, coalesces into droplets, and subsequently reaches the ground due to gravitational pull”. This response, while scientifically accurate, includes complex vocabulary such as “coalesces” and “gravitational pull”, making it difficult for young learners to grasp.

The most significant challenge identified in the study was the low level of pedagogical relevance. Only 12.8% (N = 153) of the responses were fully aligned with elementary-level curricula, meaning they were presented in a manner appropriate for young students and reinforced key educational concepts effectively. Pedagogically relevant responses included explanations such as “A shadow forms when an object blocks light. The shape of the shadow depends on the object’s shape and the direction of the light”, which aligns well with elementary science standards and presents the concept in a clear and engaging way. Similarly, “The Moon does not make its own light. We can see it because it reflects sunlight” is a direct and simple way of explaining an important astronomical concept. However, the majority of responses, 87.2% (N = 1047), did not meet pedagogical relevance criteria. These responses often included unnecessary details, complex terminology, or advanced scientific concepts that are beyond the scope of elementary education. An example of a response that lacked pedagogical relevance is: “Seismic waves propagate in different forms: P-waves, which are compressional waves traveling faster through solids and liquids, and S-waves, which are shear waves that only move through solids”. While this explanation is scientifically valid, it introduces technical terms such as “compressional waves” and “shear waves”, which are typically covered in secondary education rather than elementary school. Another example is: “The human eye perceives colors based on the wavelength of light reflected off objects. Shorter wavelengths appear blue, while longer wavelengths appear red due to differential absorption and scattering”. This explanation, although accurate, requires a foundational understanding of physics that exceeds what is expected at the elementary level. Table 2 presents a summary of the overall quantitative findings and examples.

The overall findings suggest that while ChatGPT generates scientifically valid responses, a significant proportion of these responses are not well adapted for elementary-level learners. The primary issue affecting explanatory clarity was the use of technical terms and complex sentence structures, which made explanations difficult for young students to comprehend. The major challenge in pedagogical relevance was the frequent inclusion of advanced scientific concepts that surpass the elementary science curriculum.

These findings imply that AI-generated educational content, while factually reliable, requires further adjustments to improve readability and educational alignment for young learners. Future developments in AI-driven educational tools should focus on three key areas: (1) simplifying language and sentence structure to enhance explanatory clarity, (2) aligning responses with curriculum standards to improve pedagogical relevance, and (3) incorporating interactive and inquiry-based elements to make explanations more engaging and effective for elementary students.

3.2. Analyzing Inadequacies in AI-Generated Responses: Insights from Scientific Validity, Explanatory Clarity, and Pedagogical Relevance

The analysis of 1200 AI-generated responses in elementary science education revealed notable inadequacies in three key areas: scientific validity, explanatory clarity, and pedagogical relevance. While ChatGPT demonstrated a high level of scientific accuracy, its effectiveness was compromised by issues in explanatory clarity and pedagogical relevance. This section explores the specific inadequacies observed in each criterion and their implications for AI-assisted education.

3.2.1. Strengths and Weaknesses in Scientific Validity

The first major area of analysis focused on the scientific validity of AI-generated responses. While 94.2% (N = 1130) of responses were scientifically accurate, a small portion, 5.8% (N = 70), contained minor inaccuracies that could potentially lead to misconceptions. These inaccuracies were primarily categorized into two key issues.

First, oversimplification of scientific concepts was a frequent issue in responses that attempted to condense complex ideas into overly simplistic explanations. While simplification is necessary for younger learners, excessive reduction of content often resulted in the omission of critical scientific details. For instance, some responses described the phases of the Moon by stating, “The Moon changes shape because of Earth’s shadow”. This explanation, although understandable to students, misrepresents the actual cause, as the Moon’s phases are a result of its orbit around Earth and the way sunlight illuminates its surface.

Second, incorrect causal explanations were observed in a subset of responses that misrepresented fundamental scientific principles. While most inaccuracies were subtle, they had the potential to introduce misunderstandings if students interpreted them literally. For example, one response incorrectly claimed, “Earthquakes occur due to underground air pressure”. This explanation fails to acknowledge the role of tectonic plate movement as the primary cause of seismic activity. While students might recognize that earthquakes involve underground forces, this explanation attributes the phenomenon to an incorrect cause, leading to potential misconceptions.

These findings emphasize that even minor inaccuracies in AI-generated content can impact students’ understanding of scientific concepts. While the overall accuracy of responses was high, ensuring that all AI-generated explanations adhere strictly to scientific principles is essential. Future AI models designed for educational purposes should integrate enhanced verification mechanisms to identify and correct subtle inaccuracies before presenting information to learners. Additionally, AI systems should prioritize conceptual accuracy while maintaining accessibility, striking a balance between simplification and precision to prevent misinformation from being unintentionally introduced into elementary science education.

3.2.2. Challenge in Explanatory Clarity

The second major inadequacy observed in the AI-generated responses was the issue of explanatory clarity, affecting 29.4% of the responses. Many responses contained structural or linguistic complexities that made comprehension difficult for elementary learners. There were two primary factors contributing to this challenge.

First, the overuse of technical terminology made responses less accessible to young learners. Instead of providing simplified explanations, many responses included specialized scientific terms without offering additional context. For example, rather than stating that clouds form when water vapor cools and turns into droplets, some responses used unnecessarily complex phrasing, such as “Precipitation results from atmospheric condensation and coalescing water molecules”. This type of explanation assumes a level of prior knowledge that elementary students may not yet possess.

Second, lengthy and dense explanations further reduced clarity. Many responses attempted to include multiple scientific processes within a single explanation, leading to cognitive overload for young students. This issue was particularly evident in explanations of weather phenomena, where excessive detail made it difficult for learners to extract the essential information. An example of this can be seen in descriptions of precipitation, where some responses included references to gravitational pull and molecular interactions instead of focusing on the more straightforward process of rain formation.

These findings indicate that while AI-generated responses may be scientifically valid, their effectiveness is often hindered by the way information is presented. To improve clarity, AI systems should prioritize linguistic simplicity, concise explanations, and the avoidance of unnecessarily complex vocabulary. Ensuring that responses are structured in a way that aligns with elementary-level comprehension is crucial for enhancing the educational utility of AI-generated content.

3.2.3. The Disconnect Between AI-Generated Responses and Pedagogical Relevance

The third major inadequacy observed in AI-generated responses was the significant lack of pedagogical relevance, with only 12.8% (N = 153) of responses aligning appropriately with elementary science curricula. This indicates that the majority of responses (87.2%, N = 1047) contained information that, while scientifically valid, was not suitable for young learners. The primary factors contributing to this disconnect can be categorized into two key issues.

First, the use of advanced scientific concepts beyond elementary-level comprehension was a recurring issue. Many responses introduced technical terminology and explanations that were beyond what elementary students are expected to learn. For instance, rather than explaining that Earth’s surface is made of different types of rocks, some responses stated, “The lithosphere consists of igneous, sedimentary, and metamorphic rocks formed through diagenetic and magmatic processes”. While this explanation is scientifically accurate, it introduces unnecessary complexity and specialized terminology that exceeds the cognitive abilities of elementary students.

Second, a lack of alignment with elementary science curricula was evident in many responses. Some responses included information that is typically covered in secondary education, making them less effective for younger learners. For example, an AI-generated response describing seismic waves (P-waves and S-waves) as indicators of earthquake activity is technically correct but does not align with the elementary-level curriculum, which typically introduces earthquakes in a more general manner. Similarly, a response stating that “The Coriolis effect influences wind patterns by causing deflections based on Earth’s rotation” assumes prior knowledge of atmospheric dynamics that elementary students have not yet acquired.

These findings suggest that scientific accuracy alone does not ensure educational effectiveness. While AI-generated responses provide factually correct explanations, they often fail to consider the learning progression of elementary students. For AI to be effectively utilized in classrooms, it must be capable of adapting explanations to match students’ cognitive development and curriculum standards. This requires filtering out excessive scientific details and presenting information in a way that is both engaging and age-appropriate. Future improvements in AI-driven educational content should focus on curriculum-based filtering mechanisms to ensure that responses are not only scientifically valid but also pedagogically useful.

4. Discussion and Conclusions

This study explored the potential and limitations of ChatGPT as an educational tool in elementary science, particularly in the domains of Earth and Space science. The findings indicate that while ChatGPT demonstrates strong scientific validity, there are challenges related to explanatory clarity and pedagogical alignment that may affect its effectiveness in elementary education. These results suggest that AI-based learning tools could be further refined to better accommodate the learning needs of young students and align more closely with curriculum standards. Additionally, as AI continues to be integrated into educational settings, it is important to consider how it may contribute to broader educational equity and accessibility goals, such as those outlined in SDG 4: Quality Education [25,34,35].

First, ChatGPT’s responses demonstrated a high level of scientific validity, aligning with established scientific principles. This suggests that AI-generated educational content has the potential to serve as a useful resource for reinforcing scientific concepts. In contexts where access to quality educational materials is limited, AI-based tools may help facilitate exposure to accurate scientific information [4]. However, some responses contained factual inaccuracies, highlighting the need for ongoing verification processes to maintain the reliability and consistency of AI-generated content. Ensuring the accuracy of information remains a key factor in determining how AI can support evidence-based science education.

Second, explanatory clarity was found to be moderate. While many responses were considered understandable, a portion contained complex terminology or sentence structures that may be challenging for elementary learners. This suggests that AI-generated content could be refined to enhance accessibility for students with varying levels of language proficiency. Presenting educational content in a way that is both accurate and easy to understand may contribute to a more inclusive learning experience. Future AI models may benefit from adaptive language processing capabilities that adjust response complexity based on the learner’s comprehension level [18,36].

Third, pedagogical relevance emerged as the most significant area for improvement, as only a small fraction of responses effectively aligned with elementary curricula. While many responses were scientifically accurate, a significant portion included advanced terminology or explanations beyond the expected knowledge level of elementary students. For instance, rather than simply stating that rocks form naturally over time, some responses provided technical descriptions involving complex geological processes that are typically introduced at higher educational levels. While such responses contain valid scientific information, they may not be appropriate for younger learners. This underscores the potential benefits of curriculum-based content filtering to ensure that AI-generated responses are both educationally relevant and developmentally appropriate [19].

Building upon these conclusions, the following discussion examines key implications for AI-assisted science education and potential areas for future improvement. First, scientific accuracy alone is not sufficient for effective AI-assisted learning. Although ChatGPT demonstrated a high level of factual reliability, even minor inaccuracies may lead to misunderstandings. Integrating more robust fact-checking mechanisms could help maintain the credibility of AI-generated content. Additionally, periodic expert reviews could be beneficial in ensuring that AI models continue to generate scientifically sound educational materials [4]. These efforts align with SDG 4’s emphasis on improving educational quality through verified, evidence-based content [37].

Second, enhancing the explanatory clarity of AI-generated responses is essential for making educational content more accessible to young learners. Given that 29.4% of responses contained complex wording, AI models could be refined to automatically adjust language complexity based on the student’s proficiency level. Developing adaptive language processing tools that modify sentence structures dynamically may help ensure that explanations remain both accurate and accessible [38]. This aligns with SDG 4.5, which focuses on promoting inclusive learning environments by minimizing language barriers.

Third, ensuring stronger alignment between AI-generated content and educational curricula is critical. Since only 12.8% of responses were pedagogically relevant, there is an opportunity to implement curriculum-based filtering systems that tailor AI-generated explanations to specific grade levels. By structuring content in a way that aligns with students’ expected learning progressions, AI can be used to enhance inquiry-based science education and foster critical thinking [39].

Fourth, AI should be positioned as a complementary learning tool rather than a replacement for traditional instruction. The findings suggest that AI-generated responses could be more effective when integrated into structured learning environments alongside teacher-led instruction. Providing professional development opportunities for educators to learn how to evaluate and incorporate AI-generated content into their teaching could help maximize its educational impact. These efforts support SDG 4, which emphasizes strengthening teacher capacity to improve educational outcomes [19].

Fifth, this study should consider the practical implementation of AI in classrooms. This entails examining how AI can be effectively applied and how it may impact student learning, particularly in alignment with SDG 4. Ensuring that AI enhances educational accessibility, supports inquiry-based learning, and promotes equity in science education directly contributes to the broader goals of SDG 4. By integrating AI thoughtfully, classrooms can foster inclusive, high-quality learning experiences that empower students with critical thinking skills and technological literacy.

Finally, ethical considerations and policy discussions surrounding AI-driven education should be explored further. While AI has the potential to expand access to educational resources, challenges related to bias, misinformation, and equitable access need to be addressed. Collaboration among educators, AI developers, and policymakers could help establish best practices and ethical guidelines for the responsible implementation of AI in education. By ensuring that AI-based learning remains inclusive, transparent, and aligned with educational objectives, its integration into classrooms can better support lifelong learning and educational equity, as outlined in SDG 4 [4].

In conclusion, AI-powered learning tools have the potential to enhance elementary science education, but their effectiveness depends on how well they balance scientific accuracy, explanatory clarity, and pedagogical alignment. Future advancements should focus on developing adaptive AI models, improving curriculum integration, and providing teacher support systems to ensure that AI serves as a reliable and effective educational tool. Aligning AI-assisted education with the objectives of SDG 4 may help expand access to quality learning, promote inquiry-based science education, and support educators in delivering more impactful instruction.

5. Limitations and Recommendations

This study provides insights into the strengths and challenges of using ChatGPT in elementary science education. While ChatGPT demonstrated strong scientific validity, the analysis also identified issues related to explanatory clarity and pedagogical alignment, suggesting that further refinements are needed for AI-based learning tools to be more effective. Additionally, as AI-assisted education continues to evolve, its potential contributions to the broader objectives of sustainable and equitable education (SDG 4: Quality Education) warrant further exploration [25,34]. However, before drawing definitive conclusions, several limitations must be acknowledged, followed by recommendations for future improvements.

One key limitation of this study is the scope of AI-generated responses analyzed. The study focused on ChatGPT’s performance in elementary Earth and Space science education, and findings may not be fully applicable to other scientific domains or different educational levels. Future studies could explore a wider range of scientific disciplines and age groups to determine whether similar patterns in AI-generated content emerge across different learning contexts. Expanding research beyond a single domain may help clarify how AI tools can be effectively tailored to different subjects and student needs, aligning with SDG 4’s emphasis on inclusive and equitable education [35].

Another limitation concerns the AI model used in this study. All AI-generated responses were collected using the free version of ChatGPT (GPT3.5) in 2024, which may not reflect the performance of premium models or future AI iterations. Since AI systems are continuously updated and refined, the explanatory clarity, accuracy, and pedagogical relevance of responses may improve or change over time. This study provides a snapshot of ChatGPT’s educational performance at a particular point in time, and ongoing research is necessary to track how AI-generated educational content evolves. Longitudinal studies could be beneficial in assessing whether AI models become more effective in adapting to different educational needs over time.

A further limitation is the absence of student interaction data. This study evaluated AI-generated responses independently, without analyzing how students engage with AI-generated explanations or how AI-driven learning tools impact comprehension and retention. Future research could incorporate classroom-based studies, student engagement tracking, or experimental learning trials to assess how AI-based instruction influences actual learning outcomes. Understanding how students interpret, apply, and respond to AI-generated explanations would provide a more comprehensive perspective on AI’s role in improving learning accessibility (SDG 4.1) and personalized instruction (SDG 4.7).

Additionally, this study does not fully address the integration of AI-generated content into real-world teaching practices. While AI can serve as a supplementary educational tool, its effectiveness depends on how teachers incorporate AI responses into their instructional strategies. Future research should examine teacher perspectives, best practices for AI integration, and professional development needs, as well as how AI might be used to support differentiated instruction in diverse classroom settings. Strengthening teacher preparedness and AI literacy could contribute to more effective implementation of AI-based learning [39].

Despite these limitations, this study suggests several recommendations for improving AI-driven educational tools and ensuring that they align more effectively with sustainable education goals.

First, enhancing AI’s ability to generate age-appropriate responses is essential. Given that only 12.8% of responses were pedagogically relevant, AI models should incorporate curriculum-based filtering mechanisms to ensure that content is appropriately aligned with student learning levels. Developing AI systems that can recognize different stages of learning and adjust explanations accordingly would help make AI-assisted education more relevant and developmentally appropriate, in line with SDG 4’s focus on improving educational quality.

Second, improving explanatory clarity in AI-generated responses could enhance learning accessibility. Since 29.4% of responses contained overly technical phrasing, future AI models should integrate linguistic simplification algorithms while maintaining scientific accuracy. AI-driven content should be designed to accommodate diverse learning needs and literacy levels, ensuring that students from varied educational backgrounds can effectively engage with AI-based instruction (SDG 4.5).

Third, fact-checking mechanisms should be strengthened to minimize inaccuracies in AI-generated content. While ChatGPT demonstrated a high level of scientific validity (94.2%), occasional factual errors highlight the need for real-time verification processes to maintain credibility. AI models could incorporate external reference databases and dynamic review systems to improve content accuracy. Ensuring that AI-generated educational materials remain consistent, evidence-based, and scientifically sound aligns with SDG 4.7, which promotes knowledge acquisition through critical and inquiry-based learning.

Fourth, integrating AI into student-centered, inquiry-based learning frameworks could enhance engagement. AI-generated content should not only provide factual explanations but also encourage students to ask follow-up questions, explore concepts in depth, and engage in problem-solving activities. Designing AI-driven tools that promote active participation rather than passive knowledge absorption could make AI-based learning more meaningful and aligned with modern educational methodologies that support scientific inquiry and creative thinking (SDG 4.7).

Fifth, teacher training and professional development programs should be expanded to ensure effective Gen AI integration into classrooms. Educators play a crucial role in mediating AI-generated content, guiding student discussions, and ensuring that Gen AI serves as a complement to traditional instruction rather than a replacement. Professional development initiatives could focus on helping teachers critically evaluate AI-generated content, adapt it to their instructional goals, and use AI to support differentiated instruction.

Finally, a collaborative, ethical, and policy-driven approach is necessary for Gen AI’s responsible implementation in education. While AI holds potential for expanding access to knowledge, challenges related to bias, misinformation, and equitable access must be addressed. Policymakers, educators, and AI developers should work together to establish guidelines that ensure AI-driven education remains inclusive, unbiased, and aligned with educational standards. Promoting AI literacy for both teachers and students could support informed decision-making about AI usage in educational environments while contributing to SDG 4’s overarching goal of fostering equitable and sustainable learning opportunities.

In conclusion, while ChatGPT and other AI-driven educational tools have the potential to transform elementary science education, their effectiveness will depend on how well they balance scientific accuracy, explanatory clarity, and curriculum alignment. Continued improvements in adaptive AI models, ethical content filtering, and teacher support systems will be essential in ensuring that AI serves as an equitable, reliable, and effective educational partner. Aligning AI-assisted education with SDG 4’s objectives may help expand access to quality learning, foster student engagement in inquiry-based science education, and support educators in delivering impactful instruction.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The author declares no conflicts of interest.

References

Cooper, G. Examining science education in ChatGPT: An exploratory study of generative artificial intelligence. J. Sci. Educ. Technol. 2023, 32, 444–452. [Google Scholar]
Jia, X.; Sun, D.; Looi, C. AI in science education: 2013 to 2023 research trends. J. Sci. Educ. Technol. 2023, 33, 94–117. [Google Scholar]
Airaj, M. Ethical artificial intelligence for teaching-learning in higher education. Educ. Inf. Technol. 2024, 29, 17145–17167. [Google Scholar]
UNESCO. Beijing consensus on artificial intelligence and education. In Proceedings of the International Conference on Artificial Intelligence and Education, Planning Education in the AI Era: Lead the Leap, Beijing, China, 16–18 May 2019. [Google Scholar]
Wang, K.; Burkholder, E.; Wieman, C.; Salehi, S.; Haber, N. Examining the potential and pitfalls of ChatGPT in science and engineering problem-solving. Front. Educ. 2024, 8, 1330486. [Google Scholar]
Choi, Y.; Chung, J. Exploring Pre-service Elementary Teachers’ Perceptions of the Use of Generative Artificial Intelligence in Elementary Science Education. J. Educ. Stud. 2024, 55, 309–331. [Google Scholar]
National Research Council. A Framework for K-12 Science Education: Practices, Crosscutting Concepts, and Core Ideas; National Academies Press: Washington, DC, USA, 2012. [Google Scholar]
Alsorman, S. The readiness to use AI in teaching science: Science teachers’ perspective. J. Balt. Sci. Educ. 2024, 23, 432–448. [Google Scholar]
Lee, M.; Yi, T.; Der, C.; Song, H.; David, H. Exploring interactions between learners and ChatGPT from a learner agency perspective: A multiple case study on historical inquiry. Educ. Inf. Technol. 2024, 39, 1102–1125. [Google Scholar]
Zhai, X.; Nyaaba, M.; Ma, W. Can generative AI and ChatGPT outperform humans on cognitive-demanding problem-solving tasks in science? Sci. Educ. 2024, 27, 1–12. [Google Scholar]
Schulze, L.; Weber, J.; Buijsman, S.; Hildebrandt, J.; Ziefe, M.; Schweidtmann, A. Empirical assessment of ChatGPT’s answering capabilities in natural science and engineering. Sci. Rep. 2024, 14, 4998. [Google Scholar]
Saude, S.; Barros, J.; Almeida, I. Impacts of generative artificial intelligence in higher education. Soc. Sci. 2024, 29, 45–60. [Google Scholar]
Murshidi, G.; Shulgina, G.; Kapuza, A.; Costley, J. How understanding the limitations and risks of using ChatGPT can contribute to willingness to use. Smart Learn. Environ. 2024, 15, 77–95. [Google Scholar] [CrossRef]
Chiu, T. Student AI Literacy and Competency. In Empowering K-12 Education with AI; Taylor & Francis: Abingdon, UK, 2025; pp. 31–54. [Google Scholar]
Ng, D.; Tan, C.; Leung, J. Empowering student self-regulated learning and science education through ChatGPT: A pioneering pilot study. Br. J. Educ. Technol. 2024, 55, 1328–1353. [Google Scholar] [CrossRef]
Imran, M.; Almusharraf, N. Google Gemini as a next generation AI educational tool: A review of emerging educational technology. Smart Learn. Environ. 2024, 11, 22. [Google Scholar] [CrossRef]
Almasri, F. Exploring the impact of artificial intelligence in teaching and learning of science: A systematic review of empirical research. Res. Sci. Educ. 2024, 54, 977–997. [Google Scholar] [CrossRef]
ElSayary, A. An investigation of teachers’ perceptions of using ChatGPT as a supporting tool for teaching and learning in the digital era. J. Comput. Assist. Learn. 2024, 40, 931–945. [Google Scholar] [CrossRef]
Mondal, H.; Marndi, G.; Behera, J.K.; Mondal, S. ChatGPT for teachers: Practical examples for utilizing artificial intelligence for educational purposes. Indian J. Vasc. Endovasc. Surg. 2023, 10, 200–205. [Google Scholar] [CrossRef]
Zhai, X. Transforming Teachers’ Roles and Agencies in the Era of Generative AI: Perceptions, Acceptance, Knowledge, and Practices. J. Sci. Educ. Technol. 2024, 1–11. [Google Scholar] [CrossRef]
Tang, R.; Cooper, G. The role of materiality in an era of generative AI. Sci. Educ. 2024, 39, 445–463. [Google Scholar]
Yue, M.; Jong, M.; Ng, D. Understanding K–12 teachers’ technological pedagogical content knowledge readiness and attitudes toward artificial intelligence education. Educ. Inf. Technol. 2024, 29, 19505–19536. [Google Scholar] [CrossRef]
O’Dea, X.; Ng, D.; O’Dea, M.; Shkuratskyy, V. Factors affecting university students’ generative AI literacy: Evidence and evaluation in the UK and Hong Kong contexts. Policy Futures Educ. 2024. [Google Scholar] [CrossRef]
Agathokleous, E.; Saitanis, C.; Fang, C.; Yu, Z. Use of ChatGPT: What does it mean for biology and environmental science? Sci. Total Environ. 2024, 888, 164154. [Google Scholar] [CrossRef]
Uğraş, H.; Uğraş, M.; Papadakis, S.; Kalogiannakis, M. ChatGPT-supported education in primary schools: The potential of ChatGPT for sustainable practices. Sustainability 2024, 16, 9855. [Google Scholar] [CrossRef]
Kotsis, K. ChatGPT as teacher assistant for physics teaching. J. Eff. Teach. Methods 2024, 2, 18–27. [Google Scholar]
Adel, A.; Ahsan, A.; Davison, C. ChatGPT promises and challenges in education: Computational and ethical perspectives. Educ. Sci. 2024, 14, 814. [Google Scholar] [CrossRef]
Daher, W.; Diab, H.; Rayan, A. Artificial Intelligence Generative Tools and Conceptual Knowledge in Problem Solving in Chemistry. Information 2023, 14, 409. [Google Scholar] [CrossRef]
Su, J.; Ng, D.; Chu, S. Artificial intelligence (AI) literacy in early childhood education: The challenges and opportunities. Comput. Educ. Artif. Intell. 2023, 4, 100124. [Google Scholar] [CrossRef]
Ng, D.; Su, J.; Chu, S. Fostering secondary school students’ AI literacy through making AI-driven recycling bins. Educ. Inf. Technol. 2024, 29, 9715–9746. [Google Scholar]
Elmas, R.; Adiguzel-Ulutas, M.; Yılmaz, M. Examining ChatGPT’s validity as a source for scientific inquiry and its misconceptions regarding cell energy metabolism. Educ. Inf. Technol. 2024, 29, 25427–25456. [Google Scholar]
Jauhiainen, J.; Guerra, A. Generative AI in education: ChatGPT-4 in evaluating students’ written responses. Innov. Educ. Teach. Int. 2024, 1–18. [Google Scholar] [CrossRef]
Braun, V.; Clarke, V. Using thematic analysis in psychology. Qual. Res. Psychol. 2006, 3, 77–101. [Google Scholar]
Adams, T.; Jameel, S.M.; Goggins, J. Education for Sustainable Development: Mapping the SDGs to University Curricula. Sustainability 2023, 15, 8340. [Google Scholar] [CrossRef]
Raman, R.; Lathabai, H.; Mandal, S.; Das, P.; Kaur, T.; Nedungadi, P. ChatGPT: Literate or intelligent about UN sustainable development goals? PLoS ONE 2024, 19, e0297521. [Google Scholar]
Chen, J.; Lin, J. Artificial intelligence as a double-edged sword: Wielding the POWER principles to maximize its positive effects and minimize its negative effects. Contemp. Issues Early Child. 2024, 25, 146–153. [Google Scholar]
Rane, N. Enhancing Mathematical Capabilities Through ChatGPT and Similar Generative Artificial Intelligence: Roles and Challenges in Solving Mathematical Problems. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4603237 (accessed on 23 March 2025).
Leelavathi, R.; Surendhranatha, R. ChatGPT in the classroom: Navigating the generative AI wave in management education. J. Res. Innov. Teach. Learn. 2024. [Google Scholar] [CrossRef]
Daskalaki, E.; Psaroudaki, K.; Fragopoulou, P. Navigating the Future of Education: Educators’ Insights on AI Integration and Challenges in Greece, Hungary, Latvia, Ireland and Armenia. arXiv 2024, arXiv:2408.15686. [Google Scholar]

Table 1. Theoretical framework in this study.

Original Criteria [31]	Original Criteria [32]	Revised Criteria for This Study	Integration and Justification
Scientific accuracy	Factual accuracy	Scientific validity	Both criteria emphasize correctness and reliability of scientific information. Combined to ensure factual accuracy and alignment with scientific principles.
Explanatory Aspects	Contextual relevance Logical coherence	Explanatory clarity	Explanatory aspects focus on logical and structured explanations [31], while contextual relevance and coherence [32] enhance readability and accessibility.
	Grammatical accuracy	Explanatory clarity	Grammatical accuracy ensures that responses are linguistically clear for elementary students.
	Completeness	Pedagogical relevance	Completeness [32] and explanatory aspects [31] were adapted to focus on aligning content with elementary education goals and inquiry-based learning.
	Contextual relevance	Pedagogical relevance	Adapted to assess relevance to real-world applications and practical educational contexts.

Table 2. Summary of ChatGPT response analysis based on scientific validity, explanatory clarity, and pedagogical relevance.

Criteria	Number of Responses (N)	Percentage (%)	Findings	Examples (Appropriate Response)	Examples (Inappropriate Response)
Scientific validity	1130	94.2	Most responses were factually accurate and aligned with scientific principles. However, a small percentage contained minor inaccuracies.	The Moon reflects sunlight, which is why we see it shining at night.	The Moon produces its own light, similar to a star.
Explanatory clarity	849	70.6	While most responses were clear, 29.4% (N = 351) contained complex vocabulary or lengthy explanations that reduced readability.	Rain forms when water vapor cools, turns into droplets, and falls from clouds.	Precipitation occurs due to atmospheric condensation, coalescing water molecules, and gravitational descent.
Pedagogical relevance	153	12.8	The majority of responses (87.2%, N = 1047) included terminology or scientific details too advanced for elementary students.	Sedimentary layers form when sand and soil pile up over time and harden into rock.	From a stratigraphic perspective, sedimentary layers represent depositional records influenced by diagenetic geochemical processes and paleoenviron- mental shifts

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Choi, Y. Exploring the Scientific Validity of ChatGPT’s Responses in Elementary Science for Sustainable Education. Sustainability 2025, 17, 2962. https://doi.org/10.3390/su17072962

AMA Style

Choi Y. Exploring the Scientific Validity of ChatGPT’s Responses in Elementary Science for Sustainable Education. Sustainability. 2025; 17(7):2962. https://doi.org/10.3390/su17072962

Chicago/Turabian Style

Choi, Yoonsung. 2025. "Exploring the Scientific Validity of ChatGPT’s Responses in Elementary Science for Sustainable Education" Sustainability 17, no. 7: 2962. https://doi.org/10.3390/su17072962

APA Style

Choi, Y. (2025). Exploring the Scientific Validity of ChatGPT’s Responses in Elementary Science for Sustainable Education. Sustainability, 17(7), 2962. https://doi.org/10.3390/su17072962

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exploring the Scientific Validity of ChatGPT’s Responses in Elementary Science for Sustainable Education

Abstract

1. Introduction

1.1. The Role of ChatGPT in Sustainable Science Education

1.2. Addressing the Challenges of ChatGPT for Sustainable Science Education

1.3. Research Purpose and Questions

2. Materials and Methods

2.1. Elementary Science Education in the Earth and Space Domain Under the 2015 Revised Curriculum in Korea

2.2. Participants

2.3. Question Design Framework

2.3.1. Step 1: Question Formulation

2.3.2. Step 2: Reasoning Articulation

2.3.3. Step 3: Certainty Assessment

2.3.4. Justification of the Framework

2.4. Data Collection

2.5. Data Analysis

3. Results

3.1. Statistical Results of 1200 Questions

3.2. Analyzing Inadequacies in AI-Generated Responses: Insights from Scientific Validity, Explanatory Clarity, and Pedagogical Relevance

3.2.1. Strengths and Weaknesses in Scientific Validity

3.2.2. Challenge in Explanatory Clarity

3.2.3. The Disconnect Between AI-Generated Responses and Pedagogical Relevance

4. Discussion and Conclusions

5. Limitations and Recommendations

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI