Exploring the Scientific Validity of ChatGPT’s Responses in Elementary Science for Sustainable Education
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe aim of this study was to evaluate ChatGPT’s effectiveness in elementary science learning, particularly in promoting sustainable education through equitable access to knowledge. With this purpose it was proposed to assess the validity and applicability of ChatGPT’s responses in elementary Earth and Space science.
The contextualization of the study is adequate, as well as its main purpose, however, it is suggested that the variables that are being addressed for the assessment be developed in more detail and precision, what exactly is meant by the terms validity of the ChatGPT’s responses and pedagogical effectiveness.
In lines 181-182, it is noted that Table 1 shows the structure of the stages used in formulating the questions, however, Table 1, which appears only after line 292, shows something else.
It is suggested that the description and justification relevant to Table 1 be developed before it appears in the text to increase the clarity of the above. On the other hand, it is necessary that the description of the criteria selected for the evaluation be clearer. It may not be necessary to include the first two columns, but it is necessary to specify the indicators of each criterion better. In this way, clarity in the discussion of the results obtained is also improved.
In the results section, it is suggested to review the articulation of the texts since the findings are repeatedly found in percentage of each of the three evaluated criteria.
The conclusions, limitations and recommendations are pertinent and consistent with the purpose of the study and constitute a good contribution to this line of research.
Author Response
The response letter has been prepared based on the provided template and submitted along with the attached file.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe article is another attempt to answer the question: can we trust ChatGPT for use in the educational process? To answer this question, a rather interesting experiment was conducted with the participation of future teachers, who generated 1200 questions on natural science topics. The author of the article, studying the answers of ChatGPT to these questions, came to the expected result that in a significant number of cases, the use of these answers in the educational process is impossible. What is new in the article is the allocation of gradations of the impossibility of using answers. Usually, only incorrect answers are identified. The author identified two more reasons why it is impossible to use answers: clarity and pedagogical significance. Moreover, if incorrect answers are relatively rare, then the impossibility of using answers for pedagogical reasons is almost 90%. This high rate is explained, first of all, by the fact that the experiment was conducted ion the material of junior grades, where it is necessary to take into account both cognitive immaturity and limited knowledge.. The article is written in a clear and distinct language, the goals and results of the article are achieved, there are no comments on the text (although, in my opinion, the author describes the stage of preparing questions in too much detail). Despite its not very high scientific significance, the article can still be published. Firstly, because of the interesting experiment and, secondly, because of the above-mentioned novelty.
Author Response
The response letter has been prepared based on the provided template and submitted along with the attached file
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsDear Authors,
this is a very interesting and well-written paper with a very appropriate educational/theoretical embedding. I would like to add a few points that could help improve the paper. Just for the sake of clarity, when I ask questions, I think the reader should be informed of the points.
(0) The work is very redundant in parts, starting with the methods section. The results section already contains interpretations and discussions that are repeated in the discussion section, including the results (%). Some information can be found in the methods section, the results section and the discussion section. It would therefore be desirable to reduce the redundancy and make the thread of the work clearer (the methods section contains all the necessary information about the methods, the results section is only for the results, and the discussion is for the discussion).
(1) Page 6, from line 251 on. You mention eight Earth and science units. Please list them all. For orientation, it would be informative to include the time frames for traditional teaching, e.g. the number of teaching hours and the weekly teaching time.
You also mention that your process ensures a balanced presentation of all important Earth science topics. That may be true, but how do the questions cover the eight topics? So please provide more information on the questions. For example:
I can't believe that all 1200 questions cover different ideas. How many of the 150 questions per topic are more or less the same?
How well are the topics covered?
How long are the questions on average?
How long are the answers on average?
It is not clear whether the questions and justifications are good questions and justifications, even if they are created and formulated following a rigorous process and with the help of experts (preservice teachers). Has the quality of ChatGPT's input been evaluated in this regard?
What about the variance in the difficulty of questions? Does it make sense to talk about the difficulty of questions?
Please tell us more about the pre-service teachers. Were they all male, 20 years old and in their first semester? How long did they work on their 40 questions?
(2) Page 6, line 266. You report on using an online platform to send queries to ChatGPT and receive the answer. Could you provide more information about the platform and the process behind it? For example:
What does it look like when the questions, rationales, and confidence scores are prompted? Please also provide examples of the ChatGPT prompts as used on the platform.
What was the exact input for ChatGPT? Was anything added or changed when it was sent to ChatGPT?
How were the answers presented on this platform? Please also provide specific examples of answers.
What was the exact result of ChatGPT? Was anything added or changed when it was sent to the presentation platform?
Please also provide some meaningful screenshots.
(3) Page 15, line 644. You mention the free version of ChatGPT (3.5) in the discussion. Your ChatGPT version with a brief description of this version should already be part of the “Methods” section.
(4) Table 1. In this table, there is an entry “contextual relevance, logical relevance” sensu [31] with a text on integration and justification. This is not clear and, in my opinion, contradicts your information on page 8, lines 309-312. Please check the entire table for discrepancies.
(5) Page 8, lines 317-338. It would be good to present good and not-so-good answers in a table. It provides information about the evaluation process in relation to the ChatGPT answers.
How were the answers rated according to the criteria? Only with yes/no (clarity, scientific validity, pedagogical relevance)? The trust was rated on a scale from “very certain” to “uncertain”. How many points? Perhaps it would be interesting to relate the ratings to the answer criteria (reliability, validity, relevance).
(6) Page 9, lines 352-359. The review process was done for all 1200 questions?
(7) For reasons of practical relevance, I would expect you to also discuss the students' perspective. What role does the skill of asking questions from students play? They are the ones asking questions when ChatGPT is additionally used in school settings. Are they asking accurate questions? Should they tell ChatGPT how confident they are about the question? I don't think it's within their competence to name the educational goals. How might ChatGPT's behavior change with different quality of prompts?
Anyway, how does the quality of the answers affect the prompts? Perhaps the aspect of pedagogical relevance is simply a question of the prompts.
(8) I would prefer an extended results section that includes information about the confidence ratings and the evaluation criteria of ChatGPT answers, for example. This could provide some information about the relationships between the quality of the questions and the quality of the answers. An analysis by topic could also be interesting, and then a comparison between topics. In this context, the coverage of topics and the overlaps of questions in the question pool is interesting.
Kind regards,
Reviewer
Author Response
The response letter has been prepared based on the provided template and submitted along with the attached file.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThank you very much for reviewing and collecting the suggestions for the first version of your article. I believe that the adjustments made have significantly improved the clarity and quality of the work.
Author Response
We are submitting our response along with the attached file.
Author Response File: Author Response.docx
Reviewer 3 Report
Comments and Suggestions for AuthorsDear Author,
Thank you for working on your essay. I think it has become more comprehensible and transparent, but in my opinion, this has given rise to new challenges and some old ones have remained. I will explain them below.
(1) Participants: I don't think you collected any data about the gender and age of the participants who created the ChatGPT questions. Perhaps you can at least tell the reader whether the participants were of different genders and what age they are usually at when they are in their third year of university.
(2) Figure 1 is unreadable. If it is presented, it would be good if the reader could see screenshots with English text (translation).
(3) Step 2: Reasoning articulation: While you provide examples of prompts for steps 1 and 3, you do not provide an example of a prompt for step 2. Please do so.
(4) Question design framework: It is not entirely clear whether steps 2 and 3 also served to review and improve the original questions of the pre-service teachers. I think that this is not the case, since the participants – I suspect – underwent an inquiry-based learning phase, but I am not sure. Was that the case? Was the process of question generation used as part of the university's lessons? Please explain these aspects explicitly.
(5) Data Collection: I think the reader is confused by the third paragraph. I will list my questions about it below.
- Who submitted the questions via the online platform? I believe it was the pre-service teachers, wasn't it?
- Were these the questions that were created in step 1 of the question design framework, or were the questions the result of the entire inquiry-based learning scenario?
- Where does the detailed rationale explaining the scientific relevance come from? Was this created by the student teachers? Does this have something to do with step 2 of the question design framework (ChatGPT gives her rationale)?
- Where does the confidence rating by the student teachers come from? Does this have something to do with step 3 of the question design framework?
It is not clear how the question design framework and the data collection are intertwined and connected. Please make this more transparent.
Figure 1 shows – I am speculating because I cannot read the figure – that the prompts for asking, justifying, and trusting were given one after the other and not together. This could be clearer because in the third paragraph of the “Data Collection” section, you write “each submission.” This could be misinterpreted, in my opinion, and cast doubt on the similarity of asking to real classroom scenarios (fourth paragraph).
(6) Please comment on the answers provided by ChatGPT (see also one of my points in my first review). I don't think ChatGPT only responds with one sentence, but your examples give that impression. When I type “What causes earthquakes, and how do they impact the Earth's surface?”, I get a 193-word answer from ChatGPT (free version). When I ask the question, “Why do the phases of the moon change?”, I get a 497-word answer from ChatGPT (free version). So I wonder how you managed to practically evaluate scientific validity, explanatory clarity, and pedagogical appropriateness. I think the reader needs more information about your document analysis and the documents. Perhaps a primary school student is overwhelmed by the answers of ChatGPT anyway. I wonder if a response for primary school students can be clearly explained by ChatGPT at all, or if ChatGPT is failing here at all. In this context, it would be interesting to discuss point (7) of my earlier review (= simulation of real teaching scenarios).
(7) Please add the state of research. I'm sorry that I didn't mention it in my first review, but I think you haven't presented the state of research on the validity, clarity, and appropriateness of ChatGPT's answers. Is there no research on this? Also, you should relate your results to the state of research in the discussion section. Yes, you have references to challenges, etc., but you don't have more detailed information about the quality of ChatGPT's responses, and that's your research.
(8) The version of ChatGPT 3.5 should be mentioned in the abstract and at its first relevant occurrence, not just in the discussion section.
(9) The article contains some repetition, but I think the reader can live with that. I refer to my comment in my first review.
Sincerely,
Reviewer
Author Response
We are submitting my response along with the attached file.
Author Response File: Author Response.pdf
Round 3
Reviewer 3 Report
Comments and Suggestions for AuthorsDear Author,
Thank you for revising the article. I think it is very inspiring, especially because it is very well contextualized in terms of the school system. I wish you every success with your further studies, which promise to be equally exciting.
As you wrote, you could not find any relevant studies regarding the validity, clarity and appropriateness of ChatGPT responses. Please mention this in the text. Then it will be clear that you have clarified this. On the other hand, one could get the idea that you simply did not do this. An explicit mention would be a point for your competence.
Kind regards
Reviewer
Author Response
Comments 1
Thank you for revising the article. I think it is very inspiring, especially because it is very well contextualized in terms of the school system. I wish you every success with your further studies, which promise to be equally exciting.
Response 1
Thank you very much. Your detailed and thoughtful feedback has greatly contributed to the improvement and overall quality of this manuscript. I would like to take this opportunity to sincerely express my gratitude once again through this message. ?
Comment 2
As you wrote, you could not find any relevant studies regarding the validity, clarity and appropriateness of ChatGPT responses. Please mention this in the text. Then it will be clear that you have clarified this. On the other hand, one could get the idea that you simply did not do this. An explicit mention would be a point for your competence.
Response 2
I have revised the introduction by incorporating additional literature relevant to this matter. Specifically, I added the following reference: Schulze, L.; Weber, J.; Buijsman, S.; Hildebrandt, J.; Ziefe, M.; Schweidtmann, A. (2024). Empirical assessment of ChatGPT’s answering capabilities in natural science and engineering. Scientific Reports, 14, 1–11.
This study was cited to reflect prior research on the evaluation of ChatGPT’s responses. However, I clarified that this work was conducted in the context of natural science and engineering, rather than elementary science education. The relevant section has been revised accordingly, and this change is reflected on page 2, lines 58 to 70 of the revised manuscript.
I would like to sincerely express my gratitude for your valuable support in helping to improve the quality of this manuscript until the very end.