From Play to Understanding: Large Language Models in Logic and Spatial Reasoning Coloring Activities for Children
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsPerhaps it would be easier to number the research questions (RQ1, RQ2, etc.) and refer directly to them in the article. Nevertheless, the research questions are interesting and worth answering, and the topic of the article has important scientific and economic value in light of the ongoing development of human-AI interfaces and the need for adaptation.
Figure 1, Figure 5, Figure 6, Figure 7, Figure 9, Figure 10, Figure 11 should have larger lettering.
Author Response
Thank you very much for your time and dedication in reviewing the manuscript. We have included additional references, named the research questions as suggested, and enlarged the letters in the figures. We have added two new references in the introduction (line 60), and changing “for example” to “but”) to provide sufficient background and highlighting the change from GPT-4 to GPT-4o:
- Arkoudas, K. (2023) GPT-4 Can’t Reason. https://arxiv.org/pdf/2308.03762 , which argues that while GPT4 has some successes in reasoning problems, it is still utterly incapable of reasoning.
- Kraaijveld, K., Jiang, Y., Ma, K., Ilievski, F. (2024) COLUMBUS: Evaluating COgnitive Lateral Understanding through Multiple-choice reBUSes. https://doi.org/10.48550/arXiv.2409.04053, which reviews visual reasoning problems and concludes that, for GPT4o and other recent LLMs, there is still a gap in visual lateral thinking.
We added these clarifications and references in the manuscript.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe paper "From Play to Understanding: LLMs in Logic and Spatial Reasoning Coloring Activities for Children" by Sebastián Tapia and Roberto Araya presents a novel application of Large Language Models (LLMs), specifically GPT-4o, in the educational context of teaching logic and spatial reasoning through coloring activities for children. While the study provides interesting insights into the potential of LLMs to aid in education by automating the correction process and providing real-time feedback, it has several critical shortcomings that need to be addressed.
Firstly, the paper lacks sufficient detail about the methodology used in the study. There is inadequate information regarding the experimental setup, including how the coloring activities were designed, how they were presented to GPT-4o, and the criteria used to select the specific prompting techniques. This lack of transparency makes it difficult to assess the validity and reproducibility of the findings. A more detailed description of the methodology would help in understanding how the experiments were conducted and how the conclusions were drawn.
Secondly, the generalizability of the results is not adequately discussed. The study focuses exclusively on a specific type of activity—coloring—and a single LLM model, GPT-4o. This narrow focus limits the applicability of the findings to other educational contexts or age groups. To strengthen the paper, the authors should consider testing their approach across a wider range of educational activities and with different LLM models. This would help to determine whether the observed benefits of using LLMs for immediate feedback and correction are applicable beyond the specific scenarios tested in this study.
Furthermore, the paper does not sufficiently address the ethical and pedagogical implications of using LLMs in the classroom. There is no discussion of potential issues related to data privacy, the risk of over-reliance on AI by educators, or the impact on students' learning experiences and critical thinking development. These are significant concerns that should be explored in detail, especially given the involvement of young children in the study. A more comprehensive analysis of these ethical and pedagogical challenges would provide a more balanced view of the potential benefits and risks of integrating LLMs into educational settings.
Additionally, the paper fails to critically examine the limitations and challenges associated with the implementation of LLMs in education. The authors do not discuss the potential biases inherent in AI models, the technical difficulties that might arise in deploying such systems in schools, or the need for adequate teacher training to effectively use these tools. Addressing these limitations and challenges is crucial for a realistic assessment of the feasibility and effectiveness of using LLMs in education.
Lastly, the study’s heavy reliance on a single LLM model, GPT-4o, without considering other models or alternative AI technologies, limits the depth of the analysis. Exploring the performance of different models would provide a more comprehensive understanding of the capabilities and limitations of LLMs in educational applications. This would also help in determining whether GPT-4o is uniquely suited for the specific tasks described or if similar results could be achieved with other models.
In conclusion, while Tapia and Araya’s paper contributes to the emerging field of AI in education by highlighting the potential of LLMs to enhance learning through real-time feedback and automated correction, it falls short in several critical areas. To improve the robustness and applicability of their findings, the authors need to provide more detailed methodological information, consider a broader range of educational contexts and LLM models, address ethical and pedagogical concerns, and critically evaluate the limitations and challenges of using AI in educational settings.
Comments on the Quality of English LanguageThe text is generally understandable, but there are areas where sentence structure, clarity, and academic tone could be improved to enhance readability and precision.
Author Response
Comment 1:
The paper "From Play to Understanding: LLMs in Logic and Spatial Reasoning Coloring Activities for Children" by Sebastián Tapia and Roberto Araya presents a novel application of Large Language Models (LLMs), specifically GPT-4o, in the educational context of teaching logic and spatial reasoning through coloring activities for children. While the study provides interesting insights into the potential of LLMs to aid in education by automating the correction process and providing real-time feedback, it has several critical shortcomings that need to be addressed.
Firstly, the paper lacks sufficient detail about the methodology used in the study. There is inadequate information regarding the experimental setup, including how the coloring activities were designed, how they were presented to GPT-4o, and the criteria used to select the specific prompting techniques. This lack of transparency makes it difficult to assess the validity and reproducibility of the findings. A more detailed description of the methodology would help in understanding how the experiments were conducted and how the conclusions were drawn.
Response 1:
Thank you very much for your time and dedication to carefully reviewing the manuscript and for the many suggestions for improvement.
Regarding your review of the methodology, we added in line 43 that reasoning problems are very important in the new Mathematics curricula and that they must integrate Computational Thinking, and the reference
- Araya, R. (2021) What Mathematical Thinking Skills will our Citizens Need in 20 More Years to Function Effectively in a Super Smart Society? In Inprasitha, M., Changsri, N., & Boonsena, N. (Eds). (2021). Proceedings of the 44th Conference of the International Group for the Psychology of Mathematics Education (Vol.1) Khon Kaen, Thailand: PME. ISBN 978-616-93830-0-0
- Lockwood, J., Mooney, A.: Computational Thinking in Education: Where does it fit? A systematic literary review. (2017) https://doi.org/10.48550/arXiv.1703.07659
Problems with logical quantifiers in education and handling arrays are central to computational thinking.
Doing so with coloring books is a widely used strategy in elementary education. For example
- Inharjanto, A., Lisnani: (2019) Developing Coloring Books to Enhance Reading Comprehension Competence and Creativity Advances in Social Science, Education and Humanities Research, vol. 394 (2019)
- NYPost. https://nypost.com/2015/12/13/hottest-trend-in-publishing-is-adult-coloring-books
- Kaufmann, R.: A FORTRAN Coloring Book. MIT Press, Boston (1978)
- Sandor, G.: A Fortran coloring book. Comput. Struct. 10(6), 931–932 (1979)
We added this reference and the paragraph in line 36:
Coloring books have become popular (NYPost) not only for their artistic appeal but also for their educational benefits. They can enhance reading comprehension (Inharjanto, A., Lisnani, 2019) and have a long history of helping learning in STEM subjects (Kaufmann, 1978; Sandor, 1979). Children and adults can better understand complex ideas and develop critical thinking skills by engaging with visual representations of concepts.
The methodology was to select 5 types of problems with connectors, quantifiers, logical and spatial reasoning and determine the performance of GPT4o, the most powerful LLM to date. Since context-free and zero shots did not perform well for correction, we included two specialized prompts, one for general Chain-of-thought (CoT) reasoning and the other for visual-spatial Visualization-of-Thought (VoT) reasoning.
We added these clarifications and references in line 246 with the paragraph
We incorporated three specialized prompts, Chain-of-Thought (CoT) for logical reasoning, Visualization-of-Thought (VoT), to further enhance the AI's ability to assess and also to provide better feedback on the coloring activities, and Self-Consistency to mitigate probability issues.
Comment 2:
Secondly, the generalizability of the results is not adequately discussed. The study focuses exclusively on a specific type of activity—coloring—and a single LLM model, GPT-4o. This narrow focus limits the applicability of the findings to other educational contexts or age groups. To strengthen the paper, the authors should consider testing their approach across a wider range of educational activities and with different LLM models. This would help to determine whether the observed benefits of using LLMs for immediate feedback and correction are applicable beyond the specific scenarios tested in this study.
Response 2:
Thank you very much for the suggestion about generalizability.
Coloring activities are a strategy to build much broader items than multiple-choice ones. Any multiple-choice question can be posed as coloring the correct option. However, as described in the text, the proposed activities have millions of correct solutions, although there are billions of options. For this reason, they have great generalizability.
In addition, the activities with boxes and balls are another way of specifying problems of handling arrays or matrices, but they are suitable in a concrete language for elementary school students. This type of problem with arrays is central in computational thinking, very ubiquitous, and, therefore, very general.
On the other hand, the use of GPT4o is due to the fact that, at the time of this work, it was considered the most powerful LLM for reasoning problems. For example, it is the LLM with the best performance in spatial reasoning tasks in the recent paper
- Kraaijveld, K., Jiang, Y., Ma, K., Ilievski, F. (2024) COLUMBUS: Evaluating COgnitive Lateral Understanding through Multiple-choice reBUSes. https://doi.org/10.48550/arXiv.2409.04053, which reviews various visual reasoning problems.
We have added this reference in line 60
Comment 3:
Furthermore, the paper does not sufficiently address the ethical and pedagogical implications of using LLMs in the classroom. There is no discussion of potential issues related to data privacy, the risk of over-reliance on AI by educators, or the impact on students' learning experiences and critical thinking development. These are significant concerns that should be explored in detail, especially given the involvement of young children in the study. A more comprehensive analysis of these ethical and pedagogical challenges would provide a more balanced view of the potential benefits and risks of integrating LLMs into educational settings.
Response 3:
Thank you very much for highlighting these critical issues. Regarding ethics, we have added in line 143 that it is essential to clarify that there are no ethical issues since students only color on sheets of paper. They do not use any device. We have developed an LLM-based app that teachers use on their smartphones (Araya, 2024). Teachers take pictures of students´ sheets and receive immediate feedback that helps to assess the student´s work. it is, therefore, a cost-effective solution that we have been testing in underrepresented and poor areas.
- Araya, R. (2024) AI as a Co-Teacher: Enhancing Creative Thinking in Underserved Areas. In Kashihara, A. et al. (Eds.) (2024). Proceedings of the 32nd International Conference on Computers in Education. Asia-Pacific Society for Computers in Education.
This paper describes the advantages for teachers because the app allows them to carry out much more creative activities with millions of possible correct solutions. It is very complex to evaluate in the classroom because having a list of solutions or an adequate rubric is impossible.
Comment 4:
Additionally, the paper fails to critically examine the limitations and challenges associated with the implementation of LLMs in education. The authors do not discuss the potential biases inherent in AI models, the technical difficulties that might arise in deploying such systems in schools, or the need for adequate teacher training to effectively use these tools. Addressing these limitations and challenges is crucial for a realistic assessment of the feasibility and effectiveness of using LLMs in education.
Response 4:
While it is not the aim of the paper to study the general feasibility of using LLMs in education, we have added a paragraph in line 58 clarifying that: the strategy is to support the teacher in creating richer, more creative, and more logically complex activities in the classroom. These activities have been tested in schools in underrepresented regions such as Chile and Peru, and countries such as Indonesia and Thailand members of the Southeast Asia Ministers of Education Organization (SEAMEO).
Comment 5:
Lastly, the study’s heavy reliance on a single LLM model, GPT-4o, without considering other models or alternative AI technologies, limits the depth of the analysis. Exploring the performance of different models would provide a more comprehensive understanding of the capabilities and limitations of LLMs in educational applications. This would also help in determining whether GPT-4o is uniquely suited for the specific tasks described or if similar results could be achieved with other models.
Response 5:
As explained above, we chose GPT4o, because at the time of this work, in August 2024, this LLM was considered to have the best performance in reasoning tasks. Following their observation, we have added a paragraph clarifying this decision and the reference in line 60.
Comment 6:
In conclusion, while Tapia and Araya’s paper contributes to the emerging field of AI in education by highlighting the potential of LLMs to enhance learning through real-time feedback and automated correction, it falls short in several critical areas. To improve the robustness and applicability of their findings, the authors need to provide more detailed methodological information, consider a broader range of educational contexts and LLM models, address ethical and pedagogical concerns, and critically evaluate the limitations and challenges of using AI in educational settings.
Comments on the Quality of English Language
The text is generally understandable, but there are areas where sentence structure, clarity, and academic tone could be improved to enhance readability and precision.
Response 6:
We used Grammarly to review the English.
Reviewer 3 Report
Comments and Suggestions for AuthorsIn this paper, the authors explored large language models for accessing coloring activities completed by elementary school students. A range of different prompting techniques was used to improve the model performance.
[General comment]
The paper is a pleasure to read. The research questions and methods are well explained. The limitations and future work are thoroughly discussed. I recommend publishing the paper in its current form.
[Editorial suggestions]
Line 33: A brief introduction to gamification would be helpful.
Table 2: For very low probabilities, scientific notation can be used instead of decimal numbers to avoid excessive zeros.
Author Response
Comments and Suggestions for Authors
Comment 1:
In this paper, the authors explored large language models for accessing coloring activities completed by elementary school students. A range of different prompting techniques was used to improve the model performance.
[General comment]
The paper is a pleasure to read. The research questions and methods are well explained. The limitations and future work are thoroughly discussed. I recommend publishing the paper in its current form.
Response 1:
Thank you very much for your time in reviewing our manuscript. We greatly appreciate your positive evaluation and your highlighting the wide range of prompting techniques we use to measure performance.
Comment 2:
[Editorial suggestions]
Line 33: A brief introduction to gamification would be helpful.
Response 2:
We have added the phrase in line 33: Gamification, the process of integrating activities by using game elements (Lee, Hammer, 2011)
- Lee, J.; Hammer, J. (2011) Gamification in Education: What, How, Why Bother? Academic Exchange Quarterly 15(2):1-5
Comment 3:
Table 2: For very low probabilities, scientific notation can be used instead of decimal numbers to avoid excessive zeros.
Response 3:
We have changed the notation.
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsComments applied.