Interactive Environment-Aware Planning System and Dialogue for Social Robots in Early Childhood Education

Moon, Jiyoun; Song, Seung Min

doi:10.3390/app152011107

Open AccessArticle

Interactive Environment-Aware Planning System and Dialogue for Social Robots in Early Childhood Education

by

Jiyoun Moon

^*

and

Seung Min Song

Department of Electronics Engineering, Chosun University, Gwangju 61452, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(20), 11107; https://doi.org/10.3390/app152011107

Submission received: 9 September 2025 / Revised: 14 October 2025 / Accepted: 14 October 2025 / Published: 16 October 2025

(This article belongs to the Special Issue Robotics and Intelligent Systems: Technologies and Applications)

Download

Browse Figures

Versions Notes

Abstract

In this study, we propose an interactive environment-aware dialog and planning system for social robots in early childhood education, aimed at supporting the learning and social interaction of young children. The proposed architecture consists of three core modules. First, semantic simultaneous localization and mapping (SLAM) accurately perceives the environment by constructing a semantic scene representation that includes attributes such as position, size, color, purpose, and material of objects, as well as their positional relationships. Second, the automated planning system enables stable task execution even in changing environments through planning domain definition language (PDDL)-based planning and replanning capabilities. Third, the visual question answering module leverages scene graphs and SPARQL conversion of natural language queries to answer children’s questions and engage in context-based conversations. The experiment conducted in a real kindergarten classroom with children aged 6 to 7 years validated the accuracy of object recognition and attribute extraction for semantic SLAM, the task success rate of the automated planning system, and the natural language question answering performance of the visual question answering (VQA) module.The experimental results confirmed the proposed system’s potential to support natural social interaction with children and its applicability as an educational tool.

Keywords:

social robots; early childhood education; semantic SLAM; automated planning system; visual question answering; human-robot interaction; neuro-symbolic reasoning

1. Introduction

Cognitive development in early childhood is organically promoted through three main factors: physical activity, social interaction, and environmental exploration [1,2]. These developmental processes can be maximized through interaction with the surrounding environment and voluntary participation, rather than one-way delivery of standardized educational content [3,4]. Recently, social robots and digital interfaces have been introduced as supplementary educational tools in early childhood education settings, showing a certain level of effectiveness in stimulating children’s interest and participation [5]. Specifically, social interaction between humans and robots leads to robots being perceived not just as machines but as social beings, and it has a positive impact on promoting emotional engagement and learning participation in young children [6]. Social interaction-based kindergarten robots can be used as effective educational tools because they help children participate in class and can also induce emotional stability and a positive learning attitude during the process of forming relationships with peers and teachers [7]. Furthermore, interaction between young children and robots functions not merely as a source of interest but also as a medium that simultaneously fosters cognitive and emotional development [8]. Through robots, children experience a combination of play and learning, which contributes to strengthening their voluntary exploration and problem-solving abilities [9]. In this context, social robots can be positioned as effective tools in early childhood education by supporting the creation of participation-centered learning environments and fostering positive learning attitudes [10].

However, most social robots currently used in early childhood education are highly dependent on predefined conversation scenarios and limited interaction rules [11]. Conti et al. conducted an experiment with 81 kindergarten children in which a robot told two types of fairy tales and compared the memory effects under static and expressive speaker conditions [12]. Keren et al. played classical music to develop children’s spatial cognitive abilities and guided their learning by having them press buttons attached to a robot according to given questions [13]. De et al. increased students’ language learning rates by having children respond with different appropriate words and gestures depending on whether their answers to questions were correct or incorrect [14]. While this method can be effective in encouraging short-term learning engagement, it has limitations in dynamic and unpredictable environments such as real classrooms, given that it does not fully account for children’s spontaneous reactions or diverse social situations.

The cognitive development process of young children cannot be explained solely by simple robot–child interactions; environmental contexts such as the arrangement of classroom space, play tools, and collaborative activities with peers have a crucial influence [15,16]. Considering these points, educational robots should go beyond simply providing conversational responses and be able to comprehensively understand and reflect the environment and the child’s activities. To achieve this, robots must have the ability to move autonomously within a specific space. Additionally, they should be able to recognize the semantic information of the objects that make up the environment and, based on this, reconfigure and perform tasks in accordance with the changing classroom situation. Furthermore, this environmental understanding and planning ability needs to be connected to social interaction through conversations with children. This allows robots to play a richer and more adaptive educational role within the learning context.

In recent robotics, spatial perception technology is moving beyond simply mapping geometric structures. Semantic simultaneous localization and mapping (SLAM), which semantically recognizes objects within the environment while simultaneously estimating their position and constructing maps for use in real-world tasks, is becoming central [17,18,19]. These technologies allow robots to go beyond simply obtaining spatial coordinates for obstacle avoidance; they can now distinguish objects such as desks, teaching aids, and toys in a classroom and understand their functional meaning. However, in the research on educational robots, the application of semantic SLAM is still in its initial stages, with most systems being limited to simple localization or basic object detection. In this study, we propose a method for robots to autonomously navigate by perceiving the classroom environment, focusing on objects related to interaction with children by leveraging existing semantic SLAM technology tailored to an educational context.

Robots must not only comprehend the meaning of their environment but also develop and execute contextually appropriate task plans based on their perception. Task planning methods are broadly divided into learning-based and logic-based approaches. The learning-based approach primarily leverages reinforcement learning or deep neural networks to learn optimal action policies through experience [20,21,22]. This method can exhibit robust performance in environments with high uncertainty and variability, but it has limitations in that it requires large amounts of training data and long training times. By contrast, the logic-based approach leverages formal languages such as planning domain definition language (PDDL) or Stanford research institute problem solver (STRIPS) to define target states and derive reasonable plans under given constraints [23,24]. This method is highly likely to be interpreted based on explicit rules and reasoning processes, and it can ensure the reliability and consistency of plans in environments that are structured but have many variables, such as classrooms. However, the conventional logic-based approach has limitations in adapting immediately to environmental changes. Therefore, in this study we propose an automated task planning system using PDDL that aims to overcome these limitations while leveraging the advantages of the logic-based approach. This allows a robot to autonomously move within a changing classroom environment and flexibly plan and execute interacting behaviors with children.

Visual question answering (VQA), which has advanced through the fusion of computer vision and natural language processing technologies, is gaining attention as a technology that combines visual input and linguistic queries to generate meaningful answers [25,26,27]. Early VQA primarily relied on learning-based approaches based on large datasets, which involved extracting image features, combining them with the question, and then statistically deriving the most suitable answer [28,29]. However, this approach focused on simple pattern recognition and showed limitations in complex logical reasoning or contextual understanding. To address this problem, neuro-symbolic approaches have been actively researched recently. Neuro-symbolic VQA combines the representation learning capabilities of deep learning with the interpretability of symbolic reasoning, offering the advantage of logically interpreting queries about complex visual scenes and generating more explainable answers [30,31,32]. In this study, we propose an educational robot interaction model that incorporates this development trend into the educational field, using VQA based on the neuro-symbolic approach. The proposed approach structurally represents the semantic relationships between objects and attributes within the classroom and reflects them in the question-answering process, enabling the generation of contextual and explainable answers to children’s questions, going beyond simple visual factual responses.

Hence, we propose a new framework that integrates an environment-aware dialog and planning system to enable social robots to effectively perform an educational role in early childhood education environments. The proposed robot recognizes objects and positional information within the classroom through semantic SLAM and autonomously plans and executes tasks in response to changing situations using PDDL-based automated task planning. Additionally, this robot integrates a neuro-symbolic-based VQA module to provide explainable answers to children’s questions, going beyond simple factual responses. Figure 1 illustrates an example of a practical application scenario for the framework proposed in this study. The robot recognizes the color of the traffic lights installed at the intersection and exhibits a safe behavior by providing appropriate linguistic responses for the situation, such as “The light is red” or “Green light,” to children’s questions. Meanwhile, the robot asks the child, “What color is the traffic light?” or “What color can we cross on?” By presenting questions like these, children are guided to recognize and answer questions about traffic light colors and rules of behavior on their own. This study validated the proposed framework through a student-centered scenario focused on dialogue with children. Through this interaction process, the robot functions as both a respondent answering children’s questions and a questioner posing new ones, expanding the conversation with children from a one-way transmission to a two-way question-and-answer system. Furthermore, this process demonstrates an example in which semantic SLAM, task planning, and VQA modules work together to achieve educational interaction. The proposed system is designed with a bidirectional structure in which the robot not only responds to children’s questions but also initiates its own questions and reacts to new inquiries from the children. This structure allows children to participate not as passive recipients of information but as active thinkers and explorers, thereby enabling child-centered learning.

First, we validated the performance of each component of the proposed framework individually. The semantic SLAM module was confirmed to be able to reliably detect, classify, and estimate the positions of various objects in a classroom environment. The PDDL-based task planning module demonstrated autonomous execution capability by deriving reasonable and consistent task plans in response to dynamic changes in a real driving environment. Additionally, the VQA module effectively evaluated the accuracy and explainability of responses through a question dataset constructed to match the developmental level of kindergarten children. Furthermore, the entire proposed framework, including autonomous driving based on semantic SLAM and object recognition capabilities, was demonstrated to be integrally operable through experiments conducted on children and university students in a real kindergarten classroom environment. Therefore, we demonstrate that robots can precisely reflect the physical environment within a classroom to promote children’s exploration and learning engagement. This study is significant in that it lays a foundational basis for the development of early childhood education robots.

2. Related Work

In this section, we cover prior research on educational social robots, environment-aware task planning, and VQA. First, we examine existing research on the social role and effectiveness of educational robots. Next, we investigate an environment-aware task planning approach that robots need to operate autonomously in a complex environment such as a classroom. Finally, we discuss the development of VQA systems for question-and-answer interactions with children and attempts to integrate them into robot conversations.

2.1. Social Robots for Education

Educational social robots can function as social beings with the potential to enhance children’s emotional empathy, cognitive engagement, and learning persistence. Lampropoulos [10] comprehensively presented the current state and future research directions related to how social robots can enhance learning, emotional, and cognitive outcomes by acting as intelligent tutors or peer learners. Mifsud et al. proved that robots could play a significant social mirror role in the formation of self-concept in early childhood by analyzing the impact of the NAO6 robot on identity formation and literacy class participation of 5–6-year-old children in a Maltese kindergarten setting [33]. Johnson et al. analyzed whether children’s curiosity increased when asked STEM-related questions by a social robot and found that a significant number of children effectively interacted with the robot [34]. These studies show that educational robots can have a positive impact on the learning process. However, most are based on limited conversational patterns and scenarios, so they do not fully reflect the unstructured interactions found in real classrooms. Considering these limitations, we propose a method for designing social robots that can reflect classroom contexts in real time.

2.2. Environment-Aware Task Planning System

To reliably perform tasks in complex environments such as classrooms, robots must carry out task planning that considers the semantic context of the environment, rather than simple pathfinding. Previous research has largely developed data-driven and rule-based approaches to achieve this. First, the data-driven approach primarily involves robots learning optimal behavior policies through experience, often leveraging reinforcement learning or deep neural networks. Mei et al. presented a framework that integrates zero-shot learning and hierarchical reinforcement learning. This method was designed to enable robots to understand context and generalize behavior even in environments or with objects they have never seen before. It has demonstrated improved collision-free path planning performance in simulation-based experiments [35]. Zhao et al. proposed the multi-actor critic deep deterministic policy gradient algorithm, achieving high stability and learning speed in task trajectory planning for collaborative robots [36]. Zhang et al. introduced a new approach called gated attention prioritized experience replay soft actor–critic. It combines a gated attention structure, prioritized experience replay, and dynamic heuristic rewards with soft actor–critic-based reinforcement learning, enabling faster and more robust path planning for mobile robots even in complex environments [37]. These data-driven techniques have the advantage of being able to flexibly handle unstructured situations, but they also have the practical constraint of requiring a vast amount of data and time for the training process.

In contrast, the rule-based approach derives plans based on explicitly defined target states and constraints, which offers the advantage of ensuring plan interpretability and stability [38]. However, this method has the limitation of being unable to adapt immediately to unexpected environmental changes. To address this, various attempts have been made recently to reflect dynamic situations [39]. For example, Heuss et al. proposed a method to automatically adapt abstract planning domains in PDDL-based automated planning systems to real-world application environments, reducing the modeling burden on experts and increasing planning flexibility [40]. Additionally, Dong et al. demonstrated that by separating high-level task planning and low-level action execution through a hierarchical online planning approach, they could ensure the continuity of plan execution in structured environments while also being able to respond to a certain degree of environmental change [41]. In this study, we propose a task planning system that leverages the advantages of rule-based approaches while also reflecting the contextual needs of the educational setting.

2.3. Visual Question Answering

VQA is a technology that combines visual information and natural language questions to generate meaningful answers. Recently, it has been developing in the direction of increasing explainability by fusing deep learning and symbolic reasoning. Mascharka et al. demonstrated the visualization and interpretability of the reasoning step in the question-answering process by combining neural network-based representation learning with symbolic execution paths [42]. Moon et al. proposed a neuro-symbolic VQA technique that represents environmental scenes as symmetric graphs and combines it with SPARQL-based reasoning, successfully performing explainable question answering on the CLEVR dataset [43]. In other words, it is expanding beyond simple image recognition to technology that can understand a child’s intention behind their questions and generate inferential responses by considering the situational context.

VQA is being used in various fields. Pena et al. conducted research using VQA to enable robots to actively ask questions about environmental information and enhance their own map information through responses. VQA was proposed as a means of acquiring not only the geometric representation of a robot’s map but also semantic information [44]. Luo et al. proposed a transformer-based VQA model that achieved improved accuracy in both robot navigation and question answering through visual–linguistic information alignment [45]. Xiao et al. developed a VQA system specifically for educational settings and empirically demonstrated its potential applications in classroom interaction, learning content analysis, and interactive learning support [46]. Based on this trend, we propose an interactive model that applies neuro-symbolic VQA to a classroom environment, enabling it to provide contextual and explainable answers to children’s questions, going beyond simple factual responses.

3. Methods

The overall architecture proposed in this study is presented in Figure 2. In the process of interacting with children, robots simultaneously employ verbal input and constantly changing environmental information within the classroom. To achieve this, the robot acquires semantic information about space and objects through semantic SLAM and object recognition and plans autonomous driving and interaction behaviors using a PDDL-based automated planning module. The information collected and structured in this way is then passed to a neuro-symbolic VQA module, which is used to respond to children’s questions or generate contextually appropriate conversations. In other words, the proposed system operates in a cyclical structure of environment perception, planning, question answering, and interaction, with each component interconnected in a complementary manner, enabling child-centered learning support within the classroom environment. The details of the implementation are described step-by-step in the following subsections.

3.1. Semantic SLAM

For a robot to act autonomously in interaction with children, simple mechanical movement is not enough; it must accurately perceive its environment and perform path planning and behavioral decisions based on this perception. We address this by leveraging LiDAR-based 3D SLAM technology to perform map creation and localization in a classroom environment [47]. SLAM is a technology that allows a robot to localize and map simultaneously using sensors mounted on the robot itself, without the aid of external devices. It provides the foundation for this system. Furthermore, going beyond simply generating geometric maps and by combining object recognition to implement semantic SLAM, we constructed a semantic scene that includes various kinds of semantic information such as location, type, size, and color of objects. This allows robots to move beyond simply structurally perceiving the classroom environment and acquire higher-level contextual information that supports appropriate actions and conversations in various situations. The final semantic scene obtained is used as input for an automated planning system and a VQA module, enabling responses to children’s questions and context-based social conversations. In this way, semantic SLAM plays a crucial role in connecting environmental perception with interaction planning, providing an intelligent foundation for social robots to support child-centered learning.

3.2. Automated Planning System

The semantic environmental information obtained through semantic SLAM is used as a key element that enables automated task planning. Object and spatial information in the environment is structured and represented as a graph. Based on this, we designed a system that automatically generates and executes plan files. In other words, robots can automatically generate the necessary task procedures to achieve a given goal, and they can dynamically respond by reconfiguring new plans if environmental information changes during the execution of the plan. Figure 3 shows the automated planning system.

Task planning is based on PDDL language. PDDL consists of domain and problem files. A domain defines the possible actions and their preconditions and effects, while a problem describes the initial and target states. In the proposed system, the problem file is automatically generated by reflecting the semantic environmental information extracted from semantic SLAM. Afterward, the planner calculates a sequence of actions from the initial state to reach the target state, following the rules defined in the domain, and the robot performs the movements according to that plan. This structure goes beyond simple goal achievement, allowing for real-time plan reconfiguration when environmental changes occur, ensuring stable and adaptive robot behavior even in complex and dynamic classroom environments.

3.3. Visual Question Answering

VQA is a technology that takes an image and a related question as input and produces an answer. In this study, VQA is applied as a core module for human–robot social interaction. The proposed VQA integrates semantic information obtained through semantic SLAM and object attribute information obtained through object recognition to construct a high-level scene graph. This process is illustrated in Figure 4. Thereafter, when a natural language question is inputted from a child, it is converted into a SPARQL query, and graph-based reasoning is performed. If the input a robot needs to perform is an action rather than a simple question and answer, it expands social interaction by internally selecting a scene-related question and then presenting it to the child after undergoing SPARQL conversion and reasoning. The VQA module is broadly divided into two stages: scene perception and question understanding. The VQA module enables a two-way interactive framework in which the robot not only provides answers to children’s questions but also initiates its own inquiries. Through this interactive exchange, children engage as active participants in meaning-making, using language to reason, question, and construct understanding. Such dynamic communication fosters an environment that supports child-centered learning, encouraging curiosity-driven exploration rather than passive information reception.

In scene perception, attributes such as object names, colors, materials, uses, sizes, existence, and positional relationships are extracted from the semantic scene. The extracted information is stored in JSON format and then converted to TTL format to be represented as a graph structure. During this process, nodes representing object classes and instances are created, and the relationships between the nodes are connected using the objectProperty type. In particular, the positional relationships between objects are mapped to appropriate relationship nodes through the relationships item. The scene graph constructed in this way is then used as a core knowledge base in the subsequent question reasoning process. Figure 5 shows the structure of performing child-natural language interaction through STT/TTS, a VQA system, and a graph database.

In question understanding, the system takes a child’s natural language question as input and converts it into a query that conforms to SPARQL syntax using the BERT-small model [48]. In this process, we trained the model using the CLEVR dataset [49], which provides question–answer pairs along with metadata including object attributes and spatial relationships between objects. Furthermore, the program information within the dataset consists of logical operations such as location comparison, object count verification, and attribute comparison, which can be directly mapped to SPARQL expressions. The learned transformation model converts the input question into SPARQL and queries the scene graph constructed during the scene perception stage to produce the final answer. Thus, our VQA module supports natural responses to children’s questions and context-dependent social interactions by integrating visual environment perception and question understanding. This allows robots to go beyond simply relaying object recognition results and function as intelligent, interactive partners that support children’s cognitive development.

4. Experiment Results

To verify the performance of the framework proposed in this study, experiments were conducted in a real kindergarten environment. The target participants were children aged 6 to 7 years old based on international age. As shown in Figure 6, the classroom was designed as a space with various learning tools and interactive elements, such as crosswalks, traffic lights, and vehicle models. The robot hardware was based on the kindergarten robot developed by REDONE Technologies, and experiments were conducted by integrating the architecture designed in this study into the robot. The experiments were conducted with two scenarios: one involving traffic safety education and the other involving a recycling activity. A total of 12 children and 10 university students participated in the experiments. The child participants were children enrolled at the Suncheon National University kindergarten, who had engaged in English play-based learning sessions twice a week for 20 min per session prior to the experiment. The university participants consisted of students from various majors, aged between 18 and 24 years. As this study represents an initial stage of technical validation, no control group was included. Instead, the focus was placed on verifying whether the proposed framework could operate stably in a real kindergarten environment and enable real-time interaction with children.

The experiment was conducted by verifying the performance of each major module of the proposed framework. First, the semantic SLAM module was evaluated to determine whether it could accurately acquire semantic information such as position, type, size, and color of objects. Next, in the automated planning system, we verified whether the robot could adaptively replan its path and successfully complete the given task in a changing environment. Finally, in the VQA module, we confirmed whether the system could appropriately respond to questions through natural language-based interaction with children. Through this experimental design, we aimed to demonstrate that the proposed framework can reliably support child-centered learning and social interaction in a real kindergarten environment.

4.1. Semantic SLAM

To verify the performance of the semantic SLAM module, semantic information about various objects was extracted in a real kindergarten environment. The experiment was conducted in the classroom environment shown in Figure 6, where the robot used camera and LiDAR sensors to perceive the surrounding scene and perform object detection and attribute extraction. As a result, robots were able to effectively acquire not only the presence of objects but also various semantic information such as their names, colors, sizes, materials, uses, and positional coordinates. For example, the semantic information extracted for the traffic lights and car objects that make up the environment in Figure 6 is summarized in Table 1. For traffic light objects, attributes such as color, light, and purpose were recorded together, while for car objects, detailed recognition results were shown, including information on sub-components such as wheels. In addition, the ROS2 Humble nav2 1.0.0 package was employed to perform localization and collision avoidance, enabling the robot to achieve stable navigation and perception even in dynamic classroom environments [50]. These results demonstrate that the proposed semantic SLAM module can generate semantically rich scene representations in real classroom environments, providing high-level environmental information that can be leveraged by subsequent automated planning systems and VQA modules.

4.2. Automated Planning System

To verify the performance of the proposed automated planning system, we conducted experiments where plans were automatically generated based on environmental information obtained from semantic SLAM and then executed by a real robot. The purpose of the experiment was to evaluate whether the generated plan presents a reasonable sequence of actions and whether the robot can reliably perform the task based on that plan. First, the plans defined in this system are expressed in PDDL format, and various actions are described in the domain. Table 2 shows two example actions (robot_traffic_question, recycle_location_check), each composed of parameters, preconditions, and effects. Based on these action definitions, given a problem, the planner generates a sequence of actions to reach the target state from the initial state. Table 3 shows an example of a generated plan, where you can see the sequence of actions the robot must perform, such as moving, asking a question, responding, and entering a crosswalk.

Figure 7 illustrates the process in which the robot interacts with children and moves according to a predefined sequence of actions in a real kindergarten classroom environment. The classroom was equipped with various learning tools such as crosswalks, vehicle models, and traffic lights, and the children participated in traffic rule and safety education activities together with the robot. The robot sequentially executed planned actions such as moving, asking questions, and responding, thereby interacting naturally with the children. Through this process, it was confirmed that the proposed framework can operate stably even in an actual educational environment. The specific sequence of actions performed during this interaction is presented in Table 3. As shown in the table, the robot executed a series of actions, including moving to specific locations, asking questions, checking traffic conditions, conducting question–answer exchanges with children, and crossing the crosswalk. This action sequence was generated based on a predefined PDDL domain, demonstrating that the robot can reliably execute the planned procedures in a real-world environment.

The results of applying the generated plans to a real robot platform and performing the tasks are summarized in Table 4. As can be seen from the results, the success rate was 100% in simple environments, and it decreased somewhat as the number of objects in the environment increased and the situation became more complex. However, in most cases, it maintained stable performance of over 80%, demonstrating that the proposed automatic planning module effectively operates even in dynamic situations such as those taking place in a real kindergarten environment. In summary, we experimentally validated that the proposed automated planning system can establish reasonable plans based on semantic scene information and support robots in successfully executing those plans.

In addition, Figure 8 illustrates the process of dynamic replanning in a changing environment. The objective of each robot was to sequentially visit all designated waypoints (wp), but when unexpected obstacles appeared, the robot was unable to follow its original path. In such cases, the system automatically triggered a replanning process, allowing the robot to continue its mission and complete the waypoint visitation task through an alternative path. Importantly, this experiment was not limited to verifying plan execution in a static setting; rather, it focused on evaluating how quickly and reliably the robot could respond when the environment changed in real time. Through this, it was confirmed that the proposed planning system operates flexibly even in dynamic and unpredictable environments such as a real kindergarten, enabling the robot to effectively adapt to environmental changes.

4.3. Visual Question Answering

To verify the performance of the VQA module proposed in this study, a question-and-answer experiment was conducted with kindergarten children. It could be employed in real English education. The sentences used in the experiment are presented in Table 5; they range from simple word-level questions to sentence-level communication questions. This was designed to help children learn English vocabulary and understand basic sentences. We designed stepwise questions considering young children’s stages of English language development and enabled a robot to interact with children based on visual information. Specifically, in Stage 1, the robot guided basic cognitive and linguistic responses by recognizing the location of objects and naming them. In Stage 2, it supported vocabulary expansion through questions about basic attributes such as color. In Stage 3, questions about children’s preferences and experiences were included to encourage emotional expression and the development of sentence structure. In Stages 4 and 5, questions involving grammatical elements such as quantity, comparison, and past tense were presented to promote deeper linguistic thinking.

The main questions used in this experiment focused on object colors, such as What color is the car? object locations, such as What is in front of the bus? and object attributes or functions, such as When can we cross the traffic light? and Which bag do you throw the can into? These questions were designed to help children not only learn English vocabulary but also recognize and reason about the properties and relationships of objects, thereby promoting conceptual learning. In addition, accurately answering these questions requires the robot to precisely perceive and utilize visual environmental information, including the color, shape, and spatial relationships of objects in the classroom. Through this process, the VQA module serves as a key component that goes beyond simple language processing by integrating visual and linguistic information to enable meaningful educational interactions with children.

The VQA module converts the input natural language sentence into a SPARQL query and then infers the answer on the scene graph. Table 6 shows the results of example sentences converted into SPARQL syntax, allowing us to see how questions such as “What color is the motorcycle?” or “How many cars are there?” are transformed into SPARQL queries and processed on a graph database. For processing children’s speech, the Google Speech-to-Text and Text-to-Speech APIs were employed, enabling reliable conversion of spoken input into text and subsequent synthesis back into speech [51]. Figure 9 illustrates how the VQA module performs question answering in dynamically changing environments. Even when the same question is asked, the system reinterprets the scene according to the changed number of objects and provides a new answer. This demonstrates that the proposed framework can operate flexibly not only in fixed settings but also in continuously changing real classroom environments.

Additionally, the object information obtained from semantic SLAM is organized as shown in Figure 10, including attributes such as name, color, size, purpose, and material of each object, as well as their positional relationships. This information is converted into a graph structure and used in the scene perception stage of the VQA module. Figure 11, Figure 12 and Figure 13 visually represent the process of finding answers to different questions in the same environment. For example, in response to the question “What color is the traffic light?”, the system performed a series of SPARQL queries step by step and ultimately produced the answer “Red.” In response to the question “What do you see in front of the bus?”, it produced the results “Car, Motorcycle, Crosswalk.” Finally, for the question, “What is the name of the black metal object?” the correct answer, “Traffic light,” was inferred. Additional question-answering examples are presented in Figure 14, Figure 15, Figure 16 and Figure 17, which demonstrate the generality and scalability of the proposed VQA module. These results demonstrate that the proposed VQA module can semantically interpret children’s questions and reliably generate answers by leveraging a semantic SLAM-based scene graph. Furthermore, the results indicate that the module can be effectively used not only for simple object recognition but also for English education and social interaction through children’s question-and-answer sessions.

In summary, the semantic SLAM, automated planning system, and VQA modules proposed in this study were found to operate stably through interaction with children in a real kindergarten environment. Semantic SLAM effectively constructed a semantic scene representation including various object attributes and positional relationships, while the automated planning system demonstrated that the robot could successfully plan and execute tasks even in a changing environment. Furthermore, the VQA module has shown the potential to be used in educational interactions by accurately interpreting and responding to children’s natural language questions. These results demonstrate that the architecture proposed in this study has practical potential to effectively implement child-centered learning support and social interaction in real educational settings.

5. Discussion

The proposed interactive environment-aware dialogue and planning system is designed to be developmentally appropriate for kindergarten children. First, early childhood is a period in which linguistic expression and cognitive inquiry abilities expand rapidly, and bidirectional dialogue experiences maximize learning effectiveness. The VQA module developed in this study allows the robot to function not merely as a passive respondent to a child’s questions but as an active interactive partner that can both generate its own questions and receive new ones from the child. This approach overcomes the limitations of conventional educational robot systems, which have typically relied on pre-defined scenarios or one-way feedback structures, enabling the robot to respond to a child’s linguistic reactions and curiosity while expanding the dialogue. This structure encourages children to explore concepts through language use and to develop higher-order thinking skills.

Second, the automated planning system adjusts the robot’s behavior according to the child’s responses and environmental changes, thereby supporting experiences in which children actively construct their own knowledge. Unlike conventional systems that operate based on fixed action sequences, the proposed planning module incorporates a PDDL-based dynamic replanning mechanism, allowing the robot to reflect real-time changes in the learning environment. Through this capability, the robot can adapt effectively to classroom situations and maintain continuous interaction with children.

Third, the Semantic SLAM-based environmental perception enables the robot to semantically understand the objects, spaces, and relationships within a real classroom, allowing it to generate contextually appropriate questions, answers, and feedback in real time. Whereas previous educational robots were limited to physical localization or basic object detection, the proposed system achieves structural innovation by integrating semantic, scene-level perception with linguistic interaction and behavioral planning. Through this integration, children experience a comprehensive form of linguistic, cognitive, and social interaction within an authentic educational environment.

Consequently, the proposed framework demonstrates that the robot can serve not merely as an information provider but as a learning companion that engages in reciprocal question–answer exchanges with children, overcoming the technical and interactional limitations of existing educational robots and promoting children’s cognitive, linguistic, and social development in a developmentally appropriate manner.

The approach proposed in this study represents a novel structural integration that combines Semantic SLAM, automated planning, and VQA modules into a single coherent framework, distinguishing it from previous studies that have explored these technologies independently. Unlike prior research that primarily focused on improving the performance of individual functions, the proposed system aims to expand the educational applicability of social robots through the integrated connection of environmental perception, planning, and dialogue. Therefore, the key contribution of this study lies not merely in the implementation of each component but in the presentation of an integrated educational robot architecture designed to support child-centered learning.

6. Limitations

Although the proposed system demonstrated technical feasibility and stable interaction within real kindergarten environments, several limitations remain that should be addressed in future work. First, this study did not quantitatively evaluate the educational effectiveness of the proposed framework. Future research will employ this architecture to systematically assess its impact on children’s learning outcomes, engagement, and cognitive development in real educational settings. Second, the speech-to-text and text-to-speech modules used for interaction were implemented using existing commercial APIs and therefore did not fully reflect the speech characteristics of children, such as pronunciation variability and spontaneous verbal expressions. To achieve more natural and responsive communication, future work will focus on developing a child-adaptive speech processing module that better captures the linguistic and acoustic features of young learners. Third, each module within the proposed framework—Semantic SLAM, automated planning, and VQA—still has room for improvement in terms of performance and integration. Future research will aim to enhance perception, reasoning, and dialogue generation capabilities, thereby improving the robot’s adaptability and educational interactivity. These limitations indicate the next steps toward advancing the proposed system into a more comprehensive and child-centered educational robotics platform.

7. Conclusions

In this study, we propose an architecture that combines environmental perception and interaction to support preschool children’s learning using social robots. The proposed system includes three modules: semantic SLAM, automated planning system, and VQA. Each module works in a complementary manner to enable the robot to engage in meaningful interactions with children in a real classroom environment. The experimental results indicate that semantic SLAM provides precise semantic scene information, including various object attributes and positional relationships, and the automated planning system successfully completes tasks with a high success rate even in changing environments. Furthermore, the VQA module demonstrated its potential for social interaction and educational use by accurately interpreting children’s natural language questions and generating appropriate responses.

The proposed system supports child-centered learning by enabling the robot to semantically understand objects and spatial relationships within the classroom, adjust its behavior according to children’s responses and environmental changes, and engage in active interaction by generating questions and answers. Through this process, the robot is demonstrated to function not merely as an information provider but as a learning companion that facilitates children’s linguistic exploration and concept formation. However, this study did not quantitatively verify the educational effectiveness of the system, and the speech recognition module did not fully reflect the characteristics of children’s speech. Future research will focus on improving these technical aspects and conducting long-term experiments to strengthen the educational applicability and child-adaptive interaction capabilities, thereby advancing the proposed system into a more complete child-centered educational robotics platform.

Author Contributions

Conceptualization, J.M.; methodology, J.M.; software, S.M.S.; validation, J.M. and S.M.S.; investigation, S.M.S.; writing—original draft preparation, J.M. and S.M.S.; writing—review and editing, J.M.; visualization, J.M.; supervision, J.M.; All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the research fund from Chosun University, 2022.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SLAM	Simultaneous Localization And Mapping
PDDL	Planning Domain Definition Language
STRIPS	Stanford Research Institute Problem Solver
VQA	Visual Question Answering
WP	waypoints

References

Bidzan-Bluma, I.; Lipowska, M. Physical activity and cognitive functioning of children: A systematic review. Int. J. Environ. Res. Public Health 2018, 15, 800. [Google Scholar]
van Liempd, I.H.; Oudgenoeg-Paz, O.; Leseman, P.P. Object exploration is facilitated by the physical and social environment in center-based child care. Child Dev. 2025, 96, 161–175. [Google Scholar] [CrossRef]
Rakesh, D.; McLaughlin, K.A.; Sheridan, M.; Humphreys, K.L.; Rosen, M.L. Environmental contributions to cognitive development: The role of cognitive stimulation. Dev. Rev. 2024, 73, 101135. [Google Scholar] [CrossRef]
Grava, J.; Pole, V. The promotion of self-directed learning in pre-school: Reflection on teachers’ professional practice. Cypriot J. Educ. Sci. 2021, 16, 2336–2352. [Google Scholar] [CrossRef]
Dore, R.A.; Dynia, J.M. Technology and media use in preschool classrooms: Prevalence, purposes, and contexts. Front. Educ. 2020, 5, 600305. [Google Scholar] [CrossRef]
Breazeal, C. Social interactions in HRI: The robot view. IEEE Trans. Syst. Man Cybern. C Appl. Rev. 2004, 34, 181–186. [Google Scholar]
Kanda, T.; Shimada, M.; Koizumi, S. Children learning with a social robot. In Proceedings of the seventh annual ACM/IEEE international conference on Human-Robot Interaction, Boston, MA, USA, 5–8 March 2012; pp. 351–358. [Google Scholar]
Neumann, M.M.; Koch, L.-C.; Zagami, J.; Reilly, D.; Neumann, D.L. Preschool children’s engagement with a social robot compared to a human instructor. Early Child. Res. Q. 2023, 65, 332–341. [Google Scholar]
Woo, H.; LeTendre, G.K.; Pham-Shouse, T.; Xiong, Y. The use of social robots in classrooms: A review of field-based studies. Educ. Res. Rev. 2021, 33, 100388. [Google Scholar] [CrossRef]
Lampropoulos, G. Social robots in education: Current trends and future perspectives. Information 2025, 16, 29. [Google Scholar] [CrossRef]
Studhalter, U.T.; Jossen, P.; Seeli, M.; Tettenborn, A. Tablet computers in early science education: Enriching teacher–child interactions. Early Child. Educ. J. 2024, 53, 2531–2545. [Google Scholar] [CrossRef]
Conti, D.; Cirasa, C.; Di Nuovo, S.; Di Nuovo, A. “Robot, tell me a tale!”A social robot as tool for teachers in kindergarten. Interact. Stud. 2020, 21, 220–242. [Google Scholar] [CrossRef]
Keren, G.; Ben-David, A.; Fridin, M. Kindergarten assistive robotics (KAR) as a tool for spatial cognition development in pre-school education. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura, Portugal, 7–12 October 2012; pp. 1084–1089. [Google Scholar]
De Wit, J.; Schodde, T.; Willemsen, B.; Bergmann, K.; De Haas, M.; Kopp, S.; Krahmer, E.; Vogt, P. The effect of a robot’s gestures and adaptive tutoring on children’s acquisition of second language vocabularies. In Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, Chicago, IL, USA, 5–8 March 2018; pp. 50–58. [Google Scholar]
Wu, X.E.; Ko, J. Peer interactions during storybook reading on children’s knowledge construction: An experimental study on K2 and K3 children. Front. Educ. 2024, 9, 1253782. [Google Scholar] [CrossRef]
Cankaya, O.; Rohatyn-Martin, N.; Leach, J.; Taylor, K.; Bulut, O. Preschool children’s loose parts play and the relationship to cognitive development: A review of the literature. J. Intell. 2023, 11, 151. [Google Scholar] [CrossRef] [PubMed]
Alqobali, R.; Alnasser, R.; Rashidi, A.; Alshmrani, M.; Alhmiedat, T. A real-time semantic map production system for indoor robot navigation. Sensors 2024, 24, 6691. [Google Scholar] [CrossRef]
Zheng, C.; Zhang, P.; Li, Y. Semantic SLAM system for mobile robots based on large visual model in complex environments. Sci. Rep. 2025, 15, 8450. [Google Scholar] [CrossRef]
Jiang, Y.; Wu, Y.; Zhao, B. Enhancing SLAM algorithm with Top-K optimization and semantic descriptors. Sci. Rep. 2025, 15, 8280. [Google Scholar] [CrossRef]
Zheng, D.; Yan, J.; Xue, T.; Liu, Y. A knowledge-based task planning approach for robot multi-task manipulation. Complex Intell. Syst. 2024, 10, 193–206. [Google Scholar] [CrossRef]
Golluccio, G.; Di Vito, D.; Marino, A.; Bria, A.; Antonelli, G. Task-motion planning via tree-based Q-learning approach for robotic object displacement in cluttered spaces. In Proceedings of the 18th International Conference on Informatics in Control, Automation and Robotics (ICINCO), Online, 6–8 July 2021; pp. 130–137. [Google Scholar]
Chalvatzaki, G.; Younes, A.; Nandha, D.; Le, A.T.; Ribeiro, L.F.R.; Gurevych, I. Learning to reason over scene graphs: A case study of finetuning GPT-2 into a robot language model for grounded task plan. Front. Robot. 2023, 10, 1221739. [Google Scholar] [CrossRef] [PubMed]
Förster, J.; Ott, L.; Nieto, J.; Lawrance, N.; Siegwart, R.; Chung, J.J. Automatic extension of a symbolic mobile manipulation skill set. Robot. Auton. Syst. 2023, 165, 104428. [Google Scholar] [CrossRef]
Liu, R.; Wan, G.; Jiang, M.; Chen, H.; Zeng, P. Autonomous robot task execution in flexible manufacturing: Integrating PDDL and behavior trees in ARIAC 2023. Biomimetics 2024, 9, 612. [Google Scholar] [CrossRef]
Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2425–2433. [Google Scholar]
Wu, Q.; Teney, D.; Wang, P.; Shen, C.; Dick, A.; Van Den Hengel, A. Visual question answering: A survey of methods and datasets. Comput. Vis. Image Underst. 2017, 163, 21–40. [Google Scholar] [CrossRef]
Kafle, K.; Kanan, C. An analysis of visual question answering algorithms. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1965–1973. [Google Scholar]
Ren, M.; Kiros, R.; Zemel, R. Exploring models and data for image question answering. Adv. Neural Inf. Process. Syst. 2015, 28, 1–9. [Google Scholar]
Malinowski, M.; Rohrbach, M.; Fritz, M. Ask your neurons: A deep learning approach to visual question answering. Int. J. Comput. Vis. 2017, 125, 110–135. [Google Scholar] [CrossRef]
Yi, K.; Wu, J.; Gan, C.; Torralba, A.; Kohli, P.; Tenenbaum, J. Neural-symbolic VQA: Disentangling reasoning from vision and language understanding. Adv. Neural Inf. Process. Syst. 2018, 31, 1–11. [Google Scholar]
Eiter, T.; Higuera, N.; Oetsch, J.; Pritz, M. A neuro-symbolic ASP pipeline for visual question answering. Theory Pract. Log. Program. 2022, 22, 739–754. [Google Scholar] [CrossRef]
Amizadeh, S.; Palangi, H.; Polozov, A.; Huang, Y.; Koishida, K. Neuro-symbolic visual reasoning: Disentangling. In Proceedings of the International Conference on Machine Learning (ICML), Virtual Event, 12–18 July 2020; pp. 279–290. [Google Scholar]
Mifsud, C.L.; Bonello, C.; Kucirkova, N.I. Exploring the multifaceted roles of social robots in early childhood literacy lessons: Insights from a Maltese classroom. Int. J. Soc. Robot. 2025, 17, 1235–1249. [Google Scholar] [CrossRef]
Johnson, A.; Martin, A.; Quintero, M.; Bailey, A.; Alwan, A. Can social robots effectively elicit curiosity in STEM topics from K-1 students during oral assessments? In Proceedings of the 2022 IEEE Global Engineering Education Conference (EDUCON), Tunis, Tunisia, 28–31 March 2022; pp. 1264–1268. [Google Scholar]
Mei, L.; Xu, P. Path planning for robots combined with zero-shot and hierarchical reinforcement learning in novel environments. Actuators 2024, 13, 458. [Google Scholar] [CrossRef]
Zhao, B.; Wu, Y.; Wu, C.; Sun, R. Deep reinforcement learning trajectory planning for robotic manipulator based on simulation-efficient training. Sci. Rep. 2025, 15, 8286. [Google Scholar] [CrossRef]
Zhang, Z.; Fu, H.; Yang, J.; Lin, Y. Deep reinforcement learning for path planning of autonomous mobile robots in complicated environments. Complex Intell. Syst. 2025, 11, 277. [Google Scholar] [CrossRef]
Cashmore, M.; Fox, M.; Long, D.; Magazzeni, D.; Ridder, B.; Carrera, A.; Palomeras, N.; Hurtos, N.; Carreras, M. Rosplan: Planning in the robot operating system. In Proceedings of the 25th International Conference on Automated Planning and Scheduling (ICAPS), Jerusalem, Israel, 7–11 June 2015; pp. 333–341. [Google Scholar]
Moon, J.; Lee, B.-H. PDDL planning with natural language-based scene understanding for UAV-UGV cooperation. Appl. Sci. 2019, 9, 3789. [Google Scholar] [CrossRef]
Heuss, L.; Gebauer, D.; Reinhart, G. Concept for the automated adaption of abstract planning domains for specific application cases in skills-based industrial robotics. J. Intell. Manuf. 2024, 35, 4233–4258. [Google Scholar] [CrossRef]
Dong, X.; Wan, G.; Zeng, P.; Song, C.; Cui, S.; Liu, Y. Hierarchical online automated planning for a flexible manufacturing system. Robot. Comput.-Integr. Manuf. 2024, 90, 102807. [Google Scholar] [CrossRef]
Mascharka, D.; Tran, P.; Soklaski, R.; Majumdar, A. Transparency by design: Closing the gap between performance and interpretability in visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4942–4950. [Google Scholar]
Moon, J. Symmetric graph-based visual question answering using neuro-symbolic approach. Symmetry 2023, 15, 1713. [Google Scholar] [CrossRef]
Peña-Narvaez, J.D.; Martín, F.; Guerrero, J.M.; Pérez-Rodríguez, R. A visual questioning answering approach to enhance robot localization in indoor environments. Front. Neurorobot. 2023, 17, 1290584. [Google Scholar] [CrossRef] [PubMed]
Luo, H.; Guo, Z.; Wu, Z.; Teng, F.; Li, T. Transformer-based vision-language alignment for robot navigation and question answering. Inf. Fusion 2024, 108, 102351. [Google Scholar] [CrossRef]
Xiao, J.; Zhang, Z. EduVQA: A multimodal visual question answering framework for smart education. Alex. Eng. J. 2025, 122, 615–624. [Google Scholar] [CrossRef]
Sankalprajan, P.; Sharma, T.; Perur, H.D.; Pagala, P.S. Comparative analysis of ROS based 2D and 3D SLAM algorithms for Autonomous Ground Vehicles. In Proceedings of the 2020 International Conference for Emerging Technology (INCET), Belgaum, India, 5–7 June 2020; pp. 1–6. [Google Scholar]
Tsai, H.; Riesa, J.; Johnson, M.; Arivazhagan, N.; Li, X.; Archer, A. Small and practical BERT models for sequence labeling. arXiv 2019, arXiv:1909.00100. [Google Scholar] [CrossRef]
Johnson, J.; Hariharan, B.; Van Der Maaten, L.; Fei-Fei, L.; Zitnick, C.L.; Girshick, R. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2901–2910. [Google Scholar]
MJ, A.K.; Babu, A.V.; Damodaran, S.; James, R.K.; Murshid, M.; Warrier, T.S. ROS2-Powered Autonomous Navigation for TurtleBot3: Integrating Nav2 Stack in Gazebo, RViz and Real-World Environments. In Proceedings of the 2024 IEEE International Conference on Signal Processing, Informatics, Communication and Energy Systems (SPICES), Kollam, India, 24–26 January 2024; pp. 1–6. [Google Scholar]
Shakhovska, N.; Basystiuk, O.; Shakhovska, K. Development of the Speech-to-Text Chatbot Interface Based on Google API. In Proceedings of the International Conference on Modern Machine Learning Technologies (MoMLeT), Lviv, Ukraine, 31 May–1 June 2019; pp. 212–221. [Google Scholar]

Figure 1. Bidirectional question–answer interaction between a social robot and children at a crosswalk: the robot recognizes the traffic light color, responds to children’s questions (“The light is red”, “Green light!”), and poses questions to guide safe crossing, demonstrating the integration of VQA, Semantic SLAM, and task planning. (a) When the light is red, the children wait and do not cross. (b) When the light turns green, they safely cross the crosswalk together with the robot.

Figure 2. The overall architecture of the Interactive Environment-Aware Dialogue and Planning System.

Figure 3. Proposed workflow of the automated planning system.

Figure 4. Example of VQA process.

Figure 5. Overall architecture of the proposed VQA module.

Figure 6. Experimental environment in a kindergarten classroom, consisting of crosswalks, traffic lights, and toy vehicles for interactive learning with the social robot: (a) front view of the classroom setup, (b) side view with interactive elements, (c) kindergarten robot used in the experiment.

Figure 7. Experimental scenarios in dynamically changing environments: (a) kindergarten children interacting with robots while performing traffic rule and recycling learning tasks in a changing classroom setting; (b) university students engaging in the same interactive tasks under similar dynamic conditions.

Figure 8. Illustration of dynamic replanning in a changing environment. Each robot’s objective is to sequentially visit all designated waypoints. When an unexpected obstacle appears, the robot is unable to follow the original path. The system triggers a replanning process, allowing the robot to continue its mission and complete the waypoint visitation task despite environmental changes.

Figure 9. Visual question answering in dynamically changing environments: (a) in an environment with two bananas, the system interprets the current scene and answers the question “How many bananas are there?” with “2”; (b) in an environment where only one banana is present, the system reinterprets the changed environment and answers the same question with “1.”

Figure 10. Extracted semantic information of objects in the classroom environment for VQA.

Figure 11. Example of VQA process for the question: What color is the traffic light?

Figure 12. Example of VQA process for the question: What do you see in front of the bus?

Figure 13. Example of VQA process for the question: What is the name of the metal black object?

Figure 14. Extracted semantic information of objects in the classroom environment for VQA.

Figure 15. Example of VQA process for the question: How many recycling bags are there?

Figure 16. Example of VQA process for the question: What do you see in front of the pink bag?

Figure 17. Example of VQA process for the question: Which bag do you throw the can into?

Table 1. Examples of extracted semantic information for objects in the classroom environment: (a) traffic light (b) car.

Traffic Light	Car
“objects”: [ {	“objects”: [ {
“name”: “trafficlight”,	“name”: “car”,
“color”: [“black”],	“color”: [“red”],
“size”: “Medium”,	“size”: “Large”,
“material”: “metal”,	“material”: “metal”,
“light”: “green”,	“subcar”: [“wheel”, “wheel”],
“use”: “traffic”,	“use”: “vehicle”,
“existence”: “exist”,	“existence”: “exist”,
“rotation”: 0,	“rotation”: 90,
“3d_coords”: [50, 30, 0],	“3d_coords”: [0, 0, 0],
“pixel_coords”: [200, 2, 0]	“pixel_coords”: [0, 0, 0]
} ]	} ]

Table 2. Examples of defined actions in PDDL domain.

Action	Description
robot_traffic_question	(:action robot_traffic_question
	:parameters (?location location ?robot - robot)
	:precondition (and (= (count1) (count2))(= (count1) (count3))
	(traffic_location_ok ?location)(at ?robot ?location))
	:effect (and (robot_traffic_question_gogo)(kid_traffic_answer)
	(increase (count1) 1)))
recycle_location_check	(:action recycle_location_check
	:parameters (?near_location - location ?now_location - location
	?robot - robot ?trashcan - trashcan)
	:precondition (and (question_location_ok ?now_location)
	(at ?robot ?now_location)(connected ?now_location ?near_location)
	(at ?trashcan ?near_location))
	:effect (and (recycle_location_ok ?now_location)))

Table 3. Example of generated action sequence (plan) for task execution.

No	Plan
1	(move robot1 wpx3y5 wpx2y5)
2	(question_location_check robot1 wpx2y5)
3	(traffic_location_check wpx1y4 wpx2y5 robot1 car1)
4	(robot_traffic_question wpx2y5 robot1 car1)
5	(kid_traffic_question wpx2y5 robot1)
6	(robot_traffic_answer wpx2y5 robot1)
7	(kid_traffic_answer wpx2y5 robot1)
8	(robot_traffic_answer wpx2y5 robot1)
9	(kid_traffic_question wpx2y5 robot1)
10	(traffic_clear wpx2y5 robot1)
11	(crosswalk_1 robot1 wpx2y4 wpx2y4 crosswalk1)
12	(crosswalk_2 robot1 wpx2y4 wpx2y3 crosswalk1)
13	(clear_crosswalk robot1 wpx2y3 wpx2y2 crosswalk1)
14	(move robot1 wpx2y2 wpx3y1)
15	(move robot1 wpx3y1 wpx4y1)

Table 4. Task execution success rates according to environment complexity.

No	Planned Actions	Number of Successes	Success Rate (%)
1	10	10	100
2	10	10	100
3	10	10	100
4	10	10	100
5	14	13	92.85
6	18	15	83.33

Table 5. Examples of word- and sentence-level questions used in the VQA experiment.

No.	Type	Question	Answer
1	word	What it this?	car
2	word	What it this?	water bottle
3	sentence	What color is the car?	red
4	sentence	Do you like the red car?	Yes I like it
5	sentence	What is this?	pink can recycled bag
6	sentence	What color is the trash can?	white
7	sentence	What kind of recycled bag is a pink bag?	can
8	sentence	How many cars are there?	two
9	sentence	How many recycled bags are there?	three
10	sentence	Did you have any trash can at your house?	Yes

Table 6. Examples of SPARQL queries generated from natural language questions.

Natural Language Question	SPARQL
How many cars are there?	SELECT * FROM <graph> WHERE {?s ?p ?o}
	SELECT * FROM <graph1> WHERE {?s :isExistence :exist. ?s ?p ?o}
	SELECT * FROM <graph1> WHERE {?s :isName :car . ?s ?p ?o}
	SELECT (COUNT(?s) AS ?s) FROM <graph1> WHERE {?s a :Object}
Look at the traffic light. What light is it?	SELECT * FROM <graph> WHERE {?s ?p ?o}
	SELECT * FROM <graph1> WHERE {?s :isExistence :exist . ?s ?p ?o}
	SELECT * FROM <graph1> WHERE {?s :isName :trafficlight . ?s ?p ?o}
	SELECT * FROM <graph1> WHERE {[] :isLight ?o}

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Moon, J.; Song, S.M. Interactive Environment-Aware Planning System and Dialogue for Social Robots in Early Childhood Education. Appl. Sci. 2025, 15, 11107. https://doi.org/10.3390/app152011107

AMA Style

Moon J, Song SM. Interactive Environment-Aware Planning System and Dialogue for Social Robots in Early Childhood Education. Applied Sciences. 2025; 15(20):11107. https://doi.org/10.3390/app152011107

Chicago/Turabian Style

Moon, Jiyoun, and Seung Min Song. 2025. "Interactive Environment-Aware Planning System and Dialogue for Social Robots in Early Childhood Education" Applied Sciences 15, no. 20: 11107. https://doi.org/10.3390/app152011107

APA Style

Moon, J., & Song, S. M. (2025). Interactive Environment-Aware Planning System and Dialogue for Social Robots in Early Childhood Education. Applied Sciences, 15(20), 11107. https://doi.org/10.3390/app152011107

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Interactive Environment-Aware Planning System and Dialogue for Social Robots in Early Childhood Education

Abstract

1. Introduction

2. Related Work

2.1. Social Robots for Education

2.2. Environment-Aware Task Planning System

2.3. Visual Question Answering

3. Methods

3.1. Semantic SLAM

3.2. Automated Planning System

3.3. Visual Question Answering

4. Experiment Results

4.1. Semantic SLAM

4.2. Automated Planning System

4.3. Visual Question Answering

5. Discussion

6. Limitations

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI