PDDL Planning with Natural Language-Based Scene Understanding for UAV-UGV Cooperation

: Natural-language-based scene understanding can enable heterogeneous robots to cooperate efﬁciently in large and unconstructed environments. However, studies on symbolic planning rarely consider the semantic knowledge acquisition problem associated with the surrounding environments. Further, recent developments in deep learning methods show outstanding performance for semantic scene understanding using natural language. In this paper, a cooperation framework that connects deep learning techniques and a symbolic planner for heterogeneous robots is proposed. The framework is largely composed of the scene understanding engine, planning agent, and knowledge engine. We employ neural networks for natural-language-based scene understanding to share environmental information among robots. We then generate a sequence of actions for each robot using a planning domain deﬁnition language planner. JENA-TDB is used for knowledge acquisition storage. The proposed method is validated using simulation results obtained from one unmanned aerial and three ground vehicles.


Introduction
Natural language-based scene understanding is a critical issue for symbolic planning for heterogeneous multi-robot cooperation. We can mitigate the knowledge acquisition problem associated with the area of symbolic planning by sharing the environmental information expressed in natural language with diverse robots. Recently, heterogeneous multi-robot systems composed of robots with different abilities have received increasing attention as they are required in a broad range of fields such as surveillance, environment exploration, and field robotics [1]. Various symbolic planning studies have been conducted to generate a sequence of actions for each robot to achieve success in a shared mission. In particular, planning domain definition language (PDDL) is used as a standardized artificial intelligence planning language [2] and provides flexibility when planning actions for robots to achieve mission goals [3]. Miranda et al. [4] embedded a symbolic task planner using PDDL in the robot operating system (ROS) for multi-robot navigation. Zhang et al. [5] presented a multi-robot symbolic planning system with an iterative interdependent algorithm to find the optimal plans that minimize overall cost. Compared to many studies that aimed to maximize overall utility and reduce costs during identification of optimal plans for multi-robots, the environmental information sharing method between robots can mitigate environmental knowledge acquisition problems but continues to be insufficiently studied. We can solve various mission planning problems by allowing robots to find the environmental data of unmodeled objects and sharing them. Robots can gather information about early unmodeled objects, extract meaningful information from them, and share them to solve various mission planning problems, particularly in problems such as finding survivors in wildfire areas or spotting

Related Work
This paper is related to studies of heterogeneous multi-robot cooperation planning and natural language-based semantic scene understanding, the idea being to connect symbolic planning and deep learning.

Heterogeneous Multi-Robot Cooperation Planning
The multi-robot system has the advantage that it can perform complex tasks that cannot be accomplished by one single powerful robot with many capabilities through cooperation [20]. For example, a large building can be cleaned with one robot, but it is time-consuming and unrealistic. Thus, a multi-robot system that dispatches the overall mission into smaller sub-problems to individual robots is necessary. Rosa et al. [1] proposed a cooperative control scheme for a heterogeneous ground-air robot team. Wurm et al. [21] integrated a temporal planning approach with a PDDL planner for heterogeneous teams of robots. Jang et al. [22] solved the decision-making issues of aerial robots using an integrated decision-making framework. Kingry et al. [23] represented the environment in a scalar field and created a time-optimized mission plan for UGVs using a cascaded heuristic optimization algorithm. However, most studies of heterogeneous multi-robot systems focus on achieving shared goals effectively, with minimum time and cost, through algorithms rather than acquiring knowledge of the environment using multiple robots.
Some researchers have attempted to solve the environmental knowledge acquisition problem through data sharing among the robots. Reis et al. [24] used an adaptive transmission method for efficient distributed information sharing. Jiang and Lu [25] proposed a shared information integration method for cooperative environmental data gathering. Foerster et al. [26] introduced two approaches that could learn how information may be shared: reinforced inter-agent learning and differentiable inter-agent learning. These studies shared raw sensor information that could hardly infer semantic meanings without algorithms. They had to be designed suitably for the individual robots in a heterogeneous multi-robot team. Unlike conventional studies, sharing information embedded with semantic meaning in the form of natural language can enable the heterogeneous robots to easily understand and communicate with each other. Moreover, we can decrease the quality of service problem, which is often observed in field robotics, by transmitting a compact representation of the environmental information. We introduce a method that acquires environmental knowledge in the form of natural language and applies it to multi-robot cooperation planning.

Natural Language-Based Scene Understanding
Many studies on robotics have proposed graph-based SLAM using semantic scene understanding and various sensors. Himri et al. [27] performed object recognition using range data and feature-based semantic SLAM with a UAV. Li et al. [28] proposed a dense 3D SLAM system composed of stereo-ORB-SLAM and a CNN for a traffic environment. Mao et al. [29] combined a matured SLAM system named RTAB-Map and a CNN to utilize depth image information. However, they rarely considered the natural language inference problem, which is important in multi-robot communication.
However, semantic scene understanding using natural languages such as image captioning, visual question and answering (VQA), and scene graph generation is widely studied in the field of computer vision. Lu et al. [30] generated image captions using an attention-based neural encoder-decoder framework. Lu et al. [31] utilized a co-attention model in a hierarchical fashion to perform VQA. Dai et al. [32] proposed a deep relational network that can exploit the statistical dependencies of detected objects and their relationships. Since these approaches use images as inputs, graph maps, which are widely used as environment representation by robots, are rarely utilized. This paper proposes an architecture that includes natural language description and scene graphs generated using a graph map in multi-robot planning.

Connecting Symbolic Planning and Deep Learning
Many studies of robotics involving mission planning with symbolic planners have been conducted. Srivastava et al. [33] demonstrated off-the-shelf task implementation with a PDDL planner. Dornhege et al. [34] applied geometric reasoning to symbolic planning and conducted real-world mobile manipulation experiments. Manso et al. [35] utilized graph models and graph rewriting rules with a symbolic planner for human-robot interaction. However, symbolic planning is hardly applied to new, unforeseen, and dynamic environments, because the environments should be modeled directly by a human or via a compiler. However, deep learning, which is a data-driven approach, has shown outstanding performance in environmental cognition [36][37][38]. To take advantage of both fields, Zhang and Sornette [39] introduced a deep symbolic network to represent any knowledge as a symbol. Liao and Poggio [40] converted objects into symbols using an object-oriented deep learning algorithm. They focused on generating symbols using deep learning, rather than setting the overall architecture for planning. In this study, we propose a method to bridge the gap between symbolic planning and deep learning techniques, and verify it using heterogeneous multi-robot cooperation planning.

Architecture
This section explains the framework devised to connect deep learning techniques and the symbolic planner for cooperation among heterogeneous agents. Unlike conventional planning systems for robots [19], our framework entails natural language-based cognition and a knowledge engine for multiple agents. The general overview of the framework is shown in Figure 2. It is composed of perception, cognition, planning, coordination, execution, and memory storage. Perceptively, sensor information obtained from environments is continuously passed to cognition. During cognition, scene understanding-based natural language is created by generating language description and scene understanding using deep learning techniques. Then, the generated semantic information is passed on to the knowledge engine while raw sensor data are sent to episodic memory storage. Using the episodic memory and knowledge collected from multiple robots, the PDDL planning agent builds a sequence of actions for each agent. Then, the robots complete the required actions through coordination and execution. The details are as illustrated in Figures 3 and 4.

Natural Language-Based Cognition
Cognition part in scene understanding engine is largely composed of semantic graph generation, language description, and scene graph generation as shown in Figure 3. To understand the surrounding environment in natural language, we generate a natural language description and scene graph. In this study, we assume that the robots use a graph map (for motion planning) generated using semantic SLAM, which is a widely used environment representation method in robotics [7]. To utilize the graph map G = (V, E ) that contains features and positions of the detected object as nodes v i ∈ V and their relationships as edges e ij = (v i , v j ) ∈ E ij , we closely follow Moon and Lee [41] for generating the language description and graph inference phase of Xu et al. [42] for the scene graph generation. However, since the edge information of the graph map is binary, which can only infer whether a connection exists or not, or a weighted value that indicates relations such as the Euclidean distance between objects, it is difficult to find the semantic meaning. Therefore, we additionally extract features of the union region of two objects for edge information. For each v i and e ij , features are extracted using VGGNet [43]. f v i is the feature vector of v i , and f e ij is the feature vector of e ij . p i is position vector of v i .  . On language description part, a GCN extracts features from the graph map. The extracted graph feature is concatenated with a word and feed into the RNN as input. Then, the RNN generates sentence attention over the graph. On scene graph generation part, Two different message pooling methods are performed. Node message pooling uses the inbound and outbound edge states with a node. Edge message pooling uses the object states with an edge. This process is repeated to precisely predict the natural language words corresponding to the nodes and edges of the graph.  A GCN with graph convolution layers defined by spectral graph theory and fully connected layers is utilized to extract features from irregular and non-Euclidean graphs. Then, an RNN is used to generate a language description over the graph. The RNN takes the encoded graph features concatenated with a word vector and predicts the probabilistic distribution of the target word vector. Given that we also back-propagate the GCN when training the RNN, we can expect that graph features that fit the generated sentence will be extracted. The generated description can be used to understand the surrounding environment when an unexpected situation occurs.
Scene graph generation involves the process of finding appropriate words corresponding to each node and edge of the graph. We denote variables that need to be predicted as g = (v class i , e ij | i = 1 . . . n, j = 1 . . . n, i = j), where C is a set of object classes and R is a set of relationship types, v class i ∈ C, e ij ∈ R. The optimal g * is found as follows: The iterative message pooling method based on the gated recurrent unit (GRU) is utilized. Edge features and node features are fed into the edge GRU and node GRU as the initial value, respectively. After the message pooling, the edge message is fed into the edge GRU and the node message is fed into the node GRU. The iteration that follows precisely predicts words for the nodes and edges. The scene graph can be used to gather environmental information in natural language for large and unstructured environments.

Knowledge Engine
The knowledge engine obtains semantic environmental information in XML and stores it in triple store, which uses a resource description framework (RDF) such as "subject-predicate-object" or "resource-property type-value" unlike the conventional relational database that saves data in "key-value." Triple store uses the SPARQL protocol and RDF query language (SPARQL) to create, read, update, and delete the graph data that contain relations between objects. The triple store facilitates the reasoning process by using the relations and attributes between objects to find new relations. In this study, we utilize JENA-TDB, a type of triple store. It is an open source framework developed by Apache for the manipulation of RDF data. JENA-TDB provides persistence storage for the RDF and web UI with the Apache Fuseki interface using the http protocol.
The XML/OWL parser in the knowledge engine parses the XML file into OWL Ontology. Ontology is a model that explicitly describes conceptual meanings by restricting the relations in the artificial intelligence field. OWL is one of the ontology expression languages. It is designed to create an environment in which machines and agents can understand and utilize resources using reasoning and formal syntax. OWL defines the class and property of instances, describes relations between the classes and subclasses, and infers new concepts. In this study, we classify the topology and semantic relations among objects as object property relations and the attributes of the object as data property relations when the knowledge engine receives the XML file containing the taxonomy of classes and subclasses of semantic information achieved by cognition. The classified relations are described in OWL in the XML/OWL parser. The generated OWL ontology is saved in JENA-TDB using the Fuseki http protocol. When JENA-TDB receives a request from the environment modeler to hand over the required information to set the initial and goal states for mission planning, SPARQL is used to gather data.

PDDL Planning Agent
We utilize the planning agent of Cashmore et al. [19] as the PDDL planning agent. ROSPlan provides planning in the robot operating system (ROS). However, because natural language information achieved from surrounding environments is hardly utilized, we modified it appropriately for our approach. Two nodes are added to ROSPlan: one is the language description node and the other is the scene graph generation node. Besides Mongo DB, JENA-TDB is added for semantic memory storage. Plan dispatcher is extended to cover additional environment information from the simulator. In the planning agent, problem PDDL generation, plan generation, action dispatch, and replanning are performed. From the environment modeler and Mongo DB, data related to initial state and mission parameters are gathered and feed into problem generation. Then, the problem PDDL is automatically generated and handed to a planner with domain PDDL. In this paper, the POPF planner is used. Once the plan is generated, the plan dispatch parses the PDDL actions to the ROS messages for the robots to complete the overall plan. During the execution, if an action fails because of changes in the environment, the planning agent reformulates the problematic PDDL by replanning.

Experiment
We demonstrate the proposed framework with a patrolling scenario and find the missing child using one UAV and three AGVs. The operational diagram for the proposed method is illustrated in Figure 5. It is composed of the control tower, natural language processor, simulator, and JENA-TDB. The scenario is run in the simulation to verify the proposed architecture. The details are as follows.

Experiment Setting
The simulation environment was designed as an area around REDONE technologies cooperation, as shown in Figure 6. The size of the area was 110 m × 100 m. We utilized three AGVs of REDONE technologies, each named Smart Cookie, and 1 UAV of REDONE technologies, named Beyond. Each Smart Cookie has 2D laser sensors and an RGB-D camera. Beyond is equipped with an RGB-D camera. The laser sensor is used for navigation on the execution part while the RGB-D cameras are used for cognition for the natural language-based scene understanding. Each robot navigated using the generated map and sensor. The platform was set up with Ubuntu 16.04, ROS Kinetic, and Gazebo 7. JENA-TDB is used as the semantic memory and Mongo DB is used as the episodic memory. DICQ.R is the control tower. We used tensorflow library and Python for the natural language processing, whereas JAVA was used for JENA-TDB, and C++ was used for the simulator. Socket communication was utilized to transfer information between processors. To train the neural network for scene understanding, we used the COCO dataset and visual genome dataset for language description and scene graph generation, respectively. Since these datasets use images for natural language processing, we manually generated a graph using bounding boxes and train the networks. The details of the trained network are shown in Appendix A.

Scenario
The overall scenario outline is illustrated in Figure 7. Two missions were performed. One involved patrolling the area, and the other was concerned with finding a missing child. While the robots were visiting the point of interest (POI) for patrolling, a mission to find a missing child was generated by the DICQ.R. Every robot was required to report the current situation to the DICQ.R as well as if an unusual situation occurred. During the mission, we surmised what may happen if a dynamic obstacle, which a robot could not approach, were to suddenly appear at the POI. In this situation, the robot will generate natural language to report the current situation to the DICQ.R. Also, we expected at least one of the robots to find the missing child. In this case, we generated scene graphs to add POIs for the other robots to check. Analogously, the natural language-based scene understanding can be applied to other planning missions.

Results
The experiment involving patrolling and finding a missing child was successful. In this study, we used 16 POIs for robot patrolling according to the assigned area. When the child went missing, assume that a human is present at POI 9. Then, the robots were asked to check all the POIs and find a human who is likely to be the missing child. When such a human is detected, a scene graph is generated and sent to the DICQ.R. Using the achieved semantic information, a POI is added and the closest robot is asked to go to POI to check if the detected human is the missing child. Tables 1 and 2 show the generated plans for the robots. In the initial plan, POI 16 is not included. After the human is detected by Beyond, a new POI (16) is generated and is checked by Smart cookie.
We used the XML file structure to send the semantic graphs to DICQ.R. The XML/OWL parser located inside the knowledge engine is provided triplet data that contain scene graph information. The OWL file is generated by the classification processes of object property and data property relations. The object property relation is relevant to the relationship between objects, and the data property relation is relevant to the properties of these objects. According to the command from the DICQ.R, which provides the mission parameters, JENA-TDB fetches semantic information using SPARQL and sends it to the environment modeler. For example, using the received triple data of "human-behind-tree," "behind" is saved as "owl:ObectProperty rdf:about plan:behind/." "human-hasPositionX-100" is saved as "owl:DataProperty rdf:about plan:hasPositionX." Also, objects are parsed as "owl:NamedIndividual," which is used to describe instances. The generated language descriptions and scene graphs are shown in Figure 8a,b. The language descriptions and scene graphs were successfully generated in the simulation environment. As illustrated in Figure 8c, we utilized the natural language-based scene understanding across two situations: (1) language description in the "failed mission situation" to inform the control tower about the current situation, and (2) the scene graph in the "human detected situation," to add a POI to verify whether the detected human is the missing child. As a result, we verified that the proposed framework could successfully perform the required planning using heterogeneous multiple robots based on natural language-based scene understanding.

Conclusions
We proposed a new framework for heterogeneous multi-robot cooperation based on natural language-based scene understanding. While other studies only used the raw sensor data for the purposes of perception, we focused on identifying semantic meanings from the surrounding environment to efficiently share information between heterogeneous agents. The framework combines deep learning and symbolic planning. Neural networks were used for the generation of semantic graphs and language descriptions. JENA-TDB was utilized to store semantic triple data. By gathering the data appropriate for mission parameters from JENA-TDB, the PDDL planner generated the sequence of actions for each robot. Using one UAV and three AGVs, the proposed method was successfully verified via simulation involving patrolling and finding a missing child.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: