DK-SMF: Domain Knowledge-Driven Semantic Modeling Framework for Service Robots

Joo, Kyeongjin; Jeong, Yeseul; Kwon, Seungwon; Jeong, Minyoung; Kim, Haryeong; Kuc, Taeyong

doi:10.3390/electronics14163197

Open AccessEditor’s ChoiceArticle

DK-SMF: Domain Knowledge-Driven Semantic Modeling Framework for Service Robots

by

Kyeongjin Joo

¹,

Yeseul Jeong

¹,

Seungwon Kwon

²,

Minyoung Jeong

¹,

Haryeong Kim

¹ and

Taeyong Kuc

^1,2,*

¹

Department of Electrical and Computer Engineering, College of Information and Communication Engineering, Sungkyunkwan University, Suwon 16419, Republic of Korea

²

Department of Intelligent Robotics, College of Engineering, Sungkyunkwan University, Suwon 16419, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(16), 3197; https://doi.org/10.3390/electronics14163197

Submission received: 12 July 2025 / Revised: 6 August 2025 / Accepted: 8 August 2025 / Published: 11 August 2025

(This article belongs to the Section Systems & Control Engineering)

Download

Browse Figures

Versions Notes

Abstract

Modern robotic systems are evolving toward conducting missions based on semantic knowledge. Such systems require environmental modeling as essential for successful mission execution. However, there is an inefficiency in that manual modeling is required whenever a new environment is given, and adaptive modeling that can adapt to the environment is needed. In this paper, we propose an integrated framework that enables autonomous environmental modeling for service robots by fusing domain knowledge with open-vocabulary-based Vision-Language Models (VLMs). When a robot is deployed in a new environment, it builds occupancy maps through autonomous exploration and extracts semantic information about objects and places. Furthermore, we introduce human–robot collaborative modeling beyond robot-only environmental modeling. The collected semantic information is stored in a structured database and utilized on demand. To verify the applicability of the proposed framework to service robots, experiments are conducted in a simulated home environment and a real-world indoor corridor. Through the experiments, the proposed framework achieved over 80% accuracy in semantic information extraction in both environments. Semantic information about various types of objects and places was extracted and stored in the database, demonstrating the effectiveness of DK-SMF for service robots.

Keywords:

environmental modeling; semantic information; domain knowledge; autonomous exploration; VLMs

1. Introduction

Robots are advancing toward modeling the environment, perceiving surroundings, planning actions, and performing services based on knowledge and information. In complex environments where conditions continuously change and various challenges arise, it is crucial for robots to successfully carry out their designated missions. To achieve this, technologies such as semantic frameworks are continuously advancing, and modeling the environment in which robots perform services within these frameworks is a crucial factor [1]. For example, in the case of service robots, a mobile robot that cleans photovoltaic modules can safely perform cleaning tasks without damaging the panels by pre-modeling the semantic properties of photovoltaic panels and the surrounding work environment [2]. In the case of industrial robotic manipulators, modeling the position and properties of objects in the workspace allows the manipulator to achieve object-centered task knowledge necessary for stable task execution [3]. The existing Semantic Modeling Framework (SMF) in semantic frameworks [4,5] defines the Triplet Ontological Semantic Model (TOSM) to model objects, places, and robots, enabling an understanding of the environment in a manner similar to humans. However, the initial SMF approach required manual modeling, where people directly defined the objects and places in the environment where the robot operates and manually generated the database. This approach involved inefficiencies in time and cost, as it required re-modeling whenever the environment changed significantly or was newly introduced.

In this paper, we propose a Domain Knowledge-driven Semantic Modeling Framework (DK-SMF), which enables robots to autonomously explore spaces and perform environmental modeling using domain knowledge [6,7]. This approach utilizes Zero-shot Object Detection (ZSOD) [8,9] and Visual Question Answering (VQA) [10] among open-vocabulary-based VLMs [11] to detect objects and automatically extract their semantic information based on domain knowledge without the need for dataset collection and training. Based on the extracted object information, it further extracts semantic information about the place using Large Language Models (LLMs) [12]. The collected environmental semantic information is transformed into a final structured on-demand database, such as Resource Description Framework (RDF) or Web Ontology Language (OWL) [13] based on the specified TOSM properties. Furthermore, the proposed framework enables robots to autonomously perform environmental modeling while simultaneously allowing humans to directly update the semantic information of objects or locations through utterances such as voice or text, or to interact with robots by utilizing semantic information. For this purpose, we propose the Semantic Modeling Robot Language (SMRL), which is a method that converts human utterances so that robots can understand only TOSM properties information. With this method, humans can directly intervene in environmental modeling from the generated database. Using semantic information that includes robots, objects, and places completed in this approach, a semantic map is generated by stacking them in multiple layers and incorporating semantic information.

The TOSM properties of Symbol, Explicit, and Implicit for objects, places, and robots defined in the Conventional Semantic Framework present knowledge properties similar to human cognitive methods, but do not include semantic information directly necessary for robot service execution. In other words, using only basic TOSM properties, robots can perform roles such as recognition and planning, but there are limitations in performing high-level services in complex environments. DK-SMF additionally defines Implicit properties for robots, objects, and places necessary for robot service execution. By adding affordance and purpose attributes, it presents the robot’s service capability, and objects and places are also defined with additional purpose attributes so that robots can utilize objects or leverage the TOSM information of places when performing services. DK-SMF enables the automatic collection of these additional purpose attributes using VLMs and LLMs when extracting environmental information of objects and places. Furthermore, by predefining Domain Knowledge to limit the scope of environments that need to be modeled in the robot’s autonomous environmental modeling, it enables high-reliability semantic information collection and prevents VLMs and LLMs from producing hallucinated answers.

Recent advances in foundation models [14,15,16] and open-vocabulary object detection methods [8,9,10,11] have led to a growing number of studies applying these methods to semantic SLAM [17,18] and cognitive robotic modeling frameworks [19,20]. These approaches enable zero-shot-based extraction of semantic information about objects in unknown environments or allow language-based navigation using LLMs. However, despite these impressive developments, a lack of research remains on methods that enable robots to autonomously extract both explicit and implicit semantic information about objects and places in unknown environments for modeling, or that allow users to participate in the modeling process directly.

The main contributions of this paper are as follows:

We newly define TOSM properties with additional implicit properties suitable for service robots and DK-SMF.
Semantic information extraction of objects and places in environmental modeling can be fully automated using VLMs and LLMs based on zero-shot methods.
Framework leverages domain knowledge to make zero-shot-based semantic modeling robust.
Framework integrates the extracted semantic information into a structured semantic database.
Through SMRL, humans can directly participate in environmental modeling and enable environmental information updates and interaction with robots.
A semantic map is built by integrating semantic information through the semantic database.

The remainder of this paper is organized as follows: In Section 2, we provide a brief overview of other frameworks related to environmental modeling and compare them. In Section 3, we present the proposed framework as a whole, followed by a detailed explanation of each component. In Section 4, we conduct experiments and discuss the results. Lastly, we conclude the paper with a discussion in Section 5 and Section 6.

2. Related Work

Environmental modeling is a fundamental prerequisite and core technological component necessary for successful mission execution by robots in complex environments. In existing semantic frameworks, robots, objects, and places were defined for environmental modeling, and on-demand databases were manually generated. Recently, many researchers have proposed semantic environment modeling and mapping methods using foundation models [14,15,16] such as LLMs [12] and VLMs [11]. These methods can be categorized according to which foundation models were applied in constructing the semantic map: closed-set object extraction with predefined labels [Hydra [21], MapNav [22], Kimera [23]]; using ZSOD [8,9] and Vision-Language Embeddings (VLE) [24] [CoWs [25], NLMap [26]]; employing zero-shot-based mask generation [27] with VLE [VLMaps [28], QuesTMaps [29]]; combining ZSOD and mask generation [Sai et al. [30]]; combining ZSOD and mask generation with VLE [ConceptFusion [31]]; and utilizing ZSOD with LLMs [SeLRoS [32]].

This paper presents a comparative analysis of research on semantic environment modeling based on several criteria: the use of autonomous exploration, methods for semantic knowledge extraction, the incorporation of domain knowledge, the types of semantic knowledge collected (robot, object, place), and whether the modeling involves human participation. Table 1, sorted by the method of semantic knowledge extraction, presents a comparative overview of semantic modeling and map construction frameworks.

In Table 1, a checkmark (✓) indicates that a given paper applies the corresponding method (e.g., autonomous exploration or domain knowledge usage), whereas a dash (–) signifies that the method is not applied. For semantic information extraction, methods that satisfy class along with appearance information and implicit information are marked with ◉, methods that partially satisfy with simple class, ID, pose, etc. are marked with ●, and ○ indicates methods that do not satisfy with semantic information at all.

To demonstrate that our framework considers the components necessary for such modeling, we provide detailed descriptions of several studies and our previous research.

Allu et al. [30] proposed a framework in which a mobile robot autonomously explored a large unknown indoor environment, collected object information based on an open vocabulary, and constructed a semantic map. The semantic map they constructed was composed of a combination of a 2D grid map and a topological graph that included object information. For environmental modeling, the system employed the Dynamic Window Frontier Exploration method [33] to comprehend the geometric structure of the environment and construct a 2D grid map. Based on the constructed map, real-time zero-shot object detection [8] and segmentation were performed using Grounding DINO [9] and MobileSAM [34]. The core contribution of this system lies in its adaptive updating capability in response to environmental changes. When objects were added, removed, or relocated, the robot compared the differences and updated the map in real time. The proposed method was demonstrated by constructing a semantic map in a large-scale indoor environment measuring 93 m × 90 m.

Gadre et al. [25] proposed a unified setting called Language-driven Zero-Shot Object Navigation (L-ZSON), in which a robot explored the environment and navigated toward an object [35] based on a human-provided description. This allowed the robot to infer and locate even unseen objects through zero-shot reasoning. To adapt to the L-ZSON task without additional model training or fine-tuning, they introduced the CLIP-on Wheels (CoW) framework. When the framework received a natural language instruction from the user, it performed autonomous exploration using Frontier-Based Exploration (FBE) [33] to extract semantic information. To extract the embedding information of the object corresponding to the instruction, the framework independently evaluated three different methods using the CLIP [36], OWL-ViT [37], and MDETR [38] models. The robot navigated to locations where object features show high similarity and determined whether the target object was detected. To evaluate the effectiveness of object inference, performance was assessed using the PASTURE benchmark.

Mehan et al. [29] proposed Queryable Semantic Topological Maps (QueSTMaps) to enable hierarchical semantic understanding of complex indoor scenes composed of multiple floors and rooms by a robot. The core objective of this framework is to enable accurate retrieval of specific locations on the map through natural language queries alone by aligning language and topology. To achieve this, the constructed 3D point cloud map was projected into a 2D map, and room segmentation was performed using Mask RCNN [39]. Subsequently, the SAM model [27] was used to extract object masks without classes, and CLIP [36] was applied to generate semantic embeddings for each object. These embeddings were arranged as sequences per room and fed into a transformer encoder to generate a semantic vector representing the room. The room embedding was compared with predefined room labels using cosine similarity to assign the final room label. The semantic information of both objects and places remained in the form of embeddings, allowing location retrieval based on natural language queries.

Kim et al. [32] proposed the Semantic Layering in Room Segmentation via LLMs (SeLROS) framework, which enables a robot to perform semantic-based room segmentation on a 2D map using LLMs, allowing the robot to semantically perceive the environment. The goal of this framework is to enhance robot indoor navigation performance by incorporating semantic information, such as object recognition and spatial relation information, into conventional approaches that focus on geometric segmentation of indoor environments. For this purpose, room segmentation was performed using a Voronoi Random Field (VRF) based algorithm [40,41] on the already built 2D occupancy map. Camera RGB images obtained from each segmented room were input to Detic [42] to extract object classes. Simultaneously, information interpreting each room’s area, shape, and adjacent relationships was collected. The collected object and spatial information were structured as prompts and queried to LLMs to infer semantic information for each room. The prompts to the LLMs consisted of hierarchical queries at both room-level and environment-level, contributing to integrating incorrectly segmented spaces or improving semantic consistency. The proposed framework utilized the AI2-THOR framework [43] to build 30 different indoor environments within the ProcTHOR [44] simulation. Semantic information was then validated through both qualitative and quantitative analyses.

Joo et al. [4,5] proposed an SMF based on the TOSM for environment modeling within the Semantic Navigation Framework. In the initial SMF, object information was collected manually. After selecting the service environment, the robot constructed a grid map using range sensors such as LiDAR. Semantic information about essential objects and places on the map was then manually observed and registered in the semantic database [13]. Additionally, the spatial relations between objects and places were observed and added to their respective semantic knowledge [45]. The resulting on-demand database was validated through various indoor and outdoor scenarios using the Semantic Information Processing (SIP) and Semantic Autonomous Navigation (SAN) frameworks.

3. Domain Knowledge-Driven Semantic Modeling Framework

The initial SMF mimicked the human ability to perceive and represent the surrounding environment, enabling robots to understand their environment and utilize semantic knowledge through TOSM-based environmental modeling. However, the initial SMF lacked an understanding of robot affordance [46,47], making it unable to distinguish the tasks or services that the robot could perform. Additionally, it lacked sufficient information on the intended use of objects for service robots to utilize them effectively. Furthermore, the existing modeling approach required users to manually intervene and carry out environmental modeling, demanding a significant amount of time and labor for setup. In this section, we redefine the properties of TOSM to suit service robots and the proposed framework better and present a method for fully autonomous modeling of the robot’s operating environment based on domain knowledge and zero-shot using the DK-SMF, as illustrated in Figure 1. And, as shown in Table 2, the components and functions of DK-SMF for autonomous modeling are summarized by module.

3.1. Definition of TOSM-Based Properties for Service Robots

We extend to the symbolic, explicit, and implicit rules of TOSM for modeling environmental elements (Robot, Object, Place) suitable for service robots [1,4]. In addition, we extend the previous TOSM definition by introducing new attributes, particularly by refining the implicit properties to better accommodate the requirements of service robots.

First, we define the datatype properties of the service robot as shown in Table 3. Among these properties, the symbolic model remains consistent with the existing robot’s TOSM. However, in the explicit model, a sensor attribute has been added to represent the types of sensors equipped on the robot, enabling the utilization of sensor-related information. A battery attribute has also been introduced to incorporate information regarding the robot’s battery. Additionally, the implicit model includes hardware-dependent functions of the robot, the affordance of services it can provide, the purpose of the robot service, the robot’s current state, and the environment in which the service is performed. The affordance refers to the robot’s capabilities within the service domain, distinguishing between tasks it can and cannot perform. For example, the affordance of a household cleaning robot is represented as “vacuum, water clean,” indicating that the robot can provide vacuuming and water cleaning services but cannot perform other services. Additionally, the purpose and environment are represented as “clean for household” and “house”, respectively, indicating that this cleaning robot is designed for household use. These implicit attributes define the robot’s service domain, making it clear that it is not suitable for outdoor cleaning.

Next, to enable the service robot to utilize object properties, we define TOSM properties of the object as shown in Table 4. Like the robot’s TOSM, the symbolic and explicit models of the object remain unchanged from the previous object TOSM definition.

However, the implicit attributes of objects must be newly defined with properties appropriate for service robots. The existing implicit model has been extended by adding the purpose, isKeyObject, and isPrimeObject attributes. By adding the purpose attribute to an implicit model, the robot can better understand the intended use of an object, thereby enhancing its comprehension of the environment where the object exists. This leads to improved decision-making and more efficient task execution. For example, if the purpose of a refrigerator is identified as a “food storage device”, the robot can deduce that a location containing a refrigerator is likely a “kitchen”. It enables the robot to utilize a refrigerator for storing food items. Additionally, the isKeyObject and isPrimeObject attributes of an object are used for spatial relation modeling utilizing place semantic information and will be explained in detail in Section 3.3. Lastly, the spatialRelation attribute of an object is defined not only to represent spatial relations derived from place semantic information, but also to account for spatial configurations inferred from object detection results. This attribute can be utilized when the robot loses its location and relies on object detection from the SIP to re-localize itself. During environmental data collection, spatial relations are formed based on the bounding boxes (Bboxes) of detected objects at observed locations. For example, if a door, a vending machine, and an emergency sign are detected at a given spot, the spatialRelation attribute for the vending machine would indicate that the door isLeftTo it and the emergency sign isRightTo it. Since spatial relations are formed from place semantic information on the grid map, in situations where the map is unavailable, the robot estimates its current position using the information perceived solely through its camera. In such cases, the spatialRelation attribute of objects can be leveraged.

Finally, as shown in Table 5, the datatype properties of the Place class in the TOSM have also been redefined to better suit service robots. The TOSM properties of place also remain the same as in previous research, but the Implicit model has been enhanced by utilizing the functional purpose attribute, similar to the TOSM of objects in service robots. The purpose is represented as a string with detailed descriptions to help the service robot understand the function of a given place. For example, a cleaning robot can interpret the purpose of a kitchen as “a place for preparing and eating meals”. This allows the robot to avoid cleaning during lunch and dinner hours or to focus more on specific areas of the kitchen after the user has finished cooking. In this way, the robot’s understanding of a place contributes to determining the appropriate services it can provide.

3.2. TOSM Properties-Based Object Semantic Information Extraction

In DK-SMF, the autonomous collection of semantic information for objects is carried out through the process illustrated in Figure 2. Firstly, the robot autonomously explores the environment and builds a grid map. Next, an image is captured at a specific location, and a list of objects present at that location is extracted. Third, the generated object list is used as a prompt and fed into the ZSOD model along with the image to detect the objects. Finally, the detected object data is processed to extract semantic information according to the desired TOSM properties.

3.2.1. Autonomous Exploration

In DK-SMF, we utilize the autonomous environmental exploration method to automatically extract objects and places information and to build the fundamental map of the semantic map.

We use the frontier-based exploration method to enable the robot to build a grid map based on GMapping SLAM while simultaneously generating local and global planners for navigation [33,48,49,50]. This method detects frontiers within the robot’s current exploration area and guides the robot toward navigable regions, enabling progressive environmental mapping. As the robot explores and constructs the map, it designates waypoints along its path to capture images of specific areas. As shown in Figure 3, the robot begins its exploration from the start point and designates a series of points along its path until it reaches the finish point. At each designated point, the robot collects sensor data to extract the list of objects in the environment, detects them, and gathers relevant semantic information. The sensor data includes RGB and depth images as well as robot odometry. The interval between each point varies depending on the scale of the environment and can be configured as a parameter. In the case of a household cleaning robot, an interval of approximately 0.5 to 1 m allows sensor data to be collected without occlusion of objects. Autonomous exploration terminates when no new areas remain to be sensed or when a predefined time limit is reached. The constructed map is then saved as a 2D grid map. If the exploration is terminated at a location other than the starting point, the robot generates a planner and navigates back to the start point to perform homing.

3.2.2. Object List Generation

While navigating through autonomous exploration, the service robot uses the image data acquired at each point as input to the ZSOD model and generates a prompt incorporating the object list for object detection in the environment. As shown in Figure 4, the prompts are generated by combining the list of objects extracted from each point’s image using the VQA model [10] with a set of candidate objects manually selected from the Domain Knowledge Database (DK-DB) based on the likelihood that they exist in the service environment.

First, the method for extracting object lists using the VQA model is as follows: for each point, the image captured by the robot and a predefined prompt are fed into the VQA model. We use the VQA model to identify objects by recognizing the most salient objects in the captured image, similar to the approach used in salient object detection [51], and then identifying the remaining objects. For example, as shown in Figure 4, although various objects such as a white box and a vending machine are present in the image, the vending machine is likely to stand out as the most visually salient object due to its red color, which contrasts with the surrounding environment. In this way, objects that are more easily identifiable based on salient object detection are recognized first, followed by the remaining ones. This approach is made possible by composing a predefined query, which is used as input to the VQA model. Rather than fine-tuning or re-training the VQA model, prompt engineering based on In-Context Learning (ICL) [52] is applied to elicit the desired answers. The VQA model should be selected based on its ability to understand ICL-based prompts and provide answers that satisfy the required level of quality. The prompt designed for input to the VQA model consists of an answer rule specification, guided examples based on ICL, and a final question. We select Qwen2-VL-2B-Instruct [53] as the VQA model for our framework, as it is the most appropriate choice considering three key factors: GPU resources, computation time, and answer quality.

We compare this model with BLIP-vqa-base [54], BLIP2-opt, LLaVA-1.6 or 1.5-7B [55], and Qwen2-VL [53]. Models that provide high-quality answers to complex questions generally require greater GPU resources and exhibit slower computation times, whereas models with lower GPU requirements and faster processing speeds tend to produce subpar answer quality. Accordingly, we select the Qwen2-VL model [53] with 2B parameters, as it demonstrates the most reasonable trade-off among computation speed, GPU resource usage, and answer quality when compared with other VQA models. As shown in Figure 4, the object list is generated by having the VQA model output the class names of detectable objects in the image in a format suitable for input to the ZSOD. Furthermore, to increase the number and accuracy of object detections by the ZSOD in the environment and to enable object-centric understanding of the environment, a list of domain-relevant objects is predefined according to domain categories from DK-DB. For example, if the service robot operates in a home environment, objects likely to be present, such as “TV, refrigerator, dining table, door, and washing machine,” are selected in advance. If the robot operates in an office environment, objects like “desk, drawer, chair, door, and computer” are predefined. In addition, objects that are irrelevant for semantic information extraction, such as “ceiling” and “floor”, are excluded from the list. The operating environment and domain categorization of the robot are determined using the environment attribute within the service robot’s TOSM properties.

At each point where the robot collects image data, the generated object list and the predefined object list from the DK-DB are combined into a single list, which is then used as a prompt and provided as input to the ZSOD for object detection.

3.2.3. Zero-Shot-Based Object Detection

To replace the manual annotation of object information used in the initial Semantic Framework modeling, a zero-shot object detection model is first employed to automatically detect objects in the environment, enabling automated extraction and modeling of object information. To address the closed-set limitations of one or two-stage, transformer-based object detectors that cannot detect objects not included in the dataset, the ZSOD model is adopted to enable object detection across diverse environments based on an open-vocabulary approach. We use Grounding DINO 1.5 pro [9] as the ZSOD model. It achieves the third-best performance among the state-of-the-art methods on MSCOCO [56], LVIS v1.0 val [57], and LVIS v1.0 minival. This model enables accurate object detection and information extraction while reducing uncertainty. As shown in Figure 5, the object list generated in Section 3.2.1 and the RGB image from the RGB-D camera are fed into the ZSOD model, which outputs the class, Bbox, and confidence of the detected objects. With this output, the object ID, pose, and spatialRelation information are estimated. All detected objects are filtered to select only those suitable for TOSM properties extraction. First, objects with a confidence score below a predefined threshold are discarded. Next, objects located beyond a certain distance, calculated using depth data, are also excluded. Through these two filtering steps, objects with high reliability are selected and used for object information extraction.

As part of the modeling component of the SMF, attributes such as class, ID, and pose are extracted by the ZSOD model and incorporated as part of the TOSM data. The class and ID of each selected object are extracted, and IDs are assigned in the order of selection. The 3D pose of each object on the map is computed using the depth map from the RGB-D camera. Specifically, the depth value at the center of the Bbox is used to estimate the object’s position, and its x and y coordinates are calculated relative to the robot’s camera coordinate frame and then transformed into the map coordinate system. The object pose (

P_{w}

) is obtained using the following equation.

P_{w} = T_{r \to w} \cdot T_{c \to r} \cdot P_{c}

(1)

P_{c} = [\begin{matrix} X_{c} \\ Y_{c} \\ Z_{c} \end{matrix}] = D (u, v) \cdot K^{- 1} \cdot [\begin{matrix} u \\ v \\ 1 \end{matrix}]

(2)

The center coordinate (

u, v

) of the object’s Bbox is multiplied by the inverse of the camera’s intrinsic matrix (

K

) and the depth value (

D (u, v)

) to compute the object position in the camera coordinate frame (

P_{c}

). The camera coordinate frame (

P_{c}

) is then transformed into the world coordinate frame to obtain the object pose (

P_{w}

) by applying the camera-to-robot transformation (

T_{c \to r}

) followed by the robot-to-world transformation (

T_{r \to w}

). When the same object is detected in multiple images and its poses are estimated on the map, the center point of those poses is designated as the representative pose of the object.

The spatialRelation is determined by extracting the list of objects located to the left and right of each detected object based on their Bboxes in the image. As shown in Figure 6c, for example, with respect to the detected objects, the isLeftTo of the vending machine is a white box, while the isLeftTo of a white box is the door, and its isRightTo is the vending machine. The extraction of an object’s spatialRelation is based on its front-facing orientation, which helps prevent duplication of isLeftTo and isRightTo relations when the same object is detected from opposite directions.

3.2.4. Object Semantic Information Extraction

As shown in Figure 7, the VQA model is applied to expand the detection results from the ZSOD into the final TOSM properties required by the service robot and to extract semantic information. The attributes extracted as TOSM properties include color from the explicit model and purpose of usage, and isMovable and canBeOpen from the implicit model. In the case of a “door,” the isOpen attribute is also included to determine whether the door is open. The VQA model [54] used for extracting object semantic information is the same as the one used during object list generation, and the extraction process proceeds as follows.

As shown in Figure 8, the object images extracted using the Bbox from the original image are overlaid onto a black background to create a post-processed image, which, along with a predefined query prompt for semantic extraction, is then fed into the VQA model. To reduce hallucination in object information extraction by the VQA model and to maintain consistency in answering, an ICL-based prompt engineering method is applied [53]. Also, the prompt includes content that incorporates the implicit properties of objects predefined from domain knowledge [58]. The DK-DB serves as domain knowledge that contains object information specific to a given environment [6,7]. The object class detected by the ZSOD is retrieved from the DK-DB, and the matched data is presented in the prompt as domain knowledge. The prompt is constructed in the following three-part structure.

The prompt begins with a rule for answering and some examples of questions and answers, includes domain knowledge extracted from the DK-DB, and ends with the formulation of the target question. The VQA output, as illustrated in Figure 9, shows an example using a vending machine. Based on the image of the vending machine detected by the ZSOD and the generated prompt, the extracted TOSM properties include color: “Red”, purpose of use: “Provides beverages and snacks to the customers”, isMovable: “No”, and canBeOpen: “Yes” as the vending machine is equipped with a door. The extracted semantic information of the object is used to estimate place-level TOSM properties based on object-centric reasoning and is stored in the database for use by the robot during service execution.

Since object semantic information is extracted from images captured at regular intervals, there may be duplicate objects or misrecognized objects at the same location. If the same object is extracted multiple times at the same location, the average pose is calculated to represent it as an object. If different objects are extracted multiple times, the most frequently detected object is chosen as the representative. If multiple objects are detected with the same frequency, the one with the highest confidence score is selected as the representative.

3.3. TOSM Properties-Based Place Semantic Information Extraction

Along with object semantic information, semantic information about places is also necessary for the robot to understand the environment. This information is inferred by the LLMs and collected following the predefined TOSM properties. While the previous SMF approach required manual annotation and registration of place information on the map, the method we propose automatically extracts semantic information about places, as illustrated in Figure 10. The grid map built using distance sensors during autonomous exploration is segmented into distinct places, and the extracted objects are assigned to each segmented polygon. The object list collected from each polygon is forwarded as a prompt, along with predefined domain knowledge of the environment, to the LLMs to extract the semantic information of the place.

3.3.1. Place Segmentation of the Map

To extract the semantic information of each place in the service robot’s operating environment, the free-space areas on the map must be segmented. The regions on the map corresponding to places are segmented as shown in Figure 11.

First, the 2D map built using distance sensors such as LiDAR or a depth camera is initialized and pre-processed. Pre-processing is performed using morphological methods to remove unnecessary noise from the map image. Place segmentation is performed using the superpixel segmentation method [59] {Achanta, 2012 #73} based on the SLIC algorithm and the number of superpixels required for segmentation is parameterized. After obtaining a mask on the map through the SLIC superpixel segmentation, only the distinct superpixel regions generated within the map’s free space are retained, while all other regions are discarded. Next, the area of each generated boundary is calculated, and any region with an area smaller than a threshold is discarded. Each segmented space, considered a place, is filled with a random color for visualization, and the centroid of each polygon is computed using the moment-based method, which calculates the geometric center. Each place polygon is then assigned as a boundary attribute of the place TOSM properties.

3.3.2. Place Semantic Information Extraction

To extract the semantic information of each segmented place, the semantic information of objects collected in Section 3.2 is utilized. Each object includes its pose information, which is used to place it on the map and associate it with the corresponding segmented place. As shown in Figure 12, the semantic information of each place is extracted using the object information it contains. The object list within each polygon is used along with the LLMs to extract the class and purpose of the place. The prompt is generated and provided as input to the LLMs. This prompt follows a structure similar to that used for object information extraction in the VQA model, and incorporates domain knowledge from the DK-DB, aiming to maintain consistency in responses by restricting place information to those present within the specific environment domain.

The structure of the prompt to be input into the LLMs consists of three parts: presenting the answering rule for the place information extraction, providing the object list vector of each place along with domain knowledge about places from the DK-DB, and finally composing the target question to complete the prompt. When the prompt is input into LLMs, such as GPT [60], Gemini [61], or LLaMA [62], the semantic information of each place is generated. The outputs are the class of the place and its functional purpose. In addition, to generate the final TOSM properties for each place, the previously constructed boundary information and the associated objects are included in the TOSM properties of the place. Among the TOSM properties attributes for a place, the SpatialRelation includes the IDs of adjacent places based on the center point of the polygon representing the current place. This attribute provides essential information for task planning in the SAN.

In addition, it is necessary to determine the KeyObject and PrimeObject for each place. Although these attributes belong to the implicit model of objects, they are not estimated during object information extraction. Instead, they are inferred during place information extraction, as they serve to identify whether an object plays a central role within a place. The estimation method is as follows.

S (o_{i}) = w_{s} \cdot s_{i} + w_{c} \cdot c_{i} + w_{f} \cdot (- f_{i}) + w_{d} \cdot (- d_{i}) (w_{s} + w_{c} + w_{f} + w_{d} = 1, i = 1,2, \dots, n)

(3)

O_{p_i} = \{\begin{matrix} O_{i} i f S (o_{i}) \geq θ_{P} \\ r e j e c t o t h e r w i s e \end{matrix}

(4)

O_{k} = \{\begin{matrix} {a r g m a x}_{o_{i}} S (o_{i}) i f {m a x}_{i} S (o_{i}) \geq θ_{K} \\ O_{p_i} o t h e r w i s e \end{matrix}

(5)

The semantic information of each object within the polygon is weighted to assign a score

S (o_{i})

to each object (

o_{i}

). The weights (

w_{s}, w_{c}, w_{f}, w_{d}

) correspond to saliency (

s_{i}

), confidence (

c_{i}

), detection frequency (

f_{i}

), and duplicate number (

d_{i}

), respectively. If the score

S (o_{i})

, which is the sum of the weights, exceeds a prime object threshold (

θ_{p}

), the object is classified as a prime object (

O_{p}

). Furthermore, the one with the highest score among all objects in each place is designated as the key object (

O_{k}

) if its score exceeds the key object threshold (

θ_{k}

). When an object assigned to the KeyObject attribute is recognized by the SIP, the place can be inferred hierarchically, as the key object is unique to that place. Similarly, when objects with the PrimeObject attribute are recognized, the place can also be inferred based on the set of those objects.

3.4. Semantic Modeling Robot Language

The proposed framework enables the robot to autonomously perform environmental modeling. However, since the end user of a service robot is a human, it is also necessary to allow humans to model the environment based on their own observations. Updating the semantic DB directly by hand, such as reviewing and correcting information about objects or places or adding new attributes from the environment, is inefficient for the user. Since users have powerful communication methods such as speech and text, updating the semantic DB through these modalities is the most efficient approach for them. We propose SMRL, a method for human–robot interaction in modeling, where input sentences given through human utterances such as speech or text are converted in accordance with TOSM properties and used to update the semantic DB.

To allow humans to update modeling information through communication, the process is carried out as illustrated in Figure 13. The user inputs an utterance into the SMRL module in the form of either text or voice. The user’s voice is converted into text using Speech-to-Text (STT) and then input into the module as text. The input sentence is subsequently converted to align with the TOSM properties, and the process proceeds as follows. The input sentence is converted using the LLMs. A predefined prompt is queried to the LLMs, which outputs the converted information. The contents of the prompt are defined based on rules for extracting TOSM properties of places and objects, which are derived from the DK-DB. Following these rules, more than three examples of input–output pairs are provided based on ICL for prompt engineering, guiding the model to produce the desired output. As illustrated in Figure 13a, when the user says, “There are a black TV and a brown desk in the seminar room,” the sentence is input into the module. Through SMRL, two objects, TV and desk, are generated as TOSM properties for objects, and one place, seminar room, is generated as a TOSM property for a place. In addition, as shown in Figure 13b, when a user queries the robot about which objects are present in a specific place, the question is converted into a TOSM-formatted sentence through SMRL. This sentence is then compared and matched with the data stored in the semantic DB, and an appropriate answer is returned in response to the query.

Through the SMRL approach, we demonstrate that humans do not merely serve robots in modeling but can actively engage in the modeling process through the HRI. It demonstrates the potential to more effectively teach robots the human way of thinking.

3.5. Semantic Database Generation and Map Representation

The data obtained from the ZSOD through the robot’s sensors, object information extraction from detected data, place information extraction based on object information, and semantic information derived from human utterances are integrated into a unified database. The constructed semantic DB can be accessed on demand by the SIP and SAN modules within the semantic framework whenever data is required.

The extracted object and place semantic information exists as temporal metadata in an unstructured text format and therefore cannot be directly stored or utilized in ontology-based databases such as RDF or OWL. Therefore, a pipeline is designed to convert unstructured metadata into data compatible with ontology-based databases. First, the collected semantic data in the form of text is mapped to a predefined schema for TOSM to be converted into a database-compatible format. This schema defines the required and optional attributes of objects, their data types, and unit systems, and establishes a one-to-one correspondence with ontology classes and properties. Static type checking is used to compare the schema specification with the actual data types to identify any inconsistencies. Normalization is then performed, including coordinate and unit conversion, as well as converting string values into integer or floating-point formats, in order to standardize the representation of data units. In addition, datatypes such as size, pose, and velocity, which are in the form of lists or arrays, undergo cardinality inspection based on semantic constraints. Data that violates the defined minimum or maximum cardinality is discarded. All attributes that pass the datatype validation process are converted into RDF triple format, serialized into either Terse RDF Triple Language (Turtle) [63] or RDF/Extensible Markup Language (XML) format, and ultimately stored as a semantic DB for the robot that is compatible with ontology protocols. The proposed pipeline for semantic DB construction ensures semantic type stability and cardinality integrity, allowing any form of modeled information to be incorporated into the ontology-based TOSM structure without information loss.

The semantic DB is visually represented as a semantic map composed of four multiple layers, as shown in Figure 14. The base layer of the semantic map is the grid map built through autonomous exploration. The next layer is the object map, which contains the semantic information of objects extracted, with each object’s pose represented as a node. Above the object map is the place map layer, which visualizes the segmented polygons of places on the grid map, with the centroid of each polygon represented as a node. Finally, the topmost layer is the topological map, which is used by the robot as a planner in the SAN. This layer represents navigation nodes centered around place and object nodes, along with edges indicating navigation paths, enabling the robot to move to specific places or the vicinity of specific objects. Nodes are automatically generated within the robot’s navigable free space by parameterizing the interval between nodes. Edges are then generated to connect the generated nodes. Among the edge attributes, an attribute called isAccessible is defined to determine whether the robot can navigate a given edge. This attribute enables real-time replanning if a path becomes blocked.

4. Experiments

In this section, we present experiments and analyses of the proposed framework in various environments. The objective is to verify whether our environmental modeling framework is suitable for service robots.

4.1. Experimental Environments and Setup

To verify whether our framework is suitable for service robot applications, we presented experiments both in simulation and in real-world environments.

The simulation was conducted in a home-like environment where a cleaning robot performs services. As shown in Figure 15a, we constructed a household environment using the NVIDIA Isaac Sim™ [64] simulator. The environment consists of a complex multi-room indoor layout including a living room, kitchen, bedroom, and bathroom. Objects commonly found in each of these places were arranged accordingly, allowing the robot to infer the characteristics of various places and objects within a household environment. The robot platform used in the simulation was configured to match the size and sensor setup of a real household cleaning robot, and this virtual robot was deployed within the simulated environment. The simulation was executed on a PC equipped with an Intel Core i9-14900K CPU and an NVIDIA RTX 4080 GPU.

The real-world experiment, shown in Figure 15b, was conducted in a simple environment such as that of an indoor corridor, where a multi-purpose robot could collect environmental information to execute missions such as guidance, delivery, and serving. A corridor-like setting was intentionally chosen for its simplicity, enabling the validation of semantic information extraction in the presence of repeated and duplicated objects. In the real-world experiment, the robot platform was a four-wheeled mobile robot equipped with an RGB-D camera and a 2D LiDAR sensor. The autonomous exploration algorithm and sensor data collection were processed on a Single-Board Computer (SBC) powered by an Intel Core 11th Gen i5 CPU. The collected sensor data was transmitted to a server PC equipped with an Intel Core i9-14900K CPU and an NVIDIA RTX 4080 GPU, where semantic information about objects and places was extracted and the database was constructed.

4.2. Experimental Scenario and Design

We demonstrate the suitability of DK-SMF by showing common scenarios in two environments. In the scenario, the robot performs autonomous exploration and builds maps in both the experimental environments of a simulated house and a real-world indoor corridor. While building the map, RGB and depth images along with the robot’s current position are collected by designating points at regular intervals. Once map construction is completed, the collected data extracts objects and places information to complete the semantic information into a semantic database. Additionally, any further environment modeling required by the user is performed using SMRL, which generates and saves the corresponding semantic information.

We first validated the proposed framework through interpretative and visual evaluations based on qualitative results regarding semantic environment modeling. At the object level, we assessed the consistency between the object list generated within the environment using domain knowledge from the DK-DB and the collected semantic information of objects and places. Furthermore, we emphasized the practical benefits gained by integrating the automatically collected semantic information into the semantic database. Therefore, we qualitatively analyzed through experiments in both simulation and real-world environments whether semantic information about unknown objects and places can be collected and structured into the database. In addition, we evaluated whether the semantic information of objects and places generated through SMRL aligns with their actual attributes and functions and assessed the completeness of the generated semantic information. This qualitative analysis highlights the practical advantages of collecting semantic information using domain knowledge, demonstrating that it is an effective and viable approach for service robots in performing their tasks.

In the experimental scenario, we also validated the proposed framework through the following quantitative results, in addition to the qualitative evaluations. We measured the object detection accuracy against ground-truth objects present in the environment and evaluated the accuracy of the semantic information extracted from the detected objects. The accuracy of object semantic information extraction was evaluated by comparing the predicted outputs with human-labeled ground-truth. We also counted the number of objects registered in the semantic database. Additionally, for the semantic information of places, we measured the accuracy of the predicted semantic information for each segmented place using the same evaluation approach as for object semantics. The number of places registered in the semantic database was also counted. The analysis of these quantitative results verified how accurately the robot extracts semantic information, thereby validating the practical applicability of the system to real-world robotic deployments.

4.3. Experimental Results

We first confirmed through simulation that a cleaning robot autonomously explores a house environment using the proposed method and generates a semantic database by utilizing the collected sensor data. As shown in Figure 16, we verified that object labels are present on the created 2D grid map, and place labels are also generated for each segmented polygon.

The blue points represent the center coordinates of each generated place, while the yellow points represent the prime objects of each place. Similarly, the red points represent the key objects, which are the most essential ones existing in each place. The object points represent the map coordinates of objects recognized by the robot. In the simulated house environment, since the types of objects are defined in a limited way with a focus on furniture, we confirmed that prominent objects were used to generate object list prompts. For object lists not generated by VQA, information from DK-DB was used to detect objects based on ZSOD. Additionally, we confirmed that the remaining semantic information, including implicit properties, was generated using VQA. Regarding the generated object semantic information, each segmented place was assigned an appropriate label and purpose. For example, the area containing a refrigerator and an oven was identified as a “kitchen”, while the area with a TV and a sofa was labeled as a “living room”. Each place was also assigned a purpose consistent with its characteristics. However, in some places, the semantic information of objects was not generated, and as a result, the semantic information for those places could not be collected, causing them to be labeled as “unknown.” This occurred due to physical limitations that restricted the robot’s movement, preventing it from completing the map or collecting object information. Moreover, even if a place is accessible, it is not possible to infer its information without sufficient object information. The area labeled as “passway”, located between the “living room” and the “kitchen”, illustrates ambiguity in boundary segmentation. In this case, the shelf served as the key object, and the label was generated based on a combination of nearby prime objects. This demonstrates that the place label was determined using predefined domain knowledge.

Based on the simulation results, we conducted experiments in a real environment. As shown in Figure 17, the experimental robot autonomously explored approximately 30 m of a corridor in an indoor environment and collected sensor data. We confirmed that the collected data was converted into a semantic database, and the semantic information for objects and places is summarized as follows. Since the environment was a corridor, hinged doors appeared repeatedly, and certain locations contained objects such as fire extinguishers, water dispensers, and vending machines. As in the simulation, key objects and prime objects were identified in most locations. In some areas with repeated instances of the same objects, only prime objects were present without a key object. The semantic information for most places was identified as “corridor”, while places where no object information was collected were labeled as “unknown”.

Additionally, we conducted an experiment using the SMRL method with a UI program to update semantic information. As shown in Figure 18, when a person inputted the voice utterance “There are a water purifier and a trash bin in the seminar room,” the SMRL system converted the input utterance into TOSM data, saved it to the semantic database, and responded with “Update Complete.” Along with the scenario for updating semantic information, we also tested a scenario to verify whether specific semantic information exists in the system. As shown in Figure 19, when asked, “Are there a water purifier and a trash bin in the corridor 112?”, the system checked the database and confirmed the presence of the refrigerator in the kitchen by responding, “Yes, there is a refrigerator in the kitchen.”

Through the three types of experiments, we demonstrated that DK-SMF is effective for enabling service robots to autonomously model unknown environments. In particular, we verified that environmental modeling can be performed not only by robots but also with active participation from humans. However, the experiments also revealed certain limitations of our framework. In some cases, objects could not be recognized in certain areas, or the segmentation of specific places was not clearly defined, resulting in those places being labeled as “unknown.” These cases suggest the need to refine domain knowledge to establish more precise criteria for place labeling. Additionally, they indicate that relying solely on object information may be insufficient, and that integrating VLM methods, such as VQA, to extract semantic information directly from place images could be necessary.

As shown in Table 6, we calculated the area of maps created through autonomous exploration in both environments and the area of free space where the robot can move. In the simulated home environment, the robot collected sensor data every 1.2 m, generating 23 points after map completion. Additionally, in the real indoor corridor environment experiment, the robot moved every 2.0 m, generating a total of 15 points and collecting sensor data. The size of the free space area was calculated by identifying the pixels representing free space in the generated grid map and converting them to actual size units based on the map’s resolution ratio. In both the simulation and real-world environments, over 95% of the free space defined as navigable areas where semantic information can be extracted was generated in the map. This result verified that the robot’s autonomous exploration is appropriate for semantic environment modeling.

Table 7 shows the accuracy of the ZSOD and semantic information extraction, as well as the number of object types and the amount of semantic information extracted in each environment. In the simulation-based house environment, a total of 15 types of objects were recognized, and 30 complete sets of object semantic information were extracted and stored in the semantic DB. The object detection accuracy was 78.1%, and the object semantic information accuracy was 83.33%, as some of the predefined object types were not detected. Also, in the real corridor environment, a total of 11 types of objects were recognized, and 24 complete sets of semantic information of objects were extracted and stored in the semantic DB. The object detection accuracy and semantic extraction accuracy were 81.36% and 87.51%, respectively.

In both environments, semantic information was extracted for more than 10 types of objects and over 20 individual objects. In the simulated home environment, higher quantities and more diverse types of objects were extracted compared to the real-world corridor, which explains the difference in results. Although the corridor environment included fewer object types, the number of objects was still sufficient for the robot to utilize semantic information effectively. Detection accuracy was calculated as recall, defined as the number of objects detected from 23 and 15 image scenes in each environment, respectively, that matched the object list with all ground-truth objects. Cases where objects were detected but misclassified were not considered, as ambiguity arises due to the robot detecting objects from various angles in the experimental environment. To avoid misdetection, situations where objects were partially occluded or only partially visible within the sensor’s Field of View (FoV) were handled by adjusting the detection confidence threshold. The evaluation focused on whether the robot accurately detected the objects listed in the prompt. Although the overall accuracy in both environments was not particularly high, the results verified that the robot was able to detect the essential objects required for semantic environment modeling. The accuracy of semantic information extraction also reflects errors caused by incorrect inferences in certain attributes, such as color, purpose, isMovable, and canBeOpen. Even when using domain knowledge to infer object attributes through a VQA model, some objects may have indistinct colors or ambiguous implicit attributes. However, the quantitatively extracted semantic information was sufficient for use in both SIP and SAN, demonstrating that our framework is suitable for real-world service robot applications.

The results of the semantic information for places can be seen in Table 8. In the house environment, six place types were classified and eleven complete sets of semantic information were extracted. An accuracy of 85.0% was achieved in semantic information extraction. In the corridor real-world environment, two types of places were extracted, and seven full sets of semantic information were extracted. The semantic accuracy was 90.5%.

The accuracy of semantic information extraction for places was calculated as the number of correctly predicted semantic attributes, such as class and purpose, divided by the total number of ground-truth attributes. This result implies that some places could not be classified due to the absence of objects or that the composition of objects was insufficient for reliable place inference. Environments with simple structures, such as corridors, have clear place segmentation and contain fewer and more limited types of objects, often with overlapping instances. As a result, they include fewer place types and less semantic data compared to home environments, but they achieve higher semantic extraction accuracy. In contrast, complex environments like homes face difficulties in place segmentation due to boundary ambiguity caused by furniture and other objects, which leads to lower accuracy in semantic information extraction. The reason for the relatively high accuracy in place semantic information extraction is that DK-SMF provides domain knowledge about places, which helps the robot reliably infer the place class based on the composition of detected objects.

Our experimental results demonstrate that service robots can automatically perform environmental modeling for service execution using DK-SMF. We verified that robots could autonomously explore, detect numerous objects, and extract object semantic information in various environments without dataset annotation and model training. Furthermore, they were able to collect semantic information of places by utilizing the extracted object semantic information.

5. Discussion

The proposed framework has limitations in modeling dynamic environments and updating changes. In dynamic environments, even if semantic information is extracted based on fixed objects, the position and state of the objects frequently change, and it is difficult to distinguish whether an object has disappeared or is simply occluded. Additionally, environmental updates should be performed during the service execution phase, such as in SIP, along with place and location recognition. However, there are limitations to achieving real-time performance. DK-SMF is also limited to environment modeling with a single robot, and when extended to a multi-robot setting, the environment is modeled at different times, from different angles, and under varying sensor conditions, which leads to limitations in the consistency of semantic information.

In the near future, we plan to address these limitations through further research. Future work will focus on proposing an environment modeling framework capable of long-term adaptation in dynamic environments. We also plan to research a method that utilizes the constructed semantic DB to infer the current place and the robot’s location through SIP. This method will also update the semantic DB when an extracted object is no longer present or when a different object is found in its place.

In addition, while SMRL allows users to request environmental modeling from the robot solely through utterances, we plan to propose Voice and Visual Semantic Modeling Language (VVSML), which enables visual environmental modeling by allowing users to directly show the robot an image of a specific place containing objects, in addition to using utterances.

6. Conclusions

In this paper, we propose DK-SMF, a method for automatically collecting semantic information used by service robots, which was previously modeled manually, by leveraging domain knowledge, VLMs, and LLMs. We redefined the TOSM properties for robots, objects, and places that can be utilized by service robots. Without additional model training, the framework detects objects and extracts their semantic information. The map is segmented into functional regions for place-level usage, and the semantic information of each place is extracted based on the collected object data. The resulting semantic data is then used to construct the semantic DB, which can be applied to a semantic framework such as SIP and SAN. In addition, we introduced SMRL, a method that allows robots and people to participate in environmental modeling by converting human utterances into TOSM properties and storing them in a semantic DB. We validated the proposed framework, in which semantic information used by the robot for service execution is collected either autonomously or with user participation, through both qualitative and quantitative analyses. These evaluations verify whether the robot can effectively utilize the collected information to perform its services. Our future work will focus on addressing the limitations discussed and extending the framework to more advanced automated modeling methods.

Author Contributions

Conceptualization, K.J., Y.J., S.K., M.J., and T.K.; methodology, K.J., Y.J., S.K., and M.J.; software, K.J., Y.J., M.J., and H.K.; validation, K.J., Y.J., S.K., M.J., H.K., and T.K.; formal analysis, K.J. and T.K.; investigation, K.J. and H.K.; resources, K.J., Y.J., S.K., M.J., and H.K.; data curation, K.J.; writing—original draft preparation, K.J., Y.J., and M.J.; writing—review and editing, K.J. and M.J.; visualization, K.J.; supervision, T.K.; project administration, T.K.; funding acquisition, T.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the SungKyunKwan University and the BK21 FOUR (Graduate School Innovation) funded by the Ministry of Education (MOE, Korea) and National Research Foundation of Korea (NRF).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author(s).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Deeken, H.; Wiemann, T.; Hertzberg, J. Grounding semantic maps in spatial databases. Robot. Auton. Syst. 2018, 105, 146–165. [Google Scholar] [CrossRef]
Hassan, M.U.; Nawaz, M.I.; Iqbal, J. Towards autonomous cleaning of photovoltaic modules: Design and realization of a robotic cleaner. In Proceedings of the 2017 First International Conference on Latest Trends in Electrical Engineering and Computing Technologies (INTELLECT), Karachi, Pakistan, 15–16 November 2017; pp. 1–6. [Google Scholar]
Islam, R.U.; Iqbal, J.; Manzoor, S.; Khalid, A.; Khan, S. An autonomous image-guided robotic system simulating industrial applications. In Proceedings of the 2012 7th International Conference on System of Systems Engineering (SoSE), Genova, Italy, 16–19 July 2012; pp. 344–349. [Google Scholar]
Joo, S.-H.; Manzoor, S.; Rocha, Y.G.; Bae, S.-H.; Lee, K.-H.; Kuc, T.-Y.; Kim, M. Autonomous navigation framework for intelligent robots based on a semantic environment modeling. Appl. Sci. 2020, 10, 3219. [Google Scholar] [CrossRef]
Joo, S.; Bae, S.; Choi, J.; Park, H.; Lee, S.; You, S.; Uhm, T.; Moon, J.; Kuc, T. A flexible semantic ontological model framework and its application to robotic navigation in large dynamic environments. Electronics 2022, 11, 2420. [Google Scholar] [CrossRef]
Wang, Y.; Yao, R.; Zhao, K.; Wu, P.; Chen, W. Robotics Classification of Domain Knowledge Based on a Knowledge Graph for Home Service Robot Applications. Appl. Sci. 2024, 14, 11553. [Google Scholar] [CrossRef]
Zhang, X.; Altaweel, Z.; Hayamizu, Y.; Ding, Y.; Amiri, S.; Yang, H.; Kaminski, A.; Esselink, C.; Zhang, S. Dkprompt: Domain knowledge prompting vision-language models for open-world planning. arXiv 2024, arXiv:2406.17659. [Google Scholar]
Cao, W.; Yao, X.; Xu, Z.; Liu, Y.; Pan, Y.; Ming, Z. A Survey of Zero-Shot Object Detection. Big Data Min. Anal. 2025, 8, 726–750. [Google Scholar] [CrossRef]
Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In Proceedings of the European Conference on Computer Vision, Milano, Italy, 29 September–4 October 2024; pp. 38–55. [Google Scholar]
Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2425–2433. [Google Scholar]
Zhang, J.; Huang, J.; Jin, S.; Lu, S. Vision-language models for vision tasks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5625–5644. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Antoniou, G.; Harmelen, F.v. Web ontology language: Owl. Handbook on Ontologies; Springer: Berlin/Heidelberg, Germany, 2009; pp. 91–110. [Google Scholar]
Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E. On the opportunities and risks of foundation models. arXiv 2021, arXiv:2108.07258. [Google Scholar] [CrossRef]
Firoozi, R.; Tucker, J.; Tian, S.; Majumdar, A.; Sun, J.; Liu, W.; Zhu, Y.; Song, S.; Kapoor, A.; Hausman, K. Foundation models in robotics: Applications, challenges, and the future. Int. J. Robot. Res. 2025, 44, 701–739. [Google Scholar] [CrossRef]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9650–9660. [Google Scholar]
Hong, J.; Choi, R.; Leonard, J.J. Learning from Feedback: Semantic Enhancement for Object SLAM Using Foundation Models. arXiv 2024, arXiv:2411.06752. [Google Scholar] [CrossRef]
Li, B.; Cai, Z.; Li, Y.-F.; Reid, I.; Rezatofighi, H. Hier-SLAM: Scaling-up Semantics in SLAM with a Hierarchically Categorical Gaussian Splatting. arXiv 2024, arXiv:2409.12518. [Google Scholar]
Ciria, A.; Schillaci, G.; Pezzulo, G.; Hafner, V.V.; Lara, B. Predictive processing in cognitive robotics: A review. Neural Comput. 2021, 33, 1402–1432. [Google Scholar] [CrossRef]
Mon-Williams, R.; Li, G.; Long, R.; Du, W.; Lucas, C.G. Embodied large language models enable robots to complete complex tasks in unpredictable environments. Nat. Mach. Intell. 2025, 7, 592–601. [Google Scholar] [CrossRef]
Hughes, N.; Chang, Y.; Carlone, L. Hydra: A real-time spatial perception system for 3D scene graph construction and optimization. arXiv 2022, arXiv:2201.13360. [Google Scholar] [CrossRef]
Zhang, L.; Hao, X.; Xu, Q.; Zhang, Q.; Zhang, X.; Wang, P.; Zhang, J.; Wang, Z.; Zhang, S.; Xu, R. MapNav: A Novel Memory Representation via Annotated Semantic Maps for VLM-based Vision-and-Language Navigation. arXiv 2025, arXiv:2502.13451. [Google Scholar]
Rosinol, A.; Violette, A.; Abate, M.; Hughes, N.; Chang, Y.; Shi, J.; Gupta, A.; Carlone, L. Kimera: From SLAM to spatial perception with 3D dynamic scene graphs. Int. J. Robot. Res. 2021, 40, 1510–1546. [Google Scholar] [CrossRef]
Liu, Q.; Wen, Y.; Han, J.; Xu, C.; Xu, H.; Liang, X. Open-world semantic segmentation via contrasting and clustering vision-language embedding. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 275–292. [Google Scholar]
Gadre, S.Y.; Wortsman, M.; Ilharco, G.; Schmidt, L.; Song, S. Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 23171–23181. [Google Scholar]
Chen, B.; Xia, F.; Ichter, B.; Rao, K.; Gopalakrishnan, K.; Ryoo, M.S.; Stone, A.; Kappler, D. Open-vocabulary queryable scene representations for real world planning. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 11509–11522. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, Paris, France, 2–6 October 2023; pp. 4015–4026. [Google Scholar]
Huang, C.; Mees, O.; Zeng, A.; Burgard, W. Visual language maps for robot navigation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 10608–10615. [Google Scholar]
Mehan, Y.; Gupta, K.; Jayanti, R.; Govil, A.; Garg, S.; Krishna, M. QueSTMaps: Queryable Semantic Topological Maps for 3D Scene Understanding. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; pp. 13311–13317. [Google Scholar]
Allu, S.H.; Kadosh, I.; Summers, T.; Xiang, Y. Autonomous Exploration and Semantic Updating of Large-Scale Indoor Environments with Mobile Robots. arXiv 2024, arXiv:2409.15493. [Google Scholar] [CrossRef]
Jatavallabhula, K.M.; Kuwajerwala, A.; Gu, Q.; Omama, M.; Chen, T.; Maalouf, A.; Li, S.; Iyer, G.; Saryazdi, S.; Keetha, N. Conceptfusion: Open-set multimodal 3d mapping. arXiv 2023, arXiv:2302.07241. [Google Scholar]
Kim, T.; Min, B.-C. Semantic Layering in Room Segmentation via LLMs. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; pp. 9831–9838. [Google Scholar]
Yamauchi, B. A frontier-based approach for autonomous exploration. In Proceedings of the 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA’97.’Towards New Computational Principles for Robotics and Automation’, Monterey, CA, USA, 10–11 June 1997; pp. 146–151. [Google Scholar]
Zhang, C.; Han, D.; Qiao, Y.; Kim, J.U.; Bae, S.-H.; Lee, S.; Hong, C.S. Faster segment anything: Towards lightweight sam for mobile applications. arXiv 2023, arXiv:2306.14289. [Google Scholar] [CrossRef]
Batra, D.; Gokaslan, A.; Kembhavi, A.; Maksymets, O.; Mottaghi, R.; Savva, M.; Toshev, A.; Wijmans, E. Objectnav revisited: On evaluation of embodied agents navigating to objects. arXiv 2020, arXiv:2006.13171. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning(ICML), Virtual Conference, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Minderer, M.; Gritsenko, A.; Stone, A.; Neumann, M.; Weissenborn, D.; Dosovitskiy, A.; Mahendran, A.; Arnab, A.; Dehghani, M.; Shen, Z. Simple open-vocabulary object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 728–755. [Google Scholar]
Kamath, A.; Singh, M.; LeCun, Y.; Synnaeve, G.; Misra, I.; Carion, N. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1780–1790. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014; pp. 580–587. [Google Scholar]
Bormann, R.; Jordan, F.; Li, W.; Hampp, J.; Hägele, M. Room segmentation: Survey, implementation, and analysis. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; pp. 1019–1026. [Google Scholar]
Luperto, M.; Kucner, T.P.; Tassi, A.; Magnusson, M.; Amigoni, F. Robust structure identification and room segmentation of cluttered indoor environments from occupancy grid maps. IEEE Robot. Autom. Lett. 2022, 7, 7974–7981. [Google Scholar] [CrossRef]
Zhou, X.; Girdhar, R.; Joulin, A.; Krahenbuhl, P.; Misra, I. Detecting Twenty-thousand Classes using Image-level Supervision. arXiv 2022, arXiv:2201.02605. [Google Scholar]
Kolve, E.; Mottaghi, R.; Han, W.; VanderBilt, E.; Weihs, L.; Herrasti, A.; Deitke, M.; Ehsani, K.; Gordon, D.; Zhu, Y.; et al. AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv 2017, arXiv:1712.05474. [Google Scholar]
Deitke, M.; VanderBilt, E.; Herrasti, A.; Weihs, L.; Salvador, J.; Ehsani, K.; Han, W.; Kolve, E.; Farhadi, A.; Kembhavi, A.; et al. ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. arXiv 2022, arXiv:2206.06994. [Google Scholar]
Manzoor, S.; Rocha, Y.G.; Joo, S.-H.; Bae, S.-H.; Kim, E.-J.; Joo, K.-J.; Kuc, T.-Y. Ontology-Based Knowledge Representation in Robotic Systems: A Survey Oriented toward Applications. Appl. Sci. 2021, 11, 4324. [Google Scholar] [CrossRef]
Ard’on, P.; Pairet, É.; Lohan, K.S.; Ramamoorthy, S.; Petrick, R.P.A. Affordances in Robotic Tasks—A Survey. arXiv 2020, arXiv:2004.07400. [Google Scholar]
Ahn, M.; Brohan, A.; Brown, N.; Chebotar, Y.; Cortes, O.; David, B.; Finn, C.; Gopalakrishnan, K.; Hausman, K.; Herzog, A.; et al. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. In Proceedings of the Conference on Robot Learning, Auckland, New Zealand, 14–18 December 2022. [Google Scholar]
Grisetti, G.; Stachniss, C.; Burgard, W. Improved Techniques for Grid Mapping With Rao-Blackwellized Particle Filters. IEEE Trans. Robot. 2007, 23, 34–46. [Google Scholar] [CrossRef]
Zhang, H.-y.; Lin, W.-m.; Chen, A.-x. Path Planning for the Mobile Robot: A Review. Symmetry 2018, 10, 450. [Google Scholar] [CrossRef]
Fox, D.; Burgard, W.; Thrun, S. The dynamic window approach to collision avoidance. IEEE Robot. Autom. Mag. 1997, 4, 23–33. [Google Scholar] [CrossRef]
Borji, A.; Cheng, M.-M.; Jiang, H.; Li, J. Salient object detection: A survey. Comput. Vis. Media 2014, 5, 117–150. [Google Scholar] [CrossRef]
Dong, Q.; Li, L.; Dai, D.; Zheng, C.; Wu, Z.; Chang, B.; Sun, X.; Xu, J.; Li, L.; Sui, Z. A Survey on In-context Learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022. [Google Scholar]
Bai, S.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Song, S.; Dang, K.; Wang, P.; Wang, S.; Tang, J.; et al. Qwen2.5-VL Technical Report. arXiv 2025, arXiv:2502.13923. [Google Scholar] [CrossRef]
Li, J.; Li, D.; Xiong, C.; Hoi, S.C.H. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In Proceedings of the International Conference on Machine Learning, Baltimore, Maryland, 17–23 July 2022. [Google Scholar]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning. arXiv 2023, arXiv:2304.08485. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.J.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
Gupta, A.; Dollár, P.; Girshick, R.B. LVIS: A Dataset for Large Vocabulary Instance Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 5351–5359. [Google Scholar]
Weller, O.; Marone, M.; Weir, N.; Lawrie, D.J.; Khashabi, D.; Durme, B.V. “According to …”: Prompting Language Models Improves Quoting from Pre-Training Data. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics, Dubrovnik, Croatia, 2–6 May 2023. [Google Scholar]
Achanta, R.; Shaji, A.; Smith, K.; Lucchi, A.; Fua, P.; Süsstrunk, S. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2274–2282. [Google Scholar] [CrossRef]
Radford, A.; Narasimhan, K. Improving Language Understanding by Generative Pre-Training. 2018. [Google Scholar]
Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.-B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K. Gemini: A family of highly capable multimodal models. arXiv 2023, arXiv:2312.11805. [Google Scholar] [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Consortium, W.W.W. RDF 1.1 Turtle: Terse RDF triple language. 2014. [Google Scholar]
Zhou, Z.; Song, J.; Xie, X.; Shu, Z.; Ma, L.; Liu, D.; Yin, J.; See, S. Towards building AI-CPS with NVIDIA Isaac Sim: An industrial benchmark and case study for robotics manipulation. In Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice, Lisbon, Portugal, 14–20 April 2024; pp. 263–274. [Google Scholar]

Figure 1. Domain knowledge-driven semantic modeling framework architecture. The architecture consists of autonomous exploration, zero-shot-based object detection and semantic extraction (green blocks), place semantic extraction (blue blocks), the TOSM schema-based semantic information integration and database generation (pink blocks), and semantic modeling supported by the SMRL (apricot blocks).

Figure 2. Overview of Object semantic information extraction. In order, (a) shows the mobile robot autonomously exploring the environment while collecting RGB-D data and the robot’s odometry; (b) represents the process of generating an object list used as a prompt for ZSOD by utilizing a VQA model to integrate objects detectable in the image with a predefined object list from the domain knowledge; (c) shows the process of object detection using the ZSOD model and converting the results into 3D; and (d) shows the extraction of object semantic information using the VQA model.

Figure 3. Frontier-based autonomous exploration, the mobile robot autonomously explores an unknown environment, builds a map, designates points at regular intervals, and collects sensor data at each point.

Figure 4. Overview of the method for extracting an object list present in an image. Each RGB image and a prompt composed of three stages are fed into the VQA model to extract an object list. The generated list is integrated with the predefined object list from the DK-DB to make the final object list for ZSOD.

Figure 5. The generated object list and images are fed into a zero-shot-based object detection model to detect objects. The detected objects are extracted with their 3D position on the map.

Figure 6. Zero-shot-based object detection results; (a,b) show object detection from images collected in a simulation environment, while (c) shows object detection from an image collected in a real-world environment.

Figure 7. Object semantic information extraction. Each post-processed image and generated prompt are input into the VQA model to extract object semantic information. The output object information is integrated with the data extracted from ZSOD to construct the final object semantic information.

Figure 8. Post-processing of detected objects by the ZSOD: Detected object images are post-processed to remove backgrounds, enhancing the accuracy of the VQA-based semantic information extraction.

Figure 9. Object semantic information extraction using the VQA model.

Figure 10. Overview of place semantic information extraction: (a) segmenting the generated map into regions using room segmentation, (b) identifying objects in each region, (c) extracting place semantic information using domain knowledge and the LLMs.

Figure 11. The process of place segmentation on the grid map.

Figure 12. The process of extracting place semantic information: a prompt containing domain knowledge and the object list vector identified in each place is provided to the LLMs to extract place name and purpose attributes.

Figure 13. Overview of the Semantic Modeling Robot Language: An SMRL module is placed between humans and robots through which utterances and TOSM-structured data are exchanged.

Figure 14. Multi-Layer Semantic Map: Multi-layer semantic map is generated utilizing the semantic database of extracted objects and places.

Figure 15. Experimental setup with (a) simulation and (b) real-world environment.

Figure 16. Simulation results: The robot collects objects and places semantic information for each location in the household environment, and the collected semantic information is represented along with the map.

Figure 17. Real-world experiment results: The robot collects objects and places semantic information for each location in the corridor environment, and the collected semantic information is represented along with the map.

Figure 18. Demo of the SMRL: (1) Updating the Semantic information.

Figure 19. Demo of the SMRL: (2) Semantic information confirmation question and answer.

Table 1. Comparison for semantic-based environmental modeling. The symbols indicate the following meanings: ○: semantic unsatisfied, ●: semantic partially satisfied, ◉: semantic satisfied, -: not considered, ✓: considered.

Name	Autonomous Exploration	Semantic Knowledge Extraction Method	DK Usage	Semantic Knowledge Extraction			Modeling by People
Name	Autonomous Exploration	Semantic Knowledge Extraction Method	DK Usage	Robot	Object	Place	Modeling by People
Hydra [21]	-	closed-set labeling	-	○	●	○	-
MapNav [22]	✓	closed-set labeling	✓	○	●	○	-
Kimera [23]	-	closed-set labeling	-	●	●	○	-
Sai et al. [30]	✓	ZSOD, Mask generation	-	○	●	○	-
CoWs [25]	✓	ZSOD, VLE	-	○	●	○	-
NLMap [26]	✓	ZSOD, VLE	-	○	●	○	-
ConceptFusion [31]	-	ZSOD, VLE, Mask generation	-	○	●	○	-
VLMaps. [28]	-	Mask generation, VLE	-	○	●	○	-
QueSTMaps [29]	-	Mask generation, VLE	-	○	●	●	-
SeLRoS [32]	-	ZSOD, LLMs	-	○	●	◉	-
DK-SMF(our)	✓	ZSOD, VQA, LLMs	✓	◉	◉	◉	✓

Table 2. Summary of DK-SMF: components and functions.

Type	Module/Stage	Function/Description	Input	Output
Object	Autonomous exploration	Frontier-based exploration: building a map and collecting sensor data	scan, robot pose	2d map, sensor data, way points
	Object list generation	Object list prompt generation using the VQA model	image, prompt	object list
	Zero-shot object detection	Zero-shot object detection in images	2d map, sensor data, object list (prompt)	class, Bbox, conf., pose, spatialRelation
	Object semantic information extraction	Object semantic information extraction using the VQA model	post-processed image, prompt	object semantic information
Place	Place segmentation	SLIC superpixel-based room segmentation	2d map	polygon
Place	Place semantic information extraction	Place semantic information extraction using the LLMs	object list in each polygon	place semantic information
Etc.	SMRL	Environment modeling by user utterance	utterance (voice or text)	semantic information
Etc.	Semantic DB integration	Store all extracted semantic information	object place semantic information	semantic DB

Table 3. Service robot’s TOSM properties.

TOSM Datatype Properties		DataType	Example
Symbol	name	string	“Clean Robot”
Symbol	ID	int	1
Explicit	size	floatArray [width, length, height, weight]	[0.31, 0.31, 0.1, 4.13] ⁽¹⁾
	pose	floatArray [x, y, z, theta]	[0.5, 0.5, 0, 0.352] ⁽²⁾
	velocity	floatArray [linear, angular]	[1.2, 0] ⁽³⁾
	sensor	dict., map	“lidar: spec., imu: spec., camera: spec., encoder: spec., …”
	battery	floatArray [voltage, ampere, capacity]	[24.0, 5.0, 3200] ⁽⁴⁾
	coordinateFrame	string	“map”
Implicit	affordance	string	“vacuum, water clean”
	purpose	string	“home cleaning”
	current state	string	“move”
	environment	string	“house”

⁽¹⁾ Unit: [m, m, m, kg], ⁽²⁾ Unit: [m, m, m, rad], ⁽³⁾ Unit: [m/s, rad/s], ⁽⁴⁾ Unit: [V, A, mAh].

Table 4. Object’s TOSM properties for the service robot.

TOSM Datatype Properties		DataType	Example
Symbol	name	string	“refrigerator”
Symbol	ID	int	1
Explicit	size	floatArray [width, length, height]	[0.5, 0.5, 1.5] ⁽¹⁾
	pose	floatArray [x, y, z, theta]	[1.5, −12.5, 0, 0] ⁽²⁾
	velocity	floatArray [x, y, z]	[0, 0, 0] ⁽³⁾
	color	string	“silver”
	coordinateFrame	string	“house map”
Implicit	purpose	string	“food storage device”
	isKeyObject	boolean	“Y”
	isPrimeObject	boolean	“N”
	isMovable	boolean	“N”
	isOpen	boolean	“N”
	canBeOpen	boolean	“Y”
	spatialRelation	dict., map	“isleftto: dining desk, isrightto: chair”

⁽¹⁾ Unit: [m, m, m], ⁽²⁾ Unit: [m, m, m, rad], ⁽³⁾ Unit: [m, m, m].

Table 5. Place’s TOSM properties for the service robot.

TOSM Datatype Properties		DataType	Example
Symbol	name	string	“kitchen”
Symbol	ID	int	3
Explicit	boundary	polygon	[(2.0, 1.0), (1.2,−0.5), …]
Explicit	coordinateFrame	string	“house map”
Implicit	complexity	float	1.6
	level	int(floor)	“7”
	purpose	string	“place to prepare food”
	roomNumber	int	“0”
	isInsideOf	stringArray	[“object1”, “object2”, “object3”]
	spatialRelation	dict., map	“isleftto: 2, isrightto: 4”

Table 6. Autonomous exploration mapping results.

Environment	Moving Interval (m)	Mapped Area (m²)	Free Space Area (m²)	Points
household (simulation)	1.2	127.31	123.49	23
corridor (real-world)	2.0	148.14	142.21	15

Table 7. Object semantic information results.

Environment	Object Detection Accuracy (%)	Object Semantic Accuracy (%)	Object Type	Object Semantic Data
household (simulation)	78.1	83.33	15	30
corridor (real-world)	81.36	87.51	11	24

Table 8. Place semantic information results.

Environment	Place Semantic Accuracy (%)	Place Type	Place Semantic Data
household (simulation)	85.0	6	11
corridor (real-world)	90.5	2	7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Joo, K.; Jeong, Y.; Kwon, S.; Jeong, M.; Kim, H.; Kuc, T. DK-SMF: Domain Knowledge-Driven Semantic Modeling Framework for Service Robots. Electronics 2025, 14, 3197. https://doi.org/10.3390/electronics14163197

AMA Style

Joo K, Jeong Y, Kwon S, Jeong M, Kim H, Kuc T. DK-SMF: Domain Knowledge-Driven Semantic Modeling Framework for Service Robots. Electronics. 2025; 14(16):3197. https://doi.org/10.3390/electronics14163197

Chicago/Turabian Style

Joo, Kyeongjin, Yeseul Jeong, Seungwon Kwon, Minyoung Jeong, Haryeong Kim, and Taeyong Kuc. 2025. "DK-SMF: Domain Knowledge-Driven Semantic Modeling Framework for Service Robots" Electronics 14, no. 16: 3197. https://doi.org/10.3390/electronics14163197

APA Style

Joo, K., Jeong, Y., Kwon, S., Jeong, M., Kim, H., & Kuc, T. (2025). DK-SMF: Domain Knowledge-Driven Semantic Modeling Framework for Service Robots. Electronics, 14(16), 3197. https://doi.org/10.3390/electronics14163197

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DK-SMF: Domain Knowledge-Driven Semantic Modeling Framework for Service Robots

Abstract

1. Introduction

2. Related Work

3. Domain Knowledge-Driven Semantic Modeling Framework

3.1. Definition of TOSM-Based Properties for Service Robots

3.2. TOSM Properties-Based Object Semantic Information Extraction

3.2.1. Autonomous Exploration

3.2.2. Object List Generation

3.2.3. Zero-Shot-Based Object Detection

3.2.4. Object Semantic Information Extraction

3.3. TOSM Properties-Based Place Semantic Information Extraction

3.3.1. Place Segmentation of the Map

3.3.2. Place Semantic Information Extraction

3.4. Semantic Modeling Robot Language

3.5. Semantic Database Generation and Map Representation

4. Experiments

4.1. Experimental Environments and Setup

4.2. Experimental Scenario and Design

4.3. Experimental Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI