Deep Learning-Based Vision Systems for Robot Semantic Navigation: An Experimental Study

Alotaibi, Albandari; Alatawi, Hanan; Binnouh, Aseel; Duwayriat, Lamaa; Alhmiedat, Tareq; Alia, Osama Moh’d

doi:10.3390/technologies12090157

Open AccessArticle

Deep Learning-Based Vision Systems for Robot Semantic Navigation: An Experimental Study

by

Albandari Alotaibi

¹,

Hanan Alatawi

¹,

Aseel Binnouh

¹,

Lamaa Duwayriat

¹,

Tareq Alhmiedat

^2,3,*

and

Osama Moh’d Alia

¹

Computer Science Department, Faculty of Computers and Information Technology, University of Tabuk, Tabuk 47713, Saudi Arabia

²

Information Technology Department, Faculty of Computers and Information Technology, University of Tabuk, Tabuk 47713, Saudi Arabia

³

Artificial Intelligence and Sensing Technologies (AIST) Reseach Center, University of Tabuk, Tabuk 71491, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Technologies 2024, 12(9), 157; https://doi.org/10.3390/technologies12090157

Submission received: 18 July 2024 / Revised: 2 September 2024 / Accepted: 6 September 2024 / Published: 10 September 2024

(This article belongs to the Special Issue Advanced Autonomous Systems and Artificial Intelligence Stage)

Download

Browse Figures

Versions Notes

Abstract

Robot semantic navigation has received significant attention recently, as it aims to achieve reliable mapping and navigation accuracy. Object detection tasks are vital in this endeavor, as a mobile robot needs to detect and recognize the objects in the area of interest to build an effective semantic map. To achieve this goal, this paper classifies and discusses recently developed object detection approaches and then presents the available vision datasets that can be employed in robot semantic navigation applications. In addition, this paper discusses several experimental studies that have validated the efficiency of object detection algorithms, including Faster R-CNN, YOLO v5, and YOLO v8. These studies also utilized a vision dataset to design and develop efficient robot semantic navigation systems, which is also discussed. According to several experiments conducted in a Fablab area, the YOLO v8 object classification model achieved the best results in terms of classification accuracy and processing speed.

Keywords:

robot semantic navigation; deep learning-based vision systems; vision datasets

1. Introduction

Autonomous robot navigation is considered one of the most challenging problems in robotics. It aims to provide a robot with the ability to inspect the area of interest, travel independently, avoid obstacles, and plan a path to the destination point. Autonomous robot navigation involves different functions, from examining the environment and planning paths to building a map of the area of interest [1].

To achieve these tasks, existing robot navigation systems can be separated into two main categories: geometric and semantic navigation systems. The former employs range-finder methods to build a geometry map. In contrast, the latter adopts complicated sensing units, including vision systems, to construct a semantic map with rich information about the area of interest [2].

Robot semantic navigation is complicated because it involves adopting computer vision and deep learning techniques for understanding and navigating the environment based on semantic information, where the goal is to go beyond basic obstacle avoidance and enable robots to recognize and interpret the meanings of objects and structures in their surroundings [3].

World representation is considered one of the most significant issues in the robotics literature. Its importance stems from the necessity of developing an intelligent robot system that can understand the environment to allow the robot to plan, reason, and execute tasks [4].

Recently, robot semantic navigation systems have received considerable attention. Researchers are driven to create a robot platform that understands the navigation environment and acts accordingly. Existing semantic navigation systems gather semantic information via two different systems: LiDAR-based approaches and vision-based systems [5].

The importance of employing a semantic navigation system stems from the requirement of employing mobile robots in complicated environments. Mainly, robot semantic navigation depends on vision systems for the purpose of performing object detection and classification. Therefore, it is important to review, analyze, and assess the efficiency of the recently developed object detection algorithms that have been employed in robot semantic navigation approaches.

Hence, this paper presents an investigation and analysis on the existing vision-based systems that can be employed in robot semantic navigation systems. In addition, it presents experimental studies using the COCO vision dataset to validate the efficiency of different object detection methods in robot semantic navigation applications. It offers reliable guidance for developing an efficient robot semantic navigation system. The main contributions of this paper lie in the following aspects:

It reviews object detection algorithms that have been employed in robot semantic navigation systems.
It discusses and analyzes existing vision datasets for indoor robot semantic navigation systems.
It validates the efficiency of several object detection algorithms along with an object detection dataset through several experiments conducted in indoor environments.

The remainder of this paper is organized as follows: Section 2 discusses recently developed object detection algorithms that have been employed for robot semantic navigation systems. Section 3 discusses different vision datasets, whereas Section 4 presents several experimental studies to assess the efficiency of three different object detection systems. Section 5 discusses the obtained results, and finally, Section 6 concludes the paper.

2. Object Detection Algorithms for Robot Semantic Navigation

In general, robot semantic navigation applications heavily rely on object detection. Object detection facilitates the semantic interpretation of visual data by precisely locating and identifying items in images and videos. Robots can then use this information to make decisions and successfully traverse their surroundings. Object detection is important for many different applications, such as the following:

Indoor Robot Navigation: Object detection allows robots to perceive and recognize obstacles, furniture, or specific objects within indoor environments. This capability lets them plan optimal paths, avoid collisions, and safely navigate complex spaces [6].
Manufacturing Quality Control: Object detection is essential in automated quality control systems within manufacturing settings. It helps identify defects, anomalies, or irregularities in products, ensuring consistent quality and minimizing errors during the production process [7].
Autonomous Driving: Object detection is a fundamental component of autonomous driving systems. By detecting and tracking vehicles, pedestrians, traffic signs, and other objects in real-time, object detection enables self-driving cars to perceive their surroundings, make informed decisions, and respond to dynamic traffic situations [8].

In this research, existing vision systems for robot semantic navigation were classified into three main categories, as presented in Figure 1: convolutional neural networks (CNNs), recurrent neural networks (RNNs), and deep reinforcement learning (DRL). The following subsections provide a brief introduction to these algorithms.

2.1. Convolutional Neural Network (CNN) Object Detection Algorithms

Convolutional neural networks (CNN) are one of the most representative models of deep learning. They continue to evolve, undergoing gradual advancements in terms of their architectural depth. The CNN trend highlights a significant shift toward constructing more complex and sophisticated neural network models, characterized by an increasing number of layers and hierarchical representations. The progressive deepening of a CNN enables it to effectively capture and understand intricate patterns, greatly enhancing its capabilities in various image recognition and analysis tasks [9].

CNNs have been employed widely within robot semantic navigation systems to enable the robots to interpret visual data and make informed decisions. CNN-based robot semantic navigation systems can be further classified into Residual Network (ResNet), Visual Geometry Group (VGG), MobileNet, EfficientNet, YOLO, and Faster R-CNN models.

2.1.1. Residual Network (ResNet)

ResNet is known for its deep architecture with residual blocks, which helps in training very deep networks. ResNet has been widely used for various computer vision tasks, including object detection and classification. The works presented in [10,11,12] involve employing the ResNet method to improve an image-matching algorithm based on feature points to enhance the robot’s overall semantic navigation capability and ability to resist interference.

2.1.2. Visual Geometry Group (VGG)

Visual Geometry Group (VGG) is a multi-layered CNN architecture. The most popular models are VGG-16 and VGG-19, which consist of 16 and 19 convolutional layers, respectively [13]. The work presented in [14] involves a multi-scale fully convolutional network (FCN). The output is an image containing pixel-by-pixel predictions that can be used in different navigation strategies.

2.1.3. MobileNet

The MobileNet architecture was designed for portable and embedded vision applications. It is based on a simplified architecture that employs deeply separable convolutions to build light deep neural networks; it has low latency for mobile and embedded devices. The works presented in [15,16] involved the employment of MobileNet architecture to provide a practical approach for implementing autonomous indoor navigation systems based on semantic image segmentation using a state-of-the-art mobile performance model.

2.1.4. EfficientNet

EfficientNet is a CNN-based architecture and measurement method that scales all depth and width resolution dimensions uniformly using a composite coefficient. The authors of [17] proposed an efficient pixel-level terrain recognition network EfferDeepNet to achieve global perception of an outdoor environment.

2.1.5. YOLO

YOLO is a single-shot object detector that can recognize objects in an image in a single pass with no need for multiple stages. YOLO is one of the most popular single-shot object detectors [18]; it employs a single CNN to predict the class labels and bounding boxes of objects within an image.

Several robot semantic navigation approaches have employed the YOLO object detection algorithm to build a semantic map with rich semantic information. For instance, the work presented in [19] proposed a semantic framework to build an improved metric representation of the environment. In [20], the authors proposed a reliable approach to create an improved map of their environment with object-level information, where the developed framework adopted a YOLO object detection algorithm to identify, localize, and track different classes of objects in the area of interest.

A vision-based navigation system was proposed in [21], based on the employment of the YOLO algorithm to detect objects in the area of interest. On the other hand, the work presented in [22] developed a hybrid framework that consists of a metric map along with semantic information. In [23], the authors presented an enhanced version of ORB-SLAM3 through a combination with the YOLO v8 algorithm for real-time pose estimation and semantic segmentation.

The work presented in [24] involved the development of a semantic navigation system through the employment of ORB-SLAM2 and an object detection algorithm (YOLO v3). In [25], the authors proposed the employment of a novel dataset for semantic segmentation using the YOLO object detection algorithm. The works presented in [26,27,28,29,30] presented robot semantic navigation systems through the employment of a YOLO object detection algorithm. Moreover, the authors of [31,32,33,34,35,36,37] focused on developing robot semantic navigation systems through the employment of YOLO object detection algorithms.

2.1.6. Faster R-CNN

Faster R-CNN is an improvement on the Fast R-CNN object detection model. It utilizes a Region Proposal Network (RPN) with the CNN model. Faster R-CNN involves a two-stage object detection model, where it first identifies the region of interest and then passes these regions to a CNN model. Faster R-CNN achieves an efficient classification accuracy and has been employed in several applications, including robot semantic navigation systems. For instance, the works presented in [38,39,40,41,42] employed Faster R-CNN in the area of robot semantic navigation.

2.2. Recurrent Neural Network (RNN) Object Detection Algorithms

In general, a recurrent neural network (RNN) is a type of neural network that can be employed with sequences including text, videos, sounds, and much more. An RNN can understand sequences and time. Unlike CNNs, an RNN is used to deal with sequential data to represent time dependencies [43]. The integration of a CNN and RNN is useful for understanding actions in videos by analyzing multiple frames instead of only one. The works presented in [11,44,45,46,47,48] employed an RNN algorithm in robot semantic navigation systems.

2.3. Deep Reinforcement Learning (DRL) Object Detection Algorithms

DRL-based object detection models are designed to recognize and detect objects within an environment. These models aim to improve their detection accuracy over time by learning from their interactions and experiences [49]. DRL algorithms can be further classified into Deep Q-Network (DQN), Deep Deterministic Policy Gradient (DDPG), Soft Actor-Critic (SAC), and Proximal Policy Optimization (PPO) algorithms.

2.3.1. Deep Q-Network (DQN)

Deep Q-Network (DQN) is a deep reinforcement learning algorithm that has been applied to various tasks, including robot navigation. In robot semantic navigation, DQN learns a strategy to navigate through an environment by interacting with an agent, an environment, and a reward signal. The agent learns to perform actions in the environment to maximize cumulative rewards over time, such as reaching a goal or avoiding obstacles.

The work presented in [50] introduced a DQN-based approach for global path planning, enabling swift optimal path determination in dense settings for mobile robots. The DQN algorithm can be trained to make decisions based on the YOLO output, such as in autonomous driving scenarios. Through reinforcement learning, the DQN adjusts its policy based on the effectiveness of its decisions, improving its interaction with the environment [51].

Applying a DQN to semantic navigation involves training a deep neural network to approximate the Q-function, the expected cumulative reward for a given action in a given state, and using it to make navigation decisions in environments where understanding the semantic meaning of the environment is critical to successful navigation [52,53,54,55].

2.3.2. Deep Deterministic Policy Gradient (DDPG)

Deep Deterministic Policy Gradients (DDPG) is a model-free off-policy algorithm that learns continuous actions. DDPG integrates the concepts of Deterministic Policy Gradients (DPG) and DQN. The authors of [56] proposed a DDPG with an averaged state-action estimation (Averaged-DDPG) algorithm, to minimize the adverse effects of conflict. In [57], the authors reviewed DRL-based navigation frameworks, where the presented work involved a comparison and analysis of the relationship between four typical application scenarios: local obstacle avoidance, social navigation, indoor navigation, and multi-robot navigation.

2.3.3. Soft Actor-Critic (SAC)

Soft actor-critic (SAC) is a reinforcement learning (RL) algorithm that optimizes a stochastic policy in an off-policy manner, meaning that it learns the policy using information from past experiences that may have been generated by a different policy. This approach is particularly efficient in terms of sample usage (i.e., it can learn more effective policies with fewer interactions with the environment than other methods).

SAC is one of several state-of-the-art algorithms for RL, particularly in domains with high-dimensional states and action spaces. It is an active area of research and is frequently updated with improvements in efficiency and performance.

Xin Yu et al. [58] proposed an Autonomous Underwater Vehicle (AUV) system using Soft SAC. In [59], the authors presented a controller for a wheeled robot, enabling collision-free navigation in dynamic environments. In [60], the authors employed the SAC algorithm and a global planner to enhance exploration and reduce the training costs in semantic robot navigation.

2.3.4. Proximal Policy Optimization (PPO)

In [61], the authors presented a Proximal Policy Optimization (PPO) as one of the most efficient DRL algorithms, known for achieving state-of-the-art performances across a wide range of challenging tasks. The experiments involved testing the PPO on various benchmark tasks, including simulated robotic locomotion and Atari game playing. The obtained results demonstrated that the PPO model achieves better results than other online policy gradient methods; it also strikes a favorable balance between sample complexity, simplicity, and wall time.

As presented above, several object detection models have been employed recently to detect and recognize objects of interest in the navigation environment. Figure 2 shows the overall employment of object detection models among recently developed robot semantic navigation systems. For instance, the ResNet, MobileNet, and SAC object detection models have received little attention in robot semantic navigation systems. However, Faster R-CNN, RNN, and DQN have been employed in several robot semantic navigation systems. Finally, the YOLO object detection model received the highest employment rate, used in almost 17 robot semantic navigation systems to detect and recognize objects of interest in the navigation area.

3. Vision Datasets for Robot Semantic Navigation

The employment of robot semantic navigation systems requires the adoption of an efficient object detection algorithm and a vision dataset to recognize objects of interest in the navigation area, and thus construct a semantic map with rich information. Recently, several vision datasets have been constructed to achieve this, with diverse application types, a range of objects, and classification accuracy. This section discusses the recently constructed vision datasets that can be employed with robot semantic navigation systems.

Vision datasets can be categorized into indoor, outdoor, and hybrid based on their employment environment. The indoor vision datasets include COCO and Matterport 3D, whereas the outdoor datasets include PASCAL VOC and KITTI. Finally, the hybrid vision dataset includes ADE20K.

3.1. Microsoft Common Objects in Context Dataset (MS COCO)

The Microsoft Common Objects in Context dataset (MS COCO) [62] is a large vision dataset that has been employed in several vision applications. The MS COCO dataset consists of 80 different classes, where 82,783 objects are assigned for training tasks, 40,504 for validation, and 80,000 for testing. Recently, the MS COCO dataset has been employed in diverse robot semantic navigation systems; for instance, the work presented in [11,37] employed the MS COCO dataset for object detection to identify multiple indoor locations. On the other hand, the study presented in [22] involved a DL-based end-to-end object detection method, achieving accurate results, including 35 fps on YOLO v3 and 55.3 mAP on the COCO dataset. Moreover, the authors of [20,30,63] focused on the detection of objects within indoor environments for robot semantic navigation systems using the COCO dataset.

3.2. Matterport 3D Dataset (MP3D)

The MP3D dataset, created by Stanford University, is a large-scale semantic navigation dataset that contains detailed 3D models of real-world indoor scenes along with fine-grained semantic annotations. The dataset is intended for training and evaluating algorithms for semantic navigation and understanding in the visual domain. In the context of MP3D examples, semantic annotations such as object categories, room labels, and other contextual information play a crucial role in guiding navigation [64,65,66,67].

3.3. Pascal VOC 2012 Dataset

The Pascal VOC [68] is a popular dataset that has been employed to accomplish various computer tasks, including object detection, segmentation, and classification. The VOC dataset consists of 20 classes with 2913 images split into two sets: 1464 images for the training task and 1449 for testing.

The Pascal dataset was employed in [25] to achieve semantic segmentation. This study employed the VOC2012 and CityScapes datasets, and their results showed that the CityScapes dataset was more useful for robot navigation, even though it was created for semantic segmentation.

3.4. KITTI Dataset

‘KITTI’ stands for the Karlsruhe Institute of Technology and the Toyota Technological Institute, and KITTI is a popular dataset used to train and test algorithms for autonomous driving. The KITTI dataset consists of various sensor data collected from a moving vehicle, including high-resolution images, LiDAR scans, GPS coordinates, and other relevant information, including the following 11 classes: sidewalks, bicycles, fences, poles, buildings, trees, sky, automobiles, signs, roads, and pedestrians [69]. The KITTI dataset is a car segmentation dataset that contains 3714 images for training, 144 images for validation, and 120 images for testing.

To perform semantic navigation with the KITTI dataset, an algorithm must process its sensor data and extract the semantic information [70]. This can be carried out using deep learning techniques, such as CNNs for image processing and point cloud algorithms for LiDAR data [71].

Using the semantic annotations in the KITTI dataset, autonomous vehicles can better understand their surroundings. This enables them to make informed decisions and navigate safely in urban environments, contributing to the development of self-driving cars and intelligent transportation systems [72]. Semantic navigation using the KITTI dataset has been extensively studied in the field of autonomous driving.

3.5. ADE20K Dataset

The ADE20K dataset [73] is a large-scale dataset designed for semantic segmentation tasks. It consists of over 20,000 images in the training set, 2000 images in the validation set, and 3000 images in the testing set. The dataset provides pixel-level annotations for semantic segmentation across 150 object categories. These categories encompass a wide range of semantic classes, including common elements such as sky, roads, and grass, as well as discrete objects like people, cars, and beds. The dataset’s evaluation includes a total of 150 semantic categories for comprehensive analysis and benchmarking.

The authors of [74] employed the ADE20K dataset to reduce the gap between robot and animal perception by offering a novel approach to scene understanding that captures both geometric and semantic characteristics in an unknown environment. Table 1 provides a summary of the vision datasets discussed earlier in this section, highlighting the datasets commonly used in robot semantic navigation.

Various object detection datasets were presented above along with their diverse classification accuracy, size, and application. Figure 3 presents the employment frequency of each vision dataset in the literature. The COCO dataset has received high attention recently in the area of robot semantic navigation. This is due to its wide range of classes, which are beneficial for many robot semantic navigation applications.

4. Experimental Study

This section discusses the experimental studies that were conducted to assess the efficiency of various object detection algorithms in robot semantic navigation systems. In addition, it discusses the robot platform designed to validate the efficiency of three object detection algorithms in the experimental environment.

As outlined earlier in Section 2 (Object Detection Algorithms for Robot Semantic Navigation), recent developments in robot semantic navigation have predominantly emphasized CNN-based methodologies. This emphasis is often attributed to the ability of CNNs to effectively process spatial information, making them well suited for tasks such as scene understanding and environment perception in navigation scenarios.

While RNNs excel in capturing sequential dependencies and RL models in learning through interaction with environments, the specific demands of robot semantic navigation, which heavily rely on spatial data processing and real-time decision making, align more closely with the strengths of CNN architectures. This specialization allows for efficient feature extraction and mapping, crucial for successful navigation tasks.

Given these considerations and the current landscape of research in robot semantic navigation, we opted to concentrate on CNNs for our experimental evaluation to provide a focused and relevant analysis within the context of this study.

4.1. Environment Development

This subsection delves into the details of the robot platform and the experimental environment used in this study. To ensure the validity of our results, this research utilized a two-wheel drive robot platform as depicted in Figure 4. This platform was built using a Raspberry Pi 4 computer [75] and incorporates essential components for comprehensive environmental sensing and perception.

One crucial component is the RPLiDAR A1 unit [76], which plays a vital role in the robot’s environmental sensing capabilities. The RPLiDAR A1 is a LiDAR sensor that utilizes laser technology to precisely measure distances and generate a detailed 2D point cloud map of the surroundings. This enables the robot to perceive its environment accurately and create a representation of nearby objects, obstacles, and the overall spatial layout.

By leveraging the RPLiDAR A1’s scanning capabilities, the developed robot platform can navigate and avoid obstacles in real time, ensuring safe and efficient operation. The generated point cloud data can be processed and analyzed to extract valuable information for various applications, such as mapping, localization, and object recognition.

In addition to the RPLiDAR A1, the proposed robot platform also incorporates a Logitech Webcam for visual perception [77], allowing for the capture of real-time video footage. These visual data, combined with the LiDAR-based point cloud information, enable a more comprehensive understanding of the robot’s surroundings, facilitating tasks such as object detection, tracking, and scene understanding. To provide a comprehensive understanding of the robot’s capabilities, its specifications are detailed in Table 2, outlining key parameters such as its dimensions, weight, power supply, and communication interfaces. Moreover, the robot architecture is presented in Figure 5, which involves the main units of the employed robot platform.

The experimental studies were conducted at the state-of-the-art Artificial Intelligence and Sensing Technologies (AIST) research center at the University of Tabuk. This facility provided an ideal setting for our research, equipped with a diverse range of components including a pepper robot, fish tank, 3D printer, whiteboards, computers, furniture, plants, TV, and various other items. The center’s environment, though challenging for robot navigation due to its crowded nature, offers a compact testing area of approximately 2000 ft², as depicted in Figure 6.

For a comprehensive evaluation, this study meticulously assessed the efficiency and performance of three state-of-the-art object detection algorithms: YOLO v5, YOLO v8, and Faster R-CNN. These algorithms were selected based on their proven effectiveness and widespread use in a variety of robot applications, exemplifying their suitability for these evaluation objectives.

To ensure a robust evaluation, this research employed the widely recognized COCO dataset in its experimental studies. The COCO dataset is renowned for its extensive collection of annotated images, encompassing a diverse range of indoor objects. By leveraging this dataset, the researchers were able to evaluate the algorithms’ capabilities in detecting and recognizing various objects commonly encountered in indoor environments.

In utilizing the COCO dataset, the proposed evaluation encompassed a wide array of object categories, ensuring a comprehensive analysis of the algorithms’ performance. This dataset provided a solid foundation for assessing the algorithms’ accuracy, generalization, and suitability for real-world robot applications.

Through a rigorous evaluation process, this study aimed to provide valuable insights into the strengths and limitations of the YOLO v5, YOLO v8, and Faster R-CNN algorithms, enabling informed decisions regarding their adoption and utilization in robotic systems.

4.2. Experimental Results

The obtained results are discussed in this subsection, where the employed object detection models are assessed by analyzing three different validation metrics, object detection rate, object detection accuracy, and processing time, which are defined as follows:

Object detection rate (ODR): This refers to the total number of objects that were successfully detected (true positive detections) using the employed object detection model, denoted as ‘ $T P$ ’ in Equation (1), compared to the total number of existing objects in the environment, and referred to as ‘ $T o t a l O b j e c t s$ ’ in the following equation:

$O D R = \frac{T P}{T o t a l O b j e c t s}$

(1)
Object detection accuracy: This refers to the classification accuracy of the object detection model. In our study, we utilized the mean Average Precision $(m A P)$ metric to evaluate the model’s performance comprehensively. The $m A P$ aggregates precision scores across classes. Here is how it can be mathematically described.
Let
- ${AP}_{i}$ represent the Average Precision value for class i;
- N be the total number of classes.
The $m A P$ is calculated by averaging the $A P$ across all classes, as follows:

$m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}$

(2)

In this equation, the $A P$ for each class is calculated individually, and then the mean of these values is calculated to obtain the $m A P$ score. To get to this point, we have to calculate the $A P_{i}$ for class i, as shown in the following equation:

$A P_{i} = \sum_{r \in R} (P (r) . Δ r)$

(3)

where $P (r)$ is the precision at recall level r, and $Δ r$ is the change in recall from the previous recall level.
The $A P_{i}$ for a given class i is calculated by adding the precision values at various recall levels and weighting them by calculating the change in the recall. This technique aids in assessing how well the model detects objects of a certain class at varying confidence levels.
The following is how to calculate both $P (r)$ and $Δ r$ in the average precision:
(a)
Finding $P (r)$ at a Specific Recall Level r:
- To calculate precision at a specific recall level r, we need to consider the precision value associated with the highest recall that is less than or equal to r, as follows:
  
  $P (r) = m a x_{R e c a l l_{i} \leq r} (P r e c i s i o n_{i})$
  
  (4)
- The precision at a specific recall level is the maximum precision value for all detections with a recall greater than or equal to that recall level.
(b)
Determining $Δ r$ (change in the Recall):
- $Δ r$ represents the change in recall from the previous recall level to the current recall level in the precision–recall curve, and it is calculated as follows:
  
  $Δ r = R e c a l l_{i} - R e c a l l_{i - 1}$
  
  (5)
- For each recall level, $Δ r$ can be calculated as the difference between the current recall level and the previous recall level.
By computing precision at different recall levels and measuring the change in recall from one level to the next, we can effectively determine the precision–recall curve and subsequently calculate the $A P$ for a specific class in object detection tasks.
Finally, to determine the recall and precision for each class, we used the following equations:
Precision:

$P r e c i s i o n = \frac{T P}{T P + F P}$

(6)

Recall:

$P r e c i s i o n = \frac{T P}{T P + F N}$

(7)

where
- True Positives $(T P)$ : Instances that the model correctly identifies as positive.
- False Positives $(F P)$ : Instances that the model wrongly classifies as positive.
- False Negatives $(F N)$ : Instances that the model wrongly classifies as negative but are positive.
Processing time for the object detection model: This refers to the total time required for the model to process the input frames obtained from the vision unit and perform the detection task. The processing time for the object detection model can be calculated as the difference between the end time and the start time of the processing task, formulated as follows:

$P r o c e s s i n g T i m e = E n d T i m e - S t a r t T i m e$

(8)

In Figure 7, the object detection rate for the three object detection models is depicted. As presented, YOLO v5 and v8 achieved almost the same object detection rate of 60.91% and 57.47%, respectively. However, the Faster R-CNN model offers the lowest object detection rate across almost all the existing classes, with an average of 32.18%. The main reason for this is the complexity of the Faster R-CNN model, which leads to the slow processing of frames from the RGB camera.

In addition, the object detection rate for all three models was almost zero for a few classes, including the pepper robot, fish tank, 3D printer, and whiteboards, even though the webcam was able to capture the whole scene reliably. However, the COCO dataset does not contain the aforementioned classes. Table 3 presents the object detection rate for each object in the experimental testbed, using the three object detection models.

Figure 8 presents the average object detection accuracy for each class by vision detection system. For all the object detection models, the object classification accuracy was estimated. For instance, the average object classification accuracy for YOLO v5 and v8 is 53.90%, and 53.00%, respectively, whereas the Faster R-CNN model achieved the worst object classification accuracy with a percentage of 36.35%. Table 4 presents the classification accuracy for each object in the experiment testbed using the three object detection models.

In examining the experimental outcomes, the discrepancies in object detection performance among the models, as evidenced by Figure 8, underscore the nuanced impact of the model architecture and training data on detection capabilities. For instance, the failure of YOLOv5 to detect tables while successfully identifying a boat raises questions about the models’ inherent design choices and the diversity of objects in the training data. This discrepancy prompts a deeper exploration into how the specific architectural features of each model influence their detection prowess in varied scenarios. Moreover, understanding the cases or environments wherein each model excels is crucial for optimizing their applicability. Given the nuances of the COCO dataset, which includes ‘dining tables’ but not explicitly ‘tables’, and features ‘boat’ objects, the detection capabilities of YOLOv5 can be further contextualized. The presence of ‘dining tables’ in the dataset, although slightly different from generic ‘tables’, may explain YOLOv5’s failure to detect tables in our experimental setup. This discrepancy underscores the importance of precise object class definitions within the training data and their impact on model performance.

Conversely, the successful detection of boat instances by YOLOv5 aligns with the model’s exposure to boat objects in the COCO dataset. The dataset’s inclusion of boat annotations likely contributed to YOLOv5’s ability to identify boat objects in our experiments accurately. This highlights the significance of training data composition in shaping a model’s detection capabilities, emphasizing the need for datasets that encompass a diverse range of object classes relevant to the target application scenarios. Understanding these dataset-specific intricacies is essential in interpreting the performance variations observed among object detection models. By considering the subtleties of object class definitions and dataset contents, we can gain insights into how training data characteristics influence model behavior and detection outcomes in real-world environments. This underscores the importance of dataset curation and annotation precision in enhancing the effectiveness of object detection models like YOLOv5 across varied object categories and environmental conditions. Further investigation is warranted to discern the strengths and limitations of each model type, delineating their optimal use cases based on factors such as object complexity, scene clutter, and dataset diversity. By delving into these nuances, we can elucidate the interplay between model architectures, training data composition, and their respective performance in object detection tasks across different environmental contexts.

The processing time for the object detection models is essential for real-time applications; therefore, it was important to investigate the processing time for each object detection algorithm. As presented in Figure 9, the YOLO v5 and YOLO v8 algorithms processed almost four to five frames per second using the Raspberry Pi 4 computer. However, the Faster R-CNN algorithm processed almost one frame every four seconds on average.

Based on the evaluation results obtained, the YOLO v5 and YOLO v8 algorithms demonstrated superior performance in real-time object detection for robot systems compared to the Faster R-CNN algorithm.

5. Discussion

The importance of deep learning in semantic navigation stems from its capacity to provide robots with spatial awareness and contextual comprehension, where semantic navigation enables accurate navigation, object identification, and scene interpretation by determining the semantic properties of objects and situations. The deep learning approach is a cornerstone of semantic navigation, providing a thorough grasp of spatial contexts and object interactions. Its applications across varied datasets, together with continuous developments and prospects, highlight its critical role in defining the course of computer vision, autonomous systems, and human–machine interaction. As semantic navigation advances, the influence of deep learning will increase, ushering in a new era of intelligent and context-aware computers [1].

Based on the obtained results, it is evident that the Faster R-CNN model exhibits a slower processing speed compared to the YOLO models, making it unsuitable for real-time environments such as robot navigation systems. Furthermore, the Faster R-CNN model demonstrates inferior performance in terms of both detection and accuracy rates when compared to the YOLO models.

Both YOLO v5 and YOLO v8 are efficient object detection models that are capable of processing real-time images. However, YOLO v8 is a better choice for applications that necessitate real-time object detection.

As presented above, several object detection models have been employed in robot semantic navigation with diverse object detection accuracy, complexity, and processing time. However, according to the obtained results, the YOLO object detection model achieves the best object detection accuracy and offers minimum processing time.

The employment of the COCO dataset in robot semantic navigation supports the robot’s exploration and understanding of various indoor and outdoor settings, allowing deep learning models to understand item interactions, spatial layouts, and contextual relationships. This capacity has applications in robotics, augmented reality, and human–computer interaction, among other areas. While the set of 80 classes in the COCO dataset proves suitable for numerous applications, these research experiments indicate the need to incorporate additional class groups to accommodate other common object types. Therefore, it is crucial to expand and enhance the COCO dataset to meet the requirements for developing deep learning applications based on these models.

Semantic navigation is poised for further improvement and development, with an emphasis on improving interpretability, generalization, and adaptation across varied situations. The combination of reinforcement learning, continuous learning, and multi-modal perception has the potential to improve the capabilities of semantic navigation systems, allowing them to traverse complicated, dynamic, and previously unexplored settings.

6. Conclusions

One of the most important aspects of robotics is semantic navigation, which emphasizes how important it is for robots to understand the semantics of their environment in order to make intelligent decisions. In this paper, common vision models for robot semantic navigation systems were examined, and the robot vision datasets that are currently available were investigated. This study clarified the relative performance of three different object detection models by means of an empirical analysis that sought to determine each model’s effectiveness using the COCO vision dataset. The results show that, out of all the algorithms tested, the YOLO v8 object identification model performed the best in terms of classification rates, accuracy, and processing times, making it the best option for real-time robot semantic navigation systems. In the future, further research should be conducted to understand better the effects of using different object identification models in various case studies. This will help to develop a more thorough understanding of the models’ applicability and performance subtleties in the context of robot semantic navigation. This extended analysis not only synthesizes the existing literature but also opens the door to improved approaches and future developments in robotic semantic navigation systems. Additionally, it is desirable to look into the robustness and adaptability of object detection models in a variety of dynamic environments, as it is crucial to evaluate these models’ performance in a variety of lighting, weather, and cluttered scenarios to improve the dependability of semantic navigation systems.

Author Contributions

A.B., L.D., H.A. and A.A. surveyed and analyzed the recent developed robot semantic navigation systems that employed vision subsystems. T.A., A.A. and H.A. implemented three vision systems and performed several real experiments. O.M.A. validated the obtained results and contributed in writing the final manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

This study employed the MS-COCO vision dataset to perform semantic classification, where MS-COCO dataset can be accessed through https://cocodataset.org/#home.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Alenzi, Z.; Alenzi, E.; Alqasir, M.; Alruwaili, M.; Alhmiedat, T.; Alia, O.M. A semantic classification approach for indoor robot navigation. Electronics 2022, 11, 2063. [Google Scholar] [CrossRef]
Alhmiedat, T.; Marei, A.M.; Messoudi, W.; Albelwi, S.; Bushnag, A.; Bassfar, Z.; Alnajjar, F.; Elfaki, A.O. A SLAM-based localization and navigation system for social robots: The pepper robot case. Machines 2023, 11, 158. [Google Scholar] [CrossRef]
Bhatt, D.; Patel, C.; Talsania, H.; Patel, J.; Vaghela, R.; Pandya, S.; Modi, K.; Ghayvat, H. CNN Variants for Computer Vision: History, Architecture, Application, Challenges and Future Scope. Electronics 2021, 10, 2470. [Google Scholar] [CrossRef]
Alamri, S.; Alamri, H.; Alshehri, W.; Alshehri, S.; Alaklabi, A.; Alhmiedat, T. An autonomous maze-solving robotic system based on an enhanced wall-follower approach. Machines 2023, 11, 249. [Google Scholar] [CrossRef]
Alqobali, R.; Alshmrani, M.; Alnasser, R.; Rashidi, A.; Alhmiedat, T.; Alia, O.M. A Survey on Robot Semantic Navigation Systems for Indoor Environments. Appl. Sci. 2023, 14, 89. [Google Scholar] [CrossRef]
Uçar, A.; Demir, Y.; Güzeliş, C. Object recognition and detection with deep learning for autonomous driving applications. Simulation 2017, 93, 759–769. [Google Scholar] [CrossRef]
Hernández, A.C.; Gómez, C.; Crespo, J.; Barber, R. Object Detection Applied to Indoor Environments for Mobile Robot Navigation. Sensors 2016, 16, 1180. [Google Scholar] [CrossRef] [PubMed]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems 25, Lake Tahoe, NV, USA, 3–6 December 2012. [Google Scholar]
Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 6999–7019. [Google Scholar] [CrossRef] [PubMed]
Ni, J.; Gong, T.; Gu, Y.; Zhu, J.; Fan, X. An improved deep residual network-based semantic simultaneous localization and mapping method for monocular vision robot. Comput. Intell. Neurosci. 2020, 2020, 7490840. [Google Scholar] [CrossRef]
Mousavian, A.; Toshev, A.; Fišer, M.; Košecká, J.; Wahid, A.; Davidson, J. Visual representations for semantic target driven navigation. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 8846–8852. [Google Scholar]
Teso-Fz-Betoño, D.; Zulueta, E.; Sánchez-Chica, A.; Fernandez-Gamiz, U.; Saenz-Aguirre, A. Semantic segmentation to develop an indoor navigation system for an autonomous mobile robot. Mathematics 2020, 8, 855. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Dang, T.V.; Bui, N.T. Multi-scale fully convolutional network-based semantic segmentation for mobile robot navigation. Electronics 2023, 12, 533. [Google Scholar] [CrossRef]
Kim, W.; Seok, J. Indoor semantic segmentation for robot navigating on mobile. In Proceedings of the 2018 Tenth International Conference on Ubiquitous and Future Networks (ICUFN), Prague, Czech Republic, 3–6 July 2018; pp. 22–25. [Google Scholar]
Dang, T.V.; Tran, D.M.C.; Tan, P.X. IRDC-Net: Lightweight Semantic Segmentation Network Based on Monocular Camera for Mobile Robot Navigation. Sensors 2023, 23, 6907. [Google Scholar] [CrossRef] [PubMed]
Wei, Y.; Wei, W.; Zhang, Y. EfferDeepNet: An Efficient Semantic Segmentation Method for Outdoor Terrain. Machines 2023, 11, 256. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Lecture Notes in Computer Science; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar] [CrossRef]
Bersan, D.; Martins, R.; Campos, M.; Nascimento, E.R. Semantic map augmentation for robot navigation: A learning approach based on visual and depth data. In Proceedings of the 2018 Latin American Robotic Symposium, 2018 Brazilian Symposium on Robotics (SBR) and 2018 Workshop on Robotics in Education (WRE), João Pessoa, Brazil, 6–10 November 2018; pp. 45–50. [Google Scholar]
Martins, R.; Bersan, D.; Campos, M.F.; Nascimento, E.R. Extending maps with semantic and contextual object information for robot navigation: A learning-based framework using visual and depth cues. J. Intell. Robot. Syst. 2020, 99, 555–569. [Google Scholar] [CrossRef]
Dos Reis, D.H.; Welfer, D.; De Souza Leite Cuadros, M.A.; Gamarra, D.F.T. Mobile robot navigation using an object recognition software with RGBD images and the YOLO algorithm. Appl. Artif. Intell. 2019, 33, 1290–1305. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Sun, J.; Zhao, L.; Shi, H.; Seah, H.S.; Tandianus, B. Object-Aware Hybrid Map for Indoor Robot Visual Semantic Navigation. In Proceedings of the 2019 IEEE International Conference on Robotics and Biomimetics (ROBIO), Dali, China, 6–8 December 2019; pp. 1166–1172. [Google Scholar]
Anebarassane, Y.; Kumar, D.; Chandru, A.; Adithya, P.; Sathiyamurthy, K. Enhancing ORB-SLAM3 with YOLO-based Semantic Segmentation in Robotic Navigation. In Proceedings of the 2023 IEEE World Conference on Applied Intelligence and Computing (AIC), Sonbhadra, India, 29–30 July 2023; pp. 874–879. [Google Scholar]
Mengcong, X.; Li, M. Object semantic annotation based on visual SLAM. In Proceedings of the 2021 Asia-Pacific Conference on Communications Technology and Computer Science (ACCTCS), Shenyang, China, 22–24 January 2021; pp. 197–201. [Google Scholar]
Miyamoto, R.; Adachi, M.; Nakamura, Y.; Nakajima, T.; Ishida, H.; Kobayashi, S. Accuracy improvement of semantic segmentation using appropriate datasets for robot navigation. In Proceedings of the 2019 6th International Conference on Control, Decision and Information Technologies (CoDIT), Paris, France, 23–26 April 2019; pp. 1610–1615. [Google Scholar]
Henke dos Reis, D.; Welfer, D.; de Souza Leite Cuadros, M.A.; Tello Gamarra, D.F. Object Recognition Software Using RGBD Kinect Images and the YOLO Algorithm for Mobile Robot Navigation. In Intelligent Systems Design and Applications: 19th International Conference on Intelligent Systems Design and Applications (ISDA 2019) held December 3–5, 2019; Springer: Berlin/Heidelberg, Germany, 2021; pp. 255–263. [Google Scholar]
Xia, X.; Zhang, P.; Sun, J. YOLO-Based Semantic Segmentation for Dynamic Removal in Visual-Inertial SLAM. In Proceedings of the 2023 Chinese Intelligent Systems Conference; Springer: Berlin/Heidelberg, Germany, 2023; pp. 377–389. [Google Scholar]
Truong, P.H.; You, S.; Ji, S. Object detection-based semantic map building for a semantic visual SLAM system. In Proceedings of the 2020 20th International Conference on Control, Automation and Systems (ICCAS), Busan, Republic of Korea, 13–16 October 2020; pp. 1198–1201. [Google Scholar]
Liu, X.; Muise, C. A Neural-Symbolic Approach for Object Navigation. In Proceedings of the 2nd Embodied AI Workshop (CVPR 2021), Virtual, 20 June 2021; pp. 19–25. [Google Scholar]
Chaves, D.; Ruiz-Sarmiento, J.R.; Petkov, N.; Gonzalez-Jimenez, J. Integration of CNN into a robotic architecture to build semantic maps of indoor environments. In Advances in Computational Intelligence: 15th International Work-Conference on Artificial Neural Networks, IWANN 2019, Gran Canaria, Spain, June 12–14, 2019, Proceedings, Part II 15; Springer: Berlin/Heidelberg, Germany, 2019; pp. 313–324. [Google Scholar]
Joo, S.H.; Manzoor, S.; Rocha, Y.G.; Bae, S.H.; Lee, K.H.; Kuc, T.Y.; Kim, M. Autonomous navigation framework for intelligent robots based on a semantic environment modeling. Appl. Sci. 2020, 10, 3219. [Google Scholar] [CrossRef]
Qiu, H.; Lin, Z.; Li, J. Semantic Map Construction via Multi-sensor Fusion. In Proceedings of the 2021 36th Youth Academic Annual Conference of Chinese Association of Automation (YAC), Nanchang, China, 28–30 May 2021; pp. 495–500. [Google Scholar]
Fernandes, J.C.d.C.S. Semantic Mapping with a Mobile Robot Using a RGB-D Camera. Master’s Thesis, Laboratório de Robótica Móvel, Instituto de Sistemas e Robótica-Universidade de Coimbra, Coimbra, Portugal, 2019. [Google Scholar]
Xu, X.; Liu, L.I.; Xiong, R.; Jiang, L. Real-time instance-aware semantic mapping. J. Phys. Conf. Ser. 2020, 1507, 052013. [Google Scholar] [CrossRef]
Liu, X.; Wen, S.; Pan, Z.; Xu, C.; Hu, J.; Meng, H. Vision-IMU multi-sensor fusion semantic topological map based on RatSLAM. Measurement 2023, 220, 113335. [Google Scholar] [CrossRef]
Xie, Z.; Li, Z.; Zhang, Y.; Zhang, J.; Liu, F.; Chen, W. A multi-sensory guidance system for the visually impaired using YOLO and ORB-SLAM. Information 2022, 13, 343. [Google Scholar] [CrossRef]
Qi, X.; Wang, W.; Liao, Z.; Zhang, X.; Yang, D.; Wei, R. Object semantic grid mapping with 2D LiDAR and RGB-D camera for domestic robot navigation. Appl. Sci. 2020, 10, 5782. [Google Scholar] [CrossRef]
Sun, H.; Meng, Z.; Ang, M.H. Semantic mapping and semantics-boosted navigation with path creation on a mobile robot. In Proceedings of the 2017 IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE Conference on Robotics, Automation and Mechatronics (RAM), Ningbo, China, 19–21 November 2017; pp. 207–212. [Google Scholar]
Shao, C.; Zhang, L.; Pan, W. Faster R-CNN learning-based semantic filter for geometry estimation and its application in vSLAM systems. IEEE Trans. Intell. Transp. Syst. 2021, 23, 5257–5266. [Google Scholar] [CrossRef]
Sevugan, A.; Karthikeyan, P.; Sarveshwaran, V.; Manoharan, R. Optimized navigation of mobile robots based on Faster R-CNN in wireless sensor network. Int. J. Sens. Wirel. Commun. Control 2022, 12, 440–448. [Google Scholar] [CrossRef]
Sun, Y.; Su, T.; Tu, Z. Faster R-CNN based autonomous navigation for vehicles in warehouse. In Proceedings of the 2017 IEEE International Conference on Advanced Intelligent Mechatronics (AIM), Munich, Germany, 3–7 July 2017; pp. 1639–1644. [Google Scholar]
Zhang, Z.; Zhang, J.; Tang, Q. Mask R-CNN based semantic RGB-D SLAM for dynamic scenes. In Proceedings of the 2019 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), Hong Kong, China, 8–12 July 2019; pp. 1151–1156. [Google Scholar]
Sinha, R.K.; Pandey, R.; Pattnaik, R. Deep Learning For Computer Vision Tasks: A review. arXiv 2018, arXiv:1804.03928. [Google Scholar]
Cheng, J.; Sun, Y.; Meng, M.Q.H. A dense semantic mapping system based on CRF-RNN network. In Proceedings of the 2017 18th International Conference on Advanced Robotics (ICAR), Hong Kong, China, 10–12 July 2017; pp. 589–594. [Google Scholar]
Xiang, Y.; Fox, D. DA-RNN: Semantic mapping with data associated recurrent neural networks. arXiv 2017, arXiv:1703.03098. [Google Scholar]
Zubair Irshad, M.; Chowdhury Mithun, N.; Seymour, Z.; Chiu, H.P.; Samarasekera, S.; Kumar, R. SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022. [Google Scholar]
Zhang, Y.; Feng, Z. Crowd-Aware Mobile Robot Navigation Based on Improved Decentralized Structured RNN via Deep Reinforcement Learning. Sensors 2023, 23, 1810. [Google Scholar] [CrossRef]
Ondruska, P.; Dequaire, J.; Wang, D.Z.; Posner, I. End-to-end tracking and semantic segmentation using recurrent neural networks. arXiv 2016, arXiv:1604.05091. [Google Scholar]
Le, N.; Rathour, V.S.; Yamazaki, K.; Luu, K.; Savvides, M. Deep Reinforcement Learning in Computer Vision: A Comprehensive Survey. arXiv 2021, arXiv:2108.11510. [Google Scholar] [CrossRef]
Zhou, S.; Liu, X.; Xu, Y.; Guo, J. A deep Q-network (DQN) based path planning method for mobile robots. In Proceedings of the 2018 IEEE International Conference on Information and Automation (ICIA), Wuyishan, China, 11–13 August 2018; pp. 366–371. [Google Scholar]
Reddy, D.R.; Chella, C.; Teja, K.B.R.; Baby, H.R.; Kodali, P. Autonomous Vehicle Based on Deep Q-Learning and YOLOv3 with Data Augmentation. In Proceedings of the 2021 International Conference on Communication, Control and Information Sciences (ICCISc), Idukki, India, 16–18 June 2021; Volume 1, pp. 1–7. [Google Scholar]
Zeng, F.; Wang, C.; Ge, S.S. A survey on visual navigation for artificial agents with deep reinforcement learning. IEEE Access 2020, 8, 135426–135442. [Google Scholar] [CrossRef]
Dai, Y.; Yang, S.; Lee, K. Sensing and Navigation for Multiple Mobile Robots Based on Deep Q-Network. Remote Sens. 2023, 15, 4757. [Google Scholar] [CrossRef]
Vuong, T.A.T.; Takada, S. Semantic Analysis for Deep Q-Network in Android GUI Testing. In Proceedings of the SEKE, Lisbon, Portugal, 10–12 July 2019; pp. 123–170. [Google Scholar]
Kästner, L.; Marx, C.; Lambrecht, J. Deep-reinforcement-learning-based semantic navigation of mobile robots in dynamic environments. In Proceedings of the 2020 IEEE 16th International Conference on Automation Science and Engineering (CASE), Hong Kong, China, 20–21 August 2020; pp. 1110–1115. [Google Scholar]
Xu, J.; Zhang, H.; Qiu, J. A deep deterministic policy gradient algorithm based on averaged state-action estimation. Comput. Electr. Eng. 2022, 101, 108015. [Google Scholar] [CrossRef]
Zhu, K.; Zhang, T. Deep reinforcement learning based mobile robot navigation: A review. Tsinghua Sci. Technol. 2021, 26, 674–691. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft actor-critic algorithms and applications. arXiv 2018, arXiv:1812.05905. [Google Scholar]
Sharma, S. SAC-RL: Continuous Control of Wheeled Mobile Robot for Navigation in a Dynamic Environment. Ph.D. Thesis, Indian Institute of Technology Patna, Patna, India, 2020. [Google Scholar]
Wahid, A.; Stone, A.; Chen, K.; Ichter, B.; Toshev, A. Learning object-conditioned exploration using distributed soft actor critic. Proc. Conf. Robot. Learn. PMLR 2021, 155, 1684–1695. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. arXiv 2015, arXiv:1405.0312. [Google Scholar]
Pereira, R.; Gonçalves, N.; Garrote, L.; Barros, T.; Lopes, A.; Nunes, U.J. Deep-learning based global and semantic feature fusion for indoor scene classification. In Proceedings of the 2020 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC), Ponta Delgada, Portugal, 15–17 April 2020; pp. 67–73. [Google Scholar]
Georgakis, G.; Bucher, B.; Schmeckpeper, K.; Singh, S.; Daniilidis, K. Learning to map for active semantic goal navigation. arXiv 2021, arXiv:2106.15648. [Google Scholar]
Yu, D.; Khatri, C.; Papangelis, A.; Namazifar, M.; Madotto, A.; Zheng, H.; Tur, G. Common sense and Semantic-Guided Navigation via Language in Embodied Environments. In Proceedings of the International Conference on Learning Representations ICLR 2020, Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
Krantz, J. Semantic Embodied Navigation: Developing Agents That Navigate from Language and Vision. Ph.D. Thesis, Oregon State University, Corvallis, OR, USA, 2023. [Google Scholar]
Narasimhan, M.; Wijmans, E.; Chen, X.; Darrell, T.; Batra, D.; Parikh, D.; Singh, A. Seeing the un-scene: Learning amodal semantic maps for room navigation. In Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 513–529. [Google Scholar]
Vicente, S.; Carreira, J.; Agapito, L.; Batista, J. Reconstructing PASCAL VOC. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Behley, J.; Garbade, M.; Milioto, A.; Quenzel, J.; Behnke, S.; Stachniss, C.; Gall, J. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9297–9307. [Google Scholar]
Behley, J.; Garbade, M.; Milioto, A.; Quenzel, J.; Behnke, S.; Gall, J.; Stachniss, C. Towards 3D LiDAR-based semantic scene understanding of 3D point cloud sequences: The SemanticKITTI Dataset. Int. J. Robot. Res. 2021, 40, 959–967. [Google Scholar] [CrossRef]
Kostavelis, I.; Gasteratos, A. Semantic mapping for mobile robotics tasks: A survey. Robot. Auton. Syst. 2015, 66, 86–103. [Google Scholar] [CrossRef]
Zhou, B.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; Torralba, A. Scene Parsing through ADE20K Dataset. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5122–5130. [Google Scholar] [CrossRef]
Zhang, C.; Yang, Z.; Xue, B.; Zhuo, H.; Liao, L.; Yang, X.; Zhu, Z. Perceiving like a Bat: Hierarchical 3D Geometric and ndash;Semantic Scene Understanding Inspired by a Biomimetic Mechanism. Biomimetics 2023, 8, 436. [Google Scholar] [CrossRef] [PubMed]
Raspberry Pi 4 2024. Available online: http://www.raspberrypi.com/products/raspberry-pi-4-model-b/ (accessed on 21 August 2024).
RPLiDAR A1 2024. Available online: http://www.slamtec.ai/product/slamtec-rplidar-a1/ (accessed on 21 August 2024).
Logitech Webcam 2024. Available online: https://www.logitech.com/en-sa/products/webcams/c920-pro-hd-webcam.960-001055.html (accessed on 21 August 2024).

Figure 1. Deep learning vision algorithms for robot semantic navigation.

Figure 2. Frequency of object detection models in existing semantic navigation systems.

Figure 3. Frequency of vision datasets in existing semantic navigation systems.

Figure 4. The employed robot platform.

Figure 5. The robot architecture in terms of employed hardware units.

Figure 6. The Artificial Intelligence and Sensing Technologies (AIST) research center—experimental testbed.

Figure 7. Object detection rate for 16 classes using three object detection models.

Figure 8. Object classification accuracy for three object detection models.

Figure 9. Processing time for the three object detection models.

Table 1. A comparison between the existing vision datasets for robot semantic navigation.

Dataset	Application	Records	Classes	Size
COCO	Indoor	330,000	80	25 GB
Matterport	Indoor	10,800	40	176 GB
Cityscapes	Outdoor	5000	30	11 GB
PASCAL	Outdoor	11,530	20	2 GB
KITTI	Outdoor	3978	11	5.27 GB
ADE20K	Hybrid	20,210	150	4.38 GB

Table 2. The specifications of the developed mobile robot.

Specification	Value
Processor	Raspberry Pi 4—4 GB RAM
Driver control	O-drive
Ranger-finder	RPLiDAR A1
Vision unit	2MP Logitech Webcam
Robot dimension	55 × 45 cm
Robot speed	0.5 m/s

Table 3. The object detection of each object in the experimental testbed.

Object	Total Exist	Faster R-CNN	YOLO v5	YOLO v8
Pepper robot	4	0	0	0
MAC-PC	4	1	1	1
Chair	36	18	30	30
Person	4	3	4	3
Fish tank	2	0	0	0
Table	9	1	0	1
Couch	2	2	2	2
Boat	1	0	1	0
Plant	2	2	2	2
Stand	5	0	0	3
3D printer	1	0	0	0
TV	2	1	2	2
Fire extinguisher	3	0	0	0
Bottle	10	0	8	8
Whiteboard	2	0	0	0

Table 4. The object classification accuracy for each object in the experimental testbed using three classifiers.

Object	Faster R-CNN	YOLO v5	YOLO v8
MAC-PC	23.25%	47%	61%
Chair	43.75%	77%	86%
Person	64.50%	80%	81%
Table	44.50%	0%	25%
Couch	84%	84%	75%
Boat	0%	46%	0%
Plant	74%	77%	49%
Stand	0%	0%	25%
TV	29.50%	59%	60%
Bottle	0%	60%	77%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alotaibi, A.; Alatawi, H.; Binnouh, A.; Duwayriat, L.; Alhmiedat, T.; Alia, O.M. Deep Learning-Based Vision Systems for Robot Semantic Navigation: An Experimental Study. Technologies 2024, 12, 157. https://doi.org/10.3390/technologies12090157

AMA Style

Alotaibi A, Alatawi H, Binnouh A, Duwayriat L, Alhmiedat T, Alia OM. Deep Learning-Based Vision Systems for Robot Semantic Navigation: An Experimental Study. Technologies. 2024; 12(9):157. https://doi.org/10.3390/technologies12090157

Chicago/Turabian Style

Alotaibi, Albandari, Hanan Alatawi, Aseel Binnouh, Lamaa Duwayriat, Tareq Alhmiedat, and Osama Moh’d Alia. 2024. "Deep Learning-Based Vision Systems for Robot Semantic Navigation: An Experimental Study" Technologies 12, no. 9: 157. https://doi.org/10.3390/technologies12090157

APA Style

Alotaibi, A., Alatawi, H., Binnouh, A., Duwayriat, L., Alhmiedat, T., & Alia, O. M. (2024). Deep Learning-Based Vision Systems for Robot Semantic Navigation: An Experimental Study. Technologies, 12(9), 157. https://doi.org/10.3390/technologies12090157

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning-Based Vision Systems for Robot Semantic Navigation: An Experimental Study

Abstract

1. Introduction

2. Object Detection Algorithms for Robot Semantic Navigation

2.1. Convolutional Neural Network (CNN) Object Detection Algorithms

2.1.1. Residual Network (ResNet)

2.1.2. Visual Geometry Group (VGG)

2.1.3. MobileNet

2.1.4. EfficientNet

2.1.5. YOLO

2.1.6. Faster R-CNN

2.2. Recurrent Neural Network (RNN) Object Detection Algorithms

2.3. Deep Reinforcement Learning (DRL) Object Detection Algorithms

2.3.1. Deep Q-Network (DQN)

2.3.2. Deep Deterministic Policy Gradient (DDPG)

2.3.3. Soft Actor-Critic (SAC)

2.3.4. Proximal Policy Optimization (PPO)

3. Vision Datasets for Robot Semantic Navigation

3.1. Microsoft Common Objects in Context Dataset (MS COCO)

3.2. Matterport 3D Dataset (MP3D)

3.3. Pascal VOC 2012 Dataset

3.4. KITTI Dataset

3.5. ADE20K Dataset

4. Experimental Study

4.1. Environment Development

4.2. Experimental Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI