1. Introduction
Today, robotic systems with social characteristics are considered an important keystone in household chores, healthcare services, and modern industrial production [
1]. 3D visual recognition is the fundamental component of these social robots. Social robots [
2] are autonomous robots that are currently being developed on a large scale for safe and secure robot interactions in the human-centric environment [
3]. The appearance and applications of these robotic systems vary; however, recognition in the context of object and place plays a central and vital role in these systems for semantic understanding of the environment. This article starts with the impact of social robots and lists the key features of some recently developed social robots that are tailored in public, domestic, hospital, and industrial use.
These robots are designed to interact and exhibit social behaviors with broad human-like capabilities, which integrate visual recognition, knowledge representation, task planning, localization, and navigation. Among all these, we focus on a systematic review of the approaches that address the most essential robotic capability, known as visual recognition. In this direction, we present data representation methods based on sensor modalities for 3D recognition using deep learning (DL) and examine the approaches for both 3D object recognition (3DOR) and 3D place recognition (3DPR).
Visual recognition is a vital component for robotic systems that operate in human environments. The methods to perform visual recognition tasks generally fall into two categories: either machine-learning-based approaches, which first require feature definition, i.e., using scale invariant feature transform [
4], histogram of oriented gradients [
5], and then classification techniques, such as support vector machine [
6] or deep learning (DL)-based approaches that perform recognition task using convolutional neural networks (CNN) [
7] without specifically defining the features.
Autonomous robotic systems deal with a large amount of real-world data. Therefore, the manually designed models of traditional machine learning algorithms are not feasible [
8] for real-world robotics applications. On the other hand, the flexibility of DL-based models and their better performance as the scale of data increases make them well suited for use in robotics applications. Over the last few years, CNN-based DL models, starting in 2D space using two-stage [
9,
10] and one-stage object detectors [
11,
12,
13,
14,
15,
16], have achieved state-of-the-art object recognition results with the output of 2D bounding boxes (BBoxes).
Typically, two-stage detectors, such as R-CNN [
17], Fast R-CNN [
18], and Faster R-CNN [
9], exploit region proposal networks in a first step to propose regions of interest (RoI). Afterward, they send region proposals to the network pipeline for object prediction by calculating features over RoI. As a trade-off for run time, one-stage detectors, such as YOLOv3 [
15], YOLOv4 [
19], Scaled-YOLOv4 [
20], and single shot multibox detector [
12] do not involve region proposal.
Researchers [
12,
15] have handled object detection as a regression problem and directly learned class probabilities to detect the object with bounding box coordinates. One-stage detectors are faster and capable of real-time performance; however, their accuracy rate is lower than two-stage detectors [
21]. The task of place recognition is similar to object retrieval [
22] and has been performed using dynamic object detection [
23] or constructing object maps that contain object information in a place [
24]. Although extensive research has been conducted on 2D recognition, it has potential limitations compared with 3D recognition.
With the recent monumental innovations in sensor technology, a wide variety of DL-based 3D object [
25,
26,
27,
28] and place recognition approaches [
29,
30,
31] have been developed for different types of sensors. LiDAR and camera are two frequently used and increasingly popular sensors [
32] that have been employed for object and place recognition in robotic systems. 3D object recognition predicts 3D information of objects, such as the pose, volume, and shape of the object with 3D BBoxes and class labels. It plays an important role in the intelligent perception of robotic systems.
In contrast to 2D object detection, it requires richer input data and efficient algorithms to estimate six degrees of freedom (DoF) poses [
33] with high precision of oriented 3D BBox [
34,
35] dimensions for objects. 3D Place recognition involves distinguishing two identical places based on their sensor information [
36]. Different approaches for place recognition are used, such as several feature maps that are correctly matched between images, learning representative features [
37], and calculating the pixel-wise distance between camera images.
LiDAR-based methods for place recognition concentrate on developing local [
38] and global [
39] descriptors from structural information, segmenting [
40] the point cloud (PC) data in 3D LiDAR point clouds and utilizing CNN techniques with 3D LiDAR PC by projecting range sensors on 2D images [
41]. However, the synchronization of camera and LiDAR sensors [
42] is essential for capturing detailed information of objects and large-scale place recognition.
1.1. Contributions
During the last decade, there has been rapid progress in the domain of social robots, including autonomous vehicles. Parts of this success rely on the implementation of both 3D object and place visual recognition tasks.
Previous reviews, shown in
Table 1, concentrated only on 3D object recognition and did not address the 3D place recognition methods. In contrast to the previous studies, this article reviews and analyzes sensor-based data representation methods for both 3D object and place recognition (3DOPR) using state-of-the-art DL-based approaches. Moreover, we also discuss recently developed social robots.
This review is concentrated on 3D visual recognition approaches that have their applications in the domain of robotics, while approaches in the domain of smart environments are beyond the scope of the current survey. We aim at facilitating novice researchers and experts to overcome the challenging task of determining and utilizing the most suitable visual recognition approach for their intended robotic system, as one can quickly explore the recent research progress through this review.
Compared to the existing survey papers, shown in
Table 1, the present review is different in the following terms, to the best of our knowledge:
We discuss the latest representative social robots that have been developed recently (
Section 2).
The present study is the first article that comes up with a combined review of two robotic capabilities: 3D object recognition and 3D place recognition in a comprehensive assessment. It provides data representation modalities based on camera and LiDAR for both 3D recognition tasks using DL-based approaches (
Section 3).
It reviews 14 3D object detection datasets.
The current survey presents a comparison of existing results to evaluate the performance on datasets.
It yields an analysis of selected approaches from the domain of robotics, delineates the advantages, summarizes the current main research trends, discusses the limitations, and outlines the possible future directions.
Compared to the earlier surveys, this study is more concerned with the most recent work. Therefore, it provides the reader an important opportunity to advance their understanding of state-of-the-art robotic 3D recognition methods.
1.2. Survey Structure
The survey has been organized in a top-down manner. The overall structure of the survey with corresponding topics and subsections is diagrammatically illustrated in
Figure 1. In
Section 2, the aim is to provide fresh insight to the readers into recently developed social robots with their impact on society, use cases, sensors, tasks (i.e., recognition), and semantic functions (i.e., assisting) in public places (
Section 2.1), domestic (
Section 2.2), hospitals (
Section 2.3), and industrial environments (
Section 2.4).
In
Section 3, inspired by the recognition capabilities of social robots, as described in
Section 2, the article examines the sensor (camera and LiDAR) based data representation approaches used for the 3D object (
Section 3.1) and place (
Section 3.2) recognition applying DL-based models. In addition, it gives a brief overview of datasets (
Section 4) that have been used for the evaluation of 3D recognition methods. Consequently, in
Section 6, the article discusses current research challenges and future research directions, and finally we conclude the survey with a summary in
Section 7.
1.3. Inclusion and Exclusion Criteria
The inclusion and exclusion criteria are mainly focused on
Section 3 for 3DOR and 3DPR methods.
Section 2 does not involve comparison (instead it highlights the importance of visual recognition capability by giving the examples of recently developed robots from different sectors); therefore, it is not restricted to follow the same time span as
Section 3. However,
Section 3 performs the literature analysis for 3DOR and 3DPR methods; therefore, all studies in
Section 3 are restricted to follow a specific time span based on inclusion and exclusion criteria. For 3DOR (
Section 3.1) and 3DPR (
Section 3.2), the inclusion criteria are as follows:
The research publications must be from 2014 to 2021.
Their domain must be a robotic system.
They must be either journal or conference publications.
They must address 3DOR or 3DPR methods using deep-learning approaches based on Camera and LiDAR sensor modalities.
Table 2 represents both inclusion and exclusion criteria that were applied to perform the paper selection, and the results of the systematic approach for paper filtering process are described below.
Results of the Paper Selection Process
We conducted a systematic literature review for
Section 3 to determine which DL-based models are being used for 3D object and place recognition based on sensor modalities. We used four search strings (“Camera” AND “3D” AND “Object Recognition”, “LiDAR” AND “3D” AND “Object Recognition”, “Camera” AND “3D” AND “Place Recognition”, and “LiDAR” AND “3D” AND “Place Recognition”) to extract the research articles from two key digital databases of academic journal articles that were IEEE Explorer and the ACM Digital Library. The paper selection process of this article consists of four steps as shown in
Figure 2 and
Figure 3.
First, the relevant articles for the survey from digital libraries using search strings were collected that correspond to the type of sensor (camera and LiDAR) and category of 3D recognition (object and place). In the second step, 329 articles in IEEE explores library and 593 articles in ACM digital library were extracted by applying the time period filter. The third step refined the 93 articles from IEEE Explorer and 144 articles from ACM Digital Library that belonged to the robotics category. We used MS Access database management software to find duplicates among these articles. For this, we ran SQL query on the database table and found that 35 articles in ACM and 21 articles In IEEE Explorer were duplicates.
After removing the duplicate articles, the fourth step involved splitting the articles that used deep-learning-based approaches and resulted in 23 articles from IEEE explorer and 51 articles from the ACM Digital Library that met the inclusion and exclusion criteria. Lastly, the selected articles based on their sensor data representation methods were arranged into 3DOR and 3DPR categories in which 17 articles from IEEE Explorer and 44 articles from ACM Digital library are related to the 3DOR task and five articles from IEEE Explorer and seven articles from ACM Digital library are related to the 3DPR task.
4. Datasets
Many public and new datasets have been developed for training the DL-based models. This section presents 3D datasets used in the studies that were reviewed in
Section 3.1 and
Section 3.2 for 3D object and place recognition tasks in the current review. We list the datasets used by each study in
Table 20.
Several methods discussed in the survey illustrate that KITTI dataset [
172] published in 2012 by [
173] is the most frequently used dataset for 3DOR tasks. The review shows that many 3DOR models (19 out of 23 studies) have used the KITTI dataset. This dataset has been updated many times since its first release.
Current review shows that OXford RobotCar dataset [
174] published in 2017 by [
175] has gained attention from several ADV studies to perform 3DPR tasks. In the current survey, 7 out of 12 3DPR studies have used the Oxford RobotCar dataset. It contains over 1000 km of recorded driving of a consistent route with over 100 repetitions. It collected almost 20 million images from six cameras, along with the LiDAR and GPS.
A series of recent studies has also indicated that many research institutes have designed their datasets, such as the Waymo open dataset, HKUST, KAIST, and NYUD2 datasets.
Waymo is an open dataset [
176] released recently by [
177] for autonomous driving vehicles. It is a large dataset consisting of 1150 scenes and each scene is spanned 20 s. It is also well-synchronized dataset with 3D BBox in LiDAR data and 2D BBox in camera images. In this review, one study [
62] used the Waymo dataset for training one-stage detector to recognize the objects in outdoor environment.
The HKUST dataset was captured by [
82] for 3DPR task in their study. In this dataset, each shot is contained on a grayscale image and a point cloud. The KAIST dataset [
178] was proposed by [
179] to provide LiDAR and stereo images of complex urban scenes. One [
88] among the reviewed studies used the KAIST dataset to perform 3DPR tasks. NYUD2 is a kinect dataset [
180] that was used by one 3DPR study [
78] in this survey. It was introduced by [
181] with 1449 RGBD images and 26 scene classes of commercial and residential buildings.
Some networks (three 3DPR studies [
80,
83,
86]) have used the in-house dataset that includes university sector, residential area, and business direct. This dataset was created by [
80] using LiDAR sensors on the car driven in four regions at 10, 10, 8, and 5 km routes.
The SUN RGB-D dataset [
182] used by one 3DPR [
78] and two 3DOR [
70,
71] studies was presented by [
183]. It contains 10,355 RGB-D scene images as training set and 2860 images as testing set for 3D object detection, which is fundamental for scene understating. ISIA RGB-D dataset is proposed by [
78] for use in their own study for 3DPR task. It is a video dataset to evaluate RGB scene recognition videos. It contains more than five hours of footage of the indoor environment in 278 videos. it reuses 58 categories of the MIT indoor scene database [
184].
The multi vehicle stereo event camera dataset also called MVSEC [
185] is a collection of 3D perception data that was presented by [
186] for event-based cameras. It has been used by the model in [
79] to perform 3D place recognition task. Its stereo event data has been collected from a car, bike, handheld, and hexacopter in both indoor and outdoor environments.
The DDD17 dataset DDD17Dataset used in one 3DPR study [
79] was introduced by [
187]. It contains annotated dynamic and active-pixel vision sensors’ recordings, which consist of over 12 h of video in city driving at night, daytime, and evening in different weather conditions and vehicle speed. The ScanNet dataset was reported in [
188]. It has been used by two 3DOR [
70,
71] and one 3DPR [
81] studies in the current survey. It is an RGB-D video dataset containing 1513 scenes that are annotated with 3D camera poses. The research community has used this dataset for 3D scene understanding and semantic voxel labeling tasks.
The NCLT dataset [
189] used by one 3DPR study [
84] in this review, was documented in [
190]. It is a long-term autonomy dataset for robotic research, which was collected using a Segway robot by 3D LiDAR, GPS, planar LiDAR along with proprioceptive sensors. Argoverse dataset [
191] is introduced by [
192] to support machine learning tasks for object detection in outdoor environment. A recent study [
65] in the survey used this dataset for 3DOR task. It is mainly designed for 3D tracking and motion forecasting. Its 3D tracking dataset contains 360° images taken from seven cameras with 3D point clouds from LiDAR while its motion forecasting dataset contains 300,000 tracked scenarios. It also includes 290 km “HD maps”.
Summary
Section 4 presented 14 datasets that have been used by 35 studies. The Sun RGB-D, KITTI, and ScanNet datasets have been used for both 3DOR and 3DPR tasks. However, KITTI is the most frequently used dataset for 3DOR tasks (used by 20/23 studies), while Oxford Robot-car is a widely used dataset for scene understanding to perform 3DPR tasks (7/12 studies) in autonomous driving vehicles.
5. Performance Evaluation
Section 5 analyzes and compares the existing results in the context of different datasets (discussed in
Section 4) to present the performance of the methods that have reviewed in
Section 3.1 and
Section 3.2 for 3DOR and 3DPR tasks. The evaluation metrics that have been used for the KITTI dataset include average precision (AP) of Intersection over Union (IoU) for both bird’s eye view (AP
) and 3D object detection (AP
) along with the average orientation similarity (AOS) [
173] and average localization precision (ALP). AP, AOS, and ALP metrics are divided into easy, moderate, and hard according to difficulty levels of 3D object detection, which are height, occlusion, and truncation for all three categories: cars, pedestrians, and cyclists. The recall @ 1 %, AUC, and accuracy % are the metrics that were used to compare the performance of 3DPR tasks on different 3D detection datasets.
For performance evaluation based on the KITTI dataset, Mono Pair [
55] uses 40-point interpolated average precision metric AP
, which is evaluated at both the bird-eye view AP
and the 3D bounding box AP
. It reports AP with intersection over union (IoU) using 0.7 as thresholds for cars, pedestrians and cyclists detection.
Table 21 and
Table 22 shows the performance of one-stage anchor-free detector of [
55] on the KITTI validation and test sets for the car category, while performance for pedestrians and cyclists on the KITTI test is shown in
Table 23 and
Table 24, respectively. It can also perform inference in real-time as 57 ms per image, which is higher than [
106].
GS3D [
56] evaluated the framework on the KITTI object detection benchmark and follows [
193] to use two train/validation (val) splits. Its experiments were mainly focused on the car category.
Table 21 and
Table 22 show the evaluation results of 3D detection accuracy on the KITTI for car category using the metric of AP
on two validation sets val
and val
. The performance on val
is higher than [
102] for 3D object detection in autonomous driving. In [
56], researchers used the metric of Average Localization Precision (ALP) and outperformed [
193].
Table 21 presents the results of [
56] for car category evaluated using the metric of ALP with the results on the two validation sets val
/val
.
SS3D [
57] evaluated its proposed methods primarily on the KITTI object detection benchmark. It focused on three categories car, pedestrian and cyclist, which are most relevant for autonomous vehicle applications. The metric used for [
57] evaluation is the average precision (AP), where valid detection is specified if the IoU is at 0.7, in bird’s-eye-view and in 3D, respectively.
The researchers in [
57] used the same validation splits and called them split-1 [
194] and split-2 [
195], which divided the training data almost in half and performed the training on all three categories simultaneously.
Table 21 shows AP with the 3D IoU detection criterion on validation set for the Cars class with a clear ranking Method 1 ≺ Method 2 ≺ Method 3 in terms of their performance. It also represents the results using the ALP metric. Jörgensen et al. [
57] used inference on the KITTI test set and the evaluation results on test data for cars in
Table 22, while pedestrians and cyclists classes in bird’s-eye-view (AP
) and in 3D (AP
) are presented in
Table 24.
M3DSSD [
58] evaluated the proposed framework on the challenging KITTI benchmark for 3D object detection covering three main categories of objects: cars, pedestrians, and cyclists. AP scores on validation and test sets of 3D object detection and bird’s eye view for cars are shown in
Table 21 and
Table 22, while the 3D detection performance for pedestrians and cyclists on test set at a 0.5 IoU threshold is reported in
Table 24.
SRCNN [
59] evaluated the proposed model using Average Precision for bird’s eye view (AP
) and 3D box (AP
) on the KITTI car validation and test sets, while the results are reported in
Table 21 and
Table 22, respectively. It outperforms state-of-the-art monocular-based methods [
34,
196] and stereo-method [
197] by large margins. Specifically, for easy and moderate sets, it outperforms 3DOP [
197] over 30% for both AP
and AP
while for the hard set, it achieved ∼25% improvements.
CenterNet [
60] used restnet18 [
108] and dla-34 [
105] as backbone of its three methods and showed that its methods are superior to the previous monocular-based methods. The performance on AP
and AP
for car 3D localization and detection on the KITTI validation set is shown in
Table 21.
RT3D [
61] evaluated the proposed method on the KITTI for autonomous driving and divides the samples in training and validation sets exactly the same as [
194]. The results of both 3D localization and 3D detection evaluations are obtained using Average Precision (AP
) and (AP
), as reported in
Table 21 and
Table 22 respectively. It is 2.5× faster than the [
114]. Its detection time of 0.089 s allows it to be deployed in real-time systems and it achieves at least 13% higher accuracy compared to [
102,
194,
198].
AFDet [
62] evaluated the results using average precision (AP) metric as shown in
Table 21, where the IoU threshold was 0.7 for the car class. They did not use complex post-processing process and NMS to filter out the results.
SegV Net [
63] evaluated the 3D vehicle detection results on the KITTI test dataset using AP
and AP
metrics, as shown in
Table 22, while the results on validation dataset with AP
metric and orientation estimation (AOS) are reported in
Table 21. It outperformed LiDAR only single stage methods [
111,
113] in 3D vehicle detection.
SECONDX [
64] supports cars, pedestrians and cyclists’ categories with a single model and outperforms other methods for all APs in three classes. Its evaluation results on the KITTI validation set are given in
Table 21 and
Table 23. It runs in real time without increasing memory usage and inference time compared with [
120].
IPOD [
66] follows AP metrics for all three classes where the IoU threshold is 0.7 for car class and 0.5 for pedestrians and cyclists classes. For evaluation on the test set, the model used train/val sets at a ratio of 4:1. The performance of the method is listed in
Table 21,
Table 22,
Table 23 and
Table 24. Yang et al. [
66] showed that compared to [
199], the detection accuracy of IPOD on hard set has improved by 2.52%, and 4.14% on BEV and 3D respectively. Similarly, compared to [
73,
120] it performs better in pedestrian prediction by 6.12%, 1.87%, and 1.51% on the easy, moderate, and hard levels, respectively.
FVNet [
67] presents the performance for cars category at 0.7 IoU using AP
and AP
and for the pedestrians and cyclists categories at 0.5 IoU using AP
metric on the KITTI test dataset, as shown in
Table 22 and
Table 24. It achieved significant better results despite using the raw point clouds, and its inference time was 12 ms. Compared to [
73], it performs best on all three categories except the car detection in easy setting, which employs both front-view and bird’s-eye-view.
In DPointNet [
68], the dataset includes three categories of car, pedestrian, and cyclist. However, it only evaluates the car class for its rich data.
Table 21 and
Table 22 show its performance on the KITTI validation and test sets respectively using the average precision (AP) of car class with a 0.7 IoU threshold. Li et al. [
68] demonstrated that the effectiveness of proposed DPointNet on the KITTI validation set has increased from 0.4% to 0.6%, with only about 60% running time.
Point-GCNN [
69] used the KITTI benchmark to evaluate the average precision (AP) of three types of objects: car, pedestrian and cyclist. Following [
111,
114,
200], it handles scale differences by training one network for the car and another network for both the pedestrian and cyclist. The AP results of 3D and BEV object detection on the KITTI test set for all three categories are shown in
Table 22 and
Table 24. It achieved good results for car detection on easy and moderate levels, for cyclist detection on moderate and hard levels while it surpasses previous approaches by 3.45. The reason of low pedestrian detection compared to its car and cyclist classes is that vertices are not dense enough to obtain more accurate bboxes.
S-AT GCN [
72] evaluated 3D detection results using the 3D and BEV average precession at 0.7 IoU threshold for the car class and 0.5 IoU threshold for the pedestrian and cyclist classes. The results on the KITTI validation data are reported in
Table 21 and
Table 23. Its method 1 indicates the results of self-attention (AT) without dimension reduction while method 2 represents the results of self-attention with dimension reduction (ATRD). Compared to method 1, the second method performrf better for car detection on all three difficulty levels, pedestrians at the hard difficulty level, and cyclists at moderate and hard difficulty levels. Wang et al. [
72] described that adding feature enhancement layer with self-attention, can bring extra 1% and 2–3% improvement for its pedestrians and cyclists’ detection.
MV3D [
73] followed [
194] to split training set and validation set, each containing about half of the whole dataset. It only focused on car category and performed the evaluation on three difficulty regimes: easy, moderate, and hard. The results using AP
and AP
at IoU = 0.7 on validation set are shown in
Table 21. Chen et al. [
73] has showed that the proposed method [
73] performed better than [
41] by AP
under IoU threshold 0.7 and achieves ∼45% higher AP
across easy, moderate, and hard regimes. Similarly it obtained ∼30% higher AP
over [
41] with criteria of IoU = 0.7, and reaches at 71.29% AP
on easy level.
BEVLFVC [
74] evaluated the pedestrian detection results using 3D detection average precision AP
on the KITTI validation dataset, as shown in
Table 23. Wang et al. described that its highest performance on validation set can be achieved by fusing [
114,
126] with the proposed sparse non-homogeneous pooling layer and one-stage detection network.
D3PD [
75] trained the model using different hyper parameters and evaluated the validation split using AP
metric for pedestrian detection, as shown in
Table 23. Roth et al. [
75] illustrated that the highest performance can be obtained using concatenation feature combination in the detection network and showed that deep fusion scheme performs slightly better than early fusion scheme.
MVX-Net [
76] splits the training set into train and validation sets and does not include the samples from same sequences in both sets [
73]. It evaluated the 3D car detection performance using AP metric in 3D and bird’s eye view for validation and test sets as shown in
Table 21 and
Table 22. The experimental results show that [
76] with point fusion significantly improves the score of mean average precision.
SharedNet [
77] achieves competitive results compared with other state-of-the-art methods. The results in the KITTI validation and test dataset for three classes (cars, pedestrians, and cyclists) were evaluated on mean average precision metric. The results for car validation and test set are given in
Table 21 and
Table 22 respectively while for pedestrian and cyclist categories on validation set are listed in
Table 23. Wen et al. [
77] illustrates that the proposed model [
77] competes with [
199,
201] in comprehensive performance. For the cyclist class, it outperforms the [
201] while in the car class, it is 2× faster than [
201].
SDes-Net [
86] trains and tests different descriptor extraction models on real world data from the KITTI dataset. It evaluates their performance for 3DPR tasks to determine matching and non-matching pairs of segments, and to obtain the correct candidate matches. First, it compares the general accuracy of different descriptors using positive and negative pairs of segments from the test set. The experimental results show that Siamese network [
167] achieves the best overall classification accuracy, which is about 80%, listed in
Table 25.
The second comparison among descriptors was conducted to find the potential descriptor for generating candidate matches based on the closest neighbor in the euclidean descriptor-space. The experimental results demonstrates that the group-based classifier and feature extraction network that was trained using contrastive loss function [
165] performed the best with around 50% positive matches, while the Siamese network [
167] had only around 30% positive matches.
OREOS [
84] demonstrates the place recognition performance on NCLT and KITTI datasets for an increasing number of nearest place candidates retrieved from the map. with recall in % that is 96.7 on the KITTI dataset and 98.2 on NCLT dataset as shown in
Table 25.
CLFD-Net [
88] uses KITTI and KAIST datasets for place recognition task. KITTI dataset supplies 11 scenes containing accurate odometry ground truth information. These scenes are used in experiments and referred as KITTI 00, · · ·, KITTI 10. It has potential to be applied in the field of autonomous driving or robotic systems with a recall @1%. The performance is 98.1 for KITTI 00 scene, which is 1.7% higher than [
80], and 2.5% higher than [
108]. The performance on KAIST3two scene is 95.2, which is 8.5% higher than [
80], and 6.9% higher than [
108]. The overall performance of model [
88] on the KITTI dataset with average recall @ 1% is higher than KAIST dataset as shown in
Table 25.
Table 26 illustrates the performance of proposed network in [
65] for vehicle and pedestrian detection using the standard average precision for 3D detection (AP
) and on the bird’s eye view (AP
). The AP scores are measured at IoU = 0.7 threshold for car class, and IoU = 0.5 for pedestrian class with a reasonable inference speed (30FPS).
RGNet [
70] and HGNet [
71] used the ScanNet and Sun RGB-D datasets to perform 3DOR tasks while [
81] used ScanNet dataset for 3DPR task. In [
70], the network model performs better on 15/18 classes for 3D object (i.e., chair, table, bed etc.) detection task using ScanNet dataset and evaluates the performance using mean average precession, which is given in
Table 27 as model accuracy is 48.5 in terms of mAP @ 0.25. Its 3D object detection in point cloud on Sun RGB-D dataset showed the overall performance is 59.2 on 6/10 object classes with mAP @ 0.25.
In [
71], 3D object detection results with 61.3 % accuracy on the ScanNet dataset has been achieved with mAP @ 0.25 while 61.6 % on Sun RGB-D dataset for the ten most commonly used object categories ( such as bed, sofa, chair, table etc). The results are listed in
Table 27.
RGBD-Net [
78] evaluated the scene recognition results on NYUD2, SUN RGB-D and the ISIA RGB-D dataset for 3DPR task. It follows the split by [
181] to recognize 27 indoor categories of NYUD2 dataset into 10 categories. Scene categories in the SUN RGB-D dataset are 40 and in the ISIA RGB-D video database are eight. It contains 60 % data of each category for training and 40 % for testing. Following [
183], it uses ther mean class accuracy for the evaluation and comparisons of results, which are shown in
Table 27.
ISR-Net [
81] uses the ScanNet benchmark to present the scene classification results for place recognition (library, bedroom, kitchen, etc) and achieves an average recall of 0.70 as shown in
Table 27. It performs better on 11/13 scenes and jumps to 70.0% recall compared to [
202], which has an average recall of at most 49.8%.
In Pointnetvlad [
80], the performance on average recall at 1% is evaluated using the Oxford dataset and three in-house datasets. It achieved reasonable results, which are 80.31, 72.63, 60.27, and 65.3 for the Oxford, U.S., R.A., and B.D. datasets, respectively, as shown in
Table 28.
MinkLoc3D [
87] evaluated the experimental results on the Oxford dataset and three in-house datasets that were acquired using LiDARs with different characteristics. The evaluation results of place recognition model on Oxford Robot-car dataset have achieved 97.9 average recall at 1 %, which is higher than [
83]. When [
87] model is evaluated on three in-house datasets, its performance compared to [
83] is 1.0 and 0.6 p.p. lower for U.S. and B.D. sets that is 95.0 and 88.5 respectively while 0.7 p.p. higher for R.A. set. The results are listed in
Table 28.
The experimental results of PIC-Net [
89] show the performance of its optimal configuration is 98.23% on average with the recall @ 1, as shown in
Table 28, which is about 0.52% better than the direct concatenation.
Lpd-net [
83] evaluated the network model on the three In-House datasets and achieved 96.00, 90.46 and 89.14 average recall @ 1 % for U.S., R.A., and B.D. sets, shown in
Table 28. It is trained only on the Oxford Robotcar dataset and directly test it on the In-House dataset.
SDM-Net [
85] considers ten place recognition cases and uses area under the precision-recall curve (AUC) to evaluate the sequence pairs for representative cases. The results for all of them are reported in
Table 28. It outperforms [
152], in six out of ten cases.
In Event-VPR [
79] the performance of proposed method is evaluated on MVSEC and Oxford RobotCar datasets, and the results are listed in
Table 28. On the MVSEC dataset, two daytime and three nighttime sequences are trained together, and then each of them is tested separately. The recall @ 1 % of its model in night sequences has achieved 97.05% on average while almost the same at daytime sequences. On the Oxford RobotCar dataset, it shows the model performance for place recognition under various weather and seasons. It uses night sequences for training and performs testing on the day and night sequences. Its recall @ 1 % on Oxford Robot-car dataset is about 26.02% higher than [
203] but about 7.86% lower than [
152].
Summary
Section 5 analyzes the performance of the 3DOR and 3DPR methods by comparing the published results based on three evaluation metrics (AP, AOS, and ALP) for 3DOR and three evaluation metrics (Recall, Accuracy, and AUC) for 3DPR tasks. It classified the results for comparison according to the datasets used by each method.
Performance comparison on the KITTI car validation and test sets is presented in
Table 21 and
Table 22 respectively. Analysis on the KITTI pedestrian and cyclist validation set is given in
Table 23 and on the test set is given in
Table 24.
Table 21 shows that the performance of [
77] on easy while [
72] on moderate and hard difficulty levels is better for AP
(IoU @ 0.7); [
63] on easy while [
68] on moderate and hard levels performs better than the other methods for AP
(IoU @ 0.7); [
61] val1
set surpasses all models for ALP on all three levels.
Table 22 presents that [
77] outperforms on easy while [
69] performs better on moderate and hard sets for AP
(IoU @ 0.7); [
69] performance is higher on all three levels compare to other methods for AP
(IoU @ 0.7); [
63] model exceeds over [
67] for AOS on all three levels.
In
Table 23, the performance analysis of pedestrian category illustrates that [
66] on all three levels outperforms for AP
(IoU @ 0.5); [
77] on easy and moderate while [
66] on hard level performs better for AP
(IoU @ 0.5). The comparison on cyclists category shows that first method of [
72] on easy while its second method on moderate and hard levels gives better results using AP
(IoU @ 0.5); first method of [
72] on moderate while its second method on easy and hard levels outperforms for for AP
(IoU @ 0.5).
Table 24 presents that, for the pedestrian category, the results of [
55,
66] outperform other methods on all three levels for AP
and AP
(IoU @ 0.7) and for AP
and AP
(IoU @ 0.5) respectively. For cyclist category the results of [
69] and third method of [
57] have higher performance on all three levels when compared using AP
and AP
(IoU @ 0.5) and AP
and AP
(IoU @ 0.7).
For 3DPR tasks,
Table 25 presents that [
88] has higher recall than [
84] on the KITTI dataset while Constructrive and group-based methods have equally higher accuracy in [
88]. Performance comparison for 3DOR task on ScanNet and Sun RGB-D datasets shows that [
71] has higher mAP @0.25 compared to [
70] in
Table 27.
Table 28 presents that [
89] on Oxford Robot-car and [
83] on the In-House datasets outperform [
80,
89] when evaluated with average recall @ 1 % for 3DPR task.
6. Discussion and Future Research Directions
This section summarizes the most relevant findings on the review of social representative robots (
Section 2), camera and LiDAR-based data representation of 3D recognition (
Section 3) for both object (
Section 3.1) and place (
Section 3.2).
This article first highlighted the value-centric role of social robots in the society by presenting recently developed robots. These social robots are performing front-line tasks and taking complex roles in public, domestic, hospitals, and industrial settings. The semantic understanding of the environment varies depending on the domain and application scenarios of the robots. For instance, the semantic understanding task for a robot working in a factory with a human co-worker is different from those robots working at home due to different objectives. Usually, these robots are equipped with a variety of sensors, such as camera and LiDAR to perform human-like recognition tasks.
Focusing on the recognition capability of social robots, it has explored camera and LiDAR-based 3D data representation methods using deep learning models for object and place recognition. Both sensors are affected by the changes in the scene lighting conditions as well as the other weather factors [
204]. In addition, both object and place recognition (OPR) tasks rely on different methods of semantic understanding, which help to detect small and occluded objects in cluttered environment or objects in occluded scenes.
Examining the existing literature on 3D recognition reveals that there are relatively fewer studies on 3D place recognition compared to 3D object recognition. Moreover, a stable model for 3D recognition has not yet been formed. In the real world, a robot’s behavior strongly depends on its surrounding conditions and it needs to recognize its environment through the input scenery. However, literature search shows that up to now, little attention has been paid to LiDAR-based 3D recognition in indoor environment using DL-based approaches in contrast to outdoor recognition.
A monocular camera is a low-cost alternative for 3DOR and depth information is calculated with the aid of semantic properties understanding from segmentation. 3D monocular object detection can be improved by establishing pairwise spatial relationships or regressing 3D representation for 3D boxes in the indoor environment, while visual features of visible surfaces for extracting 3D structural information in the outdoor environment. Compared with the monocular camera more, precise depth information can be obtained through the stereo camera by utilizing semantic and geometric information and region-based alignment methods can be used for 3D object localization. However, it can be extended to general object detection by learning 3D object shapes.
At present, most of the 3DOR methods heavily depend on LiDAR data for accurate and precise depth information. However, LiDAR is expensive, and its perception range is relatively short. The article categorized the LiDAR-based 3DOR methods into structured, unstructured, and graph-based representations. Some 2D image grid-based methods used pre-RoI pooling convolution methods and pose-sensitive feature maps for accurate orientation and size that can be enhanced with a more advanced encoding scheme for maintaining height information.
We reviewed 3D voxel grid-based methods that incorporate semantic information by exploiting BEV semantic masks and depth aware head and by providing multi-class support for 3D recognition. 3D object detection from raw and sparse point cloud data has been far less explored to date using DL models, compared with its 2D counterpart.
3D LiDAR PC-based object detection can yield improved performance by context information and Precise PC coordinates as well as generating feature maps through cylindrical projection and combining proposal general and parameter estimation network. However, little research has looked into encoding PC using graph neural networks (GNNs) for highly accurate 3DOR. The joint learning of pseudo centers and direction vectors for utilizing multi-graphs was explored with supervised graph strategies for improving the performance. The point clouds do not well capture semantic (e.g., shape) information; however, utilizing the hierarchical graph network (HGNet) approach effectively handles this problem at multi-level semantics for 3DOR.
Sensor fusion methods based on camera and LiDAR for 3DOR using deep fusion schemes have gained attention. These methods rely on combing multi-view region-wise features, constructing sparse non-homogeneous pooling layer for feature transform between two views and allows fusion of these features, extracting point clouds using voxel feature encoder and utilizing anchor proposals, or integrating point and voxel fusions. In this direction, future research needs to deep multi-class detection network.
Unlike 3DOR, 3DPR task based on LiDAR and camera-LiDAR fusion methods by leveraging the recent success of deep networks has remained as a less explored problem. LiDAR PC based 3DPR methods depend on metric learning and inference to extract the global descriptors from 3D PC, extraction of local structures and finding the spatial distribution of local features, representation of semi-dense point clouds-based scene, utilization of data-driven descriptor for near-by place candidates, and estimation of yaw angle for oriented recognition. Camera-LiDAR sensors fusion methods to extract fused global descriptors for 3DPR via DL approaches depends on applying a trimmed strategy on the global feature aggregation of PC or using attention-based fusion methods to distinguish discriminative features that can be improved by color normalization.