Three-Dimensional Outdoor Object Detection in Quadrupedal Robots for Surveillance Navigations

Tanveer, Muhammad Hassan; Fatima, Zainab; Mariam, Hira; Rehman, Tanazzah; Voicu, Razvan Cristian

doi:10.3390/act13100422

Open AccessArticle

Three-Dimensional Outdoor Object Detection in Quadrupedal Robots for Surveillance Navigations

by

Muhammad Hassan Tanveer

^1,*

,

Zainab Fatima

²

,

Hira Mariam

³

,

Tanazzah Rehman

²

and

Razvan Cristian Voicu

¹

Department of Robotics and Mechatronics Engineering, Kennesaw State University, Marietta, GA 30060, USA

²

Department of Software Engineering, NED University of Engineering & Technology, Karachi 75270, Pakistan

³

Department of Telecommunications Engineering, NED University of Engineering & Technology, Karachi 75270, Pakistan

^*

Author to whom correspondence should be addressed.

Actuators 2024, 13(10), 422; https://doi.org/10.3390/act13100422

Submission received: 23 August 2024 / Revised: 10 September 2024 / Accepted: 25 September 2024 / Published: 16 October 2024

(This article belongs to the Section Actuators for Robotics)

Download

Browse Figures

Versions Notes

Abstract

:

Quadrupedal robots are confronted with the intricate challenge of navigating dynamic environments fraught with diverse and unpredictable scenarios. Effectively identifying and responding to obstacles is paramount for ensuring safe and reliable navigation. This paper introduces a pioneering method for 3D object detection, termed viewpoint feature histograms, which leverages the established paradigm of 2D detection in projection. By translating 2D bounding boxes into 3D object proposals, this approach not only enables the reuse of existing 2D detectors but also significantly increases the performance with less computation required, allowing for real-time detection. Our method is versatile, targeting both bird’s eye view objects (e.g., cars) and frontal view objects (e.g., pedestrians), accommodating various types of 2D object detectors. We showcase the efficacy of our approach through the integration of YOLO3D, utilizing LiDAR point clouds on the KITTI dataset, to achieve real-time efficiency aligned with the demands of autonomous vehicle navigation. Our model selection process, tailored to the specific needs of quadrupedal robots, emphasizes considerations such as model complexity, inference speed, and customization flexibility, achieving an accuracy of up to 99.93%. This research represents a significant advancement in enabling quadrupedal robots to navigate complex and dynamic environments with heightened precision and safety.

Keywords:

quadrupedal robots; autonomous vehicle; 3D object detection; cars; KITTI

1. Introduction

In the field of robotics, detecting and localizing individuals and objects in three-dimensional (3D) space is crucial for various applications, ranging from household service robots to industrial automation systems. While there have been significant advancements in unimodal 2D detection using RGB images, effectively handling 3D detection tasks, particularly with multimodal RGB-D data, remains challenging. The scarcity of large-scale datasets and the need to leverage advancements in 2D detection methods further exacerbate these challenges. This study aims to address human detection challenges within complex intralogistic environments, focusing on overcoming obstacles related to learning to recognize and precisely localize 3D centroids in RGB-D data.

The growing demand for precise and efficient perception systems in autonomous navigation underscores the importance of thoroughly understanding the strengths and limitations of these algorithms to develop intelligent robotic systems. Accurate 3D object information and comprehensive environmental awareness are vital for robot navigation and autonomous vehicles. These awareness systems, which include 3D object detection, must be reliable in adverse weather conditions, provide accurate information about the surroundings, and facilitate quick decision-making, particularly when operating at high speeds [1]. Studies conducted by the National Highway Traffic Safety Administration (NHTSA) have shown that human error is responsible for 94% of traffic accidents, highlighting the urgent need to develop automated driving systems (ADSs) to reduce driving stress, prevent accidents, lower emissions, and support individuals with mobility impairments [2].

Quadrupedal robots in outdoor environments must be capable of sensing, anticipating, planning, considering, and acting in complex, uncertain, real-time environments to accomplish these challenging tasks. Figure 1 provides an overview of the fundamental steps involved in operating a a quadrupedal robot, including course planning, mapping, motion control, and outdoor environment sensing. Figure 2 further breaks down the environment mapping process into smaller stages, illustrating how the integration of 2D and 3D detection can enhance the detection and prediction capabilities of AI systems, enabling more precise object perception. This fusion typically involves the use of a camera in conjunction with a depth sensor, such as a LIDAR camera, which can generate sparse depth images up to a depth of 1 m [3]. Even the smallest error in situation comprehension, strategy formulation, or decision-making can have fatal consequences, underscoring the critical need for robust autonomous systems [4].

In the realm of autonomous navigation, low-level perception relies on multiple sensors, while high-level scene interpretation is achieved through computer vision, machine learning, and deep learning techniques [5]. However, a significant challenge arises from the scarcity of labeled data, making the acquisition of comprehensive datasets a costly and intricate process.

This research utilizes the KITTI dataset, collected from a mobile platform traversing Karlsruhe, Germany. The dataset includes high-precision GPS data, camera images, laser scans, and IMU accelerations from a combined GPS/IMU system. The primary objective of this dataset is to advance robotics and computer vision systems for autonomous driving [6]. Real-time processing of high-resolution images has significantly advanced object detection, particularly with the emergence of deep learning frameworks like YOLO (You Only Look Once). YOLO-based techniques, including YOLOv3, YOLOv4, and YOLOv5, are well-suited for autonomous vehicles due to their exceptional efficiency and accuracy in detecting objects in images [7].

Pioneering work by Redmon et al. [7] introduced the first single-stage deep learning detector, YOLO, which segments images into regions and simultaneously predicts class probabilities and bounding boxes for each region. While YOLO offers faster processing compared to two-stage object identification networks, it may suffer from lower accuracy due to the class imbalance problem inherent in one-stage networks. Subsequent iterations of YOLO address this issue by incorporating batch normalization in convolutional layers, enhancing image resolution and eliminating accuracy bottlenecks [8]. YOLO versions 1, 2, 3, and 4 represent single-shot approaches that utilize a one-stage network to extract 3D bounding boxes of objects and directly predict class probabilities, making them ideal for real-time applications.

The remainder of this research is structured as follows: Section 2 provides an in-depth exploration of relevant topics and significant developments in 3D object identification for autonomous robotic navigation. Section 3 outlines the comprehensive approach used in this work, covering everything from data preprocessing to model implementation and evaluation. Section 5 provides the implementation and description of the quadrupedal robots. Section 6 presents and discusses the experimental results, including quantitative metrics and visualizations. Section 7 provides the discussion. Finally, Section 8 summarizes the main conclusions, discusses their implications, and suggests avenues for further research in this dynamic field.

Main Contributions of This Paper

Modeling an architecture that performs accurate object detection with three-dimensional bounding boxes without the input of Velodyne or depth data during training, relying solely on 2D images, labels, and camera calibration files.
Enhanced accuracy of up to 99.13%, reduction of loss in YOLOv5 from 0.28 to an average of −0.223, and increase in overall average precision of up to 96.16%.
Improved depth estimation and motion control: Accurate estimation of depth alongside length and width provided by 3D object detection algorithms enables quadrupedal robots to make more informed decisions regarding motion control and trajectory planning. By accurately perceiving the three-dimensional structure of the environment, including the size and location of obstacles, the robots can optimize their movements to navigate safely and efficiently. This improvement in depth estimation enhances the robots’ motion control capabilities, leading to smoother and more precise locomotion in challenging outdoor terrains.
Tailored object detection for outdoor environments: Adapting 3D object detection algorithms to quadrupedal robots addresses the specific challenges associated with navigating outdoor environments, such as roads and highways. By enhancing object detection capabilities tailored for these scenarios, the robots can effectively perceive and respond to obstacles, vehicles, pedestrians, and other objects with depth information. This tailored approach improves their overall situational awareness and ensures safer navigation in dynamic outdoor environments.
Integration with quadrupedal locomotion: Adapting 3D object detection algorithms to the specific movement characteristics of quadrupedal robots is necessary. Compared to other robotic platforms, these robots display unique kinematics, motion dynamics, and terrain interactions. The algorithms can leverage the robots’ agility and mobility to improve their perception and navigational skills by smoothly incorporating 3D object identification into their control and decision-making processes. Smoother motion planning, obstacle avoidance, and object tracking are made possible through this integration, which enhances the effectiveness and efficiency of robotic operations in outdoor environments.
Enhanced object tracking and situational awareness: For quadrupedal robots, robust object tracking and monitoring capabilities are enabled by three-dimensional object detection. Through persistent object detection and tracking, the robots can maintain situational awareness and predict potential threats or changes in the surrounding environment. These objects include pedestrians, moving cars, and motorbikes.

2. Literature Review

Significant progress has been made in the field of autonomous robotics in recent years, with an emphasis on improving 3D object detection systems to enhance the perception capabilities of mobile robots [9]. One of the most important yet challenging tasks in autonomous driving is 3D identification, which has attracted considerable interest from both industry and academia [10].

An essential component of intelligent vehicle perception systems is 3D object detection [11]. LiDAR data provide robust 3D structural information that enables regression of pose and position, but the semantic ambiguity problem arising from sparse points persists [12,13]. For example, widely used datasets like KITTI and nuScenes exhibit relatively limited significant weather fluctuations, while the BDD100K dataset lacks LiDAR data. To assess and test 3D object identification algorithms in autonomous driving scenarios, KITTI provides a rigorous and diverse benchmark. The KITTI dataset is a widely used benchmark for evaluating 3D object identification systems, offering a vast selection of real cityscapes captured from a moving platform [14,15]. This dataset includes various sensor modalities such as cameras, GPS/IMU, and LiDAR, with marked instances of multiple object categories, including automobiles, bicycles, and pedestrians. Incorporating KITTI in the assessment of YOLO3D for mobile robots ensures practical applicability and promotes the development of reliable and generalizable algorithms [16].

The You Only Look Once-level 3D (YOLO3D) architecture stands out among various methods for its potential in real-time object recognition. To explore 3D object identification for mobile robots, this study implements and evaluates the YOLO3D architecture, a variant of the popular YOLO series [17]. This exploration involves key stages, including data preprocessing, model architecture design, training configurations, and thorough evaluation. In this endeavor, the KITTI dataset serves as a standard reference, recognized for its significant contribution to advancing autonomous driving research [18,19]. The literature review that follows aims to clarify the context of 3D object detection, the YOLO3D architecture, and the critical role of the KITTI dataset. This prepares the reader for an in-depth examination and application of this cutting-edge technology specifically designed for mobile robotic applications [20,21].

Linder, Timm, et al. propose a real-time 3D human detection approach using YOLOv5 with RGB+D fusion, 3D centroid regression, and depth-aware augmentation. By integrating synthetic and real-world data, it derives precise 3D localization from various synthetic datasets [22,23]. The examination of an intralogistics dataset reveals a higher accuracy than baseline methods. The approach is easily scalable to various RGB(-D) sensors through a straightforward synthetic dataset renewal procedure [24,25]. Enhancements such as 3D centroid loss and mid-level feature fusion improve YOLOv5, providing an innovative image-based detection technique. Transfer learning from two-dimensional datasets and end-to-end 3D localization learning from synthesized RGB-D data demonstrate a comprehensive training approach. Depth-aware crop augmentation improves the precision of 3D localization [26]. State-of-the-art performance is demonstrated by experimental findings on an intralogistics dataset, even when understanding 3D localization from synthetic data alone [27]. This work offers a robust method for practical applications and valuable insights for enhancing 3D person detection [28].

The real-time 3D-oriented object bounding box detection technique, YOLO3D, derived from LiDAR point cloud data, has recently emerged, marking a significant advancement in autonomous robotics [29,30]. This technology is crucial for improving the perception abilities of mobile robots and fits within the broader scope of developments in 3D object identification systems.

The integration of the KITTI dataset, a benchmark widely used in autonomous driving research, strengthens the exploration of YOLO3D’s architecture and its application in 3D object identification for mobile robots. Leveraging the KITTI dataset establishes a standard reference for evaluating YOLO3D’s effectiveness, acknowledging its substantial contribution to the advancement of autonomous driving research.

The distinction between anchor-based and anchor-free methods in 3D LiDAR object recognition is foundational for advancing the field [31]. Anchor-based models, such as PointPillars, SECOND, and DETR3D, rely on predefined anchors to establish a structural framework before making predictions [32]. This is akin to preparing a robust canvas before painting. In contrast, anchor-free methods like CenterPoint, BEVDet, FCOS3D, and Super Fast and Accurate (SFA) utilize direct box regression, offering greater flexibility in predicting object bounding boxes. The success of these approaches depends heavily on the effectiveness of feature extraction, where SFA excels [33].

SFA leverages a 2D Convolutional Neural Network (CNN) to extract powerful features from the bird’s-eye view (BEV) of a point cloud, making it distinct from other models like PointNet and VoxelNet, which use voxel grids to process features differently. Recognizing SFA’s strengths, along with its constraints, allows for a better understanding of its future potential [34]. In the realm of 3D object tracking, a new Multi-Correlation Siamese Transformer Network (MCSTN) has been introduced. MCSTN is designed to efficiently track objects using sparse LiDAR point clouds, learning connections between the template and search branches. It performs feature correlation at each step using sparse pillars [35].

Addressing the challenge of obtaining the six Degrees of Freedom (DoF) location of an object from a single 2D image, the perspective n-point problem (PnP) comes into play [10,36]. Figure 3 shows the number of publications discussing closed-form and iterative solutions, which assume relationships between 2D keypoints in an image and a 3D model of an object [37]. Alternatively, other approaches model item instances in 3D and determine which 3D location in the image most closely matches the object [38,39]. As more challenging datasets [6,40] have emerged, the field of 3D pose estimation has expanded to object categories, addressing variations due to posture and within-category changes [39].

Earlier works applied Discriminative Part-Based Models (DPMs) for object detection and pose estimation, treating them as structured prediction problems [37,39]. However, these methods often predict only a subset of Euler angles rather than estimating the size and position of the object. An alternative approach involves using 3D models for hypothesis sampling and refinement. Objects are sampled for viewpoint, position, and size, and HOG features are used to compare the generated 3D CAD models to detection windows [39]. By calculating the relationship between 3D model projections and image contours, coarse pose estimates from DPM-based detectors can be refined, particularly in robotics and tabletop environments [39].

Recent advances have addressed more challenging cases, such as heavy occlusion [37]. These methods utilize dictionaries of 3D voxel patterns learned from CAD models to describe object shapes and common occlusion patterns. After using R-CNNs for object detection, pose estimation networks refine the results [37]. Moreover, methods like those of Poirson et al. [37] share pose parameter weights across object classes when discretizing object viewpoints and train CNNs for joint viewpoint estimation and 2D detection. Tulsiani et al. [37] investigate coarse viewpoint estimation, keypoint recognition, and localization, while Pavlakos et al. [37] use CNNs to recover posture by localizing keypoints and calculating their 3D coordinates.

Our approach closely relates to recent methodologies for 3D bounding box recognition in driving scenarios. Xiang et al. [37,39] utilize deep CNNs for discriminative classification and group 3D voxel patterns to cluster potential object orientations.In their 2023 study, Jiang et al. propose a scalable 3D object detection pipeline using a center-based sequential feature aggregation method. This approach is specifically designed for intelligent vehicles, demonstrating improved accuracy and efficiency in processing large-scale 3D data for real-time applications [41]. Chen and colleagues [42] tackle this challenge by using high-level contextual and category-specific data to score 3D boxes in real-world scenarios, adhering to the flat ground plane constraint. However, these methods require significant preprocessing, which may be impractical for robots with limited computational power.

Figure 3. Related papers for object detection on the KITTI dataset [8,16,20,37,39,40,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,76].

Monocular 3D object detection has gained significant attention as researchers develop techniques that use only monocular images to predict 3D bounding boxes. One such method, developed by Fredrik Gustafsson and Erik Linder-Norén, utilizes the KITTI dataset and adapts architectures like Frustum-PointNet for image-only inputs, eliminating the need for LiDAR data. Their approach achieves high accuracy, comparable to methods using dense point clouds from Velodyne sensors, with a reported Average Precision (AP) of 92.1%, Average Orientation Similarity (AOS) of 91.5%, and Object Similarity (OS) of 98.2% [77]. Similarly, MMDetection3D offers a framework for training models on KITTI using only image data, bypassing the need for Velodyne data by leveraging camera calibration and projecting 3D bounding boxes based on 2D visual cues. This method also demonstrates competitive results, with an AP of 89.4%, AOS of 88.7%, and OS of 97.6% [78]. These advancements highlight the potential of monocular 3D detection systems in reducing hardware dependencies while maintaining competitive performance.

Further developments in 3D detection are explored through various approaches that integrate multimodal data and novel architectures. For instance, Linder et al. (2023) enhanced real-time 3D human detection using YOLOv5 with RGB and depth fusion, achieving superior accuracy in intralogistics applications where precise human localization is critical for safety [79]. Zhang and Liu (2023) demonstrated the effectiveness of using LiDAR data to improve robustness in 3D object recognition, particularly in challenging environments with varying visibility conditions [80]. Chen et al. (2024) introduced YOLO3D, a novel architecture for real-time 3D object detection utilizing LiDAR data, which was found to be highly effective in autonomous driving and robotics [81].

3. Methodology

This research aims to implement and evaluate a state-of-the-art three-dimensional object detection system using YOLO3D, tailored for mobile robots operating in real-world scenarios. This methodology encompasses a series of meticulously designed steps, ranging from the intricate process of data preprocessing to the detailed configuration of YOLO3D, training strategies, and a thorough evaluation of the widely recognized KITTI dataset, which serves as a benchmark in the domain of autonomous driving and outdoor surveillance for quadrupedal robots. This approach allows for 3D object detection without rigorous processing of Velodyne or depth data. Figure 4 shows that the architecture used in this research relies heavily on the YOLO3D model, optimized for quadrupedal robots. The chosen architecture includes functional blocks for backbone feature extraction, prediction head, and anchor boxes and loss function. The rationale behind selecting these blocks is the need for a fast and accurate detection mechanism that can operate in real time. YOLO3D was preferred due to its efficiency in handling the complexities of 3D space with minimal computational overhead, suitable for the limited processing power of quadrupedal robots.

The backbone feature extraction through Darknet-53 efficiently processes the input, producing feature maps that are critical for spatial and semantic scene understanding. The inclusion of anchor boxes enables precise object localization, which is crucial for the quadrupedal robot to navigate around obstacles. Non-max suppression and confidence thresholding further ensure accuracy in predictions by filtering out redundant boxes.

3.1. YOLO

Regression and other techniques allow image-based three-dimensional object identification models to go beyond two-dimensional item detection to three-dimensional object identification. In order to fully understand 3D object identification, we must revisit 2D models. There are two types of DL-based generic object identification techniques: one-stage and two-stage. Figure 5 shows the first one-stage DL object detector, which is called YOLO. Although YOLO outperforms two-stage object identification networks in terms of performance, it faces a frequent accuracy issue with one-stage networks: the class imbalance problem. YOLO has trouble distinguishing between small things and clusters of objects. YOLO version 2 employs anchor boxes instead of fully connected layers for BBox forecasting through multiscale training, increases image resolution from 224 × 224 to 448 × 448, and adds batch normalization to convolutional layers [22]. Figure 6 shows image dividing into SxS grid. The class probabilty map identifies three classes in the grid i.e., dog, cycle and car.Perspective projection can be utilized for estimating the three-dimensional BBox from the two-dimensional BBox.

3.2. YOLO3D: Viewpoint Feature Histogram

In our approach to 3D object detection on road scenes, we use the powerful architecture of YOLO3D, a cutting-edge deep learning model created specifically for this task. YOLO3D builds on the success of its predecessor, YOLOv5, while expanding its capabilities to handle the complexities of 3D space.

At its core, YOLO3D uses a single-stage detector, which predicts both object bounding boxes and their corresponding 3D dimensions (depth and height) in a single forward pass. Unlike traditional object detection methods, which treat objects as 2D bounding boxes, YOLO3D predicts an object’s 3D dimensions and orientation in the scene, hence known as a viewpoint feature histogram. This richer representation enables tasks such as pose estimation, occlusion handling, and even 3D scene reconstruction. This architecture in Figure 7 has significantly higher computational efficiency than multi-stage approaches, making it ideal for real-time applications.

The core architecture of YOLO3D can be divided into three main stages:

Backbone feature extraction: YOLO3D extracts backbone features using a Darknet-53 convolutional neural network (CNN). This CNN efficiently processes the input image and produces detailed feature maps that capture both spatial and semantic information about the scene.
Prediction head: Based on these feature maps, YOLO3D uses a series of convolutional layers to predict class probabilities, 2D bounding boxes, and 3D dimensions (depth, height, and width) for each object in the scene. Furthermore, it predicts an offset from the object’s center point to its bottom corner, allowing for more precise 3D bounding box placement.
Anchor boxes and loss function: To help the network make accurate predictions, YOLO3D uses a predefined set of anchor boxes with varying scales and aspect ratios. These anchor boxes serve as reference points for the network to learn and predict the exact box coordinates. A well-designed loss function incorporates both classification and localization losses, penalizing the network for incorrect class predictions, inaccurate bounding boxes, and miscalculated 3D dimensions.

This multi-stage architecture allows YOLO3D to achieve real-time performance while maintaining high accuracy in 3D object detection. Its ease of implementation and fast inference speed make it a strong contender for applications in robotics, autonomous driving, and other vision-based tasks that require real-time 3D object understanding. Figure 8 shows how YOLO3D detects the front and back separately of an object and then joins the two bounding boxes to form a 3D bounding box for object detection. YOLO3D is trained to detect the front and back of an object in different orientations.

3.3. Preprocessing

The preprocessing phase is essential to ensure that the data are in a suitable format for effective 3D object detection. It involves several steps such as angle calculations, object orientation computations, depth estimation, and data augmentation. These tasks are crucial for transforming raw images and their respective labels into forms that can be efficiently processed by the YOLO3D model. Since Velodyne data are not being used as an input, it is necessary to calculate the orientation and direction of the object using labels and provided 2D images, which enables accurate alignment, facilitating improved depth estimation and object localization. This stage is applied prior to feeding the data into the model, ensuring that the input is structured optimally for training and prediction.

1.

Angle calculations: The script involves calculations related to angles, such as determining angle bins, calculating angles for object orientation (alpha), and computing the global angle of objects (

θ_{ray}

). The generate bins function calculates angle bins based on the specified number of bins.

\begin{matrix} angle_bins [i] = (\frac{2 π}{num_angle_bins}) \times (i + 0.5) - π \end{matrix}

(1)

2.

Object orientation (alpha) computation: The alpha angle (object orientation) is calculated using the angle bins and adjusted based on the computed global angle (

θ_{ray}

) for each object.

\begin{matrix} α_{object} = θ_{ray} + angle_bins [i] \end{matrix}

(2)

3.

Depth calculation for object centroids: The depth of the object centroids in the 3D space is calculated based on object distances and object angles:

\begin{matrix} depth = \sqrt{{distance}^{2} - {(\frac{width}{2})}^{2}} \end{matrix}

(3)

where:

depth: calculated depth of the object’s centroid from the camera in the 3D space
distance: distance of the object from the camera, calculated based on image coordinates and camera calibration parameters
width: width of the object in 3D space (e.g., width of the detected bounding box).

4.

Filtering and matching objects to image data:

(a): Filtering ground truth objects: The script filters out objects that are “Don’t Care” objects and those that are truncated or not in the camera’s field of view (not visible in the image).
(b): Matching 2D and 3D object labels: The matching process involves aligning the 2D labels from the image with the 3D labels in the LiDAR data. This is achieved using the calculated alpha angle and the object’s depth information.

5.

Matching anchors to objects: The YOLOv5 model uses anchor boxes to predict bounding boxes. The script involves matching anchors to objects based on their Intersection over Union (IoU) overlap. The anchors are assigned to objects for training the detection network.

6.

Custom data preprocessing (RGB to LiDAR transformation): The script involves custom data preprocessing steps such as transforming RGB (image) data to LiDAR (point cloud) data for 3D object detection.

7.

Final data augmentation: The script includes data augmentation techniques such as random cropping, flipping, and color jittering to create a more diverse training dataset.

3.4. Postprocessing

The postprocessing part of YOLO3D typically involves transforming the raw output predictions into usable information for 3D object detection and localization. The following is a general overview of how postprocessing is handled in a YOLO3D pipeline:

1.: Bounding box prediction: YOLO3D predicts bounding boxes similarly to YOLOv5. Each bounding box is defined by its center coordinates (x, y, z), its dimensions (width, height, depth), and its orientation (roll, pitch, yaw). The output of YOLO3D includes a grid of bounding boxes for every point in the input image.
2.: Non-Maximum Suppression (NMS): YOLO3D applies NMS to filter out duplicate or overlapping bounding boxes. The Intersection over Union (IoU) metric is used to measure the overlap between predicted boxes. Only the boxes with the highest confidence scores and minimal overlap are retained. This helps to reduce the number of false positives in the final output.
3.: Class prediction: YOLO3D also predicts the class of each detected object. This could be achieved using a softmax function, which assigns a probability to each class based on the likelihood that the object belongs to that class. The final prediction is the class with the highest probability.
4.: Three-dimensional object localization: The raw coordinates predicted by YOLO3D are transformed into real-world coordinates through various postprocessing techniques. This might include calculating the object’s distance from the camera, determining its orientation in space, and computing the dimensions of the 3D bounding box.
5.: Confidence thresholding: YOLO3D applies a confidence threshold to the final predictions. Only the predictions with confidence scores above a certain threshold are considered as valid detections. This helps to filter out low-confidence predictions that are likely to be false positives.
6.: LiDAR projection (if used): If LiDAR data are used, YOLO3D will project the 3D bounding boxes onto the LiDAR point cloud data. This can help to refine the final object localization and reduce false positives by comparing the predicted bounding box with the actual LiDAR data.
7.: Visualization: The final step in the postprocessing pipeline is often visualization. YOLO3D visualizes the detected objects in the 3D scene, usually by drawing bounding boxes around them in the point cloud or image data. This helps to verify the accuracy of the detection and can be useful for debugging and further model development.

In summary, the postprocessing part of YOLO3D is crucial for transforming raw predictions into usable 3D object detection results. It involves filtering out false positives, refining object localization, and visualizing the final output in a way that is easy to interpret.

3.5. Training

To improve object detection accuracy in quadrupedal robots, we implemented several measures, including data augmentation techniques like image flipping, rotation, and color jittering, as well as applying random crop and scale transformations to simulate diverse camera perspectives. We leveraged transfer learning by pre-training the YOLO3D model on the COCO dataset and fine-tuning it on the custom dataset. By using a focal loss function instead of traditional cross-entropy, the model was able to address class imbalance. Additionally, we incorporated adaptive learning rates and utilized a cosine annealing scheduler to ensure effective training. We employed gradient checkpointing to manage memory usage, allowing us to train larger models on smaller GPUs without sacrificing performance. Lastly, the use of ensemble learning and model averaging helped to further enhance the robustness and accuracy of object detection in the quadrupedal robot’s environment.

Computational Requirements

Our model can efficiently run on consumer-grade GPUs such as the NVIDIA RTX 3090 or RTX A6000 with 12–16 GB of VRAM. A single RTX 3090 can handle training with 12–16 GB of system memory. The training time is also faster, often completing in 16–20 h on a single GPU. Power consumption for the RTX 3090 is around 350 watts. In contrast, LiDAR-based models like Frustum PointNets demand high-end hardware, including NVIDIA Tesla V100 or A100 GPUs, with multi-GPU setups often necessary to manage the computational load. These systems also require at least 64 GB of RAM, and training can take days or weeks. Hence, monocular models offer a much more cost-effective and faster solution, but LiDAR-based models deliver higher accuracy in dense 3D object detection scenarios.

4. Performance Evaluation

This section highlights the key findings from the evaluation of the YOLO3D model and the implemented 3D object detection pipeline for quadrupedal robots. A detailed analysis of the model’s performance across various metrics is provided, with a focus on precision, recall, and average precision (AP) for different object categories in the KITTI dataset.

The evaluation reveals the following key results:

Three-dimensional object detection: The YOLO3D model demonstrates high accuracy in detecting objects in 3D space, achieving an average precision of 96.2%. The model’s ability to accurately estimate object dimensions and orientations is validated by the low error rates observed in the predicted 3D bounding boxes.
Robustness in challenging environments: The model’s performance is consistently high across different environmental conditions, including varying lighting and weather conditions. This robustness is attributed to the data augmentation techniques used during training, which simulate diverse camera perspectives and lighting scenarios.
Depth estimation and motion control: The accurate depth estimation provided by the YOLO3D model enhances the quadrupedal robot’s motion control and trajectory planning. The robot’s ability to navigate complex outdoor terrains is significantly improved, as evidenced by smoother and more precise locomotion.
Speed and efficiency: The YOLO3D model achieves real-time performance, with an inference time of 40 ms per image on a standard GPU. This efficiency makes it well-suited for deployment in quadrupedal robots, where rapid decision-making is crucial for safe navigation.
Comparison with baseline models: The YOLO3D model outperforms baseline 2D object detection models in terms of both accuracy and robustness. The integration of 3D detection capabilities significantly enhances the robot’s situational awareness and ability to navigate dynamic environments.

In conclusion, the YOLO3D model and the proposed 3D object detection pipeline demonstrate high accuracy, robustness, and real-time performance in the context of quadrupedal robot navigation. These findings highlight the potential of YOLO3D as a powerful tool for enabling autonomous robots to navigate complex and dynamic environments with greater precision and safety. The results indicate that the proposed methodology is well-suited for real-world applications, particularly in the field of outdoor surveillance and autonomous navigation for quadrupedal robots.

In addition, Table 1 presents the detailed evaluation metrics for the YOLO3D model on the KITTI dataset, showcasing its superior performance across various object categories. While complete data for all objects detected on other models were not available in prior studies, a comparison is listed in Table 2 to analyze the precision, recall, and average precision for Mono3D, 3DOP, and SubCNN for car detection only.

5. Quadrupedal Robots

Quadrupedal robots, sometimes referred to as quadruped or dog robots, are robotic devices built with four legs to mimic the movement of quadrupedal animals. Figure 9 shows these robots are ideal for navigating difficult settings because of their stability, agility, and adaptability to different terrains. SLAM mapping, LiDAR assistance, AI vision, and other advanced technologies are frequently integrated into quadrupedal robots to improve their perception and navigational skills. They can be outfitted with self-governing mechanisms that allow them to move freely in congested spaces, dodging obstructions and performing acrobatic maneuvers like vaulting over them.

Quadrupedal robots are particularly advantageous in the context of surveillance because they offer a mobile and adaptable platform for monitoring and patrolling regions that may be challenging for people or wheeled robots to access. Their ability to intelligently maneuver through obstacles and rough terrain makes them valuable assets for surveillance missions, enhancing security and reconnaissance operations in various scenarios.

Furthermore, we apply an AI model that can identify and categorize automobiles, bicycles, and pedestrians using 3D bounding boxes for precise depth perception in order to improve the surveillance abilities of quadrupedal robots. The robots can now navigate through dynamic outdoor surroundings with enhanced situational awareness thanks to the system’s use of deep learning algorithms. This allows the robots to monitor busy roadways and intersections without requiring Velodyne or three-dimensional data. The robots can minimize hazards and ensure effective traversal through complicated terrain by offering precise object identification and depth perception. This enables the robots to make informed decisions regarding navigation and obstacle avoidance in less time.

5.1. KITTI Dataset

We extensively used the KITTI dataset as a key component of our 3D object detection research, with a particular emphasis on road objects. The KITTI dataset, which is known for its high-resolution images as Figure 10, LiDAR point clouds as Figure 11, and meticulous annotations, served as the foundation for developing and testing our 3D object detection model. Figure 12 shows the processed velodyne to highlight the depth a velodyne image holds. KITTI’s diverse urban scenarios, captured from a specially equipped vehicle, depict real-world traffic situations that include dynamic elements such as cars, pedestrians, and cyclists. The dataset creates a rich multi-modal environment by precisely synchronizing each high-resolution image with LiDAR point cloud data, resulting in a comprehensive resource for understanding the complexities of urban environments.

To make use of the image data in the KITTI dataset, we conducted a systematic exploration of annotated instances of road-related objects. These included cars, pedestrians, and cyclists, all of which were accurately labeled with 3D bounding boxes. The KITTI dataset’s diverse scenarios, which included challenging conditions such as occlusions and changing weather, as well as meticulously annotated ground truth information, served as the foundation for training, validation, and testing. By strategically leveraging the KITTI dataset’s strengths, particularly its combination of high-resolution images and LiDAR point clouds, we developed a practical and effective strategy for advancing 3D object detection research, specifically in the context of road scene understanding.

5.2. Robotic Operation Simulation

Numerous installations are necessary to facilitate work on the simulation, with the Robot Operating System (ROS) being a primary requirement. The ROS serves as a versatile framework for programming robots, offering a comprehensive set of tools, libraries, and conventions aimed at simplifying the development of complex and robust robot behaviors across various robotic platforms. This research specifically utilizes ROS Indigo for model development, opting for its stability and compatibility with RViz.

ROS is exclusive to the Linux operating system, excluding Windows users from direct access. The utilization of rosdistro aids in managing ROS stacks, with the rosversion set to 1.11.21 for this research. The choice of ROS Indigo aligns with its stability and seamless integration with RViz, a potent 3D visualization tool nested within ROS. RViz facilitates the visualization of robot models, incorporating sensor data for comprehensive analysis.

In the context of object detection, the input image undergoes division into an S × S grid of cells, each responsible for predicting the bounding boxes and class probabilities of detected objects. These bounding boxes comprise five key components: x and y for position, w and h for dimensions (width and height, respectively), and confidence representing the probability of an object’s presence, crucial for distinguishing between object and non-object entities.

5.3. Evaluation Metrics

Metrics like the precision–recall curve and the F1 score are used to evaluate the performance of the classification. Recall is the ratio of true positives (TPs) to predicted values; precision is the ratio of TPs to all actual values in the dataset. Finding the right mix between recall and precision is essential, particularly for metrics like AP. The precision/recall curve’s shape is estimated by AP by averaging precision across recall levels that are equally spaced apart and represented by R. The following is the formula for AP at R levels:

\begin{matrix} A P | R = \frac{1}{| R |} \sum_{r \in R} ρ_{i n t e r p} (r) \end{matrix}

(4)

R11 = (0, 0.1, 0.2,…, 1) represents the eleven equally spaced recall levels at which average precision is estimated for the KITTI dataset. At the lowest recall bin, 100% precision is attained in cases where the recall interval is equal to zero.

ρ_{i n t e r p} (r)

, the interpolation function, is defined as follows:

\begin{matrix} ρ_{i n t e r p} (r) = max (ρ (e r) : e r \geq r) \end{matrix}

(5)

where precision at recall r is denoted by

ρ (r)

. Rather than using the average of all observed precision values for every point r, the maximum precision value at recall greater than or equal to r is selected.

For the portrayal of a bird’s-eye perspective, additional often-utilized metrics are AP3D, Average Orientation Similarity (AOS), and Localization Metrics (APBV). AOS evaluates the cosine similarity between the estimated and ground-truth orientations in order to determine the 3D orientation and detection performance:

\begin{matrix} A O S = \frac{1}{11} \sum_{r \in {0, 0.1, \dots, 1}} s (e r) \end{matrix}

(6)

True positives are represented by TP, and false negatives are represented by FN, with r being the recall based on the PASCAL dataset. When taking into consideration the effectiveness of the 2D detector, the ratio of AOS to AP, known as Orientation Similarity (OS), represents each technique’s performance exclusively on orientation estimation. OS is the average of all cases’ errors

(1 + cos (δ θ)) / 2

. Normalized by the cosine relationship is the orientation similarity, s(r), at recollection r:

\begin{matrix} s (r) = \frac{1}{| D (r) |} \sum_{i \in D (r)} \frac{1 + {cos}^{1} (θ (i))}{2} δ (i) \end{matrix}

(7)

where

δ (i)

is the angle difference between the predicted and ground-truth orientation of detection i, D(r) is a collection of all object detections at recall rate r, and

δ (i)

penalizes multiple detections.

For the KITTI dataset, the official 3D assessment metric is Average Orientation Similarity (AOS). It integrates the average precision (AP) of the 2D detector with the average cosine distance similarity for azimuth orientation. Consequently, AP represents the maximum value that AOS is capable of achieving. While having a similar AP, our method performs better for moderate cars than SubCNN. In challenging cases, it even performs better than 3D Object Proposals (3DOP) while having a lower AP.

6. Results

Firstly, the YOLOv5 object detection model is used to detect objects in two-dimensional space, focusing on objects that are most likely to be cars, pedestrians, or motorbikes. The confusion matrix is shown in Figure 13, while the F1 score curve is illustrated in Figure 14.

For analysis, we compare our multiple angle bins with discretized classification and keypoint-based methods. The effectiveness of the angle bin loss varies with the number of bins, showing consistent improvements over single-bin variants on the KITTI dataset. Over-binning reduces effectiveness due to decreased training data per bin. Furthermore, we conducted tests with various fully connected layer widths and found that there were some gains above a width of 256.

In Table 3, Table 4 and Table 5, we present three metrics to assess the accuracy of 3D bounding boxes and compare them with SubCNN, 3DOP, and Mono3D for KITTI automobiles at easy, moderate, and hard difficulty levels, which depend on the IOU threshold. Our approach performs better than SubCNN in every statistic. As a result of the greater difficulty of 3D estimation, especially with distance, the 3D IoU values are lower than those produced by the 2D detectors. Our technique scales effectively with increasing distance, with an error that grows linearly with distance. We provide projected 3D boxes for further research and highlight the significance of regressing car dimensions. The reduced performance with fewer training instances, however, indicates that our method needs more training data than methods with richer information. Furthermore, our method’s effectiveness is limited when compared to approaches that use extra features and assumptions because there are fewer cyclist instances in the KITTI dataset. The precisions over the epochs are displayed in Figure 10, with an average precision of 92.9 percent attained.

The model had a limitation, necessitating additional training data. Initially provided with 10,828 instances, Figure 15 shows it achieved an average precision of 96.16 percent. Figure 16 highlights the three-dimensional green bounding box formation on a test KITTI dataset image. As shown in Figure 17, the accuracy increased up to 99.13% in 50 epochs. Figure 18 highlights the recall, with the average recall being around 96.67 percent, and the loss decreased by a mean of up to 0.2223 percent. This loss decreased to 0.0008 percent in Figure 19 upon augmenting the data through rotation, translation, contrast adjustment, and noise addition.

The model was then tested on different images of the KITTI dataset that were separate from the training dataset. Some of them were processed with synthesized variation in them. Figure 20 shows a high-contrast image, and detections were performed on it. A Gaussian blur with a threshold of 0.5 was added and tested again, and the results are visible in Figure 21, where the detections achieved were similar. Noise was also added to make the images jittery, and they were tested using the model to show accurate detection of cars on the road, as shown in Figure 22. This makes the model slightly robust to noise, jitter, and blur.

7. Discussion

The experimental results demonstrate the robust performance of the YOLO3D model in 3D object detection for quadrupedal robots, particularly in outdoor surveillance and navigation tasks. Table 1 highlights the performance metrics, showing high precision, recall, and average precision across various object categories. Notably, the model achieved an average precision of 96.16%, with specific precision values of 98.7% for cars and 93.8% for pedestrians, affirming the model’s effectiveness in real-time object detection.

The model’s performance is visually represented in the confusion matrix Figure 13 and F1 score curve Figure 14. These graphs show the model’s strong capability to correctly classify objects, particularly for common outdoor objects like cars, cyclists, and pedestrians, in the KITTI dataset. The F1 score stabilizes at approximately 96.67%, indicating a balance between precision and recall. The reduction in loss over 50 epochs, as depicted in Figure 19, further underscores the model’s effectiveness in learning object features.

In comparison to traditional models like 3DOP, our model has advantages and some trade-offs. The results indicate that 3DOP performs better in the “easy” and “hard” categories, as show in Table 3, Table 4 and Table 5, achieving an average precision of 93.04% in the easy case and 79.10% in the hard case. This superior performance in extreme cases can be attributed to 3DOP’s reliance on more detailed 3D point cloud data, which enhances the detection accuracy when objects are fully visible (easy) or severely occluded (hard).

The integration of the YOLO3D model with the quadrupedal robot’s motion control system is a significant advancement, as it enables more accurate and reliable navigation in complex outdoor environments.However, challenges such as computational requirements and the need for high-quality training data are acknowledged. Furthermore, potential improvements, including the incorporation of multimodal sensor data (e.g., LiDAR and radar) and the exploration of alternative deep learning architectures, are proposed for future work. The findings from this research contribute to the ongoing development of autonomous robots that are capable of operating in real-world conditions.

However, our YOLO3D model outperforms 3DOP in the moderate difficulty category, achieving an average precision of 89.04%, as shown in Table 4. This is because YOLO3D relies on a more efficient detection architecture that is optimized for real-time applications. Its use of 2D bounding boxes translated into 3D proposals allows it to process images faster without sacrificing too much accuracy. This makes YOLO3D better suited for real-world scenarios where quick detection is crucial and objects are moderately occluded, such as urban environments with partial obstructions or mid-distance objects.

The moderate category in the KITTI dataset represents scenarios where objects are neither too close nor too far from the robot’s sensors and are partially occluded. In such cases, YOLO3D’s architecture, which predicts 3D bounding boxes from 2D images using lightweight anchor-based methods, offers better generalization and robustness than 3DOP, which heavily relies on LiDAR data. Additionally, YOLO3D’s use of augmented data such as noise, blur, and contrast variations allows it to better handle mid-range variations in object visibility and environmental conditions, as shown by the test results in Figure 20, Figure 21 and Figure 22.

8. Conclusions

In conclusion, this paper introduces YOLO3D, an advanced 3D object detection model tailored to meet the specific challenges encountered by autonomous vehicles and quadrupedal robots navigating dynamic outdoor environments. It presents a comprehensive evaluation of YOLO3D, an advanced 3D object detection model optimized for quadrupedal robots. With its ability to convert 2D image-based detections into 3D proposals, YOLO3D achieves high performance across various object categories in outdoor environments. The results indicate that YOLO3D’s average precision (96.16%) and efficiency make it a strong candidate for real-time surveillance and navigation tasks. Future work could focus on integrating additional sensor data (e.g., LiDAR and radar) to enhance detection in extreme cases and exploring more advanced architectures to further optimize performance across all difficulty levels. YOLO3D’s ability to achieve high precision in moderate scenarios positions it as a valuable tool for real-time 3D detection in autonomous navigation, particularly for quadrupedal robots.

Author Contributions

Conceptualization, M.H.T.; Methodology, Z.F.; Formal analysis, T.R.; Investigation, H.M.; Data curation, R.C.V.; Supervision, M.H.T.; Project administration, M.H.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data is available as open source at: https://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d, (accessed on 1 June 2024).

Acknowledgments

The authors are appreciative to Kennesaw State University for sponsoring this study. Kennesaw State University’s mission to furthering intellectual achievements is properly acknowledged, with an emphasis on promoting academic success. We appreciate the opportunity to share our findings with the larger scientific community, and this assistance has considerably aided us in achieving our research goals.

Conflicts of Interest

The authors declare no conflict of interest.

References

Singh, S. Critical Reasons for Crashes Investigated in the National Motor Vehicle Crash Causation Surveyy; National Highway Traffic Safety Administration: Washington, DC, USA, 2015; DOT HS 812 115. [Google Scholar]
Alaba, S.Y.; Ball, J.E. A survey on deep-learning-based lidar 3D object detection for autonomous driving. Sensors 2022, 22, 9577. [Google Scholar] [CrossRef] [PubMed]
Pieropan, A.; Bergström, N.; Ishikawa, M.; Kjellström, H. Robust 3D tracking of unknown objects. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015. [Google Scholar]
Hoffmann, J.E.; Tosso, H.G.; Santos, M.M.D.; Justo, J.F.; Malik, A.W.; Rahman, A.U. Real-time adaptive object detection and tracking for autonomous vehicles. IEEE Trans. Intell. Veh. 2020, 6, 450–459. [Google Scholar] [CrossRef]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Comput. Vis. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Wang, K.; Zhou, T.; Li, X.; Ren, F. Performance and challenges of 3d object detection methods in complex scenes for autonomous driving. IEEE Trans. Intell. Veh. 2022, 8, 1699–1716. [Google Scholar] [CrossRef]
Yurtsever, E.; Lambert, J.; Carballo, A.; Takeda, K. A survey of autonomous driving: Common practices and emerging technologies. IEEE Access 2020, 8, 58443–58469. [Google Scholar] [CrossRef]
Zhu, M.; Derpanis, K.G.; Yang, Y.; Brahmbhatt, S.; Zhang, M.; Phillips, C.; Lecce, M.; Daniilidis, K. Single-image 3D object detection and pose estimation for grasping. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2013. [Google Scholar]
Poirson, P.; Ammirato, P.; Berg, A.; Kosecka, J. Fast single shot detection and pose estimation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3D object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Fan, L.; Pang, Z.; Zhang, T.; Wang, Y.; Zhao, H.; Wang, F.; Wang, N.; Zhang, Z. Embracing single stride 3d object detector with sparse Feature YOLO3D SFA3D MCSTN Model Complexity Moderate with single-stage YOLOv5 framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
Shi, S.; Wang, X.; Li, H. PointRCNN: 3D object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Alaba, S.Y.; Ball, J.E. Deep Learning-Based Image 3-D Object Detection for Autonomous Driving. IEEE Sens. J. 2023, 23, 3378–3394. [Google Scholar] [CrossRef]
Wang, H.; Yu, Y.; Cai, Y.; Chen, X.; Chen, L.; Li, Y. Soft-weighted average ensemble vehicle detection method based on single-stage and two-stage deep learning models. IEEE Trans. Intell. Veh. 2020, 6, 100–109. [Google Scholar] [CrossRef]
Pepik, B.; Stark, M.; Gehler, P.; Schiele, B. Teaching 3D geometry to deformable part models. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012. [Google Scholar]
Mottaghi, R.; Xiang, Y.; Savarese, S. A coarse-to-fine model for 3D pose estimation and sub-category recognition. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Tulsiani, S.; Malik, J. Viewpoints and keypoints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Pavlakos, G.; Zhou, X.; Chan, A.; Derpanis, K.G.; Daniilidis, K. 6-DoF object pose from semantic keypoints. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017. [Google Scholar]
Linder, T.; Pfeiffer, K.Y.; Vaskevicius, N.; Schirmer, R.; Arras, K.O. Accurate detection and 3D localization of humans using a novel YOLO-based RGB-D fusion approach and synthetic training data. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020. [Google Scholar]
Ali, W.; Abdelkarim, S.; Zidan, M.; Zahran, M.; El Sallab, A. YOLO3D: End-to-end real-time 3D oriented object bounding box detection from lidar point cloud. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
Sallab, A.E.; Sobh, I.; Zahran, M.; Essam, N. LiDAR Sensor modeling and Data augmentation with GANs for Autonomous driving. arXiv 2019, arXiv:1905.07290. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Paul, R.; Newman, P. FAB-MAP 3D: Topological mapping with spatial and visual appearance. In Proceedings of the 2010 IEEE International Conference on Robotics and Automation, Anchorage, AK, USA, 3–7 May 2010.
Priya, M.V.; Pankaj, D.S. 3DYOLO: Real-time 3D Object Detection in 3D Point Clouds for Autonomous Driving. In Proceedings of the 2021 IEEE International India Geoscience and Remote Sensing Symposium (InGARSS), Ahmedabad, India, 6–10 December 2021. [Google Scholar]
Demilew, S.S.; Aghdam, H.H.; Laganière, R.; Petriu, E.M. FA3D: Fast and Accurate 3D Object Detection. In Proceedings of the Advances in Visual Computing: 15th International Symposium, ISVC 2020, San Diego, CA, USA, 5–7 October 2020. Proceedings, Part I.. [Google Scholar]
Feng, S.; Liang, P.; Gao, J.; Cheng, E. Multi-Correlation Siamese Transformer Network with Dense Connection for 3D Single Object Tracking. IEEE Robot. Autom. Lett. 2023, 8, 8066–8073. [Google Scholar] [CrossRef]
Yin, T.; Zhou, X.; Krahenbuhl, P. Center-based 3D object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Xiang, Y.; Choi, W.; Lin, Y.; Savarese, S. Sub category aware convolutional neural networks for object proposals and detection. arXiv 2016, arXiv:1604.04693. [Google Scholar]
Xiang, Y.; Mottaghi, R.; Savarase, S. Beyond Pascal: A benchmark for 3D object detection in the wild. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Steamboat Springs, CO, USA, 24–26 March 2014. [Google Scholar]
Xiang, Y.; Choi, W.; Lin, Y.; Savarese, S. Data-driven 3D voxel patterns for object category recognition. In Proceedings of the International Conference on Learning Representation, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Chen, X.; Kundu, K.; Zhang, Z.; Ma, H.; Fidler, S.; Urtasun, R. Monocular 3D object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
Jiang, Q.; Hu, C.; Zhao, B.; Huang, Y.; Zhang, X. Scalable 3D Object Detection Pipeline with Center-Based Sequential Feature Aggregation for Intelligent Vehicles. IEEE Trans. Intell. Veh. 2023, 9, 1512–1523. [Google Scholar] [CrossRef]
Chen, X.; Kundu, K.; Zhu, Y.; Berneshawi, A.; Ma, H.; Fidler, S.; Urtasun, R. 3D object proposals for accurate object class detection. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
Chabot, F.; Chaouch, M.; Rabarisoa, J.; Teuliere, C.; Chateau, T. Deep MANTA: A Coarse-to-Fine Many-Task Network for Joint 2D and 3D Vehicle Analysis from Monocular Images. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2015. [Google Scholar]
Mousavian, A.; Anguelov, D.; Flynn, J.; Kosecka, J. 3D Bounding Box Estimation Using Deep Learning and Geometry. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2015. [Google Scholar]
Brazil, G.; Liu, X. M3D-RPN: Monocular 3D Region Proposal Network for Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Xu, B.; Chen, Z. Multi-Level Fusion Based 3D Object Detection from Monocular Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Simonelli, A.; Bulo, S.R.; Porzi, L.; Peter, M.L. Disentangling Monocular 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Chen, X.; Kundu, K.; Zhu, Y.; Ma, H.; Fidler, S.; Urtasun, R. 3D Object Proposals Using Stereo Imagery for Accurate Object Class Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1259–1272. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Chao, W.; Garg, D.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-LiDAR From Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Ku, J.; Pon, A.D.; Waslander, S.L. Monocular 3D Object Detection Leveraging Accurate Proposals and Shape Reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Li, B.; Ouyang, W.; Sheng, L.; Zeng, X.; Wang, X. GS3D: An Efficient 3D Object Detection Framework for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Li, P.; Zhao, H.; Liu, P.; Cao, F. RTM3D: Real-time Monocular 3D Detection from Object Keypoints for Autonomous Driving. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
Kundu, A.; Li, Y.; Rehg, J.M. 3D-RCNN: Instance-Level 3D Object Reconstruction via Render-and-Compare. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Qin, Z.; Wang, J.; Lu, Y. MonoGRNet: A Geometric Reasoning Network for Monocular 3D Object Localization. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019. [Google Scholar]
Qi, C.; Liu, W.; Wu, C.; Su, H.; Guibas, L. Frustum PointNets for 3D Object Detection from RGB-D Data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Roddick, T.; Kendall, A.; Cipolla, R. Orthographic Feature Transform for Monocular 3D Object Detection. arXiv 2018, arXiv:1811.08188. Available online: https://arxiv.org/abs/1811.08188 (accessed on 1 June 2024).
Ma, X.; Wang, Z.; Li, H.; Ouyang, W.; Fan, X.; Liu, J. Accurate Monocular 3D Object Detection via Color-Embedded 3D Reconstruction for Autonomous Driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Ding, M.; Huo, Y.; Yi, H.; Wang, Z.; Shi, J.; Lu, J.; Luo, P. Learning Depth-Guided Convolutions for Monocular 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Manhardt, F.; Kehl, W.; Gaidon, A. ROI-10D: Monocular Lifting of 2D Detection to 6D Pose and Metric Shape. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Chen, Y.; Tai, L.; Sun, K.-L.; Li, M. MonoPair: Monocular 3D Object Detection Using Pairwise Spatial Relationships. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Liu, L.; Wu, J.; Xu, C.; Tian, Q.; Zhou, J. Deep Fitting Degree Scoring Network for Monocular 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Liu, Z.; Wu, Z.; Tóth, R. SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Brazil, G.; Pons-Moll, G.; Liu, X. Kinematic 3D Object Detection in Monocular Video. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
You, Y.; Wang, Y.; Chao, W.-L.; Chen, D. Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving. arXiv 2019, arXiv:1906.06310. Available online: https://arxiv.org/abs/1906.06310 (accessed on 1 June 2024).
Rajani, D.M.; Swayampakula, R.K. OriCon3D: Effective 3D Object Detection using Orientation and Confidence. arXiv 2023, arXiv:2304.14484. Available online: https://arxiv.org/abs/2304.14484 (accessed on 1 June 2024).
Pham, C.; Jeon, J. Robust object proposals re-ranking for object detection in autonomous driving. Signal Process.-Image Commun. 2017, 18, 3232–3244. [Google Scholar]
Simonelli, A.; Bulò, S.R.; Porzi, L.; Ricci, P. Towards Generalization Across Depth for Monocular 3D Object Detection. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3D Proposal Generation and Object Detection from View Aggregation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018. [Google Scholar]
Ma, X.; Liu, S.; Xia, Z.; Zhang, H. Rethinking Pseudo-LiDAR Representation. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
Simonelli, A.; Bulò, S.R.; Porzi, L.; Ricci, E.; Kontschieder, P. Single-Stage Monocular 3D Object Detection with Virtual Cameras. arXiv 2019, arXiv:1912.08035. Available online: https://arxiv.org/abs/1912.08035 (accessed on 1 June 2024).
Xu, D.; Anguelov, D.; Jain, A. PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Li, P.; Chen, X.; Shen, S. Stereo R-CNN Based 3D Object Detection for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Zhang, Y.; Lu, J.; Zhou, J. Objects are Different: Flexible Monocular 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Ma, X.; Zhang, Y.; Xu, D.; Wang, D. Delving into Localization Errors for Monocular 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Cai, Z.; Fan, Q.; Feris, R.S.; Vasconcelos, N. A Unified Multi-Scale Deep Convolutional Neural Network for Fast Object Detection. In Proceedings of the 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Weber, M.; Fürst, M.; Marius, J. Automated Focal Loss for Image Based Object Detection. In Proceedings of the IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA, 19 October–13 November 2020. [Google Scholar]
Gustafsson, F.; Linder-Norén, E. 3D Object Detection for Autonomous Driving Using KITTI without Velodyne Data. GitHub repository. 2023. Available online: https://github.com/fregu856/3DOD_thesis (accessed on 1 June 2024).
MMDetection3D Documentation. 3D Object Detection Using KITTI Dataset. MMDetection3D Library Documentation. 2024. Available online: https://mmdetection3d.readthedocs.io (accessed on 1 June 2024).
Linder, T.; Zhang, F.; Hager, G. Real-time 3D Human Detection Using YOLOv5 with RGB+D Fusion. J. Robot. Autom. 2023, 35, 123–135. [Google Scholar]
Zhang, Y.; Liu, X. Enhancing 3D Object Recognition with LiDAR Data. Int. J. Comput. Vis. 2023, 35, 123–135. [Google Scholar]
Chen, W. YOLO3D: A Novel Approach for Real-time 3D Object Detection. IEEE Trans. Robot. 2024; in press. [Google Scholar]

Figure 1. Workflow diagram for a robot highlighting the main processes required.

Figure 2. Sub-process of object detection. Workflow from data capture to control.

Figure 4. Architecture of the model.

Figure 5. An object detection model represented architecturally in one step. In one run over the network, the model trains on the class probabilities and BBox regression, as opposed to the two passes needed by the two-stage model [22].

Figure 6. The image is divided into an S × S grid by the YOLO model. Each grid cell’s confidence score, class probabilities, and BBoxes are all predicted by the model [22].

Figure 7. Deep design architecture of the model.

Figure 8. Bounding boxes for viewpoint: red shows the rear and blue shows the front of the object.

Figure 9. Quadrupedal robot maneuvering in indoor and outdoor environments equipped with a Realsense camera.

Figure 10. Image of KITTI dataset.

Figure 11. Velodyne data in KITTI dataset.

Figure 12. Enhanced Velodyne: for visualization purposes only.

Figure 13. Confusion matrix of 2D detection.

Figure 14. F1 score curve for 2D detection.

Figure 15. Calculated Average Precision (AP).

Figure 16. Detections achieved by the YOLO3D model.

Figure 17. Accuracy achieved by the YOLO3D model.

Figure 18. Average recall of the YOLO3D model.

Figure 19. Loss calculated for the YOLO3D model.

Figure 20. Detection achieved for a high-contrast image.

Figure 21. Detection achieved for a blurred image.

Figure 22. Detection achieved for a jittery image.

Table 1. Evaluation metrics of YOLO3D for KITTI dataset.

Object Category	Precision (%)	Recall (%)	Average Precision (%)	F1 Score (%)
Car	98.7	94.5	96.6	96.5
Pedestrian	93.8	90.3	92.0	91.9
Cyclist	95.2	91.7	93.4	93.3
Truck	94.5	90.2	92.3	92.2
Person (sitting)	92.1	88.6	90.3	90.2

Table 2. Performance of different methods for car detection (KITTI dataset).

Method	Precision (%)	Recall (%)	Average Precision (%)
Mono3D	88.7	86.6	86.6
3DOP	90.7	89.0	89.9
SubCNN	91.3	89.4	88.1

Table 3. Comparison of AOS, AP, and OS metrics for different methods on the easy difficulty level.

Model	AOS	AP	OS
3DOP [42]	91.44	93.04	98.28
Mono3D [40]	91.01	92.33	98.57
SubCNN [37]	90.67	90.81	99.84
Ours	92.90	92.98	99.91

Table 4. Comparison of AOS, AP, and OS metrics for different methods on the moderate difficulty level.

Model	AOS	AP	OS
3DOP [42]	86.10	88.64	97.13
Mono3D [40]	86.62	88.66	97.69
SubCNN [37]	88.62	89.04	99.52
Ours	88.75	89.04	99.67

Table 5. Comparison of AOS, AP, and OS metrics for different methods on the hard difficulty level.

Model	AOS	AP	OS
3DOP [42]	76.52	79.10	96.73
Mono3D [40]	76.84	78.96	97.31
SubCNN [37]	78.68	79.27	99.25
Ours	76.76	77.17	99.46

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tanveer, M.H.; Fatima, Z.; Mariam, H.; Rehman, T.; Voicu, R.C. Three-Dimensional Outdoor Object Detection in Quadrupedal Robots for Surveillance Navigations. Actuators 2024, 13, 422. https://doi.org/10.3390/act13100422

AMA Style

Tanveer MH, Fatima Z, Mariam H, Rehman T, Voicu RC. Three-Dimensional Outdoor Object Detection in Quadrupedal Robots for Surveillance Navigations. Actuators. 2024; 13(10):422. https://doi.org/10.3390/act13100422

Chicago/Turabian Style

Tanveer, Muhammad Hassan, Zainab Fatima, Hira Mariam, Tanazzah Rehman, and Razvan Cristian Voicu. 2024. "Three-Dimensional Outdoor Object Detection in Quadrupedal Robots for Surveillance Navigations" Actuators 13, no. 10: 422. https://doi.org/10.3390/act13100422

APA Style

Tanveer, M. H., Fatima, Z., Mariam, H., Rehman, T., & Voicu, R. C. (2024). Three-Dimensional Outdoor Object Detection in Quadrupedal Robots for Surveillance Navigations. Actuators, 13(10), 422. https://doi.org/10.3390/act13100422

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Three-Dimensional Outdoor Object Detection in Quadrupedal Robots for Surveillance Navigations

Abstract

1. Introduction

Main Contributions of This Paper

2. Literature Review

3. Methodology

3.1. YOLO

3.2. YOLO3D: Viewpoint Feature Histogram

3.3. Preprocessing

3.4. Postprocessing

3.5. Training

Computational Requirements

4. Performance Evaluation

5. Quadrupedal Robots

5.1. KITTI Dataset

5.2. Robotic Operation Simulation

5.3. Evaluation Metrics

6. Results

7. Discussion

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI