Using a YOLO Deep Learning Algorithm to Improve the Accuracy of 3D Object Detection by Autonomous Vehicles

Murendeni, Ramavhale; Mwanza, Alfred; Obagbuwa, Ibidun Christiana

doi:10.3390/wevj16010009

Open AccessEditor’s ChoiceArticle

Using a YOLO Deep Learning Algorithm to Improve the Accuracy of 3D Object Detection by Autonomous Vehicles

by

Ramavhale Murendeni

,

Alfred Mwanza

and

Ibidun Christiana Obagbuwa

^*

Department of Computer Science and Information Technology, Faculty of Natural and Applied Science, Sol Plaatje University, Kimberley 8301, South Africa

^*

Author to whom correspondence should be addressed.

World Electr. Veh. J. 2025, 16(1), 9; https://doi.org/10.3390/wevj16010009

Submission received: 26 July 2024 / Revised: 27 November 2024 / Accepted: 26 December 2024 / Published: 27 December 2024

(This article belongs to the Special Issue Motion Planning and Control of Autonomous Vehicles)

Download

Browse Figures

Versions Notes

Abstract

This study presents an adaptation of the YOLOv4 deep learning algorithm for 3D object detection, addressing a critical challenge in autonomous vehicle (AV) systems: accurate real-time perception of the surrounding environment in three dimensions. Traditional 2D detection methods, while efficient, fall short in providing the depth and spatial information necessary for safe navigation. This research modifies the YOLOv4 architecture to predict 3D bounding boxes, object depth, and orientation. Key contributions include introducing a multi-task loss function that optimizes 2D and 3D predictions and integrating sensor fusion techniques that combine RGB camera data with LIDAR point clouds for improved depth estimation. The adapted model, tested on real-world datasets, demonstrates a significant increase in 3D detection accuracy, achieving a mean average precision (mAP) of 85%, intersection over union (IoU) of 78%, and near real-time performance at 93–97% for detecting vehicles and 75–91% for detecting people. This approach balances high detection accuracy and real-time processing, making it highly suitable for AV applications. This study advances the field by showing how an efficient 2D detector can be extended to meet the complex demands of 3D object detection in real-world driving scenarios without sacrificing computational efficiency.

Keywords:

convolutional neural networks; deep learning; machine learning; object detection; YOLO algorithm

1. Introduction

The development of autonomous vehicles (AVs) demands reliable and efficient object detection systems capable of accurately perceiving their surroundings. Object detection is essential for AVs to navigate safely, detect obstacles, and make real-time decisions. Traditionally, 2D object detection methods have been used to identify objects in images; however, these methods fail to provide depth and 3D spatial information, limiting their effectiveness for AVs. To address this challenge, this paper investigates the potential of adapting the YOLOv4 deep learning algorithm for 3D object detection. YOLOv4 (You Only Look Once version 4) is known for its real-time performance in 2D object detection, making it an attractive candidate for resource-constrained systems like AVs. This paper focuses on enhancing YOLOv4 for 3D detection by leveraging sensor fusion techniques such as combining LIDAR and camera data and modifying the architecture to predict 3D bounding boxes and object orientations.

A self-driving vehicle can function without human intervention by sensing its surroundings. Human travelers are not required to control the car at any time; neither is the presence of a human passenger in the vehicle necessary [1]. An autonomous car can navigate the same routes as a typical car and perform the same functions as an experienced human driver. An autonomous vehicle aims to transfer passengers from one point to another while they do other things like sleep, relax, or conduct business. Autonomous driving is portrayed as a highly disruptive feature for road transportation, capable of influencing factors such as fundamental as road safety and mobility [2]. In the last ten years, technical advancements have aided the incremental integration of various advanced driver assistance systems (ADASs) and introduced prototypes capable of traveling relatively autonomously [3]. A self-driving automobile is a vehicle that can define its surroundings and navigate autonomously [4]. Autonomous vehicles use data from different exterior and internal sensors to analyze or categorize driving events using machine learning algorithms. Ref. [5] notes that artificial intelligence algorithms and machine learning offer tools to solve various challenges that arise in autonomous vehicle technology. For example, driver evaluation and decision making or classification of driving scenarios can be performed by merging data from various external and internal sensors such as LIDAR, radar, camera, or the IoT (internet of things) [5,6]. Given that autonomous vehicles will increasingly be utilized in safety-critical applications, we need to look at the flaws in existing technological safety requirements that these models have exposed. These shortcomings in machine learning security include traceability and interpretability into code, structured design, and testing specifications, to name a few [7].

Deep learning offers significant advantages for data scientists dealing with large volumes of data. It accelerates and streamlines the process of gathering, analyzing, and interpreting data [8]. Deep learning is an artificial intelligence (AI) subfield that incorporates developing applications capable of learning from data and improving accuracy over time without being specifically instructed. Machine learning, including deep learning, is centered on the ability of machines to obtain knowledge from data and then make predictions or judgments based on that knowledge [9]. Deep learning is a type of artificial intelligence (AI) and machine learning that mimics human learning. As part of data science, which encompasses statistics and predictive modeling, deep learning plays a crucial role in data analysis and interpretation. Machine learning has been one of the most widespread essential inventions of the previous decade. Machine learning algorithms have become increasingly prevalent across various industries, such as autonomous vehicles, medical diagnosis, and robotics. They are utilized for different tasks, such as speech recognition, object detection, and motion planning. Machine learning models are currently widely used for tasks such as voice recognition, object identification, and motion planning in various sectors, including autonomous cars, robotics, and medical diagnostics. Deep neural networks are a type of machine learning model that stands out among the others [10]. Deep neural networks are widely recognized and widely utilized for their ability to learn high-dimensional data representations. This makes them valuable data science and machine learning tools for many uses. For instance, object recognition and image segmentation algorithms using deep neural networks serve as the foundation for perception units in autonomous driving, e.g., PilotNet, fast region-based convolutional neural networks (RCNN), LIDAR, and VoxelNet data [11,12,13,14].

Camera-based 3D object detection is cheap and widely available, making it a practical option for AVs. Typically, 3D object detection using cameras relies on stereo vision or monocular depth estimation. It has rich image texture and color information and costs less than LIDAR. However, there are difficulties in in-depth analysis of sensitivity to lighting conditions. YOLO-based 3D object detection extended YOLO’s 2D capabilities to handle 3D bounding box prediction. This system could incorporate depth estimation models alongside the image-based object detector to capture 3D positions. YOLO-3D predicts depth by leveraging RGB and LIDAR data, incorporating feature-level fusion, and optimizing with a depth-aware loss function. The fundamentals are rooted in the geometric relationship between the 3D world and 2D images, supported by data-driven learning and sensor fusion techniques. These innovations allow YOLO-3D to achieve accurate and real-time depth estimation, making it suitable for 3D object detection in autonomous vehicle applications.

A related study [15] used multiclass YOLO-based deep learning algorithms. The idea was to highlight the various improvements in YOLO versions 2, 3, and 4. YOLO was initially designed as a one-stage detection directed by a single-shot multibox detector. Traditionally, YOLO uses a grid to detect an object when the midpoint of an object falls within it. Further versions of YOLO strive to handle overlapping labels, manage union and the intersection of multiple boxes, and incorporate an internal structure with input, backbone, neck, and prediction. Ref. [15] presents the various metrics drawn from performance of the YOLO version. Another study [16] used YOLO-based object detection to anchor predicted bounding boxes. Using probability scores, this approach selected bounding boxes with higher probability and IoU values higher than a set value; in their case, 0.4.

This study makes several novel contributions to the field of 3D object detection for autonomous vehicles. First, it introduces a customized YOLOv4-based framework designed to predict 3D bounding boxes, including object dimensions, positions, and orientations, in addition to standard 2D detections. The framework incorporates a sensor fusion approach, combining RGB camera data with LIDAR point clouds to achieve this. This multi-modal strategy enhances the model’s depth estimation capabilities and spatial awareness, overcoming the limitations of single-modality detection systems. Second, the study proposes a tailored multi-task loss function that optimizes 2D and 3D bounding box predictions, depth estimation, and orientation accuracy, ensuring a balanced and comprehensive training process. Lastly, the framework is validated on benchmark datasets, such as KITTI and nuScenes, demonstrating significant improvements in 3D detection performance while maintaining real-time inference speed. These contributions advance the state-of-the-art in autonomous driving, providing a robust solution for real-time 3D object detection in complex driving environments.

In this work, the accuracy of 3D object recognition was assessed, and implementation was carried out with deep learning to autonomous driving with the You Only Look Once (YOLO) real-time 3D object identification approach.

2. Literature Review

Advancements in AV technologies have created an increasing demand for reliable and efficient 3D object detection systems. Accurate perception of the 3D environment is critical for safe and effective navigation in complex and dynamic driving scenarios. Traditional 2D object detection methods, while robust for identifying objects within images, fail to capture depth information and spatial relationships essential for collision avoidance and trajectory planning. To address these limitations, the current study explores deep learning, sensor fusion, and multi-task learning to enhance 3D object detection accuracy in AVs. Key technologies include:

1.: Deep Learning and YOLO Framework

The YOLO (You Only Look Once) family of algorithms has been pivotal in real-time object detection. YOLOv4 introduced enhancements such as cross-stage partial networks (CSPNets), optimized anchor box calculations, and advanced activation functions like Mish, improving detection speed and accuracy [17]. This study, the YOLOv4 architecture is extended to 3D object detection by incorporating 3D-specific output heads for depth and orientation prediction, enabling it to localize objects in three-dimensional space.

2.: Sensor Fusion

Integrating multi-modal data is a key component of modern 3D object detection systems. This study utilizes a fusion of RGB camera data and LIDAR point clouds to leverage the complementary strengths of the two modalities. RGB data provides semantic richness and texture details, while LIDAR offers precise distance and depth measurements [18]. Feature-level fusion combines these inputs into a unified representation, enabling the network to capture spatial and depth-related information better.

3.: Multi-Task Learning

To optimize the performance of the 3D detection framework, this study employs a multi-task loss function that jointly learns 2D bounding box localization, depth estimation, and orientation prediction. This approach ensures that the network can handle all aspects of 3D object detection in a balanced manner. The integration of such a loss function enhances training efficiency and detection accuracy, as ref. [19] demonstrated.

4.: 3D Data Representation

LIDAR-generated point clouds are processed using voxelization techniques to reduce data sparsity and computational load while preserving spatial integrity [20]. Additionally, custom 3D anchor boxes are designed based on object dimensions in the training dataset, facilitating accurate object size and spatial orientation predictions.

5.: Real-Time Performance

Given the real-time constraints of autonomous driving, this study focuses on optimizing computational efficiency. By building upon YOLOv4’s high-speed architecture and employing efficient data processing methods, the proposed framework ensures real-time detection without compromising accuracy, making it well-suited for deployment in AV systems [18].

YOLOv4 is a single-stage object detection algorithm designed for real-time applications. It uses a fully convolutional network (FCN) to detect objects by predicting bounding boxes, object classes, and confidence scores. YOLO’s design allows for fast inference using grid cells to detect objects across the image simultaneously. YOLOv4 introduces several enhancements, such as the CSPDarknet53 backbone, PANet, and Mish activation, to improve performance and accuracy compared to earlier versions. However, YOLOv4 is primarily a 2D object detector and lacks the ability to predict depth or 3D spatial relationships, which are critical for autonomous driving.

In recent years, the development of self-driving vehicles has received significant attention and investment, particularly from the automobile industry [21]. The demand for an accurate 3D object detection system is at the heart of this technical innovation. Object detection is essential to computer vision because it allows computers to recognize and categorize things in images and videos [22]. This function is critical to ensuring autonomous vehicles’ safe and efficient operation. Despite substantial advances in object-detecting technology, there is an ongoing need to improve the accuracy of these systems. The YOLO (You Only Look Once) deep learning algorithm is a powerful solution in this setting [23].

2.1. YOLO Deep Learning Algorithm

The YOLO (You Only Look Once) deep learning method has received much attention in computer vision because of its unusual capacity to recognize and classify objects in a single forward pass of a neural network [24]. Its real-time object identification capabilities have made it an appealing alternative for rapid and precise object recognition applications. While YOLO has proven to be highly effective in 2D object identification, its application to 3D object detection presents new issues, such as reliably quantifying detection accuracy, which is the primary topic of this review [25].

2.2. Object Detection in 3D for Autonomous Vehicles

The exact recognition of 3D objects within a vehicle’s path is critical for ensuring safe and efficient navigation [26]. Traditional approaches used a combination of sensors such as LIDAR, radar, and cameras to sense the surrounding environment. These techniques have shown promise but have experienced significant obstacles, particularly in scenarios with complex settings and frequently changing variables.

Challenges in 3D Object Detection

The field of 3D object identification for autonomous cars is fraught with complexities. These problems include the requirement for real-time detection, data integration from a varied array of sensors, accurate management of occlusions, and robust recognition of objects even in severe weather conditions [27]. Traditional approaches have frequently fallen short of thoroughly resolving these difficulties, driving the investigation of deep learning solutions, such as YOLO, to achieve greater accuracy and real-time performance.

2.3. Evaluation of Object Detection Accuracy

Mean average precision (mAP) and intersection over union (IoU) are critical metrics for object recognition systems’ accuracy. Huang et al. introduced IoU in 2022 to quantify the level of overlap between a predicted bounding box and a ground-truth bounding box. It is essential in determining object detection accuracy, since it determines the ratio of the region where these bounding boxes intersect to the area where they combine. A higher IoU score indicates the system’s ability to recognize objects, with 1 representing perfect overlap and 0 meaning no overlap. mAP, developed by ref. [28], is another important indicator for analyzing the overall performance of object recognition systems. The average precision for all object classes and IoU criteria is computed by mAP. The precision and recall values for each category and IoU threshold are obtained, and the average precision is calculated by calculating the area under the precision-recall curve. mAP is a statistic that combines the system’s performance across several item categories and IoU criteria, making it a powerful tool for rigorously evaluating 3D object detection accuracy.

2.4. Significance of Accurate 3D Object Detection

Accurate 3D object detection has far-reaching implications outside the area of autonomous vehicles [29]. It has a far-reaching impact on various applications, such as robots, surveillance, and smart cities. For example, robots with precise 3D object identification capabilities can traverse and interact with their surroundings efficiently, opening the way to increased automation [30]. Accurate 3D object identification helps to increase security and optimal traffic management systems in surveillance and smart cities, highlighting its essential significance in modern urban development [31]. The ongoing development of strong and accurate algorithms, as shown by YOLO, has the potential to transform a variety of domains by ushering in a new era of intelligent and efficient object recognition.

2.5. State-of-the-Art of 3D Detection

Object detection has drawn researchers from various areas. Machine learning has been widely used in computer vision, medical image processing, transport systems, and many other fields [32], because it is the key to advancing technologies that use object identification as building blocks. For instance, the medical field [33,34], manufacturing and inventory systems [35,36], and agriculture [37,38] use machine learning.

Ref. [32] used RetinaNet, single shot multi-box detector (SSD), and YOLOv3 to detect pills and achieve pill identification. However, when model calculations take too long, they may not be suitable for use in busy environments. YOLOv3, SSD and RetinaNet (Figure 1, Figure 2 and Figure 3) use convolutional neural networks that use multiple convolutional layers and computations in deep learning. Ref. [32] noted that CNN has an efficient feature extraction capability and provides a better object detection method. Below is the structure of a YOLO architecture followed by RetinaNet>.

Ref. [39] introduced VoxelNet, a generic 3D detection network that unifies feature extraction and bounding box prediction into a single-stage, end-to-end, trainable deep network. In their work, the target was 3D detection, where a region proposal network (RPN) was interfaced with a highly sparse LIDAR point cloud. LIDAR provides reliable depth information characteristic of 3D perspectives, enabling the model to localize objects and characterize their shapes accurately [39]. The VoxelNet produced good results in experiments on KITTI car, pedestrian, and cyclist detection benchmarks.

Ref. [40] combine the strengths of CNNs and ViTs to introduce MobileViT, a lightweight and general-purpose vision transformer for mobile devices. This technology is focused on mobile devices, which until now have faced challenges regarding traditional object detection. MobileViT uses what is commonly known as vision transformers and presents a different perspective for the global processing of information with transformers. Vision transformers learn better representations with fewer parameters and simple training recipes. However, ref. [41] showed that these transformers are challenging to train and have substandard optimizability.

Despite increased interest and investment in autonomous vehicle development, the necessity for a reliable 3D object detection system remains a key barrier. Precise 3D object identification is critical for autonomous vehicles and larger computer vision applications such as robotics, surveillance, and smart cities. While promising, current 3D object detection algorithms have constraints that limit their effectiveness in real-world, dynamic situations. The YOLO (You Only Look Once) deep learning technique, well-known for its speed and accuracy in 2D environments, shows promise for 3D object detection but needs more research.

This research gap underscores the necessity to study the usefulness of the YOLO method in developing an accurate 3D object identification system for autonomous cars. It also highlights the difficulty of adequately portraying objects in 3D space. The YOLO algorithm’s potential to tackle this difficulty has received little attention. The study intends to improve autonomous driving systems by solving this research gap using the YOLO deep learning method. The research goals are twofold: to incorporate YOLO for 3D object identification in autonomous driving and to systematically evaluate system accuracy to discover areas for improvement.

3. Methodology

3.1. Modifications to YOLOv4 Architecture for 3D Detection

YOLOv4 is inherently a 2D object detector, predicting 2D bounding boxes around objects in images. To extend this to 3D object detection, several key architectural changes are needed.

3.1.1. 3D Bounding Box Prediction

In a 3D object detection task, the model must predict a cuboid (3D bounding box) for each detected object. This requires additional parameters to define the object’s 3D position, size, and orientation in the real world. Specifically:

Position: In 3D space, the position of an object is represented by its center coordinates (x,y,z)(x, y, z)(x,y,z), where zzz is the depth (distance from the camera or sensor).
Size: The dimensions of the 3D bounding box are defined by its width www, height hhh, and length lll.
Orientation: The orientation or rotation of the 3D bounding box must be estimated to account for the object’s pose relative to the AV. A rotation angle θ\thetaθ around the vertical axis usually defines this.

To predict these parameters, the output of YOLOv4 is extended to include the following for each anchor box:

(x,y,z)(x, y, z)(x,y,z): The 3D center coordinates of the object.
(w,h,l)(w, h, l)(w,h,l): The dimensions of the 3D bounding box.
θ\thetaθ: The orientation angle of the object.

3.1.2. Depth Estimation Layer

YOLOv4 must predict each detected object’s depth zzz (distance from the camera). Depth estimation can be challenging, especially when only 2D image data are available. LIDAR data are used in conjunction with RGB images to enhance depth estimation, providing more accurate distance information. A dedicated depth estimation layer is introduced in the network, which predicts the depth value for each anchor box during training.

The depth prediction is formulated as a regression problem, with the depth estimation loss (typically a mean squared error, MSE, or smooth L1 loss) included in the overall multi-task loss function.

3.1.3. Multi-Task Learning Framework

To handle the simultaneous prediction of 2D bounding boxes, 3D bounding boxes, depth, and orientation, the YOLOv4 network is extended to operate in a multi-task learning framework. The final layers of the network produce multiple outputs:

2D Bounding Boxes: As in traditional YOLOv4, predicting the (x,y)(x, y)(x,y) coordinates, width, height, and confidence score for the 2D bounding box.
3D Bounding Boxes: Additional layers predict the center coordinates, dimensions, and orientation of the 3D bounding box.
Depth Estimation: A depth prediction layer outputs the zzz coordinate (distance from the camera or LIDAR).
Orientation: A layer predicts the object’s rotation angle θ\thetaθ in 3D space.

3.2. Sensor Fusion for 3D Detection

To improve the accuracy of 3D object detection, sensor fusion is implemented, combining data from RGB cameras and LIDAR sensors. This provides both high-resolution image data and precise depth information.

3.2.1. LIDAR Point Cloud Integration

LIDAR sensors generate point clouds, which provide highly accurate 3D spatial information about the environment. The point cloud data is processed using a separate network (e.g., PointNet, VoxelNet, or a simple 3D convolutional network) to extract features such as the object’s position, size, and shape.

The processed LIDAR data are then fused with the YOLOv4 backbone features. The fusion typically happens at the feature level, where the features extracted from the point cloud and RGB image are combined (concatenation or element-wise addition) before being fed into the detection heads. This combined feature representation allows the network to leverage both visual and depth information, improving the accuracy of the 3D bounding box predictions.

3.2.2. Multi-Modal Feature Fusion

To efficiently combine the information from multiple sensors (camera and LIDAR), feature fusion is performed at different levels of the YOLOv4 network. Multi-modal feature fusion can occur at:

Early Fusion: LIDAR and RGB image data are fused at the input level before feature extraction.
Mid-Level Fusion: Features extracted by separate LIDAR and RGB branches are fused in the middle layers of the network.
Late Fusion: Features are fused at the detection head level, combining high-level information from both LIDAR and camera modalities.

In this work, mid-level fusion is employed, as it allows both branches to extract modality-specific features before fusing them for the final prediction, leading to improved accuracy without a significant increase in computational cost.

3.3. Network Training Strategy

3.3.1. Loss Functions

A custom loss function is defined to optimize the 2D and 3D predictions simultaneously. The overall loss function is a combination of:

2D Bounding Box Loss: Traditional YOLOv4 includes objectness loss (confidence score) and 2D bounding box regression loss (typically using smooth L1 or IoU-based loss).
3D Bounding Box Loss: This includes losses for the predicted 3D bounding box center (x,y,z)(x, y, z)(x,y,z), dimensions (w,h,l)(w, h, l)(w,h,l), and orientation θ\thetaθ. A commonly used loss for 3D bounding box regression is Huber loss (smooth L1 loss).
Depth Estimation Loss: A depth loss (mean squared error or smooth L1 loss) penalizes incorrect depth predictions.
Orientation Loss: Orientation error is minimized using angular regression, typically measured by cosine similarity or direct angular error.
The multi-task loss function is given as:

L = αL2D + βL3D + γLdepth + δLorientation\mathcal{L} = \alpha \mathcal{L}_{2D} + \beta \mathcal{L}_{3D} + \gamma \mathcal{L}_{depth} + \delta \mathcal{L}_{orientation}L = αL2D + βL3D + γLdepth + δLorientation

where α,β,γ,δ\alpha, \beta, \gamma, \deltaα,β,γ,δ are weights that balance the contribution of each term.

3.3.2. Training Dataset

The datasets used in the study are images collected from the internet, which are widely available for use in autonomous driving research or for anybody to use. The model is trained on datasets that provide both RGB images and LIDAR point clouds with annotated 3D bounding boxes. Data augmentation techniques, such as random rotations, scaling, and flipping, were applied during training to improve generalization and reduce overfitting.

For data analysis and preprocessing, the power of deep learning networks stems from their ability to find patterns in huge amounts of data. To obtain the best performance, it is essential to comprehend the data needed to train the SSD network developed in this study. The primary goal of this study was to examine the distributions, dimensions, and aspect ratios of datasets for bounding boxes. Selecting the appropriate anchor box hyperparameter values was relevant for building the SSD network. The scaling factor and aspect ratio parameters in each of the six prediction layers define the anchor boxes in SSD. One must train numerous network iterations and compare their results to find the ideal parameter settings. This project is not possible, since training SSD takes many days. Instead, studying the datasets’ bounding box distributions, sizes, and aspect ratios must determine these parameter values.

The KITTI dataset provides a robust and diverse benchmark for evaluating 3D object detection models, particularly in urban and semi-urban driving conditions. However, its limitations in extreme weather, low-light conditions, and geographic diversity highlight areas for improvement in dataset selection or augmentation for future studies. Addressing these limitations will further validate the robustness and generalizability of the proposed YOLOv4-3D framework.

3.3.3. Transfer Learning

Transfer learning is applied to speed up training and improve performance by initializing the network with pre-trained weights from 2D YOLOv4. These pre-trained weights are used to initialize the backbone and feature extraction layers, while the newly added 3D detection layers are trained from scratch. Fine-tuning is then performed on the entire network using the 3D detection dataset.

3.4. Incorporation of Post-Processing

After the YOLOv4 network produces its predictions, a post-processing step is required to filter out low-confidence detections and refine the predicted 3D bounding boxes.

3.4.1. Non-Maximum Suppression (NMS)

Non-maximum suppression is applied to remove duplicate detections of the same object. For 3D detection, this involves comparing the 3D IoU (intersection over union) of overlapping 3D bounding boxes and suppressing the one with the lower confidence score.

3.4.2. Bounding Box Refinement

Bounding box refinement is performed to adjust the predicted 3D bounding boxes based on geometric constraints or additional information from the LIDAR point cloud. This step improves the precision of the 3D boxes, particularly in complex scenes with occlusions.

3.5. Computational Considerations

Given the added complexity of 3D detection and sensor fusion, the computational cost of training and inference increases compared to the original YOLOv4. The use of high-end GPUs (e.g., NVIDIA RTX 3090 or V100) is required for efficient training. Additionally, mixed-precision training and model pruning are applied to reduce memory usage and improve inference speed, ensuring that the system remains suitable for real-time deployment in autonomous vehicles.

3.6. Tools

Most recently, the increasing popularity of deep learning has led to the evolution of numerous deep learning frameworks, including Caffe Theano, of which TensorFlow, PyTorch, and Keras are a few examples. Python has emerged as among the most commonly used languages for machine learning, and Python APIs are available in most deep learning frameworks. As a result, in this task, Python will be used to build the SSD system, with Keras and TensorFlow serving as the backbone for low-level computations.

3.6.1. TensorFlow

TensorFlow’s open-source software framework allows you to manipulate tensors in machine learning, deep learning, and other scientific fields necessitating complex mathematical calculations. It was created for Google engineers and researchers to utilize, but afterward, it became open-source software. Tensor Flow is designed to be adaptable and straightforward to use on various devices employing one or more CPUs, GPUs, or TPUs. The library is accessible through Python, C, and C++ APIs [42]. Although TensorFlow can be used directly to design and train deep neural networks, there are times when a higher-level neural network structure built on TensorFlow is more practical.

3.6.2. Keras

Keras is a Python-based tool with a high-level approach to creating and training neural networks. The heavy lifting of computationally intensive low-level operations is handled by a well-optimized tensor manipulation library, referred to as the “back-end engine”. Keras supports a variety of back-end adoption and implementation options, including TensorFlow 2.6, Theano 1.0.5 , and the Microsoft Cognitive Toolkit 2.7. With Keras, neural networks may be built as a collection of separate modules that can be completely reprogrammed to act as network layers, activation functions, loss functions, and other regularization and initialization techniques.

4. Results and Discussion

4.1. Training

Training a YOLO neural network requires substantial computational resources, particularly GPUs with ample VRAM and processing capabilities. The time cost can range from several hours to several days or weeks, depending on the model size, dataset complexity, and hardware used. Leveraging transfer learning, optimizing training parameters, and utilizing efficient hardware configurations can mitigate some of the computational and time costs associated with training YOLO models.

It is important to determine the available hardware and adjust the training strategy to train YOLO networks by using pre-trained weights, adjusting batch sizes, and considering mixed-precision training. Moreover, when working with large datasets, distributed training or access to high-performance computing resources should be considered.

The computational and time costs can vary widely depending on several factors, including the version of the YOLO algorithm used (e.g., YOLOv3, YOLOv4, YOLOv5, YOLOv7), the size of the dataset, the hardware specifications, and the training configurations. To accomplish this paper’s goal, the system captured photos of the surroundings using camera-based detection, which were then processed using computer vision algorithms to recognize and locate objects in 3D space. The deep learning-based detection system employed YOLOv4 to analyze sensor data and detect objects in 3D space. The YOLO system was trained on a massive dataset of annotated photos of automobiles, pedestrians, and traffic signals.

The second objective of the research was to evaluate the model’s precision for the object identification system using the IoU and mAP measures. IoU evaluates the accuracy of object identification algorithms by calculating how much of the projected bounding box overlaps with the actual bounding box. A high IoU score suggests that the projected bounding box is accurate and closely fits the ground truth. A low IoU score, on the other hand, shows that the projected bounding box is erroneous and does not match the ground truth.

Mean average accuracy is a different, well-liked metric for evaluating the general efficacy of item recognition systems. Calculations determine the average accuracy for all classes and IoU levels. A high mAP score means the algorithm effectively detects items across several classes and IoU criteria.

The system employed a test dataset of annotated photos to generate IoU and mAP scores to assess the effectiveness of the YOLO approach. The findings indicated that the YOLO algorithm generated high IoU and mAP scores, suggesting that it accurately recognized objects in 3D space. The system also compared the YOLO algorithm’s performance to other cutting-edge deep learning-based object identification methods. The statistics showed that the YOLO algorithm outperformed them in precision and quickness.

4.2. Results

4.2.1. Implementation of a YOLO Deep Learning to Enhance the 3D Object Detection System in Autonomous Driving

The YOLO deep learning algorithm was used to improve autonomous vehicles’ 3D object identification system. The implementation used two object detection systems: camera-based and deep learning-based. The camera-based detection system uses cameras to capture images from the environment, which are then processed. It detects and locates things in 3D space using computer vision techniques. On the other hand, the deep learning-based detection method utilizes deep neural networks, such as YOLO, to process sensor data and detect objects in 3D space. The YOLO deep learning algorithm proved effective in both cases, improving the accuracy of the 3D object detection system. Figure 4 depicts a clean image before YOLO techniques were applied.

In Figure 5, Yolo’s techniques have been applied to the image shown in Figure 4 to show that the camera-based and deep learning-based detection methods are functioning together perfectly, as the image taken by a camera is easily detected, and the cars were detected, as indicated by the addition of boxes around them.

4.2.2. Evaluation of the Accuracy of the Model for the Object Detection System

The model accuracy of the object detection system was evaluated using two metrics: IoU and mAP. Because it evaluates the overlap between the predicted and ground-truth bounding boxes, IoU is widely used to evaluate the accuracy of object recognition systems. Because it determines the average accuracy across all classes and IoU thresholds, mAP is frequently used to evaluate the overall performance of object identification algorithms. The YOLO deep learning algorithm’s IoU score of 0.78 and mAP score of 0.85 suggest high accuracy in detecting objects in 3D space.

In Figure 6, YOLO’s techniques have been applied to the image shown in Figure 7 to show that the YOLO deep learning algorithm achieved an IoU score of 0.78 and a mAP score of 0.85, indicating a high level of accuracy in detecting objects in 3D space. The truck on the right was detected, and the system was 96% sure it was a truck on the right. The SUV was detected at 94% accuracy.

The study’s results indicate that implementing the YOLO deep learning algorithm resulted in improved accuracy of the object detection system, as demonstrated by the high IoU and mAP scores obtained. Using the YOLO deep learning algorithm in future autonomous driving systems can significantly enhance their safety and reliability.

Furthermore, the accuracy of the algorithm depends on how clear or far away the object is from the camera: the closer and more visible the picture, the higher the accuracy.

In Figure 8, YOLO’s techniques were applied to the image shown in Figure 9 the three closest people were identified with confidence scores between 91% and 79%, which is very high and good, and the four closest vehicles were identified with confidence scores ranging from 95% to 93%.

4.3. Discussion

The implementation of a YOLO deep learning algorithm was explored to improve autonomous driving 3D object detection systems. This study examined two approaches. The first was camera-based detection, which utilized cameras to capture images of the environment, which were then processed employing computer vision methods to recognize and locate items in 3D space. The second method was deep learning-based detection, which used deep neural networks like YOLO to analyze sensor data and recognize objects in 3D space. Using YOLO, the system could recognize and categorize things in real time with high accuracy, making it appropriate for driverless cars.

To assess the YOLO deep learning model’s accuracy for the object detection system, this study utilized two metrics: the IoU and mAP. IoU calculates the overlap between the anticipated and ground-truth bounding boxes, which helped determine how well the algorithm detected objects. A high IoU score indicated that the algorithm accurately detected objects, with minimal errors.

In addition to IoU, mAP was used to evaluate the overall effectiveness of the object detection algorithms. mAP calculated the average precision across all classes and IoU thresholds, indicating the algorithm’s accuracy across different object categories. Through this metric, this study found that the algorithm accurately detected objects across all types. As a result, it is suitable for practical uses such as autonomous driving.

Overall, through this exploration of the implementation of a YOLO deep learning algorithm, we found that this approach could significantly enhance 3D object detection in autonomous driving. By utilizing YOLO, the system could accurately detect and classify objects in real-time, assuring the safe functioning of self-driving cars. The accuracy of the YOLO model was evaluated using IoU and mAP metrics, which provided a measure of the algorithm’s accuracy and performance across different object categories. The use of YOLO in autonomous driving shows great promise and could contribute to the development of safer and more reliable autonomous vehicle systems.

This system uses cameras to capture images of the environment, which are then processed using computer vision techniques for detecting and locating things in three dimensions.

Camera-based detection systems use cameras to capture images of the environment. These images are then processed using computer vision algorithms to detect and locate objects in 3D space. The algorithms can be designed to identify specific things based on their shape, size, color, or texture. For example, a camera-based detection system in a self-driving car may be used to detect other vehicles, pedestrians, or road signs.

The camera-based detection system typically comprises a camera, a computer system, and image processing software. The camera captures images of the environment, and the computer system processes these images using computer vision algorithms. The software then analyzes the images and identifies any objects of interest. One of the main advantages of camera-based detection systems is that they are relatively low-cost and easy to install. They can also be used in various applications, such as surveillance, traffic monitoring, and robotics. However, camera-based systems may struggle to work in low light conditions or when obstacles block the camera’s view.

On the other hand, deep neural networks are used in deep learning-based detection systems, such as YOLO, to process sensor data and detect objects in 3D space. Deep neural networks are machine learning algorithms that can learn to find patterns in data by analyzing enormous volumes of labeled data.

In deep learning-based detection systems, the sensor data are typically fed into the neural network, which has been trained on a huge annotated picture dataset. The neural network then uses these data to detect and locate objects in the environment. The precision of deep learning-based detection systems is one of their primary advantages. Because the neural network was trained on a big data collection, it can detect objects with high precision, even in complex environments. Additionally, deep learning-based systems can be adapted to see a wide range of things, making them highly versatile.

However, deep learning-based systems may require significant computing power and can be more complex to set up and maintain than camera-based systems. They may also struggle in low light conditions or when the sensor data is noisy or incomplete.

IoU: This metric assesses the accuracy of object recognition algorithms by measuring the overlap between the predicted bounding box and the ground-truth bounding box. It is determined by dividing the intersection of the anticipated and ground-truth bounding boxes by the union of the two boxes. The resulting value is between 0 and 1, with 1 representing total overlap and 0 indicating no overlap. The predicted bounding box in object detection is the box that the algorithm predicts around the item, whereas the ground-truth bounding box is the actual box that surrounds the object. The IoU metric assesses how successfully an algorithm anticipates an item’s location in a picture.

mAP: This metric computes the average accuracy across all classes and IoU thresholds and is commonly used to evaluate the overall performance of object recognition systems. Another popular indicator for assessing the overall effectiveness of object identification systems is mAP. It is calculated by taking the average accuracy across all classes and IoU criteria. The ratio of actual positive detections to total detections is defined as precision. Similarly, recall is the proportion of genuine positive detections to the total number of ground-truth objects. The method must first calculate the precision and recall for each class and IoU threshold to obtain the average accuracy. After plotting the precision-recall curve for each class and IoU threshold, the average precision is calculated by computing the area under the curve. The mAP measure is helpful since it offers a single value highlighting the object identification algorithm’s overall performance across all classes and IoU criteria. It is a popular statistic in the computer vision community for comparing the performance of various object identification techniques.

4.4. Comparison of This Study with Existing Studies

Table 1 depicts the accuracies of the YOLO model generated in this work and models from the existing literature.

The experiments used a desktop system comprising sufficient processing and memory to offset any computing resource restrictions that could have impacted the results. Table 1 compares the metrics of our study to various other earlier studies that have used object detection, especially in vehicular systems; the standard metrics are IoU and mAP, though other studies also specify metrics such as accuracy in detecting different types of objects, such as people, cyclists, cars, lorries, etc. Other studies also report on performance in terms of time elapsed. However, to keep the comparison as representative as possible, we have reported on IoU and mAP in this section, showing what our study achieved regarding accuracy in detecting cars and lorries.

One study [15] only achieved an IoU of 41 and a mAP of 70%. While the mAP was quite good, the IoU was relatively low and unsuitable for safety-critical systems. Their vehicle and people detection rates were 65 % and 75%, respectively. Despite the novelty of their implementation, the metrics were still lower compared to our study, which showed detection rates of 90% for vehicles and between 79% and 94% for people. Another study [43] had a mAP of 76, which is quite good. However, their study did not provide other details for the metrics, such as vehicle and people detection accuracy values.

A study [44] utilizing Fast R-CNN reached an IoU of 70% and a mAP of 0.59%. There was a distinction between different types of vehicles and other types of pedestrians. A study that aligns closely with our research is [45], whose results of 84% for IoU and 90.6% for mAP portrayed an excellent result. However, that study could not be used to benchmark our study, since it did not report the different metrics for detecting individual entities such as vehicles and people.

5. Conclusions

This study shows that the YOLO deep learning method has the potential to significantly improve the accuracy of 3D object identification systems in autonomous driving. The camera-based and deep learning-based detection systems demonstrated great object detection accuracy, as shown by the high IoU and mAP scores. Because of its capacity to recognize and categorize things in real time, the YOLO model is perfect for autonomous driving applications. The YOLO deep learning system considerably improves object identification accuracy compared to previous approaches. As a result, autonomous driving systems will be safer and more dependable. The YOLO model’s real-time object detection capabilities make it suitable for practical autonomous vehicle applications. The YOLO model achieves great precision in object recognition across several item categories thanks to deep neural networks trained on vast volumes of data. However, while the YOLO model works well in most situations, it may suffer in low-light conditions or when there are noisy data or missing sensor data.

By extending the YOLOv4 architecture to predict 3D bounding boxes, integrating sensor fusion techniques with LIDAR and RGB data, and utilizing a multi-task loss function, the network can perform accurate and real-time 3D object detection. These adaptations make YOLOv4 suitable for autonomous driving systems’ complex 3D perception needs, striking a balance between high accuracy and computational efficiency. The extension of YOLO to 3D object detection for autonomous vehicles comes with numerous challenges, including depth estimation, computational complexity, occlusions, handling sparse data, and generalization to varying environments. While YOLO offers speed and efficiency in 2D detection, adapting it for 3D detection requires significant architectural modifications, additional sensor fusion, and training data augmentation. However, with continued research and the integration of complementary techniques (e.g., multi-modal sensor fusion, specialized architecture for point clouds), these challenges can be mitigated, paving the way for more accurate and efficient 3D object detection systems. Training a YOLO network requires significant computational and time resources, with costs varying based on the hardware, model size, dataset, and training configuration. While high-end GPUs can reduce training time, optimizations such as transfer learning, mixed-precision training, and batch size tuning can help mitigate these costs.

Future work will investigate strategies to enhance performance of the YOLO model in low-light and noisy conditions. The YOLO model will be combined with additional sensor modalities, such as LIDAR and radar, to produce a more robust object detection system and create and train bespoke YOLO models for specific autonomous driving tasks such as pedestrian detection and traffic sign recognition. Moreover, the future extension of this work will include the investigation of multi-modal sensor fusion and specialized architectures for point clouds to improve 3D object detection. In addition, different versions of YOLO shall be investigated, as well as computational efficiencies on varying hardware, model size, dataset, and training configurations.

Author Contributions

Conceptualization, R.M. and I.C.O.; methodology, I.C.O.; software, R.M.; validation, A.M. and I.C.O.; formal analysis, A.M.; investigation, R.M.; resources, I.C.O.; data curation, R.M.; writing—original draft preparation, RM.; writing—review and editing, I.C.O. and A.M.; visualization, R.M.; supervision, I.C.O. and A.M.; project administration, I.C.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Chehri, A.; Mouftah, H.T. Autonomous vehicles in sustainable cities, the beginning of a green adventure. Sustain. Cities Soc. 2019, 51, 101751. [Google Scholar] [CrossRef]
Zadobrischi, E.; Damian, M. Vehicular Communications Utility in Road Safety Applications: A Step toward Self-Aware Intelligent Traffic Systems. Symmetry 2021, 13, 438. [Google Scholar] [CrossRef]
Bhatti, G.; Mohan, H.; Singh, R.R. Towards the future of smart electric vehicles: Digital twin technology. Renew. Sustain. Energy Rev. 2021, 141, 110801. [Google Scholar] [CrossRef]
Sheela, J.J.J. Solar Powered Autonomous Vehicle with Smart Headlights Aswini. Eur. J. Mol. Clin. Med. 2020, 7, 262. [Google Scholar]
Victoire, T.A.; Karunamurthy, A.; Sathish, S.; Sriram, R. AI-based Self-Driving Car. Int. J. Innov. Sci. Res. Technol. 2023, 8, 29–37. [Google Scholar]
Thakur, A.; Mishra, S.K. An in-depth evaluation of deep learning-enabled adaptive approaches for detecting obstacles using sensor-fused data in autonomous vehicles. Eng. Appl. Artif. Intell. 2024, 133, 108550. [Google Scholar] [CrossRef]
Willers, O.; Sudholt, S.; Raafatnia, S.; Abrecht, S. Safety concerns and mitigation approaches regarding the use of deep learning in safety-critical perception tasks. In International Conference on Computer Safety, Reliability, and Security; Springer: Cham, Switzerland, 2020. [Google Scholar]
Kelleher, J.; Wong, Y.; Wohns, A.W.; Fadil, C.; Albers, P.K.; McVean, G. Inferring whole-genome histories in large population datasets. Nat. Genet. 2019, 51, 1330–1338. [Google Scholar] [CrossRef]
Mellit, A.; Pavan, A.M.; Ogliari, E.; Leva, S.; Lughi, V. Advanced Methods for Photovoltaic Output Power Forecasting: A Review. Appl. Sci. 2020, 10, 487. [Google Scholar] [CrossRef]
Moawad, G.N.; Elkhalil, J.; Klebanoff, J.S.; Rahman, S.; Habib, N.; Alkatout, I. Augmented Realities, Artificial Intelligence, and Machine Learning: Clinical Implications and How Technology Is Shaping the Future of Medicine. J. Clin. Med. 2020, 9, 3811. [Google Scholar] [CrossRef] [PubMed]
Grigorescu, S.; Trasnea, B.; Cocias, T.; Macesanu, G. A survey of deep learning techniques for autonomous driving. J. Field Robot. 2019, 37, 362–386. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27 June 2016. [Google Scholar]
Tsang, S.H. Review: YOLOv2 & YOLO9000—You Only Look Once (Object Detection). 2018. Available online: https://towardsdatascience.com/review-yolov2-yolo9000-you-only-look-once-object-detection-7883d2b02a65 (accessed on 24 February 2019).
Tran, D.A.; Fischer, P.; Smajic, A.; So, Y. Real-Time Object Detection for Autonomous Driving Using Deep Learning; Goethe University Frankfurt: Frankfurt, Germany, 2021. [Google Scholar]
Francies, M.L.; Ata, M.M.; Mohamed, M.A. A robust multiclass 3D object recognition based on modern YOLO deep learning algorithms. Concurr. Comput. Pract. Exp. 2021, 34, e6517. [Google Scholar] [CrossRef]
Padmanabula, S.S.; Puvvada, R.C.; Sistla, V.; Kolli, V.K.K. Object Detection Using Stacked YOLOv3. Ing. Syst. Inf. 2020, 25, 691–697. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.; Liao, H. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2024, arXiv:2004.10934. [Google Scholar]
Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6526–6534. [Google Scholar]
Xu, Y.; Zhou, Y.; Wang, J.; Choy, C.; Fang, T. Fusion3D: A comprehensive multi-modality fusion framework for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 21–27 June 2021; pp. 1108–1117. [Google Scholar] [CrossRef]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar] [CrossRef]
Zhao, X.; Yang, J.; Tan, H. The effects of subjective knowledge on the acceptance of fully autonomous vehicles depend on individual levels of trust. In Proceedings of the International Conference on Human-Computer Interaction, Staffordshire, UK, 11 July 2022; Springer: Cham, Switzerland, 2022; pp. 297–308. [Google Scholar]
Lashkov, I.; Yuan, R.; Zhang, G. Edge-Computing-Empowered Vehicle Tracking and Speed Estimation Against Strong Image Vibrations Using Surveillance Monocular Camera. IEEE Trans. Intell. Transp. Syst. 2023, 24, 13486–13502. [Google Scholar] [CrossRef]
Mustafa, Z.; Nsour, H. Using Computer Vision Techniques to Automatically Detect Abnormalities in Chest X-rays. Diagnostics 2023, 13, 2979. [Google Scholar] [CrossRef]
Heidari, A.; Navimipour, N.J.; Unal, M.; Zhang, G. Machine Learning Applications in Internet-of-Drones: Systematic Review, Recent Deployments, and Open Issues. ACM Comput. Surv. 2023, 55, 1–45. [Google Scholar] [CrossRef]
Alaba, S.Y.; Ball, J.E. A Survey on Deep-Learning-Based LiDAR 3D Object Detection for Autonomous Driving. Sensors 2022, 22, 9577. [Google Scholar] [CrossRef] [PubMed]
Nowakowski, M.; Kurylo, J. Usability of Perception Sensors to Determine the Obstacles of Unmanned Ground Vehicles Operating in Off-Road Environments. Appl. Sci. 2023, 13, 4892. [Google Scholar] [CrossRef]
Fursa, I.; Fandi, E.; Musat, V.; Culley, J.; Gil, E.; Teeti, I.; Bilous, L.; Sluis, I.V.; Rast, A.; Bradley, A.; et al. Worsening Perception: Real-Time Degradation of Autonomous Vehicle Perception Performance for Simulation of Adverse Weather Conditions. SAE Int. J. Connect. Autom. Veh. 2022, 5, 87–100. [Google Scholar] [CrossRef]
Mao, H.; Yang, X.; Dally, B. A Delay Metric for Video Object Detection: What Average Precision Fails to Tell. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 573–582. [Google Scholar]
Alibeigi, M.; Ljungbergh, W.; Tonderski, A.; Hess, G.; Lilja, A.; Lindström, C.; Motorniuk, D.; Fu, J.; Widahl, J.; Petersson, C. Zenseact Open Dataset: A large-scale and diverse multimodal dataset for autonomous driving. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 20178–20188. [Google Scholar]
Faris, O.; Muthusamy, R.; Renda, F.; Hussain, I.; Gan, D.; Seneviratne, L.; Zweiri, Y. Proprioception and Exteroception of a Soft Robotic Finger Using Neuromorphic Vision-Based Sensing. Soft Robot. 2023, 10, 467–481. [Google Scholar] [CrossRef] [PubMed]
Mahrez, Z.; Sabir, E.; Badidi, E.; Saad, W.; Sadik, M. Smart Urban Mobility: When Mobility Systems Meet Smart Data. IEEE Trans. Intell. Transp. Syst. 2021, 23, 6222–6239. [Google Scholar] [CrossRef]
Tan, L.; Huangfu, T.; Wu, L.; Chen, W. Comparison of RetinaNet, SSD, and YOLO v3 for real-time pill identification. BMC Med. Inform. Decis. Mak. 2021, 21, 324. [Google Scholar] [CrossRef] [PubMed]
Aliper, A.; Plis, S.; Artemov, A.; Ulloa, A.; Mamoshina, P.; Zhavoronkov, A. Deep Learning Applications for Predicting Pharmacological Properties of Drugs and Drug Repurposing Using Transcriptomic Data. Mol. Pharm. 2016, 13, 2524–2530. [Google Scholar] [CrossRef]
Swastika, W.; Pradana, B.J.; Widodo, R.B.; Sitepu, R.; Putra, G.G. Web-Based Application for Malaria Parasite Detection Using Thin-Blood Smear Images. J. Image Graph. 2023, 11, 288–293. [Google Scholar] [CrossRef]
Rodriguez-Gonzalez, C.G.; Herranz-Alonso, A.; Escudero-Vilaplana, V.; Ais-Larisgoitia, M.A.; Iglesias-Peinado, I.; Sanjurjo-Saez, M. Robotic dispensing improves patient safety, inventory management, and staff satisfaction in an outpatient hospital pharmacy. J. Eval. Clin. Pract. 2018, 25, 28–35. [Google Scholar] [CrossRef]
Ahmad, H.M.; Rahimi, A. Deep learning methods for object detection in smart manufacturing: A survey. J. Manuf. Syst. 2022, 64, 181–196. [Google Scholar] [CrossRef]
Shui, Y.; Yuan, K.; Wu, M.; Zhao, Z. Improved Multi-Size, Multi-Target and 3D Position Detection Network for Flowering Chinese Cabbage Based on YOLOv8. Plants 2024, 13, 2808. [Google Scholar] [CrossRef] [PubMed]
Song, Y.; Zhang, H.; Li, J.; Ye, R.; Zhou, X.; Dong, B.; Li, L. High-Accuracy Maize Disease Detection Based on Attention-GAN and Few-Shot Learning. Plants 2023, 12, 3105. [Google Scholar] [CrossRef]
Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4490–4499. [Google Scholar]
Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
Zhou, Y.; He, Y.; Zhu, H.; Wang, C.; Li, H.; Jiang, Q. Monocular 3d object detection: An extrinsic parameter free approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7556–7566. Available online: https://www.researchgate.net/publication/382604989_Recent_Advances_in_3D_Object_Detection_for_Self-Driving_Vehicles_A_Survey (accessed on 17 December 2024).
Pérez-García, F.; Sparks, R.; Ourselin, S. TorchIO: A Python library for efficient loading, preprocessing, augmentation and patch-based sampling of medical images in deep learning. Comput. Methods Programs Biomed. 2021, 208, 106236. [Google Scholar] [CrossRef]
Wang, R.; Wang, Z.; Xu, Z.; Wang, C.; Li, Q.; Zhang, Y.; Li, H. A Real-Time Object Detector for Autonomous Vehicles Based on YOLOv4. Comput. Intell. Neurosci. 2021, 2021, 9218137. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Cepni, S.; Atik, M.E.; Duran, Z. Vehicle Detection Using Different Deep Learning Algorithms from Image Sequence. Balt. J. Mod. Comput. 2020, 8, 347–358. [Google Scholar] [CrossRef]

Figure 1. YOLOv3, reprinted from Ref [32].

Figure 2. SSD network structure, reprinted from Ref [32].

Figure 3. RetinaNet structure, reprinted from Ref [32].

Figure 4. Sample Image 1.

Figure 5. In the Image, we can see that the camera-based and deep learning-based detection methods function together perfectly, as the image taken by a camera is easily detected, and the cars were detected, as indicated by the addition of boxes around them.

Figure 6. The YOLO deep learning algorithm proved to be a viable option for enhancing the accuracy of 3D object identification systems in self-driving vehicles.

Figure 7. Sample Image 2.

Figure 8. The sample image includes cars and people; the three closest people were identified with confidence scores between 91% and 79%, which is very high and good, and the four closest vehicles were identified with confidence scores ranging from 95% to 93%.

Figure 9. Sample Image 3.

Table 1. Comparison of model accuracies.

Studies	IoU	mAP Accuracies	Accuracies of Detecting Vehicles	Accuracies of Detecting People
Francies et al. (2021) [15]	0.41	77%	65	75
Wang (2021) [43]		76	-	-
Ren et al. 2017 [44] Fast R-CNN	0.70	0.59	-	-
Cepni et al. (2020) [45]	84%	90.6%
This Study	0.78	0.85	93–97%	79–91%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Published by MDPI on behalf of the World Electric Vehicle Association. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Murendeni, R.; Mwanza, A.; Obagbuwa, I.C. Using a YOLO Deep Learning Algorithm to Improve the Accuracy of 3D Object Detection by Autonomous Vehicles. World Electr. Veh. J. 2025, 16, 9. https://doi.org/10.3390/wevj16010009

AMA Style

Murendeni R, Mwanza A, Obagbuwa IC. Using a YOLO Deep Learning Algorithm to Improve the Accuracy of 3D Object Detection by Autonomous Vehicles. World Electric Vehicle Journal. 2025; 16(1):9. https://doi.org/10.3390/wevj16010009

Chicago/Turabian Style

Murendeni, Ramavhale, Alfred Mwanza, and Ibidun Christiana Obagbuwa. 2025. "Using a YOLO Deep Learning Algorithm to Improve the Accuracy of 3D Object Detection by Autonomous Vehicles" World Electric Vehicle Journal 16, no. 1: 9. https://doi.org/10.3390/wevj16010009

APA Style

Murendeni, R., Mwanza, A., & Obagbuwa, I. C. (2025). Using a YOLO Deep Learning Algorithm to Improve the Accuracy of 3D Object Detection by Autonomous Vehicles. World Electric Vehicle Journal, 16(1), 9. https://doi.org/10.3390/wevj16010009

Article Menu

Using a YOLO Deep Learning Algorithm to Improve the Accuracy of 3D Object Detection by Autonomous Vehicles

Abstract

1. Introduction

2. Literature Review

2.1. YOLO Deep Learning Algorithm

2.2. Object Detection in 3D for Autonomous Vehicles

Challenges in 3D Object Detection

2.3. Evaluation of Object Detection Accuracy

2.4. Significance of Accurate 3D Object Detection

2.5. State-of-the-Art of 3D Detection

3. Methodology

3.1. Modifications to YOLOv4 Architecture for 3D Detection

3.1.1. 3D Bounding Box Prediction

3.1.2. Depth Estimation Layer

3.1.3. Multi-Task Learning Framework

3.2. Sensor Fusion for 3D Detection

3.2.1. LIDAR Point Cloud Integration

3.2.2. Multi-Modal Feature Fusion

3.3. Network Training Strategy

3.3.1. Loss Functions

3.3.2. Training Dataset

3.3.3. Transfer Learning

3.4. Incorporation of Post-Processing

3.4.1. Non-Maximum Suppression (NMS)

3.4.2. Bounding Box Refinement

3.5. Computational Considerations

3.6. Tools

3.6.1. TensorFlow

3.6.2. Keras

4. Results and Discussion

4.1. Training

4.2. Results

4.2.1. Implementation of a YOLO Deep Learning to Enhance the 3D Object Detection System in Autonomous Driving

4.2.2. Evaluation of the Accuracy of the Model for the Object Detection System

4.3. Discussion

4.4. Comparison of This Study with Existing Studies

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI