You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Article
  • Open Access

4 May 2025

Implementing High-Speed Object Detection and Steering Angle Prediction for Self-Driving Control

,
and
1
Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung 81148, Taiwan
2
Department of Fragrance and Cosmetic Science, Kaohsiung Medical University, Kaohsiung 80708, Taiwan
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Object Detection in Autonomous Driving

Abstract

In the previous work, we proposed LWGSE-YOLOv4-tiny and LWDSG-ResNet18, leveraging depthwise separable and Ghost Convolutions for fast self-driving control while achieving a detection speed of 24.9 FPS. However, the system fell short of Level 4 autonomous driving safety requirements. That is, the control response speed of object detection integrated with steering angle prediction must exceed 39.2 FPS. This study enhances YOLOv11n with dual convolution and RepGhost bottleneck, forming DuCRG-YOLOv11n, significantly improving the object detection speed while maintaining accuracy. Similarly, DuC-ResNet18 improves steering angle prediction speed and accuracy. Our approach achieves 50.7 FPS, meeting Level 4 safety standards. Compared to previous work, DuCRG-YOLOv11n boosts feature extraction speed by 912.97%, while DuC-ResNet18 enhances prediction speed by 45.37% and accuracy by 12.26%.

1. Introduction

Automakers and tech companies are working together to advance fully autonomous driving technology. Current Level 1 and Level 2 self-driving cars can carry out Advanced Driver Assistance Systems (ADAS), which include onboard cameras, Camera-Monitor Systems (CMS), and Head-Up Displays (HUDs) to enhance situational awareness and driving safety. Further improvements, like long-wave infrared cameras, help CMS operate in low-light conditions, reducing glare and improving night-time driving safety. AR HUDs project driving information within the driver’s field of vision, assisting without distractions.
In Q4 2023, Mercedes-Benz Drive Pilot became one of the few Level 3 self-driving cars compliant with EU UN R157 regulations, receiving sales approval in Germany, Nevada, and California. In 2024, BMW cars with personal pilot were also approved in Germany, outperforming the personal pilot of Mercedes cars by functioning in the dark. Meanwhile, Tesla unveiled its Level 4 autonomous taxi, called Cybercab, and other high-tech vehicles in October 2024. Tech giants like Qualcomm and NVIDIA are racing to develop high-performance automotive chips to support the increased computational demands of Level 4 autonomous driving. Therefore, self-driving cars require ultra-fast object detection and image recognition to achieve L4 safety.
Real-time fusion of detecting objects and predicting steering angles is essential for precise control responses. YOLOv4-tiny [1] enables the fast detection of vehicles and traffic signs, while ResNet18 [2] accurately predicts steering angles at intersections or multi-lane roads. Our previous work [3] introduced lightweight models, LW-YOLOv4-tiny and LW-ResNet18, which improved object detection and steering angle prediction for faster autonomous driving responses. However, speed limitations still caused delayed steering predictions, increasing the number of dangerous car accidents. Further acceleration of LW-YOLOv4-tiny and LW-ResNet18 is necessary because the speed is much higher than the accuracy.
Previous work [4] incorporated GhostBottleneck [5] and SELayer [6] to replace CSP_Block, resulting in the lightweight LWGSE-YOLOv4-tiny model. Meanwhile, LWDSG-ResNet18 used depthwise separable convolution [7] to enhance computing speed and pattern recognition accuracy. Despite these improvements to achieve the detecting speed of 24.9 FPS, the system failed to meet L4 safety standards, requiring a control response speed exceeding 39.2 FPS. To achieve this, we sought a high-speed visual detection method capable of both rapid response and high accuracy.
This study introduces a new high-speed autonomous driving method to replace LWGSE-YOLOv4-tiny and LWDSG-ResNet18. By integrating RepGhost bottleneck [8] with SELayer [6] in YOLOv11n, we replace the C3k2 bottleneck and introduce Dual Conv [9] (Depthwise Separable Conv + Point Conv) to replace standard Conv layers (resulting in DuCRG-YOLOv11n). These modifications enable faster inference while maintaining high detection accuracy. RepGhost bottleneck ensures consistent feature map dimensions, allowing for more prosperous feature extraction. For ResNet18, we replace standard Conv layers with Dual Conv, forming DuC-ResNet18, achieving high-speed and high-accuracy predictions. Our approach significantly accelerates inference time, reducing autonomous driving control latency and meeting L4 safety requirements [10]. Specifically, our method achieves a control response speed of 50.7 FPS, a key contribution of this study. Following previous work [4], a model car, NVIDIA JetRacer [11], was used to simulate autonomous driving in urban environments. This study also evaluates various YOLO-related models concerning object detection and ResNet-related models focusing on steering angle prediction to identify the most efficient solution for real-time self-driving control.

3. Methods

3.1. High-Speed Response Using DuCRG-YOLOv11n and DuC-ResNet18 Models

Although YOLOv11n and ResNet18 are already relatively fast models, when implemented in real-world self-driving for environmental object detection and image recognition, the vehicle control response time still needs to be reduced. Therefore, there is a need to improve the inference performance of the model. We aim to lower the computational load, reduce the response time while the self-driving control is active, and reduce erroneous critical decisions in control. The improved model can also decrease power consumption by reducing the number of convolutions to make it energy-efficient. Based on the RepGhost [8] and DualConv [9] studies, this study proposes an improved architecture for YOLOv11n to build a robust model detecting objects, denoted DuCRG-YOLOv11n, as shown in Figure 10. Furthermore, this study applied the DualConv method to construct a reformed model predicting steering angles, denoted DuC-ResNet18, as shown in Figure 11.
Figure 10. DuCRG-YOLOv11n architecture.
Figure 11. DuC-ResNet18 architecture. The number in red indicates the number of filters in convolutions.
The proposed DuCRG-YOLOv11n replaces the bottleneck module in the C3k2 block of YOLOv11n with the RepGhost bottleneck module [8], as shown in Figure 10. The improvement in the RepGhost block lies in using the addition operation to replace the concatenation operation in Ghost Conv [5]. This change reduces memory usage during computation, and since the addition operation does not alter the shape of the feature maps, it helps lighten the computational burden, leading to faster results. The RepGhost block retains the linear transformation from Ghost, preserving the key characteristics of Ghost features. As a result, even though the algorithm simplifies traditional convolutions, it retains more features, enhancing the model’s computation speed.
Additionally, the RepGhost block incorporates the reparameterization technique from Rep, which adds the trained batch normalization (BN) weights to the Ghost Conv weights. This integrated weight allows the deep model to maintain the same precision as traditional convolutions while being faster and more lightweight during inference, thereby improving computation speed. Algorithm 1 provides a detailed statement of the execution flow of the RepGhost block. The proposed DuC-ResNet18 and DuCRG-YOLOv11n also use the DuConv module [9] to replace the conventional Conv module, as shown in Figure 10 and Figure 11. By combining depthwise convolution (DW Conv) with pointwise convolution (PW Conv), this module significantly reduces the computational load of traditional convolutions. Moreover, combining these two convolution types enables feature extraction at two scales, helping maintain the model’s accuracy.
Algorithm 1 RepGhost Block (RG Block) [8]
Input :   Image   X ,   ghost   modules   with   linear   transformation   functions   φ i
Output :   O u t f m
  • Compute the output of the pointwise convolution:
  • I n s f m i = μ h , i f m h = f m h p f i , P
  •   h = 1,2 , , N ,   i = 1,2 , , M ,   P = { 1,2 , , N }
  • where   μ h , i represents a pointwise convolution to obtain intrinsic feature maps I n s f m i ,   stands for a point-to-point product of the sum between the feature map f m h and pointwise filter p f i , P ,   N is the number of input images, and M denotes the number of intrinsic feature maps.
2.
Compute the output of the Ghost Block:
  • g f m i = φ i I n s f m i = ( ( W i I n s f m i ) B g h o s t i )
  •   i = 1,2 , , M
  • where φ i represents a simplified convolution operation, W i stands for a weight matrix of the ghost module, B g h o s t i indicates a bias matrix of the ghost module, M is the number of intrinsic feature maps, g f m i denotes a Ghost feature map, describes the pixel-wise product of two matrixes, and the symbol shows the pixel-wise sum of two matrixes.
3.
Compute the output of the Ghost feature map, adding Batch Normalization:
  • f m i = g f m i B N I n s f m i
  •   i = 1,2 , , M
  • where B N represents an operation of batch normalization, M stands for the number of intrinsic feature maps, g f m i indicates a Ghost feature map, f m i denotes an intermedia feature map, and the symbol shows the pixel-wise sum of two matrixes.
4.
Compute the output of the RepGhost Block:
  • O u t f m i = A i ( f m i )
  •   i = 1,2 , , M
  • where A i is an activation function with SiLU, O u t f m i stands for a RepGhost feature map, and M represents the number of RepGhost feature maps.
5.
After the training phase from step #1 to step #4, we add the weight of batch normalization into the Ghost module to form a composite weight of the RepGhost Block. In the inference phase, the RepGhost Block implements steps #1, #2, and #4 to obtain the output feature maps.

3.2. RepGhost Bottleneck

The RepGhost Block can construct the RepGhost bottleneck module, as illustrated in Figure 12. The RepGhost bottleneck integrates multiple sublayers for feature extraction, including first-encountered convolutional layers and, simultaneously, the original feature maps executed in batch normalization, addition, and activation function layers. During feature extraction, the RepGhost bottleneck module requires fewer parameters than the bottleneck module. It extracts features in a lower-dimensional space before projecting the extracted results into the original high-dimensionality space. This input–output equivalence approach effectively reduces computational costs while preserving key features. Furthermore, Equation (3) defines the activation function Sigmoid Linear Unit (SiLU) [26], which replaces the traditional activation function ReLU, where p j i represents the input, q j i represents the output, and e denotes the natural constant. Based on Equation (3), we substitute the ReLU activation function in the RepGhost bottleneck module with SiLU to further assist feature learning and better gradient flow. Unlike the traditional activation function ReLU, which outputs positive values for positive inputs but zero for negative inputs, SiLU dynamically adjusts activation based on input values without the discontinuities inherent in ReLU. This property helps mitigate the vanishing gradient problem and accelerates model convergence. Equation (4) defines the activation function Derivative of Sigmoid Linear Unit (dSiLU) [26], which replaces Sigmoid, where p j i represents the input, q j i denotes the SiLU-transformed output, and r j i represents the final output. Based on Equations (3) and (4), this study replaces ReLU with SiLU and substitutes Sigmoid with dSiLU in the Squeeze-and-Excitation (SE) layer to enhance its ability to focus on salient features. The conventional Sigmoid function has a relatively small gradient range, which can slow down learning. In contrast, dSiLU offers a broader gradient range, enabling more efficient learning. Although this substitution increases computational complexity, the enhanced gradient magnitude makes it better suited for deep learning and effectively mitigates the vanishing gradient problem.
q j i = p j i · 1 1 + e p j i
r j i = q j i · ( 1 + p j i 1 q j i )
Figure 12. RepGhost bottleneck architecture.
In Figure 12, the authors of RepGhost incorporated the Squeeze-and-Excitation (SE) layer [6] into the RepGhost bottleneck module. This integration enables dynamic feature map adjustment based on input data, enhancing feature representation in convolutional neural networks. In this study, we drew inspiration from MobileNetV3 [27] and modified the SE layer by replacing its two fully connected layers with pointwise convolutions (1 × 1 Convs). This lightweight architecture, called MoSElayer, improves computational efficiency while marginally enhancing accuracy. Specifically, we replaced the activation function in the SE layer from the traditional ReLU to SiLU. These changes can allow the model to better deal with nonpositive inputs and help alleviate the gradient vanishing problem, improving feature learning and training efficiency. Moreover, the Derivative of Sigmoid Linear Unit (dSiLU) replaced the last Sigmoid function in the SE layer. This replacement could incur a slight computational overhead, contributing to higher accuracy within the architecture. The modified SE layer is termed dMoSElayer, designed to automatically focus on essential features while suppressing less relevant ones, ultimately enhancing the overall inference performance, as depicted in Figure 13. During the Squeeze phase, dMoSElayer employs global average pooling to lower the spatial dimensions of feature maps, generating a global feature descriptor. The main focus is on emphasizing critical regions within the feature maps. In the excitation phase, dMoSElayer utilizes a small gating network to amplify or suppress different feature channels selectively. The excitation triggered the model to assess how important each channel is, thereby improving its ability to capture key features like self-attention mechanism [28]. In short, dMoSElayer dynamically adjusts feature weights, allowing the model to focus more precisely on significant features while minimizing interference from irrelevant information. Finally, Figure 12 illustrates the feature extraction process in the neck and prediction stages. Additionally, the symbol “DS” denotes the down-sampling operation.
Figure 13. Derivative of Mobile Squeeze-and-Excitation Layer (dMoSE) architecture.

3.3. Dual Convolution

In Figure 14, a high-efficiency convolutional computing approach called dual convolution (Du Conv) [9] replaced some of the basic convolutional units in the proposed model, DuCRG-YOLOv11n. Similarly, the proposed DuC-ResNet18 model adopts Du Conv to replace traditional convolution layers, as shown in Figure 9. The main goal of Du Conv is to reduce computational costs while maintaining the model’s performance. In Figure 14, Du Conv comprises depthwise and pointwise convolutions. In depthwise convolution, each input feature map is processed independently with its corresponding filter (an independent 3 × 3 kernel), effectively capturing spatial details in the image. This convolution type focuses on feature interactions within a single channel. In pointwise convolution, the linear combination of the output (using an independent 1 × 1 kernel) from the previous feature maps can generate new output feature maps. Figure 14 shows how to create the intermediate feature maps as input for the subsequent layers. Finally, adding the results from these two convolutions can form the final output. The Du Conv process helps reduce computation costs because it requires fewer parameters than traditional convolution in these two steps. Moreover, since we use two different convolution kernel sizes, the model can capture features at various scales, preserving accuracy. Thus, this study applies Du Conv to construct the DuC-ResNet18 and DuCRG-YOLOv11n models. Algorithm 2 describes the detailed execution process of Du Conv.
Algorithm 2 Dual Convolution (Du Conv) [9]
Input :   Image   X ,   pointwise   convolution   functions   μ h ,
Output :   Intrinsic   feature   map   O u t f m h , g
  • Compute the output of the group convolution:
  • Suppose that N represents the number of input images,   X = x i ,     i = 1,2 , , N indicates the input images,   F = f i ,     i = 1,2 , , N denotes the depthwise filters, and G stands for the number of groups. Dividing the input images X into G groups and the depthwise filters F into G groups, respectively. Execute a depthwise convolution of each input image x h once, with its corresponding filter f h by κ ×   κ independently to attain the respective feature map f m h , g , and complete a single-group convolution f m g = { f m h , g ,     h = 1,2 , , N / G ,   g = 1,2 , , G } , where N / G is the number of input images and respective depthwise filters for each group, the input images of a group X N / G = x h ,     h = 1,2 , , N / G , and the depthwise filters of a group F N / G = { f h ,     h = 1,2 , , N / G } .
2.
Compute the output of the pointwise convolution:
  • I n s f m g = μ h , i , g f m h = f m h p f i
  •   h = 1,2 , , N / G ,   i = 1,2 , , N / G ,   g = 1,2 , , G
  • where μ h , i , g represents a pointwise convolution to obtain intrinsic feature maps I n s f m g , stands for a point-to-point product of the sum between the feature map f m h and pointwise filter p f i , N indicates the number of input feature maps, G denotes the number of groups, and N / G shows the number of input feature maps and respective pointwise filters for each group.
3.
Compute the output:
  • O u t f m h , g = f m h , g I n s f m g
  •   h = 1,2 , , N / G ,   g = 1,2 , , G
  • where   I n s f m g indicates the output feature maps, including the depthwise and pointwise convolution feature maps, O u t f m h , g denotes the output feature maps in a dual convolution layer, and the symbol shows the pixel-wise sum of two matrixes.
Figure 14. Dual convolution (DuConv) architecture.
Technically speaking, in our previous work [5], we performed traditional convolution operations based on specific filters to obtain the intrinsic feature maps, followed by a simple linear transformation to attain the ghost feature maps. This approach allowed us to generate the complete feature map while saving the time required for traditional convolution computing. Regarding the time complexity of various convolutions, the following shows traditional convolution, RepGhost Block, and dual convolution. O( l = 1 u r l 2 · s l 2 · c l · h l ), O( l = 1 u ( q l 2 · w l 2 · c l · z l + r l 2 · s l 2 · z l + t l 2 · z l   ) ), and O( l = 1 u ( r l 2 · s l 2 · c l + q l 2 · w l 2 · c l · h l ) ).
We define the notation for time complexity as follows:
  • q : The side length of the input feature map.
  • r : The side length of the output feature map/the side length of the output feature map after depthwise convolution.
  • s : The side length of the filter/the side length of the depthwise filter.
  • c : The number of channels of the input feature map.
  • h : The number of channels of the output feature map.
  • v : The number of channels of a set of filters of the output feature map through a traditional convolution operation.
  • l : An index of the convolutional layer.
  • u : The number of convolutional layers.
  • g : The number of groups in each layer.
  • t : The side length of the output feature map after pointwise convolution.
  • w : The side length of the pointwise filter.
  • z : The number of channels of a set of filters of the output feature map through a pointwise convolution operation.
RepGhost Block and Dual Conv’s approaches can effectively speed up the convolution computation compared to the traditional Ghost Conv. In addition, our proposed approaches can slightly increase prediction accuracy by eliminating redundant operations related to convolutional computation. By reducing the computational load and simplifying the process, these approaches let the model focus on the most relevant features, leading to faster computation and better performance in terms of accuracy.

3.4. Scenarios of Detecting Objects and Predicting Steering Angles

In Figure 15a, we installed dual cameras at the front and rear panels of a small model car, JetRacer, to capture real-time video streams, and the Jetson Nano uses DuCRG-YOLOv11n to detect moving objects and classify them. Our previous work [4] used a visual odometer to measure the distance between the detected objects and the vehicle. In Figure 15b, DuC-ResNet18 predicts the real-time steering angle along the route while the car is in motion. The predicted values led to significant changes in the steering angle, causing the vehicle to sway from side to side and oscillate violently during driving. Therefore, our previous work [4] added a PID controller to mitigate this oscillation phenomenon.
Figure 15. Live self-driving. (a) DuCRG-YOLOv11n detects objects on the road promptly, and (b) DuC-ResNet18 instantly predicts a steering angle (–34.377) with a green visual indicator on the ground.

4. Experiment Results and Discussion

This experiment initially ran a few well-known object detection models, including YOLOv8n [20], RG-YOLOv8, YOLOv11n [20], RG-YOLOv11n, and DuCRG-YOLOv11n. Next, we trailed steering angle prediction models involving VGG16 [19], ResNet18 [2], RG-ResNet18, and DuC-ResNet18. Initially, the experiment trained the DuCRG-YOLOv11n model for detecting objects and the DuC-ResNet18 model for predicting steering angles. Subsequently, we tested five detecting object models and four predicting steering angle models separately. Finally, this study combined the functions of detecting objects and predicting steering angles executing on the Jetson Nano to evaluate the performance of self-driving of the JetRacer.

4.1. Experimental Settings

Table 1 describes the hardware specification used in the experiment. Table 2 lists the software packages applied to the experiment. Table 3 shows the hardware component of the embedded platform, Jetson Nano, operating in this experiment. We used PyTorch to program the execution of deep neural networks, and ultimately, the embedded platform Jetson Nano used TensorRT to accelerate the inference.
Table 1. Hardware specification.
Table 2. Recipe of packages.
Table 3. Components of NVIDIA Jetson Nano.

4.2. Model Training, Inference, and Capacity

In the first stage, the training dataset collected 1476 images, and the test dataset collected 366 images, with a size of 320 × 320 for each image, to train the object detection models. The training, validation, and test data ratio was 65%, 16%, and 16%, respectively. This experiment trained all models on a GPU workstation with 300 epochs. For the test images, Equation (5) evaluates the time to infer detecting objects, where i is the i th model, I represents the number of models, x stands for the x th image, X shows the number of images, I T i denotes the total inference time (IT), and   E I T i indicates the inference time of each image.
I T i = x = 1 X E I T i ,   w h e r e   i = 1 ,   2 ,   ,   I ,   x = 1 ,   2 , ,   X
In Table 4, the first row lists the training time of each model, and the second row shows the inference time. In short, Table 4 demonstrates that DuCRG-YOLOv11n outperforms the others in both training and inference efficiency.
Table 4. Time required to train and infer detecting objects for models.
In the second stage, the training dataset collected 14,710 images, and the test dataset collected 1000 images, with a size of 224 × 224 for each image, to train the steering angle prediction models. The data size ratio among training, validation, and test sets is the same percentage mentioned above. Similarly, Equation (5) computes the time required to predict the steering angles in the test images.
In Table 5, the first row lists the training time of each model, and the second row lists the inference time. In short, Table 5 demonstrates that DuC-ResNet18 took a bit longer than the VGG16 and ResNet18 models, while the speed of its inference was faster than the other models.
Table 5. Time to train and infer predicting steering angles for models.
Regarding the model capacity, Table 6 lists the parameters for the object detection models, while Table 7 provides the parameters for the steering angle prediction models. YOLOv8n has the most parameters, while DuCRG-YOLOv11n has the fewest. VGG16 has the most parameters for predicting steering angles, whereas DuC-ResNet18 has the fewest parameters.
Table 6. Parameters of detecting object models.
Table 7. Parameters of predicting steering angle models.

4.3. Training and Validation Losses

In the experiment, visualization tools monitored the model training and callback functions executed to record the best-performing model’s parameters. After 300 training epochs, Figure 16 shows the loss curves for the best-performance detecting object model, DuCRG-YOLOv11n. The proposed model can perform better than our previous work [4]. Regarding the detection models YOLOv8n, RG-YOLOv8n, YOLOv11n, RG-YOLOv11n, and DuCRG-YOLOv11n, Figure 16, Figure 17, Figure 18, Figure 19, and Figure 20, respectively, display the training loss in the first row and the validation loss in the second row. These figures also located the localization loss, the confidence loss, and the loss of matching predicted frames to the actual frames in the first, second, and third columns. In summary, DuCRG-YOLOv11n achieved the least loss.
Figure 16. Training and validation losses for YOLOv8n. (a) Loss in box training. (b) Loss in classification training. (c) Loss in detection training. (d) Loss in box validation. (e) Loss in classification validation. (f) Loss in detection validation. In each plot, the x-axis represents the error value, and the y-axis represents the number of epochs.
Figure 17. Training and validation losses for RG-YOLOv8n. (a) Loss in box training. (b) Loss in classification training. (c) Loss in detection training. (d) Loss in box validation. (e) Loss in classification validation. (f) Loss in detection validation. In each plot, the x-axis represents the error value, and the y-axis represents the number of epochs.
Figure 18. Training and validation losses for YOLOv11n. (a) Loss in box training. (b) Loss in classification training. (c) Loss in detection training. (d) Loss in box validation. (e) Loss in classification validation. (f) Loss in detection validation. In each plot, the x-axis represents the error value, and the y-axis represents the number of epochs.
Figure 19. Training and validation losses for RG-YOLOv11n. (a) Loss in box training. (b) Loss in classification training. (c) Loss in detection training. (d) Loss in box validation. (e) Loss in classification validation. (f) Loss in detection validation. In each plot, the x-axis represents the error value, and the y-axis represents the number of epochs.
Figure 20. Training and validation losses for DuCRG-YOLOv11n. (a) Loss in box training. (b) Loss in classification training. (c) Loss in detection training. (d) Loss in box validation. (e) Loss in classification validation. (f) Loss in detection validation. In each plot, the x-axis represents the error value, and the y-axis represents the number of epochs.
After 200 training epochs, Figure 21 shows the loss curves for the best-performance predicting steering angle model, DuC-ResNet18. The proposed model can perform better than our previous work [4]. Figure 20 displays the training and validation losses in blue and red. DuC-ResNet18 reduced the validation loss to 0.0513.
Figure 21. Losses in training and validation for DuC-ResNet18.

4.4. Model Testing

Equation (6) evaluates how many frames are executed detecting in a second (frames per second, FPS), where J is the number of applied models, F P S j implies the frames per second of the j th model, and   I R A I T j denotes the time taken to instantly detect an object using the j th model.
F P S j = 1 I R A I T j ,   w h e r e   j = 1 ,   2 , ,   J
A given model can compute the detection precision (mean average precision, mAP) to indicate object detection, and the user can retrieve this value by the mean of every class’s average precision for all classes. Equation (7) evaluates the detection precision of every model to detect objects   m A P l , where L is the number of applied models, k l implies a designated class in the l th model, C l means the number of designated classes in the l th model,   m A P l indicates the detection precision of the l th model, and   A P k l denotes the detection precision of a designated class in the l th model.
m A P l = k l = 1 C l A P k l C l ,   w h e r e   k l = 1 ,   2 ,   ,   C l ,   l = 1,2 , , L
The experiment detected objects on 366 test images and estimated the speed and accuracy of each object detection model. Figure 22 shows the precision–recall curve and indicates each dot’s specific recall and precision. In Figure 22, the x-axis implies recall, and the y-axis denotes precision. The experiment tested the results of the following models: YOLOv8n, RG-YOLOv8n, YOLOv11n, RG-YOLOv11n, and DuCRG-YOLOv11n. Equation (6) calculates the FPS, and Equation (7) calculates the mAP, as listed in Table 8. To summarize this test, DuCRG-YOLOv11n achieved the best results, while YOLOv8n performed the worst.
Figure 22. Precision–recall curve for DuCRG-YOLOv11n.
Table 8. Speed and precision of detecting objects.
Likewise, Equation (6) computes how many frames are executed in predicting in a second (frames per second, FPS). Equation (8) computes the error (mean square error, M S E ) of predicting steering angles, where M S E is the accuracy of a prediction model, N denotes the number of input images, k means the k th image, y i implies the actual outcome, and y ^ i indicates the predicted outcome. The smallest MSE represents the beat accuracy of predicting the steering angle.
M S E = k = 1 N ( y k y ^ k ) 2 N ,   w h e r e   k = 1,2 , ,   N
Predicting the steering angles on 1000 testing images can estimate the detection speed and then evaluate the accuracy of angle prediction for each model. In Figure 23, the curves show the predicted values versus the actual values for steering angle prediction model DuC-ResNet18. Figure 23 indicates “−1” as a right turn, implies “1” as a left turn, and denotes “0” as going straight, where the scale for the turning points ranges from −1 to 1 for the steering angle. The proposed model can perform better than our previous work [4]. Equation (6) calculates FPS, and Equation (8) calculates MSE, as shown in Table 9. This test indicates that RG-ResNet18 achieved the smallest MSE, while VGG16 performed the worst.
Figure 23. The MSE of predicted and actual steering angle using Du-ResNet18.
Table 9. Speed and loss of predicting steering angles.

4.5. Performance Evaluation

Jetson Nano executes the self-driving control using the frame per second (FPS) and mean square error (MSE) to evaluate metrics. When the JetRacer travels along the route map, the NVIDIA JetRacer instantly detects objects and simultaneously predicts the steering angle. Self-driving may take longer to detect objects or lack the time needed to predict steering angles, which may endanger the car’s safety in unexpected situations. Therefore, the frame rate could be the most crucial consideration of risk. Jetson Nano can accelerate the detection of objects using TensorRT, ensuring a faster and more efficient self-driving system.
Equation (6) calculates the frame rate for different combinations when the video resolution is 320 × 320, as shown in Table 10. In Table 10, DuCRG-YOLOv11n consistently achieves the best speed across various combinations. The combination of DuCRG-YOLOv11n and DuC-ResNet18 achieves the best FPS, while YOLOv8n and VGG16 result in the lowest FPS.
Table 10. FPS of model combination.
Subsequently, the experiment evaluated the accuracy of detecting objects and then the accuracy of predicting steering angles for each combination, with the video resolution set to 320 × 320. Equation (7) computes the accuracy of object detection, and Equation (8) calculates the accuracy of steering angle prediction. In Table 11, RG-YOLOv11n achieves the best detection accuracy, while RG-ResNet18 has the lowest MSE for predicting steering angles.
Table 11. Accuracy and precision of model combination.

4.6. Discussion

Carrasco et al. [13] proposed YOLOv5_SM_x2, which achieves an average precision (mAP) of 0.998 at 42.82 frames per second (FPS). While this model demonstrates impressive precision, it fails to meet the Level 4 (L4) safety requirements for self-driving when applied in this study. Wei et al. [27] combined millimeter-wave and visual detection, but this information fusion could waste a significant amount of time, slowing down object detection speeds. Y. Cai et al. [29] conducted experiments using an NVIDIA GTX 2080Ti as the computing engine, detecting objects at 66 FPS. However, since the Jetson Nano’s graphics processing speed is significantly slower than that of the NVIDIA GTX 2080Ti, it fails to achieve safe self-driving performance. In previous work, Chang et al. [4] proposed LWGSE-YOLOv4-tiny to implement fast object detection and LWDSG-ResNet18 to realize quick steering angle prediction, achieving 56.1 FPS in object detection and a mean square error of around 0.0683 for steering angle prediction. However, after data fusion, the FPS dropped to 24.9, and the slower processing speed could lead to serious accidents at high speeds. Despite this, the proposed self-driving vision algorithms in this study can speed up the computing in Jetson Nano, achieving 147 FPS for rapid object detection. Our proposed approaches can detect objects simultaneously and accurately predict steering angles.
Predicting delayed steering angles can significantly increase the number of dangerous unexpected events. The fast detection and response are much more important than precision in such scenarios. Two models, YOLOv11n and ResNet18, were improved by introducing the modification ones, DuCRG-YOLOv11n and DuC-ResNet18, accelerating detection speed while escalating image recognition accuracy. As a result, compared to YOLOv11n, DuCRG-YOLOv11n increased the feature extraction frame rate by 14.84% and slightly improved the accuracy of detecting objects by 0.1%. Compared with ResNet18, DuC-ResNet18 achieved a 111.59% increase in prediction speed and a 23.18% improvement in image recognition accuracy. Therefore, the proposed methods fulfill the main objective of this study.
Unfortunately, the experiment found shortcomings in this scenario. Jetson Nano encounters hardware limitations. Jetson Nano cannot handle real-time, high-resolution video streams while the model car moves. Instead of Jetson Nano, Jetson Orin Nano is a good choice for fast capture and complete processing of higher-resolution video streams. Unfortunately, despite Jetson Orin Nano having a lower power consumption than Jetson AGX Xavier, it still cannot provide a sufficient battery life for the JetRacer. Therefore, an energy-efficient model vehicle is required to address this limitation.

5. Conclusions

This study contributes to the realization of high-speed control response for object detection integrated with steering angle prediction, meeting the requirements for L4 safety in self-driving. DuCRG-YOLOv11n significantly promotes the speed of object detection. Meanwhile, DuC-ResNet18 substantially escalates the speed of steering angle prediction. As a result, the proposed approaches can reduce the overall reaction time for self-driving responses. Technically, the proposed method enables high-speed execution of self-driving control and effectively achieves the goal of L4 safe self-driving. The proposed approaches significantly beat our previous work [4] in increasing the overall speed of self-driving control by 2.04 times. According to the performance comparison, the proposed approaches outperform the other alternative methods.
In the future, we will continue to pursue ways to improve object detection and predict steering angles to meet the requirements for L5 safety in self-driving. Furthermore, integrating high-speed vision algorithms into in-vehicle supercomputers is a key part of the future trend in self-driving. Therefore, combining the Robot Operating System (ROS) with high-speed vision algorithm inference to control self-driving cars will also be a significant issue moving forward. We aim to find low-power, high-performance embedded platforms to run these high-speed vision algorithms. This solution enables a cost-effective and efficient edge computing architecture for the safe and smooth operation of self-driving cars.

Author Contributions

B.R.C. and J.-S.S. conceived and designed the experiments; H.-F.T. collected the dataset and proofread the manuscript; B.R.C. wrote the paper. All authors have read and agreed to the published version of the manuscript.

Funding

The National Science and Technology Council fully supports this work in Taiwan, the Republic of China, under grant numbers NSTC 113-2221-E-390-015 and NSTC 113-2622-E-390-003.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author(s). In addition, the Sample Programs have been used to support the findings of this study. https://drive.google.com/file/d/12peew-KnePeCVEOLy8q-Ff7JCNa-jAII/view?usp=sharing (accessed on 19 March 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. Scaled-YOLOv4: Scaling Cross Stage Partial Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13029–13038. [Google Scholar]
  2. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  3. Chang, B.R.; Tsai, H.F.; Chou, H.L. Accelerating the response of self-driving control by using rapid object detection and steering angle prediction. Electronics 2023, 12, 2161. [Google Scholar] [CrossRef]
  4. Chang, B.R.; Tsai, H.F.; Chang, F.Y. Boosting the response of object detection and steering angle prediction for self-driving control. Electronics 2023, 12, 4281. [Google Scholar] [CrossRef]
  5. Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–20 June 2020; pp. 1580–1589. [Google Scholar]
  6. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  7. Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar] [CrossRef]
  8. Chen, C.; Guo, Z.; Zeng, H.; Xiong, P.; Dong, J. RepGhost: A hardware-efficient ghost module via reparameterization. arXiv 2022, arXiv:2211.06088. [Google Scholar] [CrossRef]
  9. Zhong, J.; Chen, J.; Mian, A. DualConv: Dual convolutional kernels for lightweight deep neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 1904–1918. [Google Scholar] [CrossRef] [PubMed]
  10. Li, S.; Zhang, Y.; Blythe, P.; Edwards, S.; Ji, Y. Remote driving as the Failsafe: Qualitative investigation of Users’ perceptions and requirements towards the 5G-enabled Level 4 automated vehicles. Transp. Res. F: Traffic Psychol. Behav. 2024, 100, 211–230. [Google Scholar] [CrossRef]
  11. JetRacer AI Kit. Waveshare Wiki. 2023. Available online: https://www.waveshare.com/wiki/JetRacer_AI_Kit (accessed on 1 March 2025).
  12. Dong, X.; Yan, S.; Duan, C. A lightweight vehicle detection network model based on YOLOv5. Eng. Appl. Artif. Intell. 2022, 113, 104914. [Google Scholar] [CrossRef]
  13. Carrasco, D.P.; Sotelo, M.A.; Rodríguez, F.J. T-YOLO: Tiny vehicle detection based on YOLO and multi-scale convolutional neural networks. IEEE Trans. Intell. Transp. Syst. 2022, 23, 4906–4915. [Google Scholar]
  14. Momin, M.A.; Junos, M.H.; Khairuddin, A.S.M.; Talip, M.S.A. Lightweight CNN model: Automated vehicle detection in aerial images. Signal Image Video Process 2023, 17, 1209–1217. [Google Scholar] [CrossRef]
  15. Cai, D.; Li, K.; Hou, B. YOLOv8-RepGhostEMA: An efficient underwater trash detection model. J. Phys. Conf. Ser. 2024, 2906, 012019. [Google Scholar] [CrossRef]
  16. Wang, H.; Liu, C.; Cai, Y.; Chen, L.; Li, Y. YOLOv8-QSD: An improved small object detection algorithm for autonomous vehicles based on YOLOv8. IEEE Trans. Instrum. Meas. 2024, 73, 1–16. [Google Scholar] [CrossRef]
  17. Rambus. Driving Automation Levels. Rambus Blog. 2023. Available online: https://www.rambus.com/blogs/driving-automation-levels/ (accessed on 8 January 2025).
  18. Australian Parliament House. Inquiry into The Impact of New and Emerging Technologies on The Australian Economy. Parliament of Australia. 2023. Available online: https://www.aph.gov.au/DocumentStore.ashx?id=bb8568ee-b1c1-4f47-adb2-dc2b73eaa6c4&subId=304019 (accessed on 16 January 2025).
  19. He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
  20. Ultralytics. GitHub. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 20 November 2024).
  21. Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar] [CrossRef]
  22. Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. Available online: https://openaccess.thecvf.com/content_cvpr_2017/html/Lin_Feature_Pyramid_Networks_CVPR_2017_paper.html (accessed on 30 November 2024).
  23. Sapkota, R.; Meng, Z.; Churuvija, M.; Du, X.; Ma, Z.; Karkee, M. Comprehensive performance evaluation of YOLOv12, YOLO11, YOLOv10, YOLOv9, and YOLOv8 on detecting and counting fruitlet in complex orchard environments. arXiv 2024, arXiv:2407.12040. [Google Scholar] [CrossRef]
  24. Liu, H.; Liu, F.; Fan, X.; Huang, D. Polarized Self-Attention: Towards High-Quality Pixel-Wise Regression. arXiv 2021, arXiv:2107.00782. [Google Scholar] [CrossRef]
  25. Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
  26. Klambauer, G.; Unterthiner, T.; Mayr, A.; Hochreiter, S. Self-Normalizing Neural Networks. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 972–981. [Google Scholar]
  27. Wei, Z.; Zhang, F.; Chang, S.; Liu, Y.; Wu, H.; Feng, Z. mmWave Radar and Vision Fusion for Object Detection in Autonomous Driving: A Review. Sensors 2022, 22, 2542. [Google Scholar] [CrossRef]
  28. Liu, Y.; Ma, L.; Liu, Y.; Zhang, Y.; Yang, M. SA-YOLOv3: An efficient and accurate object detector using self-attention mechanism for autonomous driving. IEEE Trans. Intell. Transp. Syst. 2020, 23, 2326–2330. [Google Scholar]
  29. Cai, Y.; Luan, T.; Gao, H.; Wang, H.; Chen, L.; Li, Y.; Sotelo, M.A.; Li, Z. YOLOv4-5D: An Effective and Efficient Object Detector for Autonomous Driving. IEEE Trans. Instrum. Meas. 2021, 70, 4503613. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.