You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Article
  • Open Access

9 May 2023

Accelerating the Response of Self-Driving Control by Using Rapid Object Detection and Steering Angle Prediction

,
and
1
Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung 81148, Taiwan
2
Department of Fragrance and Cosmetic Science, Kaohsiung Medical University, Kaohsiung 80708, Taiwan
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Convolutional Neural Networks for Visual Detection, Recognition and Segmentation in Images and Videos

Abstract

A vision-based autonomous driving system can usually fuse information about object detection and steering angle prediction for safe self-driving through real-time recognition of the environment around the car. If an autonomous driving system cannot respond fast to driving control appropriately, it will cause high-risk problems with regard to severe car accidents from self-driving. Therefore, this study introduced GhostConv to the YOLOv4-tiny model for rapid object detection, denoted LW-YOLOv4-tiny, and the ResNet18 model for rapid steering angle prediction LW-ResNet18. As per the results, LW-YOLOv4-tiny can achieve the highest execution speed by frames per second, 56.1, and LW-ResNet18 can obtain the lowest prediction loss by mean-square error, 0.0683. Compared with other integrations, the proposed approach can achieve the best performance indicator, 2.4658, showing the fastest response to driving control in self-driving.

1. Introduction

In recent years, the self-driving system has become a trend worldwide. This study imitates the advanced driver assistance system developed by Tesla to develop a vision-based autonomous driving system in a model car. Therefore, this study will focus on real-time object detection and image recognition through image sensors to realize vision-based autonomous driving. At the same time, the car moves on the road, which is a highly complex task.
Applying AI visual algorithms to the autonomous driving system can fuse rapid object detection and steering angle prediction information to accomplish imminent response and achieve safe driving. Regarding object detection, the AI visual algorithm YOLOv4-tiny [1], a lightweight version of YOLOv4 [2], can detect objects precisely in front of a self-driving car, including vehicles and traffic signs. Similarly, another AI visual algorithm ResNet18 [3], a variant of ResNet models [4], can predict the steering angle precisely in front of a self-driving car, including multi-lanes and intersections. If an autonomous driving system cannot respond fast to driving control appropriately, it will cause high-risk problems with regard to severe car accidents from self-driving. However, YOLOv4-tiny and ResNet18 models cannot run rapidly enough to respond quickly to self-driving driving control. Therefore, this study seeks to identify a lightweight version of YOLOv4-tiny that can run on GPUs in real-time to fulfill rapid object detection. In addition, this study also needed to find a lightweight version of ResNet18 to accomplish rapid steering angle prediction. According to the improved models above, the autonomous driving system will integrate rapid object detection and steering angle prediction to control driving quickly and safely.
Regarding the imitation of a self-driving car running on a road in an urban area, a planar road map simulates a scenario of real roads to allow a model car called Nvidia JetRacer [5] to drive autonomously on it, where a model car is installed with seven cameras surrounding it and equipped with the embedded Nvidia Jetson Nano [6] to execute AI visual computing. The model car will self-drive based on the road map, including two-way lanes and intersections, where we placed vital traffic signs, such as various speed limits, traffic lights, directions, and stop signs, aside from the lanes, to simulate the scenario of autonomous driving like a natural road mesh. We collected a large amount of image data from the above-mentioned experimental environment to give AI visual models supervised learning and use the TensorRT [7] inference engine to simultaneously fuse real-time rapid object detection and steering angle prediction. This study will evaluate a variety of integration systems and give a comparison of their performance. As a result, the main contribution of this study is proposing an approach that can achieve the best performance and showing the fastest response of driving control in self-driving.

3. Method

3.1. Model Car and Planar Road Map

The boxed battery located at the base of JetRacer supplies power to Jetson Nano and the motor inside JetRacer. In addition, the mobile power attached to JetRacer supplies power to the other Jetson Nanos. Servo and DC gear motors control the rear wheels, which can drive JetRacer. Tie rods fix the front wheels, and the other parts control the steering of the car. We installed seven onboard cameras around JetRacer, forming a 360-degree panoramic view through which to look around and perceive environmental information, as shown in Figure 5. The autonomous driving system calls the drive motor, and the steering programs execute the driving JetRacer through the library. The control of the motor speed and steering degree of JetRacer ranges from −1 to 1. The autonomous driving system performed the object detection and steering angle prediction and then sent the results back to JetRacer. In such a way, JetRacer can decide whether to move forward or backward and turn left or right.
Figure 5. Onboard cameras surrounding a JetRacer. (a) Three front cameras indicated #1, #2, and #3. (b) Left camera indicated #4. (c) Two rear cameras indicated #5 and #6. (d) Right camera indicated #7.
We designed a scenario that let JetRacers self-drive on a planar road map that simulates real roads for the simulation and test of autonomous driving, as shown in Figure 6. Furthermore, this study also produced small-size traffic signs such as speed limit signs, turn signs, stop signs, and traffic lights. JetRacer can drive on a planar road map and follow traffic signal rules, as shown in Figure 7.
Figure 6. Planar road map where color yellow represents the lines to distinguish lanes in different directions and color white stands for the lines to indicate road edge.
Figure 7. Planar road map with traffic signs.

3.2. Data Collection

The controller operated JetRacer to move on the road map in small increments for data collection. By collecting images from different angles at each point on the path, the autonomous driving system can learn how to correct the lane. The wide-angle camera mounted on the front of JetRacer has a 140-degree viewing range. Data collection from this camera can help JetRacer move along the route more accurately, as shown in Figure 8. This study needed to collect two datasets. The first dataset contains various objects, such as vehicles and different traffic signs, and the second includes the routes of the planar road map. To implement the object detection task, we needed to manually label the first dataset, as shown in Figure 9.
Figure 8. Data collection using the front-end camera and handle controller. (a) Front-end camera; (b) handle controller.
Figure 9. Traffic signs and roads in the training phase. (a) Speed limit sign. (b) Stop sign. (c) Traffic light.

3.3. Rapid Response Time from Using LW-YOLOv4-Tiny and LW-ResNet18 Models

Even though YOLOv4-tiny and ResNet18 can perform precisely in object detection and steering angle prediction, we still intend to modify their network architecture to speed up the algorithm execution. If so, the new approach can shorten the response time of self-driving control to minimize the errors of driving judgment. Moreover, the model will also consume less power to realize energy-efficient object detection and steering angle prediction. Therefore, this study proposes a lightweight version of the network architecture of YOLOv4-tiny and ResNet18 to replace traditional convolution with a fast convolution within their models, where we have constructed their rapid models, abbreviated LW-YOLOv4-tiny and LW-ResNet18, as shown in Figure 10 and Figure 11, respectively. The proposed approach can significantly reduce the number of visual computations due to lightweight network architecture.
Figure 10. The architecture of the LW-YOLOv4-tiny model. (Different color means different function performed in a single block).
Figure 11. The architecture of the LW-ResNet18 model. (Different color means different function performed in a single block).
The LW-YOLOv4-tiny model comprises input, backbone, neck, and prediction, as shown in Figure 10. They replace traditional convolutions with Ghost convolutions [28] in their parts. The LW-YOLOv4-tiny model differs from the YOLOv4-tiny model, in which we use ghost convolutions instead of a traditional one and thus shorten the response time of inference intuitively. This approach will reduce the number of visual computations and enhance the ability of feature extraction to improve the inference speed without scarifying the recognition precision. The LW-ResNet18 model comprises input, backbone, and prediction, as shown in Figure 11. It has improved the inference speed like the LW-YOLOv4-tiny model to shorten the response time of the inference without sacrificing the recognition precision.
When the traditional convolutional layer performs feature extraction, the CNN-related model uses a set of filters, specifically 3 × 3 and 5 × 5 filters, to proceed with the convolution operations with the input image, and redundant information will also exist in these feature maps. Ghost convolution uses a set of ghost modules consisting of multiple network parameters ( φ i , j ,   i = 1 , 2 , , n ,   j = 1 , 2 , , m ) to perform feature extraction. This set of ghost modules executes simple linear transformation on intrinsic feature maps through network parameters φ i , j , and can generate ghost feature maps without complicated convolution operations, as shown in Figure 12. Algorithm 1 shows the ghost convolution algorithm. In Algorithm 1, intrinsic feature maps are feature maps that are calculated by traditional convolution. Then, intrinsic feature maps use ghost modules to generate more ghost feature maps through a series of simple linear transformations. The ghost feature maps can fully reveal feature information hidden in intrinsic feature maps.
Algorithm 1: Ghost Convolution
Input: Image X , ghost modules with linear transformation functions φ i , j
Output:  O u t f m
  • Obtain intrinsic feature maps I n s f m i from a traditional convolution of image X
  • Compute the output of the ghost module:
    G h o s t f m i , j = φ i , j I n s f m i , j = W g h o s t i , j I n s f m i B g h o s t i , j   i = 1 , 2 , , n ,   j = 1 , 2 , , m
    where φ i , j is a simplified convolution operation, W g h o s t i , j shows a weight matrix of the ghost module, B g h o s t i , j implies a bias matrix of the ghost module, n represents the number of intrinsic feature maps, m stands for the number of the ghost modules, the symbol denotes the pixel-wise product of two matrixes, and the symbol indicates the pixel-wise sum of two matrixes.
  • Transform intrinsic feature maps I n s f m i into the ghost feature maps G h o s t f m i , j , with multiple ghost modules.
    O u t f m = O u t f m 1 , O u t f m 2 , ,   O u t f m n = O u t f m i ,     i = 1 , 2 , , n   O u t f m i = I n s f m i ,   G h o s t f m i .1 ,   G h o s t f m i .2 , , G h o s t f m i . m = I n s f m i ,   G h o s t f m i . j ,     j = 1 , 2 , , m  
    where O u t f m i represents the output feature maps, including the corresponding intrinsic feature map and ghost feature maps, and O u t f m stands for the whole output feature maps in a convolution layer.
Figure 12. Execution flow of ghost convolution.
Generally, the CNN-related model uses many filters in traditional convolution operations, and the number of filters must be consistent with the number of output channels. Technically speaking, in Figure 12, ghost convolution can obtain intrinsic feature maps after performing traditional convolution operations on a set of filters and then generate more ghost feature maps through simple linear transformation directly. This approach can significantly avoid a substantial amount of time-consuming traditional convolution operations. Assume that r represents the side length of the output feature map, s stands for the side length of the filter, c indicates the number of channels of the input feature map, h denotes the number of channels of the output feature map, and v implies the number of channels of a set of filters of the output feature map through a traditional convolution operation, l shows an index of the convolutional layer, and u is the number of convolutional layers. According to the discussion in [29], the time complexity of the traditional convolutional layer is O( l = 1 u r l 2 · s l 2 · c l · h l ), while the time complexity of ghost convolution is O( l = 1 u ( r l 2 · s l 2 · c l · v l + r l 2 · h l v l ) ), which is much less than the former one. Therefore, compared with the traditional convolution layer, the time complexity of ghost convolution is much smaller, so it can significantly reduce the cost of convolution calculation. This approach has given new (ghost) feature maps without redundant information, thus slightly improving the prediction accuracy.
In the training phase, the CNN-related model uses the steepest gradient descent algorithm to continuously update the weight matrix W g h o s t i , j and the bias matrix B g h o s t i , j , achieving the best results of matrixes. The weight matrix W g h o s t i , j and the bias matrix B g h o s t i , j , which are obtained from a single round of network training, are not necessarily reliable, so this study proposes pixel-wise average pooling to obtain a more reliable W g h o s t i , j and B g h o s t i , j , as shown in Figure 13. Assuming that there are p rounds of training, each training carries out for q epochs. When the training runs out of all epochs for each round, it can obtain the final weight matrix W g h o s t i , j , k (   i = 1 , 2 , , n ,   j = 1 , 2 , , m ,   k = 1 , 2 , , p ), and the final bias matrix B g h o s t i , j , k (   i = 1 , 2 , , n ,   j = 1 , 2 , , m ,   k = 1 , 2 , , p ). After p rounds of network training, the training process will perform pixel-wise average pooling on the weight matrixes and bias matrixes, and the average value of the corresponding element positions in p weight matrices and p bias matrices can obtain a more reliable W g h o s t i , j and B g h o s t i , j . The linear transformation that such ghost convolution uses can extract features with more precision and obtain higher prediction accuracy for subsequent inferences.
Figure 13. Pixel-wise average pooling.

3.4. Distance Measurement and PID Control

In Figure 14 and Figure 15, LW-YOLOv4-tiny proceeds with object detection in the real-time video streaming captured from the dual cameras located in the front and rear panels of JetRacer. This study uses visual odometry to measure the distance between the vehicle and the target object, as shown in Figure 16. The triangle theorem and the principle of parallax can estimate the distance between the detected object and the center point at the horizontal line spanned dual cameras, as shown in Figure 17. However, there are some restrictions on the use of dual cameras, including the fixed distance between the dual cameras and the build of the dual cameras on the same horizontal line. Otherwise, they will significantly affect the accuracy of measuring distance.
Figure 14. Real-time object detection by using LW-YOLOv4-tiny. (a) Real-time image of model car one on the outer lane. (b) Real-time object detection of model car one on the outer lane. (c) Real-time image of model car two on the inner lane. (d) Real-time object detection of model car two on the inner lane.
Figure 15. Real-time vehicle detection by using LW-YOLOv4-tiny. (a) Left camera module. (b) Right camera module.
Figure 16. Measuring distance between the vehicle and the target object.
Figure 17. Measuring the distance between two JetRacers by using dual cameras. (a) Left camera. (b) Right camera.
In Figure 18, LW-ResNet18 predicts the steering angle along the route while JetRacer moves forward. The keen steering angle changes, resulting partially from the fact that the predicted value may cause the model car to shake from side to side while moving, causing JetRacer to swing significantly along the route. Therefore, as shown in Figure 19, we add the PID controller to the autonomous driving system, which includes proportional control with the parameter K p , integral control with the parameter K i , and differential control with the parameter K d . The purpose of proportional control is that the more JetRacer deviates from the lane, the greater the degree of correction back to the original lane. Integral control aims to sum up all error values and conduct reverse correction for the direction with much more deviation. Differential control aims to correct the offset in the opposite direction to avoid the excessive correction caused by only the parameter K p . Table 1 gives the default settings of three PID parameters.
Figure 18. Steering angle prediction by using LW-ResNet18.
Figure 19. PID controller. (a) The architecture of the PID controller; (b) going straight; (c) taking a right turn.
Table 1. Parameter setting of PID controller.

3.5. Information Fusion and Visualized Steering Assistance

This study uses the integration of YOLOv4-tiny and ResNet18 models to achieve autonomous driving functions, which have successfully implemented object detection and image recognition in the embedded platforms Jetson Nano on a JetRacer. Image recognition allows JetRacer to learn how to drive on the routes correctly in a planar road map. Object detection identifies the traffic signs and other vehicles on the road, especially in real-time, measuring the approximate distance between the vehicle and the target object through dual cameras to avoid a collision. Figure 20 shows the system diagram of the autonomous driving system proposed in this study.
Figure 20. System diagram of the autonomous driving system.
Typically, steering angle prediction and object detection will run continuously across the planar road map. An autonomous driving system needs a mechanism that requires information fusion between the two processes during the driving of the model car. First, when the object detection has recognized a red light or a stop sign in front of the model car, the Jetracer will brake accordingly. If the steering angle prediction has decided to turn left or right at that time and the driving system does not ignore this action, it will cause the Jetracer to be stationary, but its front wheels will move left and right, as mentioned above. Next, when the object detection recognizes the speed limit sign, the system should adjust the driving speed of the Jetracer according to the speed limit. Then, there is the decision to turn left or right. Jetracer has to maintain a constant speed to avoid the doubt of insufficient corner entry. Finally, there is the decision to go straight. The steering value should be fixed at 0 so that the Jetracer can stably go straight on the straight road. Therefore, the autonomous driving system takes information fusion to deal with the contradiction problems caused by inconsistent driving decision-making. Figure 21 shows the decision-making performed by Jetracer in various scenarios to avoid unreasonable driving behavior.
Figure 21. Information fusion of self-driving.
According to the steering range on the real-time image in the form of a JetRacer, the autonomous driving system creates a green dot line that presents visualized steering assistance for guiding the imminent turning direction of a JetRacer, as shown in Figure 22. The purpose of visualized steering assistance is to remind the driver of a forecast of the imminent turning direction of a JetRacer. When the autonomous driving system deviates, and the JetRacer is about to deviate from the lane, manual driving by the driver can cause the JetRacer to return to the original lane, which ensures a safe drive.
Figure 22. A green dot line as a visualized steering assistance.

4. Experiment Results and Discussion

This study tested the object detection algorithms as follows: LW-YOLOv4-tiny, YOLOv4-tiny [1], YOLOv5s [24], YOLOv5n [30], YOLOv7 [26], and YOLOv7-tiny [26]. Regarding the steering angle prediction algorithms, this study tested Nvidia-CNN [8], traditional convolutional neural networks (CNN) [31], ResNet18 [3], and LW-ResNet18. The Nvidia-CNN mentioned in [8] refers to a specific convolutional neural network architecture developed by NVIDIA Corporation for image recognition and steering angle prediction tasks for self-driving cars. Compared with CNN, the architecture of NVIDIA-CNN is relatively simple, including only convolutional layers and fully connected layers, and uses a smaller filter, which is suitable for processing tasks such as image classification. The experiments have trained five object detection models and three steering angle prediction models in different combinations executed in Jetson Nano and implemented in JetRacer for autonomous driving.

4.1. Experiment Setting

The hardware specifications used in the experiments are GPU workstation and embedded platform Jetson Nano, as listed in Table 2. Table 3 lists the recipe of packages used in the experiments.
Table 2. Hardware specifications.
Table 3. Recipe of packages.

4.2. Model Training, Inference, and Capability

The input image source is a set of 1476 images as a training dataset and 366 as a test dataset. We have collected approximately the same amount for each class to be identified, based on approximately 65% training set, 16% validation set, and 16% test set. They are relatively close to the commonly used distribution with the ratio of 60% training set, 20% validation set, and 20% test set. Several object detection models are operated in the GPU workstation, as listed in Table 4. For the same group of training data, the experiment recorded the time spent on training in the GPU workstation, and Equation (1) calculates the total inference time required for every model to detect the objects from test images where I T i denotes inference time (IT), i represents the inference using the i th object detection model, I stands for the total number of object detection models, x is the x th test image, X shows the total number of test images, and E I T i indicates the time taken to complete the inference for each test image.
I T i = x = 1 X E I T i ,     w h e r e     i = 1 ,   2 ,   ,     I ,   x = 1 ,   2 , ,   X
Table 4. Training and inference time of object detection models (unit: s).
The test image size is 224 × 224, and the number of iterations is 50. In Table 4, the first row shows the time to train every object detection model with the same parameter settings. The second one calculates the time to implement inference from 366 test images. The experimental results showed the training and inference time of every object detection model, and LW-YOLOv4-tiny bests the others.
The input image source is a training dataset of 14,710 images, and there are three different steering angle prediction models that are operated in a GPU workstation, as listed in Table 5. For the same group of training data, the experiment recorded the training time in a GPU workstation, and Equation (1) calculates the total inference time required for every steering angle prediction model to predict the steering angle of 1000 test images, where i represents using the i th steering angle prediction model for image inference.
Table 5. Training and inference time of steering angle prediction models (unit: s).
The test image size is 224 × 74, and the number of iterations is 30. In Table 5, the first row shows the time to train every steering angle prediction model with the same parameter settings. The second one calculates the time to implement inference from 1000 test images. Although the training time of the steering angle prediction model ResNet18 is much longer than that of the others, its inference time is slightly better than that of the others.
Table 6 lists the number of parameters every object detection model uses. In Table 6, the YOLOv7 model has the most significant number of parameters, and the YOLOv5n model has the least. In contrast, Table 7 lists the parameters used by every steering angle prediction model. In Table 7, the ResNet18 model has the most significant number of parameters, and the CNN model has the least.
Table 6. Parameters of object detection models.
Table 7. Parameters of steering angle prediction models.

4.3. Training and Validation Losses

After 50 training epochs, we used the visualization tool to observe the training process and the callback function to save the best-performance model. Every object detection model has six loss plots, as shown in Figure 23, Figure 24, Figure 25, Figure 26, Figure 27 and Figure 28. In Figure 23, Figure 24, Figure 25, Figure 26, Figure 27 and Figure 28, the upper row shows the training losses, and the lower row shows the verification losses, where the first column is the positioning loss and the second is the confidence level loss. The third is the loss of whether the predicted frame matches the actual frame. The verification has shown that every object detection model can stably drop to the minimum loss.
Figure 23. Training and validation losses of the LW-YOLOv4-tiny model. (a) Box training loss. (b) Objectness training loss. (c) Classification training loss. (d) Box validation loss. (e) Objectness validation loss. (f) Classification validation loss.
Figure 24. Training and validation losses of the YOLOv4-tiny model. (a) Box training loss. (b) Objectness training loss. (c) Classification training loss. (d) Box validation loss. (e) Objectness validation loss. (f) Classification validation loss.
Figure 25. Training and validation losses of the YOLOv5s model. (a) Box training loss. (b) Objectness training loss. (c) Classification training loss. (d) Box validation loss. (e) Objectness validation loss. (f) Classification validation loss.
Figure 26. Training and validation losses of the YOLOv5n model. (a) Box training loss. (b) Objectness training loss. (c) Classification training loss. (d) Box validation loss. (e) Objectness validation loss. (f) Classification validation loss.
Figure 27. Training and validation losses of the YOLOv7 model. (a) Box training loss. (b) Objectness training loss. (c) Classification training loss. (d) Box validation loss. (e) Objectness validation loss. (f) Classification validation loss.
Figure 28. Training and validation losses of the YOLOv7-tiny model. (a) Box training loss. (b) Objectness training loss. (c) Classification training loss. (d) Box validation loss. (e) Objectness validation loss. (f) Classification validation loss.
After 30 training epochs, we used the visualization tool to observe the training process and the callback function to save the best-performance model. There are three loss plots for every steering angle prediction model, as shown in Figure 29. In Figure 29, the blue line represents the training loss, and the green line represents the verification loss. LW-ResNet18 model can reduce the verification loss to 0.034, ResNet18 model 0.037, CNN model 0.048, and Nvidia-CNN model 0.067.
Figure 29. Training and validation losses of steering angle prediction model. (a) LW-ResNet18. (b) ResNet18. (c) CNN. (d) Nvidia-CNN.

4.4. Model Testing

Equation (2) can evaluate the execution speed of object detection denoted frames per second (FPS), where the F P S j is the FPS of the j th object detection model, J represents the number of all object detection models, and I R A I T j stands for the time required for each image in real-time object detection by using the j th object detection model.
F P S j = 1 I R A I T j ,     w h e r e   j = 1 ,   2 , ,   J
The mean Average Precision (mAP) of all categories can evaluate the precision of object detection, which calculates the mean of all the average precisions of each category. Equation (3) calculates the precision m A P l of all object detection models, where L represents the number of all object detection models, m A P l stands for the mean Average Precision of the l th object detection model, C l is the number of identified categories in the l th model, k l denotes a specific category in the l th model, and A P k l indicates the precision of a specific category in the l th model.
m A P l = k l = 1 C l A P k l C l ,     w h e r e   k l = 1 ,   2 ,   ,   C l ,   l = 1 , 2 , , L
Here, we will evaluate the execution speed and precision of every object detection model. After the models were trained with the same parameter settings, the experiment tested the trained models with 366 images and plotted the PR curve results, as shown in Figure 30. In the test, Equation (2) computes the execution speed FPS, and Equation (3) evaluates the precision mAP, as listed in Table 8. In Figure 30, the PR curve takes recall as the x-axis and precision as the y-axis, and each point represents a combination of recall and precision. Next, we can obtain an AP by averaging all precisions retrieved from the combinations. The mAP is the sum of the APs from all categories in the PR curves and then divided by the total number of categories. In Table 8, YOLOv5s achieve the best precision, and YOLOv7-tiny achieves the least.
Figure 30. The precision–recall curve for the object detection model. (a) LW-YOLOv4-tiny; (b) YOLOv4-tiny; (c) YOLOv5s; (d) YOLOv5n; (e) YOLOv7; (f) YOLOv7-tiny.
Table 8. Speed and precision of object detection models.
Equation (2) evaluates the execution speed of steering angle prediction by frames per second (FPS), where j represents the j th steering angle prediction model for FPS calculation. Equation (4) evaluates the accuracy of the steering angle prediction by mean square error ( M S E ), where M S E is the mean-square-error using an angle prediction model, N represents the total number of images that the model needs to identify, k stands for the k th test image, y i denotes the actual value, and y ^ i indicates the predicted value. When the value of M S E is smaller, the model for steering angle prediction can obtain better prediction accuracy.
M S E = k = 1 N y k y ^ k 2 N ,     w h e r e   k = 1 , 2 , ,   N
Here, we will evaluate the execution speed and accuracy of every steering angle prediction model. After training the models with the same parameter settings, the experiment tested the trained models with 1000 identical images and plotted the predicted and actual values, as shown in Figure 31. In the test, Equation (2) computes the execution speed FPS, and Equation (4) evaluates the prediction accuracy MSE, as listed in Table 9. In Figure 31, the red line represents the actual steering value, and the blue line is the predicted steering value. We predefined the turning value limited between −1 and 1, where −1 means turn right, 1 means turn left, and 0 means go straight. In Table 9, the LW-ResNet18 model obtains the smallest MSE, and the Nvidia-CNN model achieves the biggest M S E .
Figure 31. Predicted and actual values of steering angle prediction model. (a) LW-ResNet18; (b) ResNet18; (c) CNN; (d) Nvidia-CNN.
Table 9. Speed and loss of steering angle prediction models.

4.5. Self-Driving System Assessment

The autonomous driving system adopts the execution speed and the prediction accuracy performed in Jetson Nano as the evaluation indicator. When JetRacer starts self-driving on the road map, the autonomous driving system must simultaneously implement real-time object detection and steering angle prediction. Therefore, the execution speed of object detection is probably the most critical consideration because taking longer to detect objects will endanger self-driving due to the lack of time to complete steering angle prediction. The execution speed is frames per second (FPS) of object detection in a specific model. The object detection model can use TensorRT to accelerate and significantly improve the inference speed of the deep learning model.
First, based on Equation (2), Table 10 lists the frame rate calculation of several integration systems in which Jetson Nano can operate at a resolution of 224 × 224 per frame. As a result, the YOLOv4-tiny model can achieve the best execution speed on average among the integration systems. Integrating LW-YOLOv4-tiny and CNN models can obtain the best execution speed, and integrating YOLOv7 and ResNet18 can obtain the least.
Table 10. FPS of integrated models.
Next, we focused on both the precision of object detection and the accuracy of the steering angle prediction model, the models of which run on Jetson Nano. Based on Equations (3) and (4), we can calculate the precision of object detection and the accuracy of steering angle prediction under the 224 × 224 resolution image per frame, as listed in Table 11. In Table 11, the YOLOv5s model can achieve the best object detection precision, and the LW-ResNet18 model can obtain the lowest loss of steering angle prediction in the autonomous driving system.
Table 11. The accuracy of integrated models.

4.6. Performance Indicator

By maintaining the high object detection accuracy in the autonomous driving system, the integrated model can perform a higher frame rate FPS, which evaluates the performance indicator. The baseline is the integrated model with the lowest frame rate performance. Equation (5) calculates the FPS ratio (FR) of an integrated model out of various combinations, where m is the m th integrated model, M indicates the number of integrated models, F P S m represents the FPS of the m th integrated model, L S stands for the lowest FPS out of integrated models, and F R m denotes FR of the m th integrated model.
F R m = F P S m L S ,     w h e r e   m = 1 ,   2 , ,   M
Table 12 lists the FR of various integrated models. The FR of integrated YOLOv7 and ResNet18 models is 1, and the lowest FR is based on Equation (5). In Table 12, regarding the execution FPS ratio, the object detection model LW-YOLOv4-tiny can achieve the best average FR, and YOLOv7 is the least.
Table 12. FR of Integrated Models.
The above analysis considers only the FR of various integrated models. The prediction accuracy between object detection and steering angle prediction is also vital for the autonomous driving system. Equation (6) calculates the precision ratio of object detection (ODPR), where n is the n th object detection model, N indicates the number of object detection models, m A P n represents the mAP of the n th object detection model, L m A P n stands for the lowest mAP among the object detection models, and O D P R n denotes the precision ratio of the n th object detection model.
O D P R n = m A P n L m A P n ,     w h e r e   n = 1 ,   2 , ,   N
Equation (7) calculates the loss ratio of the steering angle prediction (SAPLR), where o is the o th steering angle prediction model, O indicates the number of steering angle prediction models, M S E o represents MSE of the o th steering angle prediction model, L M S E stands for the lowest loss among steering angle prediction models, and S A P L R o denotes the SAPLR of the o th steering angle prediction model.
S A P L R o = L M S E M S E o ,     w h e r e   o = 1 ,   2 , ,   O
Equation (8) calculates the precision ratio of an integrated model (PR) out of various combinations, where p is the p th integrated model, P indicates the number of integrated models, O D P R p represents ODPR of the p th integrated model, S A P L R p stands for the SAPLR of the p th integrated model, and P R p denotes PR of the p th integrated model.
P R p = O D P R p × S A P P R p ,     w h e r e   p = 1 ,   2 , ,   P
Table 13 lists the PR of various integrated models. The PR of integrated YOLOv7-tiny and Nvidia-CNN models is 1, and the lowest PR is based on Equation (8). In Table 13, regarding the predictive precision ratio, the steering angle prediction model LW-ResNet18 can achieve the best average PR, and Nvidia-CNN is the least.
Table 13. PR of Integrated Models.
Furthermore, Equation (9) calculates the primitive performance indicator (PPI), that is, FR multiplied by PR, where q is of the q th integration system; m indicates the m th integration system; p shows the p th integration system; Q , M , and P are the number of integration systems; F R m represents the FP of the m th integration system; P R p stands for the PR of the p th integration system; and P P I q denotes the PPI of the q th integration system.
P P I q = F R m × P R p ,     w h e r e   q = 1 , 2 , , Q ,     m = 1 , 2 , ,   M ,     p = 1 , 2 , ,   P
Table 14 lists the PPI of various integrated models. Based on Equation (9), the integrated YOLOv7 and Nvidia-CNN models have the lowest PPI, and the integrated LW-YOLOv4-tiny and LW-ResNet18 models have the best PPI. In Table 14, the object detection model LW-YOLOv4-tiny can achieve the best average PPI, and the steering angle prediction model LW-ResNet18 can obtain the highest PPI on average.
Table 14. PPI of Integrated Models.
Finally, Equation (10) calculates performance indicator (PI), which is the ratio of the PPI to the lowest PPI among the integrated models, where r is the r th integrated model, R indicates the number of integrated models, P P I r represents PPI of the r th integrated model, L P P I r stands for the lowest PPI among the integrated models, and P I r denotes PI of the r th integrated model.
P I r = P P I r L P P I r ,     w h e r e   r = 1 , 2 , , R
Table 15 lists the PI of various integrated models. Based on Equation (10), the integrated YOLOv7 and Nvidia-CNN models have the lowest PI, and the integrated LW-YOLOv4-tiny and LW-ResNet18 models have the best PI. In Table 15, the object detection model LW-YOLOv4-tiny can achieve the best average PI, and the steering angle prediction model LW-ResNet18 can obtain the highest PPI on average.
Table 15. PI of Integrated Models.

4.7. Discussion

In the vision-based autonomous driving system, the proposed approach compares with the method introduced by Bojarski et al. [8], which proposed the other two classical convolutional neural networks. This study tested several models used in steering angle prediction: Nvidia-CNN, traditional CNN, ResNet18, and LW-ResNet18. As per the results, LW-ResNet18 can achieve the best precision of steering angle prediction. On the contrary, the traditional CNN can achieve a better inference speed for steering angle prediction. Duong et al. [9] simulated autonomous driving on the virtual UDACITY platform.
In contrast, the proposed autonomous driving system is implemented on a small-size model car. Several studies [10,11,12] used Raspberry Pi as the computing platform to run the autonomous driving system. Nevertheless, this study applied Jetson Nano with powerful functions to implement the autonomous driving system. In addition, we proposed the integration system to perform object detection and steering angle prediction simultaneously.
For the object detection task, we tested several different models, namely, LW-YOLOv4-tiny, YOLOv4-tiny, YOLOv5s, YOLOv5n, YOLOv7, and YOLOv7-tiny, in this study. Although the YOLOv5s and the YOLOv7 models can achieve higher accuracies of 99.1% and 98.9%, respectively, the inference speed by FPS of YOLOv5s and the YOLOv7 models is much slower than that of the LW-YOLOv4-tiny model in the object detection. Both later models decrease the response time to object detection due to lower FPS and, thus, delay self-driving steering angle prediction. As a consequence, they can induce unsafe driving in autonomous driving systems. Therefore, the object detection accuracy of autonomous driving is not the most critical factor affecting safe driving. Instead, in the autonomous driving system, the most critical factor for safe driving is seeking the highest inference speed by FPS without sacrificing the recognition precision and integrating a high-precision steering angle prediction model with the lowest response time. As a result, integrating the object detection model LW-YOLOv4-tiny and the steering angle prediction model LW-ResNet18 can obtain the best performance index in the experiments.
Due to the limited hardware performance of Jetson Nano, the performance of the autonomous driving system in JetRacer is characterized by a trade-off between the FPS and the accuracy of object detection. We may replace the embedded platform Jetson Nano with Jetson Xavier NX to enhance the execution speed of FPS. Jetson Xavier NX requires more power supply to run the autonomous driving system. However, the model car JetRacer cannot provide enough battery power for Jetson Xavier NX. In other words, the model car JetRacer can adopt Jetson Nano as its embedded platform, which is another limitation.
Moreover, the cameras installed on JetRacer do not have high-resolution image quality. Therefore, the video captured from the camera is unclear when JetRacer is self-driving. Therefore, we are seeking high-resolution cameras to capture video with better image quality in the future.

5. Conclusions

If an autonomous driving system cannot respond fast to driving control appropriately, it will cause high-risk problems with regard to severe car accidents in self-driving. This study proposes the vision-based integration of LW-YOLOv4-tiny and LW-ResNet18 models to real-time fuse the information about rapid object detection and rapid steering angle prediction for safe self-driving. The proposed approach uses ghost convolutions instead of a traditional one, thus shorting the response time of inference intuitively without scarifying the recognition precision. The performance evaluation shows that the proposed approach can outperform the other alternatives.
In future work, we will introduce a LiDAR sensor for the autonomous driving model car, allowing real-time self-positioning in an open space. LiDAR sensors can establish a complete high-precision point cloud map to achieve environmental awareness. With the high-precision positioning in the map, the autonomous driving model car can perform path planning to implement navigation on the road. Furthermore, we will adopt the ROS system combining LiDAR sensors and vision algorithms to effectively run the task of an autonomous driving system in real time. In other words, the ROS system uses a LiDAR sensor to locate traffic signs and visual algorithms to quickly detect and recognize traffic signs and lane lines simultaneously. In such a way, incorporating LiDAR and ROS into an autonomous driving system can significantly enhance safe driving. Furthermore, enlarging the size of a dataset will reduce the gap between the training loss curve and the validation loss curve. The generalization of the model will be better when the separation is relatively small.

Author Contributions

B.R.C. and C.-W.H. conceived and designed the experiments; H.-F.T. collected the dataset and proofread the paper; B.R.C. wrote the paper. All authors have read and agreed to the published version of the manuscript.

Funding

The Ministry of Science and Technology fully supports this work in Taiwan, Republic of China, under grant numbers MOST 111-2622-E-390-001 and MOST 111-2221-E-390-012.

Data Availability Statement

The Sample Programs for Sample Program.zip data used to support the findings of this study are as follows: https://drive.google.com/file/d/1-wjUMuolISVcTwWoM46BpvR1L4PWOc-9/view?usp=sharing (accessed on 13 April 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. Scaled-Yolov4: Scaling Cross Stage Partial Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13029–13038. [Google Scholar]
  2. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
  3. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  4. Zhang, G.; Geng, L.; Chen, X. Sound Source Localization Method Based on Densely Connected Convolutional Neural Network. In Proceedings of the 2022 5th International Conference on Information Communication and Signal Processing (ICICSP), Shenzhen, China, 26–28 November 2022; pp. 743–747. [Google Scholar]
  5. Waveshare Wiki. JetRacer AI Kit. Available online: https://www.waveshare.com/wiki/JetRacer_AI_Kit (accessed on 1 May 2023).
  6. NVIDIA. Jetson Nano Developer Kit. Available online: https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-nano/ (accessed on 1 May 2023).
  7. NVIDIA. TensorRT. 2021. Available online: https://developer.nvidia.com/tensorrt (accessed on 1 May 2023).
  8. Bojarski, M.; Testa, D.W.; Dworakowski, D.; Firner, B.; Flepp, B.; Goyal, P.; Jackel, L.D.; Monfort, M.; Muller, U.; Zhang, J.; et al. End to End Learning for Self-Driving Cars. arXiv 2016, arXiv:1604.07316. [Google Scholar]
  9. Duong, M.T.; Do, T.D.; Le, M.H. Navigating Self-Driving Vehicles Using Convolutional Neural Network. In Proceedings of the 4th IEEE International Conference on Green Technology and Sustainable Development, Ho Chi Minh City, Vietnam, 23–24 November 2018; pp. 607–610. [Google Scholar]
  10. Do, T.D.; Duong, M.; Dang, T.Q.V.; Le, M.H. Real-Time Self-Driving Car Navigation Using Deep Neural Network. In Proceedings of the 4th IEEE International Conference on Green Technology and Sustainable Development, Ho Chi Minh City, Vietnam, 23–24 November 2018; pp. 7–12. [Google Scholar]
  11. Jain, A.K. Working Model of Self-Driving Car Using Convolutional Neural Network, Raspberry Pi, and Arduino. In Proceedings of the Second IEEE International Conference on Electronics, Communication and Aerospace Technology, Coimbatore, India, 29–31 March 2018; pp. 1630–1635. [Google Scholar]
  12. Seth, A.; James, A.; Mukhopadhyay, S.C. 1/10th Scale Autonomous Vehicle Based on Convolutional Neural Network. Int. J. Smart Sens. Intell. Syst. 2020, 13, 1–17. [Google Scholar] [CrossRef]
  13. Omrane, H.; Masmoudi, M.S.; Masmoudi, M. Neural Controller of Autonomous Driving Mobile Robot by An Embedded Camera. In Proceedings of the 4th IEEE International Conference on Advanced Technologies for Signal and Image Processing, Sousse, Tunisia, 21–24 March 2018; pp. 1–5. [Google Scholar]
  14. Simmons, B.; Adwani, P.; Pham, H.; Alhuthaifi, Y.; Wolek, A. Training a Remote-Control Car to Autonomously Lane-Follow Using End-to-End Neural Networks. In Proceedings of the 3rd IEEE Annual Conference on Information Sciences and Systems, Baltimore, MD, USA, 20–22 March 2019; pp. 1–6. [Google Scholar]
  15. Karni, U.; Ramachandran, S.S.; Sivaraman, K.; Veeraraghavan, A.K. Development of Autonomous Downscaled Model Car Using Neural Networks and Machine Learning. In Proceedings of the 3rd IEEE International Conference on Computing Methodologies and Communication, Erode, India, 27–29 March 2019; pp. 1089–1094. [Google Scholar]
  16. Redmon, J.; Farhadi, A. Yolov3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  17. Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization Strategy to Train Strong Classifiers with Localizable Features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6023–6032. [Google Scholar]
  18. Ghiasi, G.; Lin, T.Y.; Le, Q.V. DropBlock: A Regularization Method for Convolutional Networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 3 December 2018; pp. 10750–10760. [Google Scholar]
  19. Müller, R.; Kornblith, S.; Hinton, G.E. When Does Label Smoothing Help? arXiv 2019, arXiv:1906.02629. [Google Scholar]
  20. Misra, D. Mish: A Self Regularized Non-Monotonic Neural Activation Function. arXiv 2019, arXiv:1908.08681. [Google Scholar]
  21. Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–20 June 2020; pp. 390–391. [Google Scholar]
  22. Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation. IEEE Trans. Cybern. 2021, 52, 8574–8586. [Google Scholar] [CrossRef] [PubMed]
  23. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  24. Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Diaconu, L.; Ingham, F.; Poznanski, J.; Fang, J.; Yu, L. Ultralytics/Yolov5: v3.1-Bug Fixes and Performance Improvements. 2020. Available online: https://zenodo.org/record/4154370#.ZFjQp3ZByUk (accessed on 1 May 2023).
  25. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
  26. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-The-Art for Real-Time Object Detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
  27. Neupane, D.; Kim, Y.; Seok, J. Bearing Fault Detection Using Scalogram and Switchable Normalization-Based CNN (SN-CNN). IEEE Access 2021, 9, 88151–88166. [Google Scholar] [CrossRef]
  28. Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More Features from Cheap Operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–20 June 2020; pp. 1580–1589. [Google Scholar]
  29. He, K.; Sun, J. Convolutional Neural Networks at Constrained Time Cost. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; p. 5354. [Google Scholar]
  30. Jocher, G.; Stoken, A.; Chaurasia, A.; Borovec, J.; Kwon, Y.; Michael, K.; Skalski, S.P. ultralytics/yolov5: v6.0-YOLOv5n ‘Nano’ Models, Roboflow Integration, TensorFlow Export, OpenCV DNN Support. 2021. Available online: https://zenodo.org/record/5563715#.ZFn7fc5ByUk (accessed on 1 May 2023). [CrossRef]
  31. LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.