The Design of Preventive Automated Driving Systems Based on Convolutional Neural Network

: As automated vehicles have been considered one of the important trends in intelligent transportation systems, various research is being conducted to enhance their safety. In particular, the importance of technologies for the design of preventive automated driving systems, such as detection of surrounding objects and estimation of distance between vehicles. Object detection is mainly performed through cameras and LiDAR, but due to the cost and limits of LiDAR’s recognition distance, the need to improve Camera recognition technique, which is relatively convenient for commercialization, is increasing. This study learned convolutional neural network (CNN)-based faster regions with CNN (Faster R-CNN) and You Only Look Once (YOLO) V2 to improve the recognition techniques of vehicle-mounted monocular cameras for the design of preventive automated driving systems, recognizing surrounding vehicles in black box highway driving videos and estimating distances from surrounding vehicles through more suitable models for automated driving systems. Moreover, we learned the PASCAL visual object classes (VOC) dataset for model comparison. Faster R-CNN showed similar accuracy, with a mean average precision (mAP) of 76.4 to YOLO with a mAP of 78.6, but with a Frame Per Second (FPS) of 5, showing slower processing speed than YOLO V2 with an FPS of 40, and a Faster R-CNN, which we had difﬁculty detecting. As a result, YOLO V2, which shows better performance in accuracy and processing speed, was determined to be a more suitable model for automated driving systems, further progressing in estimating the distance between vehicles. For distance estimation, we conducted coordinate value conversion through camera calibration and perspective transform, set the threshold to 0.7, and performed object detection and distance estimation, showing more than 80% accuracy for near-distance vehicles. Through this study, it is believed that it will be able to help prevent accidents in automated vehicles, and it is expected that additional research will provide various accident prevention alternatives such as calculating and securing appropriate safety distances, depending on the vehicle types.


Introduction
Automated vehicles have been regarded as one of the most important trends in intelligent transportation systems with rapid developments recently, and are evaluated to enhance vehicular traffic, including increased highway capacity and traffic flow and fewer accidents with collision prevention systems [1,2]. Currently, automated vehicles are undergoing various research and development in GM, Waymo, Ford, etc., due to the convergence of ICT, and are focused on commercialization and product production [3]. In particular, as many patents such as collision prevention technology (Automatic Distance Control/ADC) and sensing and tracking technology (Automatic Exposure Control/AEC) have been applied, they are contributing a lot to safety and convenience, and the smart car market is expected to grow at a faster pace in the future [4]. However, for automated 1. It contributes to the prevention of accidents by designing preventive automated driving systems through improving the camera's recognition technique to be suitable for commercialization in terms of cost and recognition distance compared to LiDAR.

2.
It applies a better model for automated driving systems through performance comparisons of CNN methods that have recently made significant advances in object detection. 3.
Because it is difficult to obtain driving data, black box videos with the most similar collection location to those of automated vehicles and relatively easy to collect data were collected and learned. 4.
It can be used as basic materials in calculating the appropriate safety distance between vehicles in the future by estimating the distance according to the coordinates.
This study is conducted in the following order: In Section 2, we draw the differentiation of this study by reviewing the related studies on object detection and distance estimation of automated vehicles and CNN and, in Section 3, explain the methodology Electronics 2021, 10, 1737 3 of 16 benchmarked in the study. Next, after setting up the learning environment in Section 4, we compare and analyze the learning results of each model and estimate the distance from the surrounding vehicles through a more suitable model. Finally, in Section 5, we summarize the learning results and suggest implications and future research.

Related Works
Object detection and distance estimation of automated vehicles have recently made significant progress with research on the method of single-use of cameras, LiDAR and Radar sensors [10,[21][22][23][24], and the method of fusion of cameras and sensors [25][26][27], as well as AI-based machine learning and deep learning, which has also been applied [1,11,28].
Kehtarnavaz et al. [22] compared the mono-vision system and stereo-vision system, which are the camera systems used in automated vehicles. In their study, the stereo camera is a more efficient vision system based on the difference that the stereo system detects objects and estimates distance through two cameras, while mono system allows only one camera to detect and classify objects. Radar and LiDAR, other sensors used in automated vehicles, are sensors that detect and rank using radio and light, and various studies have been conducted to apply them [23]. Nabati et al. [24] proposed radar region proposal network (RRPN), a real-time region proposal algorithm based on radar sensors, to compensate for the limitations on slow processing speed of region propositional algorithms, a method that performs object detection by assuming the location of the object. The proposed method showed over 100 times faster and more accurate results than the existing selective search algorithm, and showed good performance even though it was the method of single-use for object detection. Zaarane et al. [10] proposed a method of calculating the distance between vehicles through the location of vehicles, geometric derivations, and specific angles (such as the camera view field angles), etc., using the mounted stereo camera on hosting vehicles for estimating the distance between vehicles, but indicated that the stereo camera has a difficulty measuring real three-dimensional coordinates.
While cameras are effective in detecting and classifying objects, Radar and LiDAR sensors are suitable for detecting objects and obtaining information such as range or geometric structure but have limitations in classifying objects [24]. Accordingly, various methods of fusing both sensors and cameras are being studied. Zhao et al. [26] reduced the average processing time to 66.79 ms/frame by generating fewer but more accurate object-region proposals by fusing 3D Lidar and vision camera information, and the average identification accuracies for vehicles was also showed to have excellent performance with 89.04%. Rashed et al. [27] proposed another fusion method, FuseMODNet. The proposed method is a real-time CNN architecture for moving object detection (MOD) in autonomous driving under low-light conditions by capturing motion information from cameras and LiDAR sensors, demonstrating a 4.25% improvement in automated vehicles by building a new 'Dark-KITTI' dataset in low-light environments based on the standard 'KITTI' dataset, and it was also shown that it can be applied in real-time to autonomous vehicles at 18 fps.
The data collected through cameras, Radar, and LiDAR have been combined with AI-based machine learning and deep learning to enable object detection and distance estimation with more accurate and faster processing speeds, and various analysis models are continuously being developed. Masmoudi et al. [1] described and investigated an image-based object detection model applicable to automated vehicles. In this study, they focused on machine learning-based support vector machine (SVM), deep learning-based YOLO, and the single shot multibox detector (SSD) and compared their performance through simulations. The analysis confirmed that SVM is not suitable for real-time analysis due to its slow processing speed, and YOLO is suitable for real-time processing but has lower performance than multi-scale SSD, so it is necessary to use it differently depending on the purpose of application. Stereo R-CNN [11] was proposed as a method for detecting and localizing 3D objects, showing approximately 30% outperformance in accuracy over the state-of-the-art methods, and confirming that it can be used for multi-object detection and tracking in the future as well as general object detection.
The 1-stage detector is suitable for real-time processing because it detects and classifies at once, and has recently shown excellent performance in terms of accuracy. Conversely, the 2-stage detector is a traditional object detection model that has been mainly used for analysis of already collected data due to its relatively slow processing speed but high accuracy. Following these conflicting features, studies have also been conducted to compare the two detector models, along with studies for each model. Madhusri Maity et al. [38] comprehensively reviewed Faster R-CNN and YOLO-based vehicle detection and tracking methods. Benjdira et al. [36] compared the performance of CNN-based Fast R-CNN and YOLOV3 using five metrics: precision, recall, processing speed, etc., for vehicle detection through aerial images. Although both models showed high accuracy, they confirmed that the processing speed of YOLO V3 is higher than that of Fast R-CNN, so it had better performance. Esther Rani and Sri Jamiya [43] proposed the LittleYOLO-SPP Algorithm based on the YOLO v3-tiny network for real-time vehicle detection, achieving 77.44% mAP on the PASCAL VOC dataset. Danilo Avola et al. [41] introduced a multi-stream Fast-RCNN that performs multi-scale image analysis for UAV tracking and confirmed a more accurate detection performance than in existing R-CNN and Faster R-CNN.
CNN-based object detection has been conducted in various studies, such as modifying parameters or combining various analytical methods to improve accuracy and processing speed. Hu et al. [29] proposed a cascade vehicle detection method that combined CNN and various methods such as LBP, Haar-like, and HOG to improve the accuracy of vehicle detection through cameras in complex weather conditions, and 97.32% recall in complex driving environments indicates that the algorithm has good robustness. Another fused CNN method, hybrid CRNN-based network intrusion detection system (HCRN-NIDS) [45], is a detection method used in the field of information security, which combines RNN with CNN, showing excellent performance in detecting both local features and temperature features. Sanchez-Castro et al. [44] proposed lean CNN, which reduces parameters from the existing CNN and compared a total of six models by building a dataset consisting of vehicle types. As a result, the overall model showed more than 80% accuracy, and the optimal model considering accuracy and processing speed was also confirmed. Molina-Cabello et al. [33] proposed five CNN-based object detection and vehicle type classification models for traffic surveillance, and the proposed models consisted of three steps: object detection, tracking, and classification, and used five resizing region proposals. In particular, the centered scale method showed an accuracy of 87% and was found to be the best classification model. AI-based object detection and distance estimation have also focused on pedestrian protection, serving as key techniques for computer vision-based pedestrian detection (PD) and distance estimation (DE) [34,35]. Dai et al. [40] proposed a 'novel multi-task Fast R-CNN' that simultaneously conducts distance estimation and pedestrian detection using improved ResNet-50 architecture with a processing speed of 7 FPS and with pedestrian detection accuracy of more than 80%. In addition, Strbac et al. [42] performed a camera-based stereoscopy measurement that completely excluded the use of LiDAR sensor through YOLO V3, and showed that it is useful for estimating distance within 20 m, showing that it can be applied to many advanced driver assistance systems (ADAS) such as automatic parking.
Although object detection and distance estimation studies of existing automated vehicles have been conducted through cameras and LiDAR, improving camera recognition techniques has become important due to the cost and recognition distance limitations of LiDAR. Furthermore, although existing research has been mainly conducted through stereo cameras, they fail to accurately measure three-dimensional coordinates. As a result, various research needs to be conducted, such as estimating three-dimensional coordinates through monocular cameras and improving the recognition technology of monocular cameras by applying AI to improve the recognition technology. Moreover, CNN shows good performance in image processing and various methods have been proposed depending on the application. In particular, as various models such as YOLO, a 1-stage detector, and R-CNN model, a 2-stage detector, are developed for object detection and distance estimation, it is believed that model selection suitable for automated driving systems based on accurate comparisons between models will be necessary.
Accordingly, in this study, to explore the application of deep learning-based CNN as one of the alternatives for the preventive design of automated driving systems, we aimed to select a model more suitable for the detection and classification of surrounding vehicles through a comparative analysis of existing CNN models and to contribute to prevention of accidents in automated vehicles through distance estimation between vehicles.

Methodology
The problem definition of object detection is to localize and classify an object. Traditional object detection methods are divided into region selection, feature extraction, and classification, requiring engineers to manually work on feature extraction themselves, and limited in handling complex and many images [46]. The development of deep learning complemented the limitations of existing methods, enabling deeper learning, and leading to improved performance. The general object detection method is divided into a region proposal-based 2-stage detector that follows the pipeline of localization and classification according to the traditional method and a 1-stage detector that performs detection and classification at once, based on regression [47]. There are R-CNN [16], spatial pyramid pooling in deep convolutional networks for recognition (SPP-Net) [48], Faster R-CNN [18], etc., and the 1-stage detector includes grid CNN(G-CNN) [49], YOLO [19], SSD [50], etc. The two types of object detection models have different processing processes, and they show differences in processing speed and accuracy depending on the data used. Table 1 compares the performance of each method for PASCAL VOC and Microsoft Common Objects in Context (MS COCO) dataset [51,52]. Although various state-of-the-art models, such as Mask R-CNN [53], have been developed in 2-stage detectors, the purpose of Mask R-CNN is image segmentation rather than object detection, so in this study, we used Faster R-CNN as a comparative model of a 2-stage detector. In addition, a 1-stage detector has a variety of models, including SSD, YOLO was selected as a comparative model of 1-stage detectors in this study because the computation process is easy, but the accuracy is similar to that of Fast R-CNN and the processing speed is slower than YOLO.
Therefore, in this study, we aimed to detect and classify vehicles from road driving videos collected through black boxes using Faster R-CNN, a 2-stage detector, and YOLO, a 1-stage detector, among CNN-based state-of-the-art methods, and to estimate the distance between vehicles using a model that is more suitable for automated driving systems. In this Section, the basic structure and principle of Faster R-CNN and YOLO V2 used in the study is described.

R-CNN
R-CNN, a network model for leveraging for object detection, has emerged as CNN shows superior performance in image classification. The flowchart of R-CNN, which was proposed by Grishick et al. [16], is shown in Figure 1 and R-CNN detects objects through region proposal, which creates segmentation on the image by using selective search and infers the location by drawing a box where the object is likely to be. Then, feature vectors are extracted from the images through pre-trained CNN, and classified class via each classspecific SVM classifier. However, R-CNN has disadvantages in that it learns in multiple steps, has a lot of computation, and has a slow processing speed as all region proposals have to pass CNN. To compensate for these disadvantages, SPP-Net [48], which extracts features by inputting the entire image into a pre-trained CNN, and Fast R-CNN [17], an end-to-end model that extracts a fixed-size feature vector through region of interest (ROI) pooling and then executes the remaining steps in a single pipeline, have been proposed. a 1-stage detector, among CNN-based state-of-the-art methods, and to estimate the distance between vehicles using a model that is more suitable for automated driving systems. In this Section, the basic structure and principle of Faster R-CNN and YOLO V2 used in the study is described.

R-CNN
R-CNN, a network model for leveraging for object detection, has emerged as CNN shows superior performance in image classification. The flowchart of R-CNN, which was proposed by Grishick et al. [16], is shown in Figure 1 and R-CNN detects objects through region proposal, which creates segmentation on the image by using selective search and infers the location by drawing a box where the object is likely to be. Then, feature vectors are extracted from the images through pre-trained CNN, and classified class via each class-specific SVM classifier. However, R-CNN has disadvantages in that it learns in multiple steps, has a lot of computation, and has a slow processing speed as all region proposals have to pass CNN. To compensate for these disadvantages, SPP-Net [48], which extracts features by inputting the entire image into a pre-trained CNN, and Fast R-CNN [17], an endto-end model that extracts a fixed-size feature vector through region of interest (ROI) pooling and then executes the remaining steps in a single pipeline, have been proposed.

Faster R-CNN
Faster R-CNN, an improved version of Fast R-CNN, is a model that removes selective search and performs the region proposal process through RPN. Faster R-CNN performs fewer operations than the existing selective search through RPN and improves processing speed and accuracy as it enables the use of GPU instead of CPU. In addition, Faster R-CNN simplifies the entire process by using an end-to-end model and has become one of the representative models of the 2-stage detector in which region proposal and classification are sequentially performed. The structure of Faster R-CNN proposed by Grishick et al. [18] is shown in Figure 2.

Faster R-CNN
Faster R-CNN, an improved version of Fast R-CNN, is a model that removes selective search and performs the region proposal process through RPN. Faster R-CNN performs fewer operations than the existing selective search through RPN and improves processing speed and accuracy as it enables the use of GPU instead of CPU. In addition, Faster R-CNN simplifies the entire process by using an end-to-end model and has become one of the representative models of the 2-stage detector in which region proposal and classification are sequentially performed. The structure of Faster R-CNN proposed by Grishick et al. [18] is shown in Figure 2. In addition, their proposed loss function formulation of Faster R-CNN is as follows: Faster R-CNN is optimized for a multi-task loss function, which combines the losses of classification and bounding box regression. According to the authors, i is the anchor In addition, their proposed loss function formulation of Faster R-CNN is as follows: Faster R-CNN is optimized for a multi-task loss function, which combines the losses of classification and bounding box regression. According to the authors, i is the anchor index in a mini-batch, p i is the predicted probability that anchor i will be an object, and t i is the vector representing the 4 parameterized coordinates of the predicted bounding box. * means the ground-truth label, where the region ratio intersects the prediction label can be calculated to determine the performance of the prediction through the intersection over union (IoU) formula. The classification loss L cls is log loss over two classes (object vs. not object) and, for the regression loss, they used L reg (t i , t i * ) = R(t i − t i * ) where R is the robust loss function. The two terms are normalized by N cls and N reg and weighted by a balancing parameter λ (defalut:10). Furthermore, to solve the class imbalance problem, which detects backgrounds more frequently than objects, the authors set the IoU criterion of non-positive anchors to 0.3, applying negative labeling, and random sampling IoU lower bound for all of these ground truth boxes. The layer parameters of ResNet-101 Architecture, the backbone network of Faster R-CNN, are shown in Table 2 [54].

YOLO
In contrast to Faster R-CNN, which proceeds with region proposal and classification sequentially, YOLO is a 1-stage detector model that performs both processes simultaneously. YOLO predicts bounding boxes (B) and their confidence for each grid after dividing the input image into S × S grids. At this point, confidence is defined by multiplying Pr(object), the probability of an object being present, and IoU, which is the ratio of intersection area between the predicted Bounding box and ground truth. Each grid predicts one class (C) probability per grid regardless of the number of bounding boxes and multiplies the individual box confidence and conditional class probability. Finally, the multiplied class-specific confidence is encoded as an S × S × (B * 5 + C) tensor to check how well it fits the object and builds a network. The processing system of YOLO proposed by Redmon et al. [19] is shown in Figure 3. area between the predicted Bounding box and ground truth. Each grid predicts one class (C) probability per grid regardless of the number of bounding boxes and multiplies the individual box confidence and conditional class probability. Finally, the multiplied classspecific confidence is encoded as an S × S × (B * 5 + C) tensor to check how well it fits the object and builds a network. The processing system of YOLO proposed by Redmon et al. [19] is shown in Figure 3.  In addition, their proposed loss function formulation of YOLO is as follows: YOLO finds the bounding box contained in the final prediction, and uses sum-squared error between the predictions and the ground truth to calculate the loss. YOLO composes of localization loss, confidence loss, and classification loss. The localization loss, first and second line of (2), measures the errors in the locations and sizes of the predictive bounding box by counting the box responsible for detecting the object. In the formula, obj ij means that the j st bounding box of the i featuring the object has produced the final prediction, otherwise, it is displayed as 0, and x, y, w, h refers to the x, y coordinates, width, and height of the bounding box, respectively. The authors predicted square roots of the width and height of the bounding box instead of width and height to differentiate the weight absolute errors of large boxes and small boxes and also multiplied the loss by λ coord (default: 5) to further emphasize the boundary box accuracy. At this time, λ coord increased the weight for the loss bounding box coordinates. In the formula, values x and y are simply calculated for simple differences, but since w and h are ratios, the difference is calculated by adding root. The confidence loss, third and fourth line of (2), measures the objectness of the box. Since most boxes do not contain objects, there is a class imbalance problem that detects backgrounds more frequently than objects. To address this issue, weight this loss down a λ noobj factor with a default value of 0.5. In the formula, C is the box confidence score. The classification loss, the last line of (2), is the squared error of the class conditional probabilities for each class if an object is detected. the difference between the predicted value and the Electronics 2021, 10, 1737 9 of 16 actual value for all classes is added to the exponential i where all objects are judged to be present. In the formula, p(c) denotes the conditional class probability for class c.
YOLO has difficulty predicting small objects that appear in groups because the bounding box predicted by the grid can have only one class, and errors occur due to inaccurate localization. However, YOLO has made many advances in real-time object detection by showing faster processing speed than existing models. Since then, YOLO V2, which improves accuracy and enables more detection and classification by modifying the network and using fine-tuning and anchor box, was proposed [55]. Afterward, YOLO V3, which improves the backbone network structure for pre-learning, and YOLO V4, which combines various deep learning techniques, have been released, continuing to develop into an improved model. Since there is no significant change in the overall structure from YOLO V3, YOLO V2 was used in this study, and the layer parameters of DarkNet-19 Architecture, the backbone network of YOLO V2, are shown in Table 3 [55].

Vehicle Detection and Distance Estimation
In this Section, we performed vehicle detection and classification using Faster R-CNN and YOLO V2, and the learning environment is shown in Table 4.

Data Collecting and Pre-Training
Before this study, the accuracy and processing speed were compared by training the PASCAL VOC data set for model performance comparison. There was no significant difference in the accuracy of the training results, but YOLO V2 showed faster processing speed and was analyzed as shown in Table 5, respectively, in the real-time object detection part for the VOC 2007 dataset [51].  The driving videos of automated vehicles are the most basic with which to detect a front object from various angles, but since it is difficult to obtain driving data, black box videos with relatively easy to collect data and those with the most similar collection location to those of automated vehicles were obtained and learned. The black box videos used for learning were collected under weekday daytimes with sunny weather conditions, consisting of 30 frames/s of 1920 × 1080 px size, and the extracted examples are shown in Figure 4. Before model comparison learning, we used the VOC dataset for pre-training, and, to handle the class imbalances of the dataset, the classes of dataset consisting of 20 classes and 20 K of data were removed except for the car, bus, and truck class, which may exist on the highway.

Vehicle Detection and Classification
We input the collected data into the pre-trained Faster R-CNN and YOLOV2 models and set the threshold to 0.5 and 0.9, respectively, to proceed with comparative learning. The threshold is the critical value for the class-specific confidence of the bounding box and anchor box detected during the learning process, and only the bounding box, which is over the threshold, was outputted. As a result of the learning, as shown in Figures 5 and  6, both models did not detect the vehicles accurately when the threshold was set to 0.5, and when the threshold was set to 0.9, the vehicle could not be detected except for nearby vehicles. Therefore, based on the previous learning, we proceeded with the learning by setting the threshold as 0.7 finally, the median value. .

Vehicle Detection and Classification
We input the collected data into the pre-trained Faster R-CNN and YOLOV2 models and set the threshold to 0.5 and 0.9, respectively, to proceed with comparative learning. The threshold is the critical value for the class-specific confidence of the bounding box and anchor box detected during the learning process, and only the bounding box, which is over the threshold, was outputted. As a result of the learning, as shown in Figures 5 and 6, both models did not detect the vehicles accurately when the threshold was set to 0.5, and when the threshold was set to 0.9, the vehicle could not be detected except for nearby vehicles. Therefore, based on the previous learning, we proceeded with the learning by setting the threshold as 0.7 finally, the median value. and anchor box detected during the learning process, and only the bounding box, which is over the threshold, was outputted. As a result of the learning, as shown in Figures 5 and  6, both models did not detect the vehicles accurately when the threshold was set to 0.5, and when the threshold was set to 0.9, the vehicle could not be detected except for nearby vehicles. Therefore, based on the previous learning, we proceeded with the learning by setting the threshold as 0.7 finally, the median value.
.  As a result of model learning, Faster R-CNN showed high accuracy in classifying detected vehicles but had difficulty in detecting vehicles, and the processing speed was also analyzed approximately ten times slower compared to YOLO V2. YOLO V2 detected most of the vehicles on the frame and showed more than 90% accuracy for nearby vehicles, which was higher than Faster R-CNN. In particular, it is determined that the YOLO V2 is more suitable for classifying vehicles by detecting objects in real-time in automated driving systems, with a better processing speed of 5.5 FPS compared to low hardware performance, and thus, we want to further proceed with distance estimation through YOLO V2. The results of the comparative learning of the two models are shown in Figure 7. As a result of model learning, Faster R-CNN showed high accuracy in classifying detected vehicles but had difficulty in detecting vehicles, and the processing speed was also analyzed approximately ten times slower compared to YOLO V2. YOLO V2 detected most of the vehicles on the frame and showed more than 90% accuracy for nearby vehicles, which was higher than Faster R-CNN. In particular, it is determined that the YOLO V2 is more suitable for classifying vehicles by detecting objects in real-time in automated driving systems, with a better processing speed of 5.5 FPS compared to low hardware performance, and thus, we want to further proceed with distance estimation through YOLO V2. The results of the comparative learning of the two models are shown in Figure 7.

Distance Estimation
To implement the distance estimation model using YOLO V2 selected by classification learning, we used previously collected black box videos and utilized video data without image extraction to implement real-time distance estimation. After entering the images into the pre-trained YOLO v2 model, we classified the probability of an object out of 80 classes using five anchor boxes and then output classes and accuracy for objects with a threshold of 0.7 or higher. To estimate the distance from the detected object, we used camera calibration [56] to calculate the parameter values of the 2D converted image captured by the camera, as shown in Figure 8, and re-extracted the 2D coordinate values to the 3D coordinate values using the calculated parameters and perspective transform. Next, we estimated the distance between the camera and the extracted 3D coordinate values through image warping and pixels per meter [57]. As a result of model learning, Faster R-CNN showed high accuracy in classifying detected vehicles but had difficulty in detecting vehicles, and the processing speed was also analyzed approximately ten times slower compared to YOLO V2. YOLO V2 detected most of the vehicles on the frame and showed more than 90% accuracy for nearby vehicles, which was higher than Faster R-CNN. In particular, it is determined that the YOLO V2 is more suitable for classifying vehicles by detecting objects in real-time in automated driving systems, with a better processing speed of 5.5 FPS compared to low hardware performance, and thus, we want to further proceed with distance estimation through YOLO V2. The results of the comparative learning of the two models are shown in Figure 7.

Distance Estimation
To implement the distance estimation model using YOLO V2 selected by classification learning, we used previously collected black box videos and utilized video data without image extraction to implement real-time distance estimation. After entering the images into the pre-trained YOLO v2 model, we classified the probability of an object out of 80 classes using five anchor boxes and then output classes and accuracy for objects with a threshold of 0.7 or higher. To estimate the distance from the detected object, we used camera calibration [56] to calculate the parameter values of the 2D converted image captured by the camera, as shown in Figure 8, and re-extracted the 2D coordinate values to the 3D coordinate values using the calculated parameters and perspective transform. Next, we estimated the distance between the camera and the extracted 3D coordinate values through image warping and pixels per meter [57]. (i) After extracting the bounding box coordinates of the detected object, calculate the parameters using camera calibration. (ii) Reshape the coordinate value to (1, 1, 2) for matrix operation, transform the coordinate value back to 3D through perspective transform, and then reshape it to (2,1). (iii) Estimate and output the distance between the detected object and the camera through image warping and pixels per meter.
As a result, objects at a distance were difficult to detect with a threshold of 0.7, but object detection and classification were performed with more than 80% accuracy for objects at relatively close, and thus distance estimation was also performed accurately for detected objects. The final analysis results are the same as Figure 9.  (i) After extracting the bounding box coordinates of the detected object, calculate the parameters using camera calibration. (ii) Reshape the coordinate value to (1, 1, 2) for matrix operation, transform the coordinate value back to 3D through perspective transform, and then reshape it to (2,1). (iii) Estimate and output the distance between the detected object and the camera through image warping and pixels per meter.
As a result, objects at a distance were difficult to detect with a threshold of 0.7, but object detection and classification were performed with more than 80% accuracy for objects at relatively close, and thus distance estimation was also performed accurately for detected objects. The final analysis results are the same as Figure 9.

Distance Estimation
To implement the distance estimation model using YOLO V2 selected by classification learning, we used previously collected black box videos and utilized video data without image extraction to implement real-time distance estimation. After entering the images into the pre-trained YOLO v2 model, we classified the probability of an object out of 80 classes using five anchor boxes and then output classes and accuracy for objects with a threshold of 0.7 or higher. To estimate the distance from the detected object, we used camera calibration [56] to calculate the parameter values of the 2D converted image captured by the camera, as shown in Figure 8, and re-extracted the 2D coordinate values to the 3D coordinate values using the calculated parameters and perspective transform. Next, we estimated the distance between the camera and the extracted 3D coordinate values through image warping and pixels per meter [57]. (i) After extracting the bounding box coordinates of the detected object, calculate the parameters using camera calibration. (ii) Reshape the coordinate value to (1, 1, 2) for matrix operation, transform the coordinate value back to 3D through perspective transform, and then reshape it to (2,1). (iii) Estimate and output the distance between the detected object and the camera through image warping and pixels per meter.
As a result, objects at a distance were difficult to detect with a threshold of 0.7, but object detection and classification were performed with more than 80% accuracy for objects at relatively close, and thus distance estimation was also performed accurately for detected objects. The final analysis results are the same as Figure 9.

Conclusions
To improve the recognition technique of vehicle-mounted monocular-camera for the design of preventive automated vehicles, this study employed CNN-based Faster R-CNN and YOLO V2 to recognize surrounding vehicles in black box highway driving videos and estimate distances from surrounding vehicles through more suitable models for automated driving systems. For analysis, black box videos of driving directly under sunny weather conditions during weekdays were collected, and pre-training was conducted. The analysis showed that Faster R-CNN had a similar accuracy, with a mAP of 76.4, as YOLO V2, with mAP of 78.6, but had a slower processing speed with an FPS of 5 compared to YOLOV2 with an FPS of 40, which had difficulty detecting. As a result, YOLO V2 was determined to be a more suitable model for real-time vehicle detection and classification, and further learned to estimate the distance between vehicles. For distance estimation, we conducted coordinate value conversion through camera calibration and perspective transform, set the threshold to 0.7, and performed object detection and distance estimation, showing more than 80% accuracy for near-distance vehicles.
In this study, 20 classes were simply reduced to three classes (car, bus, truck) which may exist on the highway, as there was a class imbalance problem with incorrect classification when pre-training with a set class of dataset. However, in the future, it is deemed necessary to use subdivided classes according to road environments and consider using an improved network that reflects additional data sampling methods (e.g., focal loss [58], gradient harmonizing mechanism [59]) to deal with class imbalance problems and learn more accurately. In addition, there are not many open source driving videos of automated vehicles, so the front black box videos of general vehicles, which are expected to be the most similar to the front camera of automated vehicles, were used instead, but it is expected that additional videos (e.g., the rear, the side black boxes, etc.) taken from various angles can be further trained to more accurately detect and classify vehicles, as well as to estimate the distance. In particular, in estimating the distance, the frame in the video is extracted from the camera to estimate the projected coordinate distance with the detected object for the stationary screen, and the speed of the currently running vehicle is not taken into account because the frame is again merged into the video and output. So, there is a limit in which the precision is somewhat lower in estimating the distance in real-time, and an error may occur in the estimated distance depending on the lane of the detected vehicle. Therefore, in the future, it is necessary to increase the precision of distance estimation by sensor fusion by conducting experiments on vehicles equipped with both cameras and LiDAR, and it is determined that additional research is needed on coordinate projection and distance estimation methods. The goal of future research includes the task of collecting data from vehicles containing both LiDAR and a black box to compare the accuracy of distance estimation.
This study is significant in that vehicle detection, classification, and distance estimation were performed by applying CNN, a deep learning network based on the most basic monocular camera data, for the preventive design of automated driving systems, and this study is expected to contribute to the commercialization of automated driving systems in the future, as basic materials for poor weather conditions (e.g., rain, snow, fog, etc.) to derive differentiation from LiDAR and radar sensors. Through this, it is believed that it will be able to help prevent accidents in automated vehicles, and it is expected that various accident prevention alternatives such as calculating and securing an appropriate safe distance according to vehicle type can be prepared through additional research in the future. In addition, it is expected to contribute to smooth traffic operations by quickly handling unexpected situations such as abnormal vehicle access by using CCTV or drones on the highway as well as cameras mounted on automated vehicles.