A Convolutional Neural Network-Based End-to-End Self-Driving Using LiDAR and Camera Fusion: Analysis Perspectives in a Real-World Environment

: In this paper, we develop end-to-end autonomous driving based on a 2D LiDAR sensor and camera sensor that predict the control value of the vehicle from the input data, instead of modeling rule-based autonomous driving. Different from many studies utilizing simulated data, we created an end-to-end autonomous driving algorithm with data obtained from real driving and analyzing the performance of our proposed algorithm. Based on the data obtained from an actual urban driving environment, end-to-end autonomous driving was possible in an informal environment such as a trafﬁc signal by predicting the vehicle control value based on a convolution neural network. In addition, this paper solves the data imbalance problem by eliminating redundant data for each frame during stopping and driving in the driving environment so we can improve the performance of self-driving. Finally, we veriﬁed through the activation map how the network predicts the vertical and horizontal control values by recognizing the trafﬁc facilities in the driving environment. Experiments and analysis will be shown to show the validity of the proposed algorithm.


Introduction
A self-driving car is a system that recognizes the driving environment, generates the path, and drives the vehicle itself by utilizing environmental awareness sensors such as camera, radar, LiDAR, and GPS. Self-driving cars generally consist of three sub-systems of recognition, decision, and control, such as human driving, and each sub-system serves to replace the driver [1].
Driving environment recognition serves as a driving environment dynamic, static object detection, lane detection, and vehicle location estimation based on sensors that can obtain information about the driving environment, and the decision to determine the vehicle trajectory, such as the creation and avoidance of routes to the destination [2]. Longitudinal and lateral controls are performed to reliably drive the target control values of the vehicle determined by recognition and decision [3]. The general self-driving developed separately for each module can easily debug and troubleshoot in the event of a defect or abnormal situation.
However, in the case of autonomous driving research, there are some development restrictions in the actual complex driving environment [4,5]. Autonomous driving occurs not only in the highway environment but also in complex urban areas, with various variables such as traffic lights, surrounding vehicles, motorcycles, pedestrians, road structures, and unpredictable conditions, and recognizing various objects in these complex road environments is still a difficult problem [6][7][8][9]. In addition, there are many areas that have not yet been resolved to develop an optimal decision algorithm considering all these complex environments. In other words, the conventional planning method, a rule-based approach, recognizes all obstacles affecting safe driving and requires accurate situational decisions that make it difficult to consider all possible situations on the road.
Unlike previous studies, which consist of perception, decision, and control, we proposed a convolutional neural network (CNN) that provides the target longitudinal/transverse speed of the vehicle as output with real-world LiDAR and camera data as inputs. Our proposed method fully utilized a CNN to send out output data directly from input data. Unlike the previous algorithm that was only laterally controlled using an end-to-end algorithm based on camera [10], our proposed algorithm simultaneously performed longitudinal and lateral control using a camera and LiDAR sensor that can provide a depth value. The output of our proposed network was the vehicle longitudinal/lateral speed targets for 250, 500, 750, 1000, and 1250 ms from the present time. Additionally, different from many previous end-to-end self-driving, we proposed end-to-end self-driving that complies with road traffic laws by acquiring real-world data in urban areas in complex environments such as traffic lights. After training the proposed algorithm with the real-road database, E2E autonomous driving was confirmed in complex urban environments, such as traffic lights and intersections, and we used feature maps to check the validity of the proposed algorithm. This paper is structured as follows. Section 2 explains related works, how to build our experimental environments, and the data set and convolutional neural network. The results and analysis of our results are shown in Sections 3 and 4, respectively, and Section 5 reveals the conclusion and future research.

Convolutional Neural Network
With the invention of Alexnet in 2012, many deep learning-based approaches such as CNN have been applied to computer vision [11]. CNN has two parts, a feature extractor, consisting of the convolution layer and pooling layer, and a fully connected layer that performs classification and regression. Previous machine learning-based computer vision algorithms extracted features within an image using HOG and SIFT, and the results were obtained by performing algorithms such as Support Vector Machines with the extracted features, while CNN extracts the features through learning of the convolution layer and achieves the results through fully connected layers. Therefore, CNN is also called E2E learning technology and it is being used in various computer vision areas such as object detection, tracking, and semantic segmentation [12].

End-to-End Self-Driving
Many studies have been conducted on E2E autonomous driving, which uses CNN's E2E characteristics fully to calculate the final output speed from input data without detailed algorithms to construct autonomous driving. Mariusz et al. proposed a CNN structure called PilotNet, the start of the CNN-based E2E autonomous driving. PilotNet uses three camera sensors mounted on the front of the vehicle, and it performs lateral control of the vehicle [10]. However, PilotNet implemented only the end-to-end self-driving function of lane keeping using only monocular cameras, and maintaining inter-vehicle distance used a classical control method, not a learning method. Chen et al. proposed a study that can use distance information in autonomous driving to effectively learn drivers' driving patterns and produce a deep learning model that enables stable longitudinal lateral control [13]. They unveiled a data set that includes a LiDAR sensor, camera sensor, and a label for longitudinal/lateral control. Based on this data set, a DNN + LSTM deep learning model that enables longitudinal control using the distance information from the LiDAR sensor and image information from the camera sensor was constructed. However, to utilize 3D Point Cloud data, a deep learning network using data from a LiDAR sensor called Point Cloud Mapping or PointNet was additionally used, which required a large amount of network parameters [14]. Navarro et al. proposed sensor fusion-based E2E self-driving Electronics 2021, 10, 2608 3 of 12 using real-world acquired data [15], but they did not analyze how to work their algorithm in real-world situations such as an urban traffic signal. Huch et al. suggested V2X-based E2E self-driving for platooning [16]. Prashanth et al. proposed JacintoNet, which was implemented in Texas Instruments (TI) TDA2x System on Chip, for real-time working [17]. However, it utilized simulation data and implemented autonomous driving at only lane keeping. Yu et al. proposed end-to-end self-driving capable of longitudinal/lateral control using a monocular camera [18]. They acquired large-scale data sets and developed end-toend self-driving in a real-road environment, but had limitations in realizing self-driving in a simple environment such as a highway. Sallab et al. proposed a reinforcement learning-based end-to-end self-driving algorithm using monocamera, but it was applied only in a lane-keeping system [19]. Table 1 shows the comparison of previous end-to-end self-driving approaches. Table 1. Comparison of End-to-End self-driving approaches.

Explainable End-to-End Self-Driving System
Zhou et al. introduced a class activation map to analyze the region within the image that influenced the results when images were classified [20]. The class activation map interprets the figure for the weight value of the last fully connected layer as important, representing the most influential part of the image with respect to the results of the network.
Mariusz et al. expanded on Zhou's paper. They proposed an explanation of what part of the end-to-end self-driving model focuses on the input driving image to conduct lateral control judgment, which is similar to the method proposed by Zhou et al., where the end-to-end self-driving model is the focus and responds accordingly. The network concentrates on the road environment, although there was a reliability issue because E2E autonomous driving is not accountable [21].

Experiemental Setup
In this paper, we generated the data set using information from the environmentaware sensors and in-vehicle sensors mounted on the vehicle and we conducted training on the E2E self-driving network using them.
The hardware development environment used in this paper is shown in Figure 1. We used Hyundai Motor Ionic EV vehicles equipped with one camera and two LiDAR sensors, a VCU that controls information about the vehicle, and a workstation for E2E self-driving algorithms and other data-logging programs.
The SW development environment was set up as below, CUDA 9.1 and cuDNN 7.1 in Ubuntu 16.04 LTS and Python with Keras 2.2.1. The details about the HW setup are shown in Table 2.  Table 2.  Table 3 represents the detail specifications of each sensor used.

Data Set
For this study, we constructed about 150,000 frames of a data set from camera and LiDAR sensors and vehicle information by driving 300 km in Seoul and Gyeonggi-do, Korea. While analyzing the data, we found that the continuity of the frame varied depending on the speed. Namely, as you can see in Figure 2, the characteristics of consecutive images varied with the vehicle's speed. The constructed database had less variation in images per frame at low speeds, and more variation in images per frame at high-speed intervals.   Table 3 represents the detail specifications of each sensor used.   Table 2.  Table 3 represents the detail specifications of each sensor used.

Data Set
For this study, we constructed about 150,000 frames of a data set from camera and LiDAR sensors and vehicle information by driving 300 km in Seoul and Gyeonggi-do, Korea. While analyzing the data, we found that the continuity of the frame varied depending on the speed. Namely, as you can see in Figure 2, the characteristics of consecutive images varied with the vehicle's speed. The constructed database had less variation in images per frame at low speeds, and more variation in images per frame at high-speed intervals.  Table 2.  Table 3 represents the detail specifications of each sensor used.

Data Set
For this study, we constructed about 150,000 frames of a data set from camera and LiDAR sensors and vehicle information by driving 300 km in Seoul and Gyeonggi-do, Korea. While analyzing the data, we found that the continuity of the frame varied depending on the speed. Namely, as you can see in Figure 2, the characteristics of consecutive images varied with the vehicle's speed. The constructed database had less variation in images per frame at low speeds, and more variation in images per frame at high-speed intervals.

Data Set
For this study, we constructed about 150,000 frames of a data set from camera and LiDAR sensors and vehicle information by driving 300 km in Seoul and Gyeonggi-do, Korea. While analyzing the data, we found that the continuity of the frame varied depending on the speed. Namely, as you can see in Figure 2, the characteristics of consecutive images varied with the vehicle's speed. The constructed database had less variation in images per frame at low speeds, and more variation in images per frame at high-speed intervals.
We summarized the amount of data for each vehicle speed section in Figure 3. We identified the acquired 2D LiDAR and camera data set with the speed of the vehicle, and we found the number of each data as vehicle speed, as shown in Figure 3. These unbalanced data were due to the duplication of the same data acquired at a standstill at low speed and we needed to eliminate these duplication data. Namely, similar image data were acquired in succession when the vehicle was stationary or low-speed driving, and data with significant changes between frames were acquired when high-speed driving. Electronics 2021, 10, x FOR PEER REVIEW 5 of 13 We summarized the amount of data for each vehicle speed section in Figure 3. We identified the acquired 2D LiDAR and camera data set with the speed of the vehicle, and we found the number of each data as vehicle speed, as shown in Figure 3. These unbalanced data were due to the duplication of the same data acquired at a standstill at low speed and we needed to eliminate these duplication data. Namely, similar image data were acquired in succession when the vehicle was stationary or low-speed driving, and data with significant changes between frames were acquired when high-speed driving. Thus, in this paper, data imbalances with vehicle speed were adjusted using downsampling techniques. We did down-sampling with the amount of data in the 1030-kph section with the fewest data. Since it is well known that such random sampling can generally maintain the distribution of original data, we did randomly extract down-sampling at 21,426 frames (amount of data in the 10-30-kph range) from data in each vehicle speed range. However, if we did down-sampling of 0-10-kph data, the image data representing different situations may be less because of multiple overlapping frames. Consequently, that number of real, meaningful data on the network training was insufficient. Therefore, at speeds less than 10 kph, we did not do down-sampling and utilized it for network training as the number of original data. Finally, the composition of the data utilized for the training, validation, and testing of the neural network model is shown in Table 4.  We summarized the amount of data for each vehicle speed section in Figure 3. We identified the acquired 2D LiDAR and camera data set with the speed of the vehicle, and we found the number of each data as vehicle speed, as shown in Figure 3. These unbalanced data were due to the duplication of the same data acquired at a standstill at low speed and we needed to eliminate these duplication data. Namely, similar image data were acquired in succession when the vehicle was stationary or low-speed driving, and data with significant changes between frames were acquired when high-speed driving. Thus, in this paper, data imbalances with vehicle speed were adjusted using downsampling techniques. We did down-sampling with the amount of data in the 1030-kph section with the fewest data. Since it is well known that such random sampling can generally maintain the distribution of original data, we did randomly extract down-sampling at 21,426 frames (amount of data in the 10-30-kph range) from data in each vehicle speed range. However, if we did down-sampling of 0-10-kph data, the image data representing different situations may be less because of multiple overlapping frames. Consequently, that number of real, meaningful data on the network training was insufficient. Therefore, at speeds less than 10 kph, we did not do down-sampling and utilized it for network training as the number of original data. Finally, the composition of the data utilized for the training, validation, and testing of the neural network model is shown in Table 4  Thus, in this paper, data imbalances with vehicle speed were adjusted using downsampling techniques. We did down-sampling with the amount of data in the 1030-kph section with the fewest data. Since it is well known that such random sampling can generally maintain the distribution of original data, we did randomly extract down-sampling at 21,426 frames (amount of data in the 10-30-kph range) from data in each vehicle speed range. However, if we did down-sampling of 0-10-kph data, the image data representing different situations may be less because of multiple overlapping frames. Consequently, that number of real, meaningful data on the network training was insufficient. Therefore, at speeds less than 10 kph, we did not do down-sampling and utilized it for network training as the number of original data. Finally, the composition of the data utilized for the training, validation, and testing of the neural network model is shown in Table 4.

Convolutional Neural Network for End-to-End Self-Driving
In this paper, instead of the 3D LiDAR used in Chen et al. [13], we proposed an E2E self-driving algorithm based on the CNN that predicts the longitudinal and lateral control values of vehicles by training point cloud data acquired from 2D LiDAR sensors and image data acquired from cameras. Figure 4 represents the flow chart of the proposed algorithm. CNN, which performs E2E self-driving, uses camera and LiDAR data as inputs and result in vehicle speed and angle as outputs, and updates weight/bias of CNN by comparing them with data driven by humans. In this paper, instead of the 3D LiDAR used in Chen et al. [13], we proposed an E2E self-driving algorithm based on the CNN that predicts the longitudinal and lateral control values of vehicles by training point cloud data acquired from 2D LiDAR sensors and image data acquired from cameras. Figure 4 represents the flow chart of the proposed algorithm. CNN, which performs E2E self-driving, uses camera and LiDAR data as inputs and result in vehicle speed and angle as outputs, and updates weight/bias of CNN by comparing them with data driven by humans.

Data Preprocessing
We used a camera and LiDAR sensors to construct an E2E self-driving model. In order to use two-sensor data for our proposed algorithm, we needed to convert original sensor data to suit the proposed network structure. Figure 5 represents the data preprocessing process. Each bit data was pre-processed into an appropriate form for the system using resizing, mapping, and so on.
First, image data from monocular cameras acquired in the driving environment were resized from 640 × 900 to 299 × 299 resolution for use in pre-trained models, Inception v3. In this paper, we utilized a size of 299 × 299, which is larger than the 224 × 224 size used in a general CNN pre-trained model, to ensure that traffic information, such as traffic lights, can be fully reflected in the learning, depending on the resolution.
Then, we encoded point cloud data in the driving environment acquired from the LiDAR sensor into an image form with three channels in two dimensions, utilizing it as training data from CNN. Different from many LiDARs, Lux2010 LiDAR gave us 2D information instead of 3D information, and we used imagenet-based CNN. Equation (1) indicates how point cloud data acquired from 2D LiDAR sensors were transformed into RGB channels by distance.

Data Preprocessing
We used a camera and LiDAR sensors to construct an E2E self-driving model. In order to use two-sensor data for our proposed algorithm, we needed to convert original sensor data to suit the proposed network structure. Figure 5 represents the data preprocessing process. Each bit data was pre-processed into an appropriate form for the system using resizing, mapping, and so on.
First, image data from monocular cameras acquired in the driving environment were resized from 640 × 900 to 299 × 299 resolution for use in pre-trained models, Inception v3. In this paper, we utilized a size of 299 × 299, which is larger than the 224 × 224 size used in a general CNN pre-trained model, to ensure that traffic information, such as traffic lights, can be fully reflected in the learning, depending on the resolution. Here, and represent the lateral and longitude coordinates of the point cloud data from the vehicle, respectively. The distance was calculated using Equation (1), isolated into three RGB channels, and the proportional values according to the distance within the RGB channel were substituted. Then, we encoded point cloud data in the driving environment acquired from the LiDAR sensor into an image form with three channels in two dimensions, utilizing it Electronics 2021, 10, 2608 7 of 12 as training data from CNN. Different from many LiDARs, Lux2010 LiDAR gave us 2D information instead of 3D information, and we used imagenet-based CNN. Equation (1) indicates how point cloud data acquired from 2D LiDAR sensors were transformed into RGB channels by distance.

Distance
Here, P x and P y represent the lateral and longitude coordinates of the point cloud data from the vehicle, respectively. The distance was calculated using Equation (1), isolated into three RGB channels, and the proportional values according to the distance within the RGB channel were substituted.
if Distance < 20 : Channel Red [P x ] P y = min(255, 255 60 × Distance) Figure 6 shows the encoding of the point cloud according to the method used in Equation (1). It was resized to 224 × 224 and used as input data for the pre-trained network, ResNet50. Here, and represent the lateral and longitude coordinates of the point cloud data from the vehicle, respectively. The distance was calculated using Equation (1), isolated into three RGB channels, and the proportional values according to the distance within the RGB channel were substituted.

Distance
(1)  Figure 6 shows the encoding of the point cloud according to the method used in Equation (1). It was resized to 224 × 224 and used as input data for the pre-trained network, ResNet50. Finally, we constructed a label for training the vehicle's longitudinal/lateral control values based on data acquired from 2D LiDAR sensors and front camera sensors. To train the vehicle control value, which is the output of E2E self-driving, the label utilized the heading angle and the velocity, which represent the lateral and longitudinal variations of Finally, we constructed a label for training the vehicle's longitudinal/lateral control values based on data acquired from 2D LiDAR sensors and front camera sensors. To train the vehicle control value, which is the output of E2E self-driving, the label utilized the heading angle and the velocity, which represent the lateral and longitudinal variations of the vehicle's information, respectively. It also configured the label data in the 1 × 10 vector format for 250-ms to 1250-ms intervals in 250 ms to predict future values and current control values. The label data after the five frames were determined by the actual driving value of the person between the current frame and five frames after.

Proposed Network Architecture
The proposed E2E self-driving network consisted of an input structure consisting of two separate branches: a 299 × 299-size image acquired from a camera sensor and an image of 224 × 224 size encoded by a preprocessing algorithm. The Inception V3 [22] model was used to utilize 299 × 299 images acquired from camera sensors without changing the size in the pre-trained model, and Resnet50 [23] was used for 224 × 224 LiDAR images. Figure 7 represents the Network Architecture used for our proposed algorithm.

Proposed Network Architecture
The proposed E2E self-driving network consisted of an input structure consisting of two separate branches: a 299 × 299-size image acquired from a camera sensor and an image of 224 × 224 size encoded by a preprocessing algorithm. The Inception V3 [22] model was used to utilize 299 × 299 images acquired from camera sensors without changing the size in the pre-trained model, and Resnet50 [23] was used for 224 × 224 LiDAR images. Figure  7 represents the Network Architecture used for our proposed algorithm. Each feature extraction layer was extracted from camera and LiDAR data using the feature extraction layer of each pre-trained model (Inception v3, Resnet50), and then these two kinds of features were concatenated for combining. The combined features consisted of a regression layer that predicted the velocity and angle we wanted through a fully connected layer. The details of the fully connected layers are shown in Table 5.

Results
We demonstrated the validity of the proposed method using data sets built in Sections 2.2 and 2.3 on the E2E network architecture proposed in Section 2.4 of this paper. We proceeded with training on the two sets of data, original and down-sampling data, to demonstrate the effectiveness of down-sampling of unbalanced data. Quantitative performance indicators for the predicted results were derived through Expression (2). The indicator showed the difference between the data driven by a person and the proposed method. Because the criterion of accurate driving was ambiguous, this paper compared the differences between human driving data and the proposed E2E algorithm.
Each feature extraction layer was extracted from camera and LiDAR data using the feature extraction layer of each pre-trained model (Inception v3, Resnet50), and then these two kinds of features were concatenated for combining. The combined features consisted of a regression layer that predicted the velocity and angle we wanted through a fully connected layer. The details of the fully connected layers are shown in Table 5.

Results
We demonstrated the validity of the proposed method using data sets built in Sections 2.2 and 2.3 on the E2E network architecture proposed in Section 2.4 of this paper. We proceeded with training on the two sets of data, original and down-sampling data, to demonstrate the effectiveness of down-sampling of unbalanced data. Quantitative performance indicators for the predicted results were derived through Expression (2). The indicator showed the difference between the data driven by a person and the proposed method. Because the criterion of accurate driving was ambiguous, this paper compared the differences between human driving data and the proposed E2E algorithm.
Tables 6 and 7 are network prediction results learned with the original data set and the down-sampling data set for the same test data. The prediction performance was verified by dividing the situation into low-(<10 kph) and high-speed (≥10 kph) sections according to the vehicle velocity, and the heading angle was verified by dividing the situation into a straight (<5 • ) and curved road (≥5 • ).   Table 8 shows the estimation performance differences of the E2E self-driving model in the low-speed section between the original data set and the down-sampling data set. In the low-speed section, each frame's velocities improved performance as a result of learning with the down-sampling data. To confirm the stable operation of the proposed algorithm, we compared the speed at which a person drives with the output of the E2E self-driving model for one driving scenario among test data. The comparison results are shown in Figure 8.   Table 8 shows the estimation performance differences of the E2E self-driving mode in the low-speed section between the original data set and the down-sampling data set. I the low-speed section, each frame's velocities improved performance as a result of learn ing with the down-sampling data. To confirm the stable operation of the proposed algorithm, we compared the spee at which a person drives with the output of the E2E self-driving model for one drivin scenario among test data. The comparison results are shown in Figure 8.

Discussion
As we mentioned in Section 3, we defined and used the gap rate to represent the quantitative performance; however, a large gap rate does not mean that the E2E self-driving drive was wrong. If the estimation value of the E2E self-driving was within the permitted driving range on the actual road, it was a correct operation, even if it differed from the actual person's driving. However, the gap rate was used to verify that the actual learning was done well in this paper. Therefore, we further checked the correct behavior of E2E self-driving using the activation map.
The advantage of E2E self-driving is that it performs autonomous driving without intermediate processing, using only input data, unlike recognizing all objects, and generating a driving path in conventional rule-based self-driving. To check the behavior of this E2E self-driving, we visualized the area that was activated by the proposed CNN while driving using the activation map. Figure 9 is an activation map resulting from a prediction result of the learned End-to-End self-driving model. The left side of each figure is the activation map of the image data, and the right side is the activation map of the LiDAR data.
Activation maps are expressed for a total of four situations: straight driving situation, driving situation with high curvature, stopping situation without intersection forward vehicles, and stopping situation with intersection forward vehicles. As shown in Figure 9a,b, in the case of driving in a straight section, we confirmed that the center of the road in the camera image and the forward portion of the LiDAR sensor were active in the network. If there were no forward vehicles in the intersection section and only traffic lights existed, the activation map was concentrated in the area of the traffic lights in Figure 9c,d. As a result of verifying the activation map of the CNN, we checked that the proposed algorithm was effective for self-driving.
The advantage of E2E self-driving is that it performs autonomous driving without intermediate processing, using only input data, unlike recognizing all objects, and generating a driving path in conventional rule-based self-driving. To check the behavior of this E2E self-driving, we visualized the area that was activated by the proposed CNN while driving using the activation map. Figure 9 is an activation map resulting from a prediction result of the learned End-to-End self-driving model. The left side of each figure is the activation map of the image data, and the right side is the activation map of the LiDAR data.
Activation maps are expressed for a total of four situations: straight driving situation, driving situation with high curvature, stopping situation without intersection forward vehicles, and stopping situation with intersection forward vehicles. As shown in Figure 9 a,b, in the case of driving in a straight section, we confirmed that the center of the road in the camera image and the forward portion of the LiDAR sensor were active in the network. If there were no forward vehicles in the intersection section and only traffic lights existed, the activation map was concentrated in the area of the traffic lights in Figure 9 c,d. As a result of verifying the activation map of the CNN, we checked that the proposed algorithm was effective for self-driving. In addition, as shown in Figure 10, it was confirmed that the speed of E2E self-driving did not show much difference from the output speed of human driving, and it was confirmed that the proposed algorithm drove safely in general road conditions. While analyzing Figure 8, interestingly the data between 1975 and 2251 frames showed a large difference between the driver and CNN results. The above data are from situations when the In addition, as shown in Figure 10, it was confirmed that the speed of E2E selfdriving did not show much difference from the output speed of human driving, and it was confirmed that the proposed algorithm drove safely in general road conditions. While analyzing Figure 8, interestingly the data between 1975 and 2251 frames showed a large difference between the driver and CNN results. The above data are from situations when the traffic light changed from green to orange, as shown in the Figure 10. When the driver met the traffic signal-changing situation, it was confirmed that human drivers drove without slowing down, while the proposed algorithm reduced the speed as it considered traffic lights. This does not mean that CNN is safer, but that the proposed algorithm can confirm that it operates by recognizing traffic lights. In other words, E2E autonomous driving is not only possible to control the longitudinal/lateral direction that maintains the distance and lane from the vehicle, but also to operate in compliance with other traffic laws, such as traffic signal and speed limit. met the traffic signal-changing situation, it was confirmed that human drivers drove without slowing down, while the proposed algorithm reduced the speed as it considered traffic lights. This does not mean that CNN is safer, but that the proposed algorithm can confirm that it operates by recognizing traffic lights. In other words, E2E autonomous driving is not only possible to control the longitudinal/lateral direction that maintains the distance and lane from the vehicle, but also to operate in compliance with other traffic laws, such as traffic signal and speed limit.

Conclusions
In this paper, we proposed E2E autonomous driving in general urban environments using 2D LiDAR and camera sensors' data. Our proposed method could drive autonomously using 2D LiDAR sensors to train in-depth information about the driving environment and a camera sensor to train image data to recognize the driving environment information such as a traffic signal. Unlike previous studies, we implemented an algorithm of end-to-end self-driving that can maintain road traffic by acquiring actual road data. Namely, our proposed algorithm could (1) enable longitudinal/lateral self-driving with an E2E method and (2) deal with complex situations such as traffic lights in urban areas.
For the quantitative performance evaluation of the model developed of the proposed method, we developed a gap rate that represented the difference between E2E self-driving data and the human driving data. The gap rate was 14.61 kph for the original data set and 5.07 kph for the down-sampling data set. Furthermore, predictions were made after 1250 ms, as well as the values currently needed, to confirm that predictions were possible for future situations.
The data used in this paper included various driving environment situations, such as intersections, stopping traffic lights, and sharp curves, and we validated them using the activation map to check the behavior of E2E autonomous driving in these various environments. In the traffic light and intersection sections, the largest activation was observed near the traffic lights in the input image under stop-and-go conditions. Nevertheless, due to deep learning's black box nature, the proposed algorithm had a limitation that mathematical analysis was not performed and was proven experimentally.
In the future, more database construction and model configuration will be performed to enable E2E autonomous driving in a wider variety of environments, including more sensor information, such as around-view monitor cameras and map information, and will improve the self-driving model to ensure safe driving to a destination. Additionally, when we acquire data for E2E self-driving, the frame ratio of the camera/LiDAR will be appropriately adjusted according to the situation so that balanced data can be acquired and training can be constructed through it.

Conclusions
In this paper, we proposed E2E autonomous driving in general urban environments using 2D LiDAR and camera sensors' data. Our proposed method could drive autonomously using 2D LiDAR sensors to train in-depth information about the driving environment and a camera sensor to train image data to recognize the driving environment information such as a traffic signal. Unlike previous studies, we implemented an algorithm of end-to-end self-driving that can maintain road traffic by acquiring actual road data. Namely, our proposed algorithm could (1) enable longitudinal/lateral self-driving with an E2E method and (2) deal with complex situations such as traffic lights in urban areas.
For the quantitative performance evaluation of the model developed of the proposed method, we developed a gap rate that represented the difference between E2E self-driving data and the human driving data. The gap rate was 14.61 kph for the original data set and 5.07 kph for the down-sampling data set. Furthermore, predictions were made after 1250 ms, as well as the values currently needed, to confirm that predictions were possible for future situations.
The data used in this paper included various driving environment situations, such as intersections, stopping traffic lights, and sharp curves, and we validated them using the activation map to check the behavior of E2E autonomous driving in these various environments. In the traffic light and intersection sections, the largest activation was observed near the traffic lights in the input image under stop-and-go conditions. Nevertheless, due to deep learning's black box nature, the proposed algorithm had a limitation that mathematical analysis was not performed and was proven experimentally.
In the future, more database construction and model configuration will be performed to enable E2E autonomous driving in a wider variety of environments, including more sensor information, such as around-view monitor cameras and map information, and will improve the self-driving model to ensure safe driving to a destination. Additionally, when we acquire data for E2E self-driving, the frame ratio of the camera/LiDAR will be appropriately adjusted according to the situation so that balanced data can be acquired and training can be constructed through it.