Channel-Based Network for Fast Object Detection of 3D LiDAR

: Currently, there are various methods of LiDAR-based object detection networks. In this paper, we propose a channel-based object detection network using LiDAR channel information. The proposed method is a 2D convolution network with data alignment processing stages including a single-step detection stage. The network consists of a channel internal convolution network, channel external convolution network and detection network. First, the convolutional network within the channel divides the LiDAR data for each channel to ﬁnd features within the channel. Second, the convolutional network outside the channel combines the LiDAR data divided for each channel to ﬁnd features between the channels. Finally, the detection network ﬁnds objects with the features obtained. We evaluate our proposed network using our 16-channel lidar and popular KITTI dataset. We can conﬁrm that the proposed method detects objects quickly while maintaining performance when compared with the existing network.


Introduction
Object detection involves the analysis of input data received through a sensor to identify the locations and classes (pedestrians, dogs, cats, cars, etc.) of an object. The object detection technology is used in many areas, especially in research related to autonomous vehicles. In an autonomous vehicle, object detection looks for objects around the vehicle (other vehicles, pedestrians, obstacles, etc.) for it to drive. Autonomous cars are equipped with multiple sensors for object detection. Among them, sensors that are mostly used for object detection are cameras and Light Detection and Ranging (LiDAR) sensors.
In order to detect objects in the LiDAR, the minimum shape of the object must be visible. Therefore, at least 16 channels of LiDAR data are used for object detection. In general, the higher the channel of a LiDAR, the greater the amount of data and the higher the density, which is more advantageous for object recognition because there are more points expressed on the object.
There are three ways to detect objects using LiDAR data. The first method is to detect objects by discretizing the LiDAR data into specific areas to create voxels. This first method uses voxels, which are 3D data; therefore, less data are lost as a result of changing dimensions. However, it must use a 3D Convolutional Neural Network (CNN). The use of 3D data requires more arithmetic operations than 2D data, making it difficult to use in real time except for a few algorithms. The method is presented in Vote3Deep [1], VoxelNet [2], Vote3D [3], ORION [4], 3D FCN [5], SECOND [6], Rist et al. [7], and PV-RCNN [8]. The second method is to transform LiDAR sensors into top-view (bird's eye view) images. The top-view method makes data transformation into images and object locating easy.
However, there are disadvantages of increasing the image size depending on the navigation range, and there are many blank spaces on the transformed image, which makes it difficult to recognize the Input data that goes into the network through random sampling using voxels are created. Next, data features are extracted by applying 3D convolutions. Finally, the class and location of an object recognized using RPN is identified. Figure 1 summarizes the voxel-based network structure.
Electronics 2020, 9, x FOR PEER REVIEW 2 of 10 However, there are disadvantages of increasing the image size depending on the navigation range, and there are many blank spaces on the transformed image, which makes it difficult to recognize the object because of its low data density. In addition, it is challenging to recognize pedestrians, which is a priority for detection. The use of the top-view method is presented in Yu et al. [9], VeloFCN [10], BirdNet [11], Kim et al. [12], Kim et al. [13], PIXOR [14], and BirdNet+ [15]. The third method is to transform data into polar-view (or front-view) images. The polar-view method is more complex than the top-view method but has the advantage of increasing the data density and recognizing pedestrians difficult to detect in the top-view method. Its downside is that objects often overlap because images change on an angle basis, and, in the case of overlaps, the data at the back are disregarded, resulting in data loss. The use of the polar-view method is presented in LiSeg [16], Kwon et al. [17], and Lee et al. [18]. In some cases, a mixture of the second and third methods is used. MV3D [19] is a typical case of such a mixed use.
This paper proposes a new network method that is different from those above. (1) LiDAR has a fixed number of channels. Points on the same channel are separated by distance but are continuous when viewed by a channel. Using the continuous properties, features within the channel are extracted.
(2) Looking at the LiDAR data in the form of a matrix, it can be said that the channel data are also continuous. Therefore, to obtain the feature of channel data, the results from each Channel Internal Convolution Network are combined and the feature is extracted through Channel External Convolution Network (3) The class and location of an object are identified through the detection network.
This paper makes the following contributions: First, we propose an object detection network based on a LiDAR channel. LiDAR data adversely affect convolution due to the large amount of empty space when searching a wide range from a limited number of points based on distance. When using the channel-based method, since the LiDAR data is sorted, the empty space of the data can be eliminated. Second, using the channel-based method, the actual LiDAR data is 3D, but since it can be used like 2D data, the speed of the network is increased by using the 2D convolution method instead of the 3D convolution.
This paper is organized as follows: Section 2 provides an overview of the existing networks, Section 3 proposes a new network, Section 4 demonstrates the excellence of the proposed network through experiments, and Section 5 provides conclusions.

Voxel-Based Network
A voxel is a volume element, representing a value on a regular grid in a three-dimensional space. The number of grids , and that occurs when the size of the voxels to be used in 3D space × × to detect an object are ℎ , , is as follows: Input data that goes into the network through random sampling using voxels are created. Next, data features are extracted by applying 3D convolutions. Finally, the class and location of an object recognized using RPN is identified. Figure 1 summarizes the voxel-based network structure.

Image Transformation-Based Network
The image transformation-based network concerns a method of recognizing objects by transforming them into 2D images using 3D LiDAR data. When transforming LiDAR data into images, the top-and polar-view methods are generally used. The equations for transforming the top [12] and polar views [17] are as follows: Equations (2) and (3) are the transformation equations for the top-view method while Equations (4) and (5) are for the polar-view method. (k, l) are the pixel coordinates whose image sizes are M and N, and R x , R y are the values of x and y axes of the largest LiDAR to be measured. Finally, α and β are the constant that determines the image size in the polar view. Figure 2 summarizes the image transformation-based network structure.

Image Transformation-Based Network
The image transformation-based network concerns a method of recognizing objects by transforming them into 2D images using 3D LiDAR data. When transforming LiDAR data into images, the top-and polar-view methods are generally used. The equations for transforming the top [12] and polar views [17] are as follows: Equations (2) and (3) are the transformation equations for the top-view method while Equations (4) and (5) are for the polar-view method. ( , ) are the pixel coordinates whose image sizes are and , and , are the values of and axes of the largest LiDAR to be measured. Finally, and are the constant that determines the image size in the polar view. Figure 2 summarizes the image transformation-based network structure.

LiDAR Channel-Based Network
The structure of the network proposed in this paper is shown in Figure 3. It is summarized as follows: first, the features of each channel are extracted through the channel's internal convolution network after divide the LiDAR data by channel. Second, the data are combined to obtain the features of the data of each channel through its external convolution network. Third, the presence of objects is verified through the detection network. Fourth and lastly, the class of the object is obtained in the class layer, and the bounding box of the object is obtained in the box layer.

LiDAR Channel-Based Network
The structure of the network proposed in this paper is shown in Figure 3. It is summarized as follows: first, the features of each channel are extracted through the channel's internal convolution network after divide the LiDAR data by channel. Second, the data are combined to obtain the features of the data of each channel through its external convolution network. Third, the presence of objects is verified through the detection network. Fourth and lastly, the class of the object is obtained in the class layer, and the bounding box of the object is obtained in the box layer.

Image Transformation-Based Network
The image transformation-based network concerns a method of recognizing objects by transforming them into 2D images using 3D LiDAR data. When transforming LiDAR data into images, the top-and polar-view methods are generally used. The equations for transforming the top [12] and polar views [17] are as follows: Equations (2) and (3) are the transformation equations for the top-view method while Equations (4) and (5) are for the polar-view method. ( , ) are the pixel coordinates whose image sizes are and , and , are the values of and axes of the largest LiDAR to be measured. Finally, and are the constant that determines the image size in the polar view. Figure 2 summarizes the image transformation-based network structure.

LiDAR Channel-Based Network
The structure of the network proposed in this paper is shown in Figure 3. It is summarized as follows: first, the features of each channel are extracted through the channel's internal convolution network after divide the LiDAR data by channel. Second, the data are combined to obtain the features of the data of each channel through its external convolution network. Third, the presence of objects is verified through the detection network. Fourth and lastly, the class of the object is obtained in the class layer, and the bounding box of the object is obtained in the box layer.

Channel Internal Convolution Network
The output of point cloud L L = p i,j , p i,j = x i,j , y i,j , z i,j , I i,j , which is the data given in LiDAR sensors, is shown in Figure 4a,b if the given point clouds are arranged by channel. We want to define the "channel" of the LiDAR. For example, Velodyne HDL-64E has 64 channels (C 1 -C 64 ). Points in the same channel can be displayed in one column. It is summarized in Equations (6) and (7): where L is the whole LiDAR data, and C i represents a set of points belonging to the channel. p i, j represents a point, and j is an index on each channel. Each point p i,j contains locations (x, y, z) and reflection (I)values.
Electronics 2020, 9, x FOR PEER REVIEW 4 of 10 The output of point cloud ( = , , , = , , , , , , , ), which is the data given in LiDAR sensors, is shown in Figure 4a,b if the given point clouds are arranged by channel. We want to define the "channel" of the LiDAR. For example, Velodyne HDL-64E has 64 channels ( -). Points in the same channel can be displayed in one column. It is summarized in Equations (6) and (7): where is the whole LiDAR data, and represents a set of points belonging to the channel. , represents a point, and is an index on each channel. Each point( , ) contains locations( , , ) and reflection( )values.
The advantage of sorting LiDAR data by channel is that the density of the data increases. Sorting the lidar data by channel increases the density of the data, which is advantageous for object recognition. Due to the sorted data, it can be viewed in the form of an image with a depth of 4(x, y, z, I).
If you can see it in the form of an image, you can use 2D CNN instead of 3D CNN. 2D convolution used in 2D CNN can be expressed as follows.
The number of operations varies depending on the size ( , ) of the pixel and the size of the filter, but the number of 2D convolutions used in 2D CNN can be expressed as Equation (9).
is the number of convolutions, and * is the size of the filter in each direction. Runtime can be reduced because the operation time of 2D Convolution is shorter than 3D Convolution and there is a difference in the number of Convolutions.
The reason for separating the LiDAR channel is that, in the case of a rider capable of 360° scanning, the points on the same channel have many similarities. For example, when you scan a flat surface, the distance and reflection values of LiDAR data on the same channel are similar. However, if an object is present, the distance or the reflection values are different in the middle, as shown in The advantage of sorting LiDAR data by channel is that the density of the data increases. Sorting the lidar data by channel increases the density of the data, which is advantageous for object recognition. Due to the sorted data, it can be viewed in the form of an image with a depth of 4(x, y, z, I). If you can see it in the form of an image, you can use 2D CNN instead of 3D CNN. 2D convolution used in 2D CNN can be expressed as follows.
The number of operations varies depending on the size (i, j) of the pixel and the size of the filter, but the number of 2D convolutions used in 2D CNN can be expressed as Equation (9).
n 2 is the number of convolutions, and f * 2 is the size of the filter in each direction. Runtime can be reduced because the operation time of 2D Convolution is shorter than 3D Convolution and there is a difference in the number of Convolutions.
The reason for separating the LiDAR channel is that, in the case of a rider capable of 360 • scanning, the points on the same channel have many similarities. For example, when you scan a flat surface, the distance and reflection values of LiDAR data on the same channel are similar. However, if an object is present, the distance or the reflection values are different in the middle, as shown in Figure 4a. Therefore, data are discretized by channel to apply it to the algorithms proposed in this paper because the data of each channel are interrelated. Convolutions inside the channel are performed using the divide channel data. We use the channel data separated based on the equation to perform convolution inside the channel. Each channel's internal convolution network consists of 12 convolution layers and 4 pooling layers, and the overall numbers are presented as 12 × i convolution layers and 4 × i pooling layers. We used ReLU as the active function after convolution. The size of the convolution mask on the convolution layer is 1 × 3. Convolutions apply to the x, y, z, and I values of the LiDAR data on each LiDAR channel. A set comprising of three convolutions and one pooling is repeated four times.
The results of the channel's internal convolution network produce a feature map (F i ) for each channel.

Channel External Convolution Network
Although the previous section mentions that the data of each channel is important, Figure 5 shows that data between channels are as relevant as the data of each channel. The purpose of the channel's external convolution network is to extract channel-specific and channel-to-channel features of the data.

Channel External Convolution Network
Although the previous section mentions that the data of each channel is important, Figure 5 shows that data between channels are as relevant as the data of each channel. The purpose of the channel's external convolution network is to extract channel-specific and channel-to-channel features of the data. The channel's external convolution network consists of three convolution layers, as shown in Figure 6. The size 3 × 3 is used for the convolution layer's mask. At the end of the network, a new feature map ( ) [20] is created by connecting the feature map ( ) that resulted from convolutions and the feature map ( ) that was used as input. For the input data for the channel's external convolution, F that combines the results of the previous section ( , , ⋯ , ) is used. The output data are , and its feature map is in the same size as that of .

Detection Network
A detection network is a network that determines whether an object is present or not by using as input, which is the result of a channel's external convolution network. For the structure of the detection network, refer to the structure of the header network of PIXOR [14]. A detection network has four convolution layers and uses a mask sized 3 × 3. We used ReLU as the active function after The channel's external convolution network consists of three convolution layers, as shown in Figure 6. The size 3 × 3 is used for the convolution layer's mask. At the end of the network, a new feature map (F ) [20] is created by connecting the feature map (F ) that resulted from convolutions and the feature map (F) that was used as input. For the input data for the channel's external convolution, F that combines the results of the previous section (F 1 ,F 2 , · · · ,F i ) is used. The output data are F , and its feature map is in the same size as that of F. Electronics 2020, 9, x FOR PEER REVIEW 5 of 10

Channel External Convolution Network
Although the previous section mentions that the data of each channel is important, Figure 5 shows that data between channels are as relevant as the data of each channel. The purpose of the channel's external convolution network is to extract channel-specific and channel-to-channel features of the data. The channel's external convolution network consists of three convolution layers, as shown in Figure 6. The size 3 × 3 is used for the convolution layer's mask. At the end of the network, a new feature map ( ) [20] is created by connecting the feature map ( ) that resulted from convolutions and the feature map ( ) that was used as input. For the input data for the channel's external convolution, F that combines the results of the previous section ( , , ⋯ , ) is used. The output data are , and its feature map is in the same size as that of .

Detection Network
A detection network is a network that determines whether an object is present or not by using as input, which is the result of a channel's external convolution network. For the structure of the detection network, refer to the structure of the header network of PIXOR [14]. A detection network has four convolution layers and uses a mask sized 3 × 3. We used ReLU as the active function after

Detection Network
A detection network is a network that determines whether an object is present or not by using F as input, which is the result of a channel's external convolution network. For the structure of the detection network, refer to the structure of the header network of PIXOR [14]. A detection network has four convolution layers and uses a mask sized 3 × 3. We used ReLU as the active function after convolution. The final data obtained after four convolution layers are used as input for the class and box layers.
The class layer is outputs 2-channel feature map and configured to output classes and scores (c, s) for an object concerned. The box layer is outputs 5-channel feature map and configured to display the position of the object, the size of the box, and the angle (h, w, l, d, θ) of the object. The detection network is illustrated in Figure 7.

Loss Function
The loss function is a function that defines how well a network deals with training data. In general, the learning objective of a network is to minimize the value of the loss function. The proposed network's loss function is as follows: where represents the loss function of the entire network, is the loss function of a class layer, and is the loss function of a box layer. The function in each loss function uses the crossentropy [21], and the equation is as follows. E equals the output when the error is the ground truth (GT) value and is the output of the network.
The entire network goes through the learning process using the above loss function in the endto-end method with Adam optimization [22].

Experiment
This section evaluates the performance of the proposed algorithm. The experiment carried out is as follows: First, the existing object-detection algorithms are compared to the proposed algorithm using the 16-channel LiDAR. Second, the existing object-detection algorithms are compared to the proposed algorithm using the KITTI dataset [23] (64-channel LiDAR).

Object Detection Test Using 16-Channel LiDAR
LiDAR used in the experiment is Velodyne VLP-16, and it is fitted to the vehicle shown in Figure  8. The operating system for the personal computer (PC) is Ubuntu. The experiment was conducted in the ROS Kinetic environment, the central processing unit was i5-6600 (3.30 GHz), and the graphics processing unit used was GeForce GTX 1080 ti. The data obtained while driving the vehicle around the campus was used for the experiment. We trained the algorithm using 2276 frames of training data and 1138 frames of test data.

Loss Function
The loss function is a function that defines how well a network deals with training data. In general, the learning objective of a network is to minimize the value of the loss function. The proposed network's loss function is as follows: where E total represents the loss function of the entire network, E class is the loss function of a class layer, and E box is the loss function of a box layer. The function in each loss function uses the cross-entropy [21], and the equation is as follows. E equals the output when the error t k is the ground truth (GT) value and y k is the output of the network.
The entire network goes through the learning process using the above loss function in the end-to-end method with Adam optimization [22].

Experiment
This section evaluates the performance of the proposed algorithm. The experiment carried out is as follows: First, the existing object-detection algorithms are compared to the proposed algorithm using the 16-channel LiDAR. Second, the existing object-detection algorithms are compared to the proposed algorithm using the KITTI dataset [23] (64-channel LiDAR).

Object Detection Test Using 16-Channel LiDAR
LiDAR used in the experiment is Velodyne VLP-16, and it is fitted to the vehicle shown in Figure 8. The operating system for the personal computer (PC) is Ubuntu. The experiment was conducted in Electronics 2020, 9, 1122 7 of 10 the ROS Kinetic environment, the central processing unit was i5-6600 (3.30 GHz), and the graphics processing unit used was GeForce GTX 1080 ti. The data obtained while driving the vehicle around the campus was used for the experiment. We trained the algorithm using 2276 frames of training data and 1138 frames of test data.

Object Detection Test Using 64-Channel LiDAR
The KITTI datasets [23] were used for the experiment of the 64-channel LiDAR. The LiDAR sensor used in the KITTI datasets is Velodyne HDL-64E. In the experiment results, two classes (vehicle, pedestrian) represent the results of the KITTI benchmark. The runtime test results used the PC used in Section 4.1.1.

Object Detection Test Using 16-Channel LiDAR
The experiment results are shown in Table 1. We can confirm that the proposed method is faster in runtime than the existing method. The reason was explained in the previous section, but the convolution operation was reduced and the time was faster. Additionally, the results of the vehicle recognition were small, but it was confirmed that the results of pedestrians were different. Due to the low density of the LiDAR data, there are many empty spaces, and because of the small percentage of pedestrians, we believe that the voxels with pedestrians are few and have low performance. The relatively dense polar-view transformation method and the proposed method are found to have a higher performance than the voxel method. However, in some circumstances, the polar view was unable to detect the object. The top-view transformation method is excluded because it cannot detect pedestrians. Figure 9 illustrates the results of object detection in some networks used in the experiment.  The object detection results were compared using existing algorithms (the voxel, top-view, and polar-view methods) and the proposed network. Studies [2,12,17] were used for the voxel, top-view, and polar-view methods, respectively. Lastly, [19] was used for a mixture of two or more methods for comparisons.

Object Detection Test Using 64-Channel LiDAR
The KITTI datasets [23] were used for the experiment of the 64-channel LiDAR. The LiDAR sensor used in the KITTI datasets is Velodyne HDL-64E. In the experiment results, two classes (vehicle, pedestrian) represent the results of the KITTI benchmark. The runtime test results used the PC used in Section 4.1.1.

Object Detection Test Using 16-Channel LiDAR
The experiment results are shown in Table 1. We can confirm that the proposed method is faster in runtime than the existing method. The reason was explained in the previous section, but the convolution operation was reduced and the time was faster. Additionally, the results of the vehicle recognition were small, but it was confirmed that the results of pedestrians were different. Due to the low density of the LiDAR data, there are many empty spaces, and because of the small percentage of pedestrians, we believe that the voxels with pedestrians are few and have low performance. The relatively dense polar-view transformation method and the proposed method are found to have a higher performance than the voxel method. However, in some circumstances, the polar view was unable to detect the object. The top-view transformation method is excluded because it cannot detect pedestrians. Figure 9 illustrates the results of object detection in some networks used in the experiment. low density of the LiDAR data, there are many empty spaces, and because of the small percentage of pedestrians, we believe that the voxels with pedestrians are few and have low performance. The relatively dense polar-view transformation method and the proposed method are found to have a higher performance than the voxel method. However, in some circumstances, the polar view was unable to detect the object. The top-view transformation method is excluded because it cannot detect pedestrians. Figure 9 illustrates the results of object detection in some networks used in the experiment.   Table 2 shows the KITTI benchmark results. The "Type" is the method used on the network. The KITTI benchmark experiment was conducted on two classes: vehicle and pedestrian. First, in the case of a vehicle, the proposed network shows the average performance among the networks in Table 2. In detail, it has higher performance than the top-view method and lower than the voxel method Next, in the case of pedestrians, the proposed network also showed average performance.

Object Detection Test Using 64-Channel LiDAR
The runtime experiment was conducted with the PC in Section 4.1.1. As expected, the data transformation method has a longer runtime in the transformation process, but we can confirm that the method we proposed is the shortest.

Conclusion
In this paper, we proposed a new method called object detection network based on a LiDAR  Table 2 shows the KITTI benchmark results. The "Type" is the method used on the network. The KITTI benchmark experiment was conducted on two classes: vehicle and pedestrian. First, in the case of a vehicle, the proposed network shows the average performance among the networks in Table 2. In detail, it has higher performance than the top-view method and lower than the voxel method Next, in the case of pedestrians, the proposed network also showed average performance. The runtime experiment was conducted with the PC in Section 4.1.1. As expected, the data transformation method has a longer runtime in the transformation process, but we can confirm that the method we proposed is the shortest.

Conclusions
In this paper, we proposed a new method called object detection network based on a LiDAR channel. The LiDAR data were aligned in a matrix to eliminate the empty space of the data. Features were extracted from each channel of the lidar using Channel Internal Convolution Network. The Channel External Convolution Network was used to extract the features of the LiDAR channel data. Finally, the class and box of the object were detected using the two features previously obtained from the detection network.
Instead of 3D convolution with a lot of computation, 2D convolution was used to reduce computation. The experimental results showed that the proposed network has medium performance in terms of classification accuracy, but has higher performance in terms of runtime.