A Preliminary Study of Deep Learning Sensor Fusion for Pedestrian Detection

Most pedestrian detection methods focus on bounding boxes based on fusing RGB with lidar. These methods do not relate to how the human eye perceives objects in the real world. Furthermore, lidar and vision can have difficulty detecting pedestrians in scattered environments, and radar can be used to overcome this problem. Therefore, the motivation of this work is to explore, as a preliminary step, the feasibility of fusing lidar, radar, and RGB for pedestrian detection that potentially can be used for autonomous driving that uses a fully connected convolutional neural network architecture for multimodal sensors. The core of the network is based on SegNet, a pixel-wise semantic segmentation network. In this context, lidar and radar were incorporated by transforming them from 3D pointclouds into 2D gray images with 16-bit depths, and RGB images were incorporated with three channels. The proposed architecture uses a single SegNet for each sensor reading, and the outputs are then applied to a fully connected neural network to fuse the three modalities of sensors. Afterwards, an up-sampling network is applied to recover the fused data. Additionally, a custom dataset of 60 images was proposed for training the architecture, with an additional 10 for evaluation and 10 for testing, giving a total of 80 images. The experiment results show a training mean pixel accuracy of 99.7% and a training mean intersection over union of 99.5%. Also, the testing mean of the IoU was 94.4%, and the testing pixel accuracy was 96.2%. These metric results have successfully demonstrated the effectiveness of using semantic segmentation for pedestrian detection under the modalities of three sensors. Despite some overfitting in the model during experimentation, it performed well in detecting people in test mode. Therefore, it is worth emphasizing that the focus of this work is to show that this method is feasible to be used, as it works regardless of the size of the dataset. Also, a bigger dataset would be necessary to achieve a more appropiate training. This method gives the advantage of detecting pedestrians as the human eye does, thereby resulting in less ambiguity. Additionally, this work has also proposed an extrinsic calibration matrix method for sensor alignment between radar and lidar based on singular value decomposition.


Introduction
Deep learning (DL) has experienced rapid growth in recent years due to its universal learning approach, robustness, generalization, and scalability [1]. In the area of DL, recursive neural networks (RNN) and convolutional neural networks (CNN) are the most commonly used, with CNN being the most prevalent in many applications, including image classification, speech recognition, obstacle detection, computer vision, and sensor fusion, among others.
Moreover, autonomous driving cars (ADC) constitute a constantly progressing research field, and the direction towards fully automated public cars will likely become a the extraction of features from a large dataset using a CNN; however, instead they used an HOG. The detection of objects in this work was done using bounding boxes (Gao, 2020).
So far, the methods mentioned previously have relied on placing bounding boxes around detected and classified objects. However, humans require the detection of pedestrians at the pixel level to better understand their surroundings, especially in the process of building autonomous cars, [20]. Thus, semantic segmentation refers to the action of classifying objects in an image at a pixel-level or pixel-wise. In other words, each pixel is classified individually and assigned to a class that best represents it. This gives a more detailed understanding of the imagery than image classification or object detection, which can be crucial in detecting people in autonomous driving cars and other fields such as robotics or image search engines [20,21].
Sensor fusion approaches based on semantic segmentation have been performed in [22]. In this approach, a semantic segmentation algorithm was used to effectively fuse 3D pointclouds with images. Another approach [21] used a network consisting of three CNN encoder-decoder sub-networks to fuse RGB images, lidar, and radar for road detection. Additionally, Khaled et al. in [23] use two networks, SqueezeSeg and PointSeg, for semantic segmentation and to apply a feature fusion level to fuse a 3D lidar with an RGB camera for pedestrian detection.
Semantic segmentation is a well-developed algorithm based on image data [23] and that leverages recent work on sensor fusion carried out in [21] to handle multi-modal fusion. In this work, an asymmetrical CNN architecture was used that consisted of three encoder sub-networks, where each sub-network was assigned to a particular sensor stream: camera, lidar, and radar. The stream data was then fused by a fully connected layer whose outputs were upsampled by a decoder layer that recovered the fused data by performing pixel-wise classification. The upsampling was done in order to recover the lost spatial information and generate a dense pixel-wise segmentation map that accurately captured the fine-grained details of the input image, [24]. The stream layers were designed to be as compact as possible, based on the complexity of the incoming data from the sensor. The convolutional neural network presented in [21] was used to detect roads in severe weather conditions. Therefore, an attempt to apply the same network for pedestrian detection was done in this work, but without success. Hence, the strategy had to be changed to use a similar methodology.
Thus, the main contribution of this paper is to explore the feasibility of fusing RGB images with lidar and radar pointclouds for pedestrian detection using a small dataset. It is widely recognized that larger datasets lead to better training, while small datasets can result in overfitting [25]. Nevertheless, studies [26,27] have shown that data size is not an obstacle to high-performing models. To address the previous issue, a novel and practical fully connected deep learning CNN architecture for semantic pixel-wise segmentation called SegNet [24] has been proposed. The network consists of three SegNet sub-networks that downsample the inputs of each sensor, a fully connected (fc) neural network (NN) that fuses the sensor data, and a decoder network that upsamples the data. Therefore, the proposed method focuses on detecting people at a pixel level. The task of identifying pedestrians is referred to as semantic segmentation and involves producing pixel-level classifications based on a dataset that has been labeled at the pixel level. Typically, there is only one class of interest, namely pedestrians. Moreover, the inclusion of radar in the fusion process gives the advantage of being able to detect pedestrians in severe weather conditions. Additionally, an extrinsic calibration method for radar with lidar and an RGB camera, based on the work done in [28,29], has been presented.
The remainder of this article is organized as follows. Section 2 deals with the sensor calibration between lidar and radar, where an extrinsic calibration matrix (T) was found based on the singular value decomposition (SVD). Section 3 shows the deep learning model architecture used in this work. Section 4 presents the results of the calibration method and the results after training the network. Finally, Section 5 summarizes the performance of the calibration method and the network, and conclusions are drawn, together with future work.
A GitHub ROS repository is available at [30].

Sensor Calibration
The proposed method of this work consisted of finding the extrinsic parameters between a multi-modal L3CAM lidar-RGB sensor from the company Beamagine [31] and a UMRR-96 Type 153 radar from the company Smartmicro [32]. Figure 1 illustrates the system. The L3CAM was already calibrated, so the focus of this method was to find the rotation matrix R from the radar frame R f to the lidar frame L f and the translation vector t from the origin O R of frame R f to the origin O L of frame L f . The three reference frames, as well as the rotation and translation, can be seen in Figure 2.

Lidar
Camera A schematic overview of the three frames: lidar, radar, and camera. A rotation matrix R and translation vector t from the radar to the lidar frames are also shown.

Calibration Board
A calibration board, proposed in [28], was chosen. It consisted of three rectangular styrofoam pieces laid side by side on top of each other. The middle layer contained four circles that served as edge detectors for the lidar and camera sensors. Furthermore, a trihedral corner reflector was placed in the back of the calibration board at the intersection of the four circles. It is worth mentioning that the styrofoam did not affect the detection of the radar signal when it was reflected by the corner reflector.
The layout of the calibration board is illustrated in Figure 3, where two pieces of styrofoam were used, and the centers of the circles were used as four point descriptors whose intersection gave a point coordinate (x, y) of the location of the laser that matched the position of the trihedral corner reflector. The dashed black line shows the place of the corner reflector. Moreover, a single copper-plated trihedral corner reflector was made and is shown in Figure 4. The dimensions were chosen according to [33] and were such that the single areas of the corner reflector were larger compared to the radar wavelength.  Thus, the side length edge of the three isosceles triangles (a) was chosen to be 14 cm, and the base of the triangles L = a √ 2 was 19.7 cm. According to [33], the radar cross section (RCS), which is the measure of a target's ability to reflect radar signals in the direction of the radar receiver, and it can be calculated with the following Equations (1) and (2).
where a is the length of the side edges of the three isosceles triangles, λ is the wavelength of the radar, A e f f is the effective area, and σ is the radar cross section.
In accordance with the datasheet of the UMRR-96 Type 153, the radar frequency was between 77-81 GHz; taking the average of these frequencies at 79 GHz yielded a λ = 3.79484 mm width, an A e f f = 0.011316 m 2 , and a σ = 111.74 m 2 .

Location and Calibration
In order to find the extrinsic parameters between lidar and radar, it is necessary to set two reference points: one for the lidar and one for the radar. To this end, the reference point location of the lidar is the crossing point center of the four circles, and the corner reflector is the reference point of the radar.
Since the L3CAM had already been calibrated, the rotation matrix and translation vector for the camera to lidar were available. Three procedures were followed to obtain lidar-radar target calibration points. First, the middle point of the intersection of the circles was calculated in the RGB image. Second, this point was transferred into the lidar frame. Third, the extrinsic parameters from the radar to the lidar frames, such as R and t, were calculated. The radar driver provided the location of the radar point.

2.2.1.
Step 1: Image Point Segmentation Figure 5 shows the calibration board and the corner reflector, as well as the lidar and radar frames. Figure 6 depicts the four circles' centers (x i , y i ), the middle point (x, y), and the two lines (l 1 , l 2 ). Based on classical analytical geometry [34], two equations that correspond to l 1 , l 2 were obtained-Equations (3) and (4). Then, by solving the previous equations, a crossing point (x, y) in pixels was found, as shown by Equations (5) and (6).

Step 2: Lidar Point
The first Algorithm 1 shows the process of how the point center (x, y) in the RGB image is transformed into the lidar frame, e.g. [x l , y l , z l ] T . First, the OpenCV undistortPoints function is used to remove distortion from a set of image points (x, y). The function takes distorted points, camera matrix, and distortion coefficients as the inputs and outputs undistorted points (u u , v u ). Then, the directional vector [u d , v d , 1] T from the camera frame to the middle point (x, y) is computed by multiplying the undistorted point with the inverse intrinsic camera matrix. Next, the undistorted point is rotated to the lidar frame by multiplying it with l c R, which is the rotational matrix from the camera to the lidar frames. Now, the vector [u l , v l , 1] T points to the center point (x, y). The next step is to calculate the crossing point of [u l , v l , 1] T with the lidar plane that corresponds to the center of the board. The function PlaneLineCrossingPoint(u l , v l , PC l ) is responsible for calculating the crossing point. It receives the lidar pointcloud PC l and the point (u l , v l ) and outputs the 3D crossing point (x l , y l , z l ). Algorithm 1 Lidar point coordinates x l , y l , z l 1: procedure 2: •Get the u, v undirstorted point in pixel coordinates. 3: function UNDIRSTORTPOINT(x, y) 4: end function 8: •Get the directional vector in the camera frame. 9: function DIRECTIONALVECTOR(u u , v u ) 10: end function 15: •Get the directional vector in the lidar frame. 16: end function 20: •Compute the intersection point between a line and a plane . 21: function PLANELINECROSSINGPOINT(u l , v l , w l PC l ) 22: creates the u vector 23: Ax + By + Cz + D = 0 ← PC l board's parallel plane model 24: n ← [A, B, C] plane normal vector 25: a point in the model plane 26: p o ← l c t translation vector from camera to lidar 27: N ← − n · w dot product 30: sl ← N D 31: The extrinsic calibration matrix T is composed of a rotation matrix R and a translation vector t that best aligns two data sets of points, as shown in Equation (7). The proposed alignment method for 3D data points is based on [29,35].
The alignment problem formulation can be stated as follows: Given two set of points Y = {y 1 , · · · , y n } and X = {x 1 , · · · , x n } with known correspondences, the task is to find a rigid transformation that optimally minimizes the sum of squared errors as seen in Equation (8).
Algorithm 2 shows the procedure for obtaining T using singular value decomposition (SVD). Thus, the function AlignmentSVD(X, Y) takes as inputs the two data point sets X and Y. Then, the centroids or means x 0 and y 0 are calculated. Next, the cross-covariance matrix of the mean-reduced coordinates is calculated, and SVD is applied to H. Finally, the function returns the extrinsic parameter matrix T. Algorithm 2 Least squares alignment of two 3D point sets based on SVD 1: procedure 2: Let Y = {y 1 , · · · , y n } and X = {x 1 , · · · , x n } two sets with corresponding points. y 0 = 1 n ∑ n i=1 y i mean of the data point set X 5: Moreover, the matrix T can be used to transform the two pointclouds into the same RF. For instance having Y in a certain RF, X can be brought to Y's RF by Equation (9).

Proposed Architecture
The proposed CNN network to handle the sensor fusion approach is based on [21,24]. Figure 7 illustrates the SegNet architecture used in this work. It consists of an encoder and a decoder network, with 26 convolutional, 5 pooling, and 5 upsampling layers, respectively. The function of the encoder network (EN) is to downsample the inputs and extract features. On the other hand, the decoder network (DN) upsamples the data and reconstructs the image. A fully connected layer between the EN and the DN is discarded with the purpose of retaining higher resolution feature maps at the deepest encoder output.
Inspired by the work in [21], a convolutional network was proposed to carry out a multi-modal sensor fusion approach to detect people. Figure 8 presents a convolutional neural network that consists of three ENs as inputs, a fully connected neural network (FCNN) that fuses the data, and a DN that outsamples and recovers the fused data. Table 1 shows in detail the model architecture for the encoder, whereas Table 2 illustrates the input and output of the fully connected layer. Table 3 also shows the details of the decoder network model.

Results
The system used to handle the simulations was composed of a L3CAM lidar, a UMRR-96 Type 153 radar, and a GE66 Raider Intel ®Core(TM) i9-10980HK CPU with an NVIDIA GeForce RTX 3070 8Gb GPU. The Robot Operating System (ROS1) Noetic on Ubuntu 20.04.5 LTS was used to collect sensor data, compute extrinsic parameters, and align the sensors. Moreover, the CNN network was simulated in a Jupyter Notebook using Python 3 and a conda environment. More specifically, the simulations were divided into two parts: an extrinsic parameters matrix and network simulations.

Extrinsic Parameters Matrix
A ROS1 wrapper was developed to collect lidar data, and the ROS1 SmartMicro driver was used to gather radar data. As a result, a dataset of 11 lidar-RGB-radar (LRR) images was taken, 8 of which were taken indoors and 3 of which were taken outdoors. Figure 9 shows the RGB image of the calibration board, and Figure 10 illustrates in white the lidar pointcloud and in colored cubes, as well as the more sparse radar pointcloud. It is worth mentioning that the radar was used in long-range mode to reduce noise and better detect the corner detector, as is shown by the purple cube in Figure 10, placed almost in the middle of the four circles.
Then, Algorithm 1 was applied to each lidar-RGB image to obtain the lidar point coordinates (x l , y l , z l ) of the middle point of the four circles. On the other hand, the ROS1 radar driver gives the coordinates of the corner detector placed behind the middle point. Figure 11 depicts the position of the lidar point as a blue sphere, while the position of the radar's coordinate is shown as a brown cube. The colored pointcloud is the Ax + By + Cz + D = 0 plane parallel to the board.   The two-point datasets, for example lidar and radar point sets, are shown in Figure 12. It can be seen that the two dataset of points were not aligned. Therefore, alignment was necessary. In other words, Algorithm 2 was applied to the two datasets to find the extrinsic matrix parameters T. Afterwards, Equation (9) was applied to the radar dataset to align and transfer it to the lidar frame L f . Figure 13 shows the radar set after correction.  The norm was calculated in both lidar and radar datasets before and after correction by Equation (9) as shown in Figure 14. It shows the first eight images taken inside a laboratory; therefore, their distances fluctuated between 1.5 m and 2.0 m. The last three images were taken outside in a parking lot, where distances fluctuated around 14.0 m. Moreover, it can be seen by the blue line that the norms were reduced. Also, the figure shows the mean values of the two norms. As expected, the mean of the norm after correction was smaller than the mean of the norm before correction. The transformation matrix T was applied to the radar coordinates given by the corner reflector, as can be seen by the blue sphere in Figure 11. Moreover, an outdoor LRR image was taken from a parking lot with people, as illustrated in Figure 15. The colored spheres represent the radar dataset before correction, and the white spheres represent the radar dataset after correction by Equation (9).

Network
The CNN was trained using a custom dataset that was taken at a parking lot with people walking in front of the system. The dataset was taken in the format yyyy-MMdd:hh:mm:ss:zz, where yyyy denotes the year, MM the month, dd the day, hh the hour, mm the minute, ss the second, and zz the milliseconds. This format was chosen because the lidar, RGB, and radar frequencies are 6 Hz, 10 Hz, and 18 Hz, respectively. For instance, a typical format is 20221017_131347_782.pcd where hh = 13, mm = 13, and ss = 48. This means 48 will be repeated 6 times for the lidar, 10 times for the RGB, and 18 times for the radar. The value of zz = 782 will be different for each reading, making it easier to synchronize them.
Afterward, the dataset was synchronized, producing a total of 224 LRR images. Then, 30 LRR images were selected and expanded to 60 by flipping and adding them to the original set. The program 'labelme' was used to label the RGB images as a 'single-class person'. In other words, 'labelme' can create polygons of the persons and save them as a JSON, which can be extracted as a PNG image with a single person class. The LiDAR images were saved in PCD format, extracted, and converted to 2D grayscale images with a 16-bit depth. They were interpolated to improve their quality. The radar images were also saved in PCD format, extracted, and converted to 2D grayscale images with a 16-bit depth. The location of each point in the radar images was projected vertically.
The training, evaluating, and testing modes were executed on Jupyter Notebook with Pytorch version 1.11 and the GPU activated. The LRR images were downsampled to a size of 256 × 256. Moreover, due to the capacity of the GPU, which is 8 GB, a batch size was set to one. Next, the network was trained for 300 epochs. Additionally, 10 LRR images were used for validation and 10 LRR for testing, giving a total dataset of 80 LRR images.
The results of the training mode are shown from Figures 16-20. Figure 16 shows the RGB image that has been applied to the EN, whereas Figure 17 illustrates the ground truth image. Moreover, the lidar and radar images are shown in Figures 18 and 19. Finally, Figure 20 depicts the resulting fused image between the lidar, radar, and RGB. Furthermore, the time elapsed during the training of the network was 7019.40 s, while the mean average of the model in testing mode was 0.010 s.     The target and prediction during training were stored in a vector. Later on, the intersection over union (IoU) was evaluated for each frame to see the percentage of the overlapping area between the prediction and the target. To this end, to evaluate the results of the network in training mode, pixel accuracy and intersection over union (IoU) metrics for the image segmentation were used. Additionally, the cross-entropy loss function was applied.

•
The IoU refers to the ratio of the overlap area in pixels to the union of the target mask and the prediction mask in pixels and is represented by Equation (10). • The pixel accuracy refers to the ratio of the correctly identified positives and negatives to the size of the image and is represented by Equation (11).
Both Equations (10) and (11) are defined in terms of pixel accuracy.
Pixel Accuracy = TP + TN where the true positive (TP) is the area of intersection between the ground truth (GT) and the segmentation mask (S). The false positive (FP) is the predicted area outside the ground truth. The false negative (FN) is the number of pixels in the ground truth area that the model failed to predict. These parameters can be obtained using Equation (12). In this equation, if the confusion vector is equal to 1, this means that there is an overlapping; then the TP is computed by adding all the ones. The same procedure happens for the FP, TN, and FN.
con f usion vector = prediction image label image Figure 21 shows the IoU versus pixel accuracy for over 60 LRR fused images. The average loss over the training of 300 epochs is shown in Figure 22.  The results of the evaluation mode are shown in Figure 23, where the blue line indicates the average loss of the training mode and the orange line indicates the average evaluation loss. It can be seen that the model started overfitting around epoch 50, with a lowest loss value of 1.411. At this point, the model was saved for use in the testing mode. In addition, the IoU and the pixel accuracy for the testing mode is shown in Figure 24, where the mean of the IoU was 94.4%, and the pixel accuracy was 96.2%.  The results of the testing mode are displayed in Figure 25, where the ground truth and the model's output are overlaid. The model's detection of pedestrians is shown in red, which represents the overlapping area with the ground truth. The white spots near and between the pedestrians are part of the model's output but do not overlap with the ground truth. Light blue depicts weakly detected pedestrians, which was a result of the model's overfitting and the limited size of the training data. This can be confirmed by the loss of 1.413 between the output and ground truth.

Conclusions
In conclusion, this article explored, as a preliminary or pilot step, the feasibility of fusing lidar, radar pointclouds, and RGB images for pedestrian detection by using a sensor fusion pixel semantic segmentation CNN SegNet network. The proposed method has the advantage of detecting pedestrians at the pixel level, which is not possible with conventional methods that use bounding boxes.
Experimental results using a custom dataset showed that the proposed architecture achieved high accuracy and performance, making it suitable for systems with limited computational resources. However, since this is a preliminary study, the results must be interpreted with caution.
An extrinsic calibration method based on SVD for lidar and radar in a multimodal sensor fusion architecture that includes an RGB camera was also proposed and was demonstrated to improve the mean norm between the lidar and radar points by 38.354%.
For future work, we look to investigate the behavior of the proposed architecture with a larger dataset and to test the model in different weather conditions. Additionally, evaluating the architecture under a real scenario by mounting the sensors on a driving car shall be investigated. Overall, this preliminary study provides a promising starting point for further research and development in the field of sensor fusion for pedestrian detection.  (Tecniospring INDUSTRY). Also, the project was supported by funding from projects PID2020-119484RB-I00 and PDC2021-121038-I00 and by MCIN/AEI/10.13039/501100011033, as well as the European Union's NextGeneration EU/PRTR and the Government of Catalonia's Agency for Business Competitiveness (ACCIÓ).

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: