As no existing dataset is available for 3D pedestrian detection by fusion of heterogeneous RGB and FIR images, the new dataset of the FieldSafePedestrian is proposed, which is established based on the data of FieldSafe [
5]. The FieldSafe dataset comprises approximately 2 h of data sequences of LiDAR, RGB camera, FIR camera, and GPS recorded with the Robot Operating System (ROS). These sensors are mounted at external positions of a tractor driving in a grass mowing scenario. During driving, the tractor changes its direction multiple times. Thus, there are images captured in different perspectives in this dataset. Moreover, the dataset also includes images with varied backgrounds such as different vegetation and houses, with changed distances to the tractor during driving. Additionally, scenarios with moving objects, especially the pedestrians with different sizes, poses, locations, and occlusion degrees are captured in this dataset. This dataset considers pedestrians and harvesters as foreground objects but does not provide 3D object labels. To build the novel 3D subdataset, i.e., the FieldSafePedestrian, monocular RGB images, FIR images, and LiDAR points are extracted from the raw data. There are in total 48,638 monocular images with a resolution of
, 167,409 FIR images with a resolution of
, and 110,084 frames of LiDAR point clouds. All extracted data are synchronized by the timestamp recorded with ROS. The spatial alignment is based on the intrinsic parameters of cameras and the extrinsic parameters among different sensors provided by the FieldSafe.
The original FieldSafe unfortunately did not provide 3D pedestrian labels which has further reduced its utilization in an agricultural perception task. In order to exploit heterogeneous information fusion, 3D cylinder labels are generated for pedestrians in each frame, forming the new dataset FieldSafePedestrian. The label generation mainly includes two steps: The semi-automatic annotation for the 2D bounding box of pedestrians and the 3D cylinder generation by separating and clustering the LiDAR point clouds. Beforehand, the invalid data is eliminated and the dataset is augmented with low-light images by deep learning-based generation methods. The details of each step are given in the following subsections.
3.1. Data Preparation and Augmentation
The 3D pedestrian labels for each image are generated with the help of LiDAR point cloud. However in the original dataset, the sensor platform moving on unstructured roads results in unaligned data pairs due to the jolt. A different sensor frame rate also increases the spatial alignment error of the multi-sensor data. Here a novel method is proposed to filter out alignment errors. Firstly, one pair of RGB image and LiDAR point frame is selected with accurate alignment as template. The template RGB image is denoted as
, and the template point cloud is projected to the image plane to build a pseudo image
. Secondly, for any other image
of the
i-th frame, by using the SIFT descriptor [
38], the associated feature point sets of
and
are obtained, denoted as
and
. Analogously, the associated feature point sets of pseudo image
and
are denoted as
and
(shown in
Figure 2). Thirdly, the perspective transform matrix
between RGB image
and pseudo image
is calculated according to the associated point coordinates using Equation (
1).
where
and
denote point coordinates of
and
. Analogously, the perspective transform matrix
between RGB image
and pseudo image
can be obtained. Fourthly, the matrix
and
are reformulated as vectors
and
. By calculating the similarity
of
and
with Equation (
2), the alignment degree between RGB image and LiDAR points of frame
i is obtained.
In this work, two methods are chosen to augment the dataset with low-light RGB images. The first one is the Cycle-GAN [
6] which designed cyclic consistency losses to achieve style transformation with unpaired training datasets. This learning-based network is used to generate virtual nighttime images in experiments. For a better fidelity, both daytime and nighttime images from the large-scale dataset BDD100K [
39] are applied to train the network, as shown in
Figure 3.
Additionally, RGB images are converted into the HSV color space and adjust the brightness channel to generate dim light images. This method can avoid the problem of an unreal light spot learned by the Cycle-GAN from dataset BDD100K [
39], but the images show low similarity to the real nighttime scenario. Samples of generated low-light images are shown in
Figure 4, where HSV-25 means that the brightness of the image is reduced by
, and the same annotation applies to HSV-50 and HSV-75.
3.2. Generation of 3D Pedestrian Labels
Annotation for multi-sensor data is an extremely tedious task. It requires human experts with a lot of experience to accurately identify all the related LiDAR points which are corresponding to objects in the image. In this work, a semi-automatic annotation method is provided for a 3D perception task. First, a pinch of images are manually annotated with 2D bounding box labels and used to fine-tune a 2D detection network to annotate the rest images. The detection errors are only a few and revised by human experts. Second, LiDAR points are projected onto the image plane and the portion covered 2D bounding boxes is cropped. For the cropped points, the ground points are further removed and clustering based on Euclidean distance is conducted to find points on the pedestrian. Third, a 3D cylinder is generated according to the clustered LiDAR points. In comparison with a 3D bounding box, which is represented by its center, width, length, height, and yaw angle, the 3D cylinder is identified only by its center, radius, and height. Thus, the 3D cylinder label has fewer parameters than a cube which eases the regression task by the network. In the above annotation procedure, the 2D box provides a coarse lateral position for the object, which is key to the 3D label generation. By employing the semi-automation of 2D labels, the entire annotation process can be accelerated and thus the annotation work can be kept at a low labor and low time cost.
As aforementioned, a pinch of filtered images are firstly annotated by the open-source tool
LabelImg [
40], which is a graphical image annotation tool for labeling images with 2D bounding boxes and saving annotations files in PASCAL VOC, YOLO, or CreateML formats. The images are further divided into subsets with 2000 for training, 250 for validation, and 250 for testing. The fine-tuned network is the improved Cascade-RCNN [
41] which employs the HRNet [
42] as a substitute of the default backbone and modifies the anchor size by K-Means clustering [
43] according to the distribution of 2D labels. The network is implemented by PyTorch. The epoch number is set to 24. Other training details are referred to Cascade-RCNN [
41]. The ablation experiments on 2D detection demonstrate that the used network performs superior to both the Faster-RCNN [
44] and the original Cascade-RCNN [
41] on average precision (AP), as shown in
Table 1 and
Figure 5. Thus, it is more suitable as an automatic annotation tools.
In the next step, 3D cylinder labels of each image are generated according to the LiDAR points and 2D labels are obtained. The ground points are firstly filtered by the RANSAC algorithm [
45]. Thereafter, the remaining points are projected into a camera coordinate system, and those enclosed by the 2D box are kept. The enclosed point cloud is further clustered by the DBSCAN algorithm [
46] to remove noise points in the background. Since LiDAR points on a pedestrian is relatively sparse in the dataset, the points may not completely cover the body surface. Therefore, the size of a corresponding point cluster may not be consistent with the the spatial size of the pedestrian.
Hence, an additional size correction procedure is implemented. First, the average depth of LiDAR points and the 2D box center are used to jointly determine the location of a 3D cylinder in the camera coordinate system. The cylinder is initialized with the smallest pedestrian size (similar to a child) in the dataset. After that, the center coordinates, the height, and the radius of the cylinder are adjusted to gradually increase its size, until the projected shape of the cylinder is inscribed in the corresponding 2D bounding box. The final cylinder is the 3D label of the pedestrian, with samples shown in
Figure 6.
After data preparation, the FieldSafePedestrian dataset totally consists of 48,120 data pairs of synchronized RGB image, FIR image, and LiDAR point cloud frame. Therein, 17,090 pairs contain positive images and have corresponding annotation files, and a total of 30,336 pedestrians in sitting, lying, and standing postures are marked with 3D cylinder labels. Most pedestrians occupy a 20–80 pixel height and within the range of 5–40 m from the camera. Finally, the dataset is augmented by aforementioned methods of Cycle-GAN, HSV-25, HSV-50, and HSV-72. Thus, the dataset is increased by a factor of by adding the low-light images.