Exploring 3D Object Detection for Autonomous Factory Driving: Advanced Research on Handling Limited Annotations with Ground Truth Sampling Augmentation

: Autonomously driving vehicles in car factories and parking spaces can represent a competitive advantage in the logistics industry. However, the real-world application is challenging in many ways. First of all, there are no publicly available datasets for this specific task. Therefore, we equipped two industrial production sites with up to 11 LiDAR sensors to collect and annotate our own data for infrastructural 3D object detection. These form the basis for extensive experiments. Due to the still limited amount of labeled data, the commonly used ground truth sampling augmentation is the core of research in this work. Several variations of this augmentation method are explored, revealing that in our case, the most commonly used is not necessarily the best. We show that an easy-to-create polygon can noticeably improve the detection results in this application scenario. By using these augmentation methods, it is even possible to achieve moderate detection results when only empty frames without any objects and a database with only a few labeled objects are used.


Introduction
The vision of autonomous driving is becoming more and more of a reality, not only in individual transport, but also for industrial applications to automate processes, simplify workflows, and increase safety.For example, the transportation of goods through warehouses can be handled by autonomous vehicles.Or even newly built vehicles can drive through parts of production facilities without the help of human drivers.This application of autonomous factory driving will be investigated in this work.In our use case, the vehicles are to drive autonomously at low speed for part of the final production route through the plant premises.The entire system required to solve such a complex task can be divided into several parts.In this paper, we focus on the 3D object detection (3D OD) task, as it is an essential part of this pipeline, since the results of many other functions are based on its output.
Most of the publicly available datasets and benchmarks for 3D OD on LiDAR data are recorded from the point of view of a single ego vehicle [1][2][3][4][5][6].Unfortunately, these are unsuitable for the stated problem, as an overall view of the production site that is as occlusion-free as possible is required to maneuver several vehicles safely at the same time.Therefore, an infrastructural LiDAR sensor setup is to be used.Using the ego-vehicle-based data for training would be a big domain shift to this setup.Although there are public datasets with infrastructural sensors [7][8][9][10], these are rather limited in availability, size and quality [11,12].Furthermore, due to the domain shift, it is not possible to use these for training and direct inference.Therefore, data had to be recorded and annotated accordingly.During the process of creating a suitable dataset, we encountered many sources of delay.
As a result, only a fraction of the pre-validated data could be labeled and prepared.This scarce data basis was addressed by using suitable methods such as heavy data augmentation.The most potent augmentation method for our case is ground truth (GT) sampling [13] as it touches the border to simulation.In GT sampling, a database of objects is created.During training time, objects are randomly selected from this database and inserted into the current frame.This work aims not only to provide a solution for the specific application but to investigate further the influence of GT sampling.Several experiments show its effectiveness even with very little data available.In addition, this work explores the idea that GT sampling can be better utilized in a fixed environment.For this scenario, a more sophisticated GT sampling method utilizing an easy-to-create polygon is proposed to compensate for the lack of data.Unfortunately, for legal reasons, the data can only be shown in parts within this publication and cannot be published in its entirety.
The rest of this paper is structured as follows: First, the related work that is relevant to this task will be reviewed.The specific sensor setup and the resulting dataset, the networks used for the experiments, the GT sampling augmentation variations, and the evaluation metric used will afterwards be discussed.The results of the experiments are then presented.Finally, a conclusion and an outlook on future extensions of the proposed methods will be given.

Related Work
In the following, the current state-of-the-art is discussed with regard to the main topics of this paper.Note that some officially unpublished papers on arXiv are also considered for the state-of-the-art research, as they provide additional insights.
Unlike ego-vehicle-based detection, where a selection of datasets has become the standard [1][2][3][4][5][6], for the task of infrastructural detection many different smaller datasets exist.The authors of [35] use CARLA [38] to create a synthetic dataset of a T-junction and a roundabout.The authors conduct experiments related to the number of LiDAR sensors in the simulated setups and the stage of the fusion of the different point clouds.It is shown that early fusion of overlapping sensors is able to increase the detection results.In [30] also mainly simulation data is used to perform 3D OD.It is likewise shown that an early fusion of the point clouds increases the detection results.The authors of [32] also work with simulation data from Carla.They experiment with the placement of different LiDAR sensors in an infrastructure setup for the task of 3D OD and with varying fusion schemes.Their experiments show that a LiDAR setup that leads to higher uniformity and coverage of the objects of interest is beneficial for 3D OD.The authors of [33] perform 3D OD on the IPS300+ [8] and the A9 dataset [7] as well as semi-synthetic data.An early fusion of the point clouds is also performed here.In [34] a follow-up to [33] with the use of synthetic data is presented.In [12] a semi-automated annotation pipeline for infrastructure LiDAR data is introduced.Their dataset features only one LiDAR sensor, so no fusion is required.The authors claim the release of their dataset named FLORIDA, but at the time of writing, it had not yet been published.The authors of [31] again use the A9 dataset.They utilize the three LiDAR sensors as well as the roadside cameras for 3D OD by fusing the results of conventional methods as well as deep learning approaches.In addition to the already mentioned datasets with real recordings for infrastructural LiDAR object detection IPS300+ [8] and A9 [7], there are other small datasets that are publicly available, such as the Baai-vanjee dataset [9], in which an intersection in Bejing was captured with two LiDAR sensors, or the datasets of intersections in Germany proposed in [39] or [40].Infrastructural LiDAR data can further be extended with point clouds recorded by vehicles.There are several papers for this extended task [11,41,42], which work with simulated data or the DAIR-V2X dataset [10].Further works in this direction have been published, but since this deviates from the task of this work, we will leave it at this point with this selection.

Data Augmentation
There are a variety of different augmentation methods for the task of 3D OD on LiDAR data.In the two papers [43,44] the most common methods for 3D OD were applied and experimented with different parameter sets, networks and datasets.These are simple transformations of the entire point cloud and the GT boxes such as rotation, scaling, and translation.These transformations can also be applied at object level.Here, only the GT boxes and their inner points are transformed.Other augmentation methods such as frustum-based deletion and noise of points [45], shifting different parts of one object [46], or mixup [47] also exist, but are applied much less frequently.Another very common method included in the standard catalog of augmentation methods is GT sampling, which was first introduced in [13].GT boxes and their inner points are collected in a database, and during training, objects are drawn from this database and inserted into the current point cloud.This GT sampling method was further developed in several ways.The placement of the objects, which were originally inserted at the position where they were cut out, was a key area of research.Thus, the placement of the objects on the previously estimated ground plane has established itself as the standard in the community [44,[48][49][50].In [51,52], the semantic segmentation of corresponding camera images was used to find semantically meaningful positions, such as the placement of cars on the road and pedestrians on the sidewalk.The authors of [48] introduce a ValidMap for position generation.A grid is created from the number of points within a cell and their height information in relation to the ground.Additionally, the objects are inserted occlusion-aware, so that inserted objects cast a shadow in the point cloud.In [50] an estimation of roads and sidewalks is performed for better placement of objects.Occlusion handling is also used here.The authors of [26] propose a pattern-aware down sampling of objects from the database so that they can be realistically placed further distances away.

Methods
The datasets for infrastructural LiDAR 3D OD presented in the previous section do not meet the requirements for the task in this paper.The datasets IPS300+ [8] and Lumpi [40] come closest to the requirements.However, the first could not be download due to region lock, while no labels were available for the second at the time of writing.Furthermore, the public datasets are not released for commercial use.For this reason, own data were recorded at the relevant sites.Following related work, single-stage detectors were used and the point clouds were fused at an early stage.In contrast to previous work in this area, this paper focuses on compensating for the limited data available, mainly using the GT sampling augmentation method and applying it to the specific case of the fixed environment.In the following subsections, the acquired dataset, the object detectors, the augmentation methods and the evaluation metrics for the experiments are explained.

Sensor Setup & Data
For data collection, we equipped two factory sites with infrastructural LiDAR sensors and cameras so that all regions of interest are clearly visible.A mix of automotive-grade fisheye (1MPix, 190 • FoV) and pinhole cameras (1MPix, 120 • FoV) cover the near and long range, respectively.At the time of writing, two different facilities are supported.referred to as K and G. Site K has three HESAI XT32 LiDAR sensors, nine fisheye cameras, and four pinhole cameras.The fused point cloud of all LiDAR sensors can be seen in Figure 1.Site G is larger and more complex than site K. Therefore, eleven LiDAR sensors, nineteen fisheye, and fifteen pinhole cameras were required to cover the area.The bounding boxes are labeled on the fused point cloud of all available LiDAR sensors.Spatial registration was done by extrinsic calibration, whereby all point clouds were transferred to a world system.Temporal alignment was done using PTP synchronization using the GPS time for all sensors and then aligning the point clouds based on minimal timestamp difference.The box parameters are the center position in 3D, the dimensions and the heading angle.Although more classes are labeled, cars and pedestrians are most relevant for the task.Therefore, only these two are considered in the following experiments.A total of 175, 932 cars and 17, 377 pedestrians were labeled.For site K there are 39 sequences and 1411 frames.For site G there are 90 sequences and 3119 frames.For the experiments, these two were considered as one dataset and split into training (60%), validation (20%), and test (20%).For this purpose, all recorded sequences were divided into subsets, whereby care was taken to ensure that the number of objects belonging to the car and pedestrian classes roughly corresponded to the defined ratios.The result was a training set with 95, 270 cars and 10, 426 pedestrians, 39, 743 cars and 3455 pedestrians for the validation, and 40, 917 cars and 3494 pedestrians for the test split.

3D Object Detectors
For our experiments, two different 3D object detectors were chosen according to their usability for the described task and the aforementioned state-of-the-art.The implementations of both networks are based on the OpenPCDet framework [53].Consequently, most of the hyperparameters are taken from the configuration files provided by OpenPCDet, unless otherwise stated.The two networks are briefly presented below.
PointPillars is a lightweight one-stage detector [21].The input point cloud is converted into a voxelgrid, where the voxels have an infinite height and thus form the namesake pillars.A feature vector is calculated for each pillar, and the resulting feature pseudo image is further processed by 2D convolutions.An anchor-based detection head generates the final box predictions.Since PointPillars is an older model, an updated version of [54] is used as it has high performance with comparatively low memory consumption and low inference time, which is crucial for a real-time application.
CenterPoint is another fast voxelization approach [27].The input point cloud is first converted to a voxelgrid, and again a pseudo-image feature map is created and further processed by 2D convolutions.Unlike PointPillars, the predictions are not made via an anchor head, but are based only on the prediction of object centers.The authors of [27] also propose a second stage extension for CenterPoint, which refines the box predictions.The one-stage variation is used for the experiments.

Ground Truth Sampling
GT sampling is one of the most commonly used augmentation methods for 3D OD on LiDAR data.Objects from the training data are gathered in a database and inserted into the current LiDAR frame during training time.Usually the insertion is done at the same position and orientation as the original GT object.A common addition is to adjust the insertion height to the previously estimated ground plane of the current frame to prevent objects from floating above or below the ground.Before insertion, a collision check is carried out based on the bounding boxes to prevent possible overlaps with other objects.However, this can still lead to unrealistic placement of objects within unlabeled point clusters such as walls.Furthermore, the original position of an object in the extraction frame is not necessarily within the region of interest in the current frame.To counteract this behavior, Ref. [48] propose a ValidMap based on the number of points and the height of the points in relation to the estimated ground plane.Objects are only inserted at valid positions on the map.Inspired by this approach, we also limit the insertion area.Unlike for ego-vehicle-based data, the environment for our task is fixed.This makes it possible to determine the regions in which objects are to be placed beforehand.For both sites K and G, a polygon is drawn around the areas where objects should be inserted.The polygons can be seen in Figure 2. Various methods for inserting the objects are also examined.A sketch for these variations can be seen in Figure 3.
This means that not only the position and orientation of the original GT object is used.A random selection of position and orientation from a uniform distribution is also experimented with.In addition, a polar coordinate-based placement is considered, where the polar distance of the object is kept within a perimeter of two meters around the original distance.The relative angle to the world origin is used for orientation.Based on the selected position, the orientation is calculated in such a way that the relative angle to the world origin is always the same as the original.From a human perspective, this should increase the realism of the augmentation method in the case of a single sensor.In our multi sensor setup, the effects of these different insertion methods need to be investigated.The combination of all these methods result in eight different variants of GT sampling, since the GT orientation and the relative orientation for GT positioning are the same.

Evaluation Metrics
We use the common mean average precision (mAP) metric for evaluation.The method used is similar to that in COCO [55] and is a non-interpolated version of the mAP as opposed to, for example, KITTI [3].The implementation is based on the one in MMDET [56] and has been adapted to support IoU matching thresholds per class.The IoU thresholds used are the same as in KITTI, 0.7 for cars and 0.5 for pedestrians.Two filters were applied to the GT during training and evaluation.First, only boxes with at least 5 points in them were used, and second, all boxes outside of the defined grid range were discarded.

Experiments
In this section, the experimental results for in total three different waves of experiments are reported and discussed.In the first experiment, the effect of the different GT sampling variants as well as the effect of the polygon will be investigated.The second experiment will investigate how well GT sampling is suited for training with little available data.In a third experiment, this is examined for the special case that only one empty frame without any object occurrences is available for each site.Here, the usable training data is generated only by GT sampling.Unless otherwise noted, all experiments are conducted using the dataset presented previously for the car and pedestrian classes, and results are reported for the test split.The validation split was used for hyperparameter tuning such as learning rates, amount of epochs, and augmentation parameters.For all experiments, global rotation drawn from U(−π, π), scaling drawn from U(0.95, 1.05), and flip around both ground axes with a probability of 0.5 for each axes were applied.If GT sampling was performed, 10 cars and 10 pedestrians were tried to be inserted if no collision with another labeled object occurred.Ground planes are utilized for the height placement.More advanced augmentation methods were not applied to keep the interpretation of the experiments as simple as possible and to avoid further obscuring the results.The networks were trained for 100 epochs with an Adam-One-Cycle optimizer [57] with a learning rate of 0.001 for PointPillars and 0.003 for CenterPoint, respectively.Each training was repeated six times to counteract random effects during the training, and the median of the six results is reported.

Ground Truth Sampling Methods
In the first experimental wave, all eight meaningful combinations of the different positioning methods (ground truth, random, polar) and orientation methods (ground truth, random, relative) are evaluated with and without using the polygons.The results can be seen in Table 1.Intuitively, before looking at the results in more detail, one would expect that using the polygon would improve the results in all cases.With all positioning methods, placement of objects outside of reasonable boundaries is possible.These are prevented by placing them inside the polygon, allowing the network to focus more on the actual region of interest.Regarding the GT sampling methods, based on experiments with single sensor setups one would expect the combination of polar positioning and relative orientation to produce the best results.It is unclear whether this also applies to our multi sensor setup.Looking first at the results for PointPillars, it can be seen that the usage of the polygon increases the mAP of the medians in all variations of GT sampling.The highest mAP is achieved using the polygon and random positioning and GT orientation with 66.61%.The lowest mAP is reached without polygon and random positioning and random orientation.The two GT positioning variations benefit the least from the polygon, with 0.23 and 0.34 percentage points for GT and random orientation, respectively.Thus, the assumptions made previously are only partially accurate.Although the polygon improves the mAP in all cases, the impact on GT positioning is relatively small.This could be due to the restriction on the number of objects inserted, as all objects outside the polygons are discarded.Contrary to expectations, polar positioning and relative orientation do not perform best but are second best.Due to the multi sensor setup, this variant is not necessarily the most realistic in terms of scan pattern and distribution of points.The random positioning with the GT orientation, which performed best, is not realistic as well.Due to the random positioning, the orientation of the objects is not correct in most cases.Consequently, the random choice of position and orientation should also give very good results.In fact, however, this variant has only the fourth-highest mAP of 65.49%.Restricting the orientation angles to existing angles in the dataset could make the difference here.
The results can be roughly seen again for CenterPoint.The polygon also generally improves the mAP, except for the two GT position variations.The mAP deteriorates by 1.06 percentage points from 71.78% to 70.71%, and by 0.03 percentage points from 72.04% to 72.01% for GT positioning with GT orientation and random orientation, respectively.The attempted explanation in the case of PointPillars remains valid here as well.Objects outside the polygons are never added during the training.The best mAP is obtained for CenterPoint from polar positioning and GT orientation using the polygon.Thus, the mAP in this case reaches a value of 73.50%.The lowest mAP this time shows GT positioning and GT orientation with polygon with 70.71%.The best GT sampling variation for CenterPoint may differ from PointPillars, but still, the same argumentation applies for the orientation.The GT orientation narrows the possible rotation angles to those occurring in the dataset with corresponding distribution.Based on the results for both networks, it seems that random or polar positioning with guidance of the polygons are beneficial compared to GT positioning.One explanation could be the increased variety of object positions and, consequently, the scenes created after augmentation.
On the basis of these observations, the polygons are also used in the following experiments.Since it is not possible to use the GT information for some of the upcoming experiments, the random position and orientation are used in the following.On average, these have the highest mAP value of the GT variants.

Reduction of Dataset Size
The amount of training data is one of the most critical factors for good results of a deep learning approach.Therefore, the next wave of experiments will examine the ability of the GT sampling augmentation to compensate for the lack of data.Thus, the size of the training dataset is limited in this experiment to roughly 75%, 50%, and 25% of the original size, respectively.These subdivisions of the training dataset were created in the same way as train, valid, and test split.Thus, subsets of all training sequences according to the desired object ratios were created.The iterations per epoch are set to the initial 100% training set in all cases to allow a fair comparison.This is done by randomly reusing samples during training until the wanted number of iterations per epoch is reached.The database for the GT augmentation is adjusted to the current dataset size accordingly, such that only objects available in the current train set are present in the object database.The results can be seen in Figure 4. Before looking at the results, two things can be expected intuitively and based on the related work.First, the GT sampling augmentation can be expected to enhance the results in all cases for both networks.And secondly, it can be expected that the mAP decreases the less data is used.Indeed, these two observations can be made for most of the results.The mAP decreases the less data is used with an exception at 75% to 50% dataset size.This is the only irregularity of this kind.GT sampling increases the results for all four dataset sizes.The gains with GT sampling amount to 2.92, 4.75, 5.67, and 7.83 percentage points, respectively.It can be observed that the median differences increase the lesser data used for training.Thus, it can be concluded that the GT sampling augmentation is not only able to increase the quality of the training results but is also able to cushion the effect of fewer training data by providing more variations of the available data.To reinforce this thesis, the experiments are repeated for the CenterPoint object detector.Again, with the utilization of GT sampling augmentation the median mAPs increase in all four cases.The differences between the medians with and without GT sampling amount to 4.15, 5.18, 5.23, and 6.58 percentage points, respectively.Therefore, the same observation as for PointPillars can be made.The effects of the smaller dataset are mitigated by GT sampling.
The question arises about how this cushion effect of GT sampling applies to egovehicle-based data.Therefore, the same experiment was performed on the KITTI dataset.The initial 100% split is the common one for KITTI.The other splits are again created by selecting whole sequences.Only CenterPoint is used for this experiment because its results are more stable and have lower variance.The GT sampling variation is set to GT position and GT orientation.The common validation split is used for validation but not the official KITTI benchmark evaluation.Instead, the evaluation described before is used here.Thus, the results are only comparable to themselves, not with other publications regarding the KITTI benchmark for 3D OD.Table 2 shows the results.The observations found on the infrastructural data can also be seen for the ego-vehiclebased KITTI data.The mAP of the medians is higher for all four sizes of the training set with GT sampling than without.Thus, the gain induced by utilizing the GT sampling method amounts to 3.73 percentage points for 100% dataset size and 10.78 percentage points for 25% dataset size.In the case of the KITTI dataset, the cushion effect of the GT sampling method is even more substantial compared to the infrastructural data used for the rest of this work.Therefore, the described effect of this augmentation method is not exclusive to a fixed environment.

Empty Case Experiments
In this experiment, the size of the training data is reduced to only one frame for both sites G and K, respectively, with no objects present.Thus, all meaningful training data is produced by the GT sampling augmentation.The GT database is taken from the 100%, 75%, 50%, and 25% trainings split, respectively, to investigate the performance for different amounts of objects.Once more, the experiments are performed with and without using the polygons to further look into its impact in this particular case.The results are depicted in Figure 5. Based on the previous experiments, it can be expected that the polygon can increase the mAP.Furthermore, one could expect intuitively that the mAP decreases with smaller GT sampling database.Nothing can be said about the size of the mAP, as such an experiment has not yet been carried out in a similar form.Looking at the results, one of the previous assumptions is directly refuted.Other than expected, the results are comparably stable for the different sizes of the GT database.The largest difference between the four database variations is only 2.06 percentage points.Considering the amount of data samples dropped, that is surprisingly small.Note that unexpectedly the mAP is highest for a database size of 75%.With utilization of the polygon the mAP is higher in all cases.Here, too, the results for the different database sizes are surprisingly close together.Once more, it can be observed that the mAP is highest for 75% database size.Looking at the results for CenterPoint, the same observations can be made.The polygon increases the result in all cases by up to 4.90 percentage points, which is a higher gain as shown in Table 1.The polygon gains even more value in the case of empty frames.
The number of objects is not as important as originally assumed.This might be caused due to a low overall variance of objects in the data.Due to the drop of mAP compared to the experiments regarding the dataset size, the exact positions and other physical effects, such as occlusion and sampling patterns, have an even stronger influence than previously expected.The better results of the 75% database compared to the 100% database indicate that the sheer number of objects is not the most relevant factor.

Conclusions & Further Work
In this work, we investigated 3D OD on an infrastructural LiDAR setup for autonomous factory driving.By using the GT sampling method, we were able to improve performance while compensating for the lack of labeling.Results were generally improved when a polygon was used to constrain the placement of objects.This restricts the placement of objects and is easy to create for a fixed environment.Moreover, the most commonly used variant of GT sampling, where objects are inserted in their original position and orientation, does not perform best.It has been shown that the GT sampling method can also mitigate the negative effect of less labeled data.This was demonstrated not only for the infrastructural setup, but also with an ego-vehicle based dataset.Finally, the possibility of training only with a database of objects inserted in the fixed environment was investigated.Surprisingly, the size of the database was found to have a smaller impact than expected.
This last experiment could be the starting point for future work.A possible continuation is the enrichment of the database for GT sampling with objects from other data sets.This could further reduce the labeling effort.The insertion of object models through ray casting is also an interesting extension that reaches the limits of simulation.Furthermore, an additional consideration of occlusion could be implemented, or a placement of the objects based on a probability map.

Figure 1 .
Figure 1.Excerpt of a fused LiDAR point cloud from the infrastructural setup of factory site K recorded with three LiDAR sensors.Labeled ground truth boxes are shown in pink.Point color decodes the height from blue as the lowest to yellow as the highest.LiDAR positions are shown with small coordinate axes each.Big coordinates axes mark the world coordinate origin.

Figure 2 .
Figure 2. The two polygons drawn to restrict the area into which the objects are inserted by the GT sampling method are shown.Left, the polygon for site K is depicted, right for site G. Hatched area represents the valid space.LiDAR sensors are depicted as black squares.The axes indicated the scale in meter.

Figure 3 .
Figure 3.A sketch is shown for all eight ground truth sampling variations.LiDAR sensor is depicted as a black square.Boxes and points refer to an exemplary object placed with all eight variations.The lines inside the boxes indicated the orientation.The colors of the boxes represent the position variation, colors of their middle lines are the orientation variant.The dashed circle shows the polar distance of the ground truth.

Figure 4 .
Figure 4.The mean average precision (car and pedestrian) on the test set for different sizes of the training split for PointPillars and CenterPoint without (red) and with (blue) GT sampling is shown as a boxplot.The differences in the medians between without and with GT sampling for each dataset size are shown in the bottom left corner.Best seen zoomed-in and in color.

Figure 5 .
Figure 5.The mean average precision (car and pedestrian) on test set for different sizes of the GT database for PointPillars and CenterPoint without (red) and with (blue) polygon is shown as boxplot.The differences of the medians between without and with usage of polygon for each database size are shown inside the box.Best seen zoomed in and in color.

Table 1 .
The mean average precision (car and pedestrian) for PointPillars and CenterPoint on test split with different combinations of GT sampling methods is shown.The left values for each cell are without polygon.The right values are with polygon.Values next to network names are baseline results without GT sampling.Reported values are the median of six trainings each.Higher values are better.The best results are marked in bold for PointPillars and CenterPoint, respectively.

Table 2 .
Results for KITTI (car and pedestrian) on validation split for CenterPoint with different training set sizes are shown.Reported are the results for the median of six training runs.