1. Introduction
Nowadays, autonomous driving is in the focus of several research communities and industrial partners. Mapping of the surroundings is a basic task in machine perception. For that task, multiple modalities are used, e.g., cameras, LiDARs (Light Detection and Ranging), RADARs, IMU, GPS, etc. In order to work these instruments together, their relative position and orientation need to be known a priori; thus, extrinsic sensor calibration is needed. This paper introduces a new method for camera and LiDAR system extrinsic calibration. It is easy to use, requires minimal user intervention, can handle multiple sensors and outperforms the state-of-the-art methods in accuracy.
Cameras offer an inexpensive solution for machine vision. They provide high resolution, colorful images at a relatively high frame-rate. Besides that, image processing is a well-studied topic of research, and many algorithms exist already to solve object detection and recognition, camera movement estimation, semantic segmentation, etc. On the other hand, cameras can be used only in appropriate lighting conditions, and problems may occur with occlusion, shadows or in night light.
The LiDAR (Light Detection and Raging) technique applies infrared light to detect objects in its surroundings. 3D LiDARs can map the environment in and produce a sparse point cloud. These sensors use active illumination; thus, they can also be used in night light. However, they retrieve only low-resolution depth information with an accuracy of 2–5 cm at a maximal refresh rate of 15 Hz. For example, the well-known Velodyne HDL-64 uses 64 vertically-oriented light beams and a rotating head to scan a field of view of . It is a notable issue for this sensor, that 3D scanning is not synchronized due to the rotation: different scannings of the same beam are shifted in time. The compensation of this shift is very challenging when the speed of the vehicle is fast.
Cameras and LiDAR sensors compensate the shortcomings of the other; thus, they can be effectively used jointly in many scenarios. Robotics and autonomous driving are the most popular of these. However, extrinsic calibration is needed for these sensors to effectively work together, meaning that their relative pose needs to be precisely estimated. LiDARs provide sparse point clouds, with only position information, while the cameras provide high resolution color images. The calibration of these different modularities is challenging, yet an important topic.
For calibration, we distinguish online and offline methods. Online calibration means that the sensors are calibrated during the usage of the system, while the latter one indicates that the calibration is done beforehand. Online methods are used when the vehicle cannot be accessed easily for calibration purposes. However, when feasible, offline methods provide more accurate results. Cameras and LiDARs have intrinsic and extrinsic parameters. In this paper, we address the problem of extrinsic, offline calibration. That means that the intrinsic parameters of the sensors are considered to be known a priori, and only the extrinsic parameters need to be calculated, namely the relative rotation and translation of the sensors. It is shown that the calibration can be carried out using an ordinary cardboard box, outperforming the state-of-the-art methods in accuracy, with the need for a single observation of the calibration object.
Rodriguez et al. [
1] used a black circle-based planar board to avoid the large noise caused by chessboard patterns. 3D coordinates of the center of the circle and the normal vector of the plane were estimated. Their method needed at least six positions of the calibration object. Finally, the initial guess of the LiDAR-camera rigid transformation was refined by the well-known Levenberg–Marquardt [
2,
3] (LM) algorithm.
An automatic calibration method was published by Alismail et al. [
4]. It used planar calibration object with a black circular region and a marked center. Random Sample Consensus [
5] (RANSAC) was applied for plane extraction. The center and normal of the circle were computed based on a single camera view. Finally, point-plane Iterative Closest Point (ICP) [
6] was used with nonlinear optimization by LM to refine the extrinsic parameters.
Park et al. [
7] used a white, homogeneous, planar triangle or diamond-shaped board for calibration. Several positions were needed from the board or at least three boards at the same time. Another drawback of their algorithm was that the spatial coordinates of the planar board were estimated and not measured. This fact influenced the accuracy of the calibration. The details of this method can be found in
Section 6.1.
Gong et al. published a method in [
8] that needed at least two scans of the same trihedron object measured by both instruments for the calibration. This produced a significant amount of data to process. In their work, it took 20 s to calibrate using nine observations. The main disadvantage of their method is that the process needed much human intervention, e.g., the separation of the trihedron points and the selection of the related planes in the images needed to be performed manually.
A different type of calibration object was used in Velas et al. [
9]. They assumed a planar object containing four circular holes in front of a white background. Their method was based on the work of Levison and Thrun [
10]. The holes in both the 3D LiDAR point cloud and the acquired image were detected automatically. However, we cannot apply this method to the point cloud measured by Velodyne VLP-16 LiDAR, due to the sparsity of the acquired point cloud.
Geiger et al. [
11] introduced a method to calibrate a LiDAR-camera pair taking only one measurement by the LiDAR and a single image by the camera. The method was fully automatic, however, it needed multiple chessboards and at least two camera images from different positions. The algorithm was briefly introduced in
Section 6.1 with a comparison to the proposed method.
Hassanein et al. published a method for the calibration of a stereo camera pair and LiDAR sensor in [
12]. Their method required a well-textured calibration object and a pre-calibrated stereo setup. They used Speeded Up Robust Features (SURF) [
13] to reconstruct the scene in 3D and ICP [
14] to match point clouds of the LiDAR and the stereo reconstruction. The details of this algorithm are discussed in
Section 6.1.
Table 1 summarizes the strength and weaknesses of the above-mentioned methods.
The main contributions of the paper are as follows: a new LiDAR-camera calibration is introduced, which uses ordinary cardboard boxes. It achieves high accuracy and can be used with multiple sensors. A new Bundle Adjustment (BA)-based technique is introduced to reduce the overall error of LiDAR-camera system calibration. The method is evaluated on both synthetic and real-world data, competing against state-of-the-art techniques. Moreover, a technique is also presented for estimating the car body border, as a 2D bounding box, with respect to the calibrated sensors.
This study is the extension of our previous paper [
15]. The novelty here is the application of BA, comparisons to other methods, the proof for the convergence of the proposed box fitting algorithm and the 2D bounding box calculation.
Our work has limitations, of course: (i) It is an offline calibration approach; therefore, the change in sensor setup during usage cannot be handled. In other words, if the calibration becomes inaccurate due to, e.g., mechanical resonance, the whole calibration procedure has to be repeated; (ii) The calibration needs a special setup that is not always available. However, the accuracy of the proposed calibration algorithms outperforms the state-of-the-art rival techniques, as is shown in this paper.
2. Calibration Outline
Offline calibration methods use one or more calibration objects that can be easily detected by both the LiDAR sensor and the camera, as well. Planar calibration objects with a checkerboard pattern [
11] or other rich textures are very popular in the literature. A planar board can be easily detected in spatial point clouds, and the pattern of the board is identifiable in the camera image. An example of accurate plane detection can be seen in
Section 7, where a technique is presented for 2D bounding box calculation of a car. Using this technique, the accuracy of the bounding box is below two centimeters. However, the texture patterns can cause heavy noise in the LiDAR point cloud, especially the checkerboard pattern [
7]. Moreover, all of these methods struggle with the fact that the edges of the board cannot be precisely calculated in a LiDAR point cloud, only the plane itself. The application of a low resolution LiDAR sensor, e.g., the Velodyne VLP-16, is even more challenging. The proposed method is developed for both low and high resolution LiDAR devices; thus, we decided to use a spatial object instead of a planar one.
Cardboard boxes can be found everywhere, and precise ones can be easily manufactured. They have a well-defined shape; therefore, they are a great choice to calibrate cameras and LiDAR device pairs. Their three perpendicular sides can be accurately detected in a LiDAR point cloud. The intersections of the planar sides yield the edges of the box. Corners are precisely calculated, giving the edges and the dimensions of the box. 3D → 2D correspondences are also known, if the projections of these corners are selected in the camera image. In this case, the camera-to-LiDAR calibration is equivalent to the Perspective-n-Point (PnP) problem, which can be solved by several efficient algorithms, e.g., Efficient PnP (EPnP) [
16,
17].
The calibration is carried out as follows. A cardboard box is placed in the field of view of the LiDAR and camera sensors, in a way that the three perpendicular sides are visible from both of them. The required inputs of the calibration are: (i) camera image(s); (ii) LiDAR point cloud(s) and (iii) length of the box edges.
The proposed method needs only one image per camera and one point cloud per LiDAR sensor. Other calibration methods need multiple observations of the same calibration object to archive the desired accuracy, which makes the calibration procedure time-consuming. The required input of the proposed one is as minimal as possible.
Figure 1 shows the outline of the calibration procedure. First, coordinates of the box corners are extracted from the point cloud. The rough area of the calibration box needs to be cut manually; however, this cut does not need to be precisely done. Here, the method can robustly find the planes belonging to the calibration box and eliminate other objects falling into this area. After outlier filtering, the intersections of the box planes are calculated. Finally, the corners are refined by an iterative method containing rotations and translations of the fitted box model. The convergence of the iteration is proven, as is discussed in the
Appendix. The calibration can be applied to threecases:
Camera-LiDAR calibration: Projections of the box corners need to be selected in the camera image. The spatial (LiDAR) and 2D (camera) point correspondences define a PnP problem, which can be effectively solved by, e.g., the EPnP algorithm [
16].
LiDAR-LiDAR calibration: The corners of the same calibration box need to be calculated in the two point clouds, separately. Then, the extrinsic parameters can be found by point registration.
Car Body-LiDAR calibration: The last calibration step is to estimate the car body location with respect to the sensors. This step is independent of the camera-LiDAR and LiDAR-LiDAR calibrations. A single plane is required that can be placed at four different locations: to the left and right side, in front of and behind the car. This step is essential for autonomous driving, as car dimension determines the free space of the car in order to avoid collision.
5. LiDAR-Camera System Calibration
In the case of multiple LiDAR and camera sensors, a minimization step can be applied, which simultaneously minimizes the overall error of cameras and LiDARs. The proposed method optimizes the parameters of the spatial box(es), as well. The minimization is achieved by two successive steps of numerical refinements of two cost functions: one describing the 3D-to-3D discrepancy of the LiDAR calibration and a re-projection error for the cameras.
The coordinate system of an arbitrarily-selected LiDAR device is set as the reference (origin) of the system; the poses of other cameras (
), LiDARs (
) and boxes (
) are defined w.r.t. it, where indices
i,
j and
k denote the
i-th camera,
j-th LiDAR and
k-th box. The rotation matrix (
) and translation vector (
) define the rigid reference to local coordinate system transformations. The cost functions for a LiDAR-camera pair are as follows:
where the parameter
is a spatial point in the
j-th LiDAR point cloud,
is a normal vector of the
k-th observed box.
and
are the spatial corner point of the
k-th observed box and the related re-projected coordinate in the
i-th camera image, respectively.
is the projection function of the
i camera.
In the case of Equation (
8), point
of the
j-th LiDAR point cloud is transformed to world coordinates using the inverse of the LiDAR pose (
). Then, world coordinates are easily projected to local coordinates of the
k-th box using (
). Having the 3D point measured by the LiDAR now in the local system of the box, the distance w.r.t. the three main planes is computed using the scalar product with the three respective normals, as was introduced before in Equation (
2).
In Equation (9), a virtual corner of the k-th box is mapped to world coordinates using the inverse of (). In the next step, the point goes through a world-to-image plane transformation by first applying world-to-camera transformation (), then projecting it to the i-th image plane by projection function . The computed cost is the squared norm of the difference between the measured point and the projected one.
The parameters of the the devices, cameras and LiDAR and those of the spatial box(es) are loosely connected; therefore, the Jacobian of the cost function is sparse. In such cases, the BA-paradigm [
24] can be applied.
The inputs of this step are (i) the LiDAR point clouds, each point labeled by which box and which face it belongs to; and (ii) the 3D → 2D correspondences between box corners and image points.
In the first pass of the optimization, the LiDAR and box poses are refined numerically: the distance of the LiDAR point cloud to the boxes, represented by three perpendicular planes, is minimized. The minimization is simultaneously carried out in a BA-like manner, refining all LiDAR and box parameters:
where
is the set of observations through the LiDARs. Its elements
denote point
of the
-th side of the
k-th cube seen by the
j-th LiDAR. Note that we applied the Huber loss for this case, assuming a noise of 10 cm in case the labeling of the point cloud is not perfect and contains outliers.
The second pass uses the refined and now fixed boxes to refine only the camera poses, based on the following compound cost:
where
is the set of 3D-to-2D correspondences between cube corners and camera images. An element of this set
denotes a corner
of cube
k observed as image point
on the
i-th camera.
In the literature, the calibration objects and cameras are usually jointly calibrated and refined [
11] in advance of the LiDAR-calibration, to achieve high quality camera and calibration object poses. However, it is essential for such an approach to use multiple cameras, with overlapping fields of view, and/or numerous calibration objects.
The effect of this minimization step on the overall error can be seen in
Table 2. It is seen that our BA-like method can significantly reduce the overall calibration cost. The synthetic testing scene consists of two cameras and two LiDARs. Additionally, Gaussian noise is added to the synthetic LiDAR point cloud with
standard deviation. The real-world test is done using a Velodyne HDL-64, two Velodyne VLP-16 sensors and two RGB cameras.
7. Calibration of Car Dimensions
It is important to know the size and location/orientation of the car body with respect to the sensor position(s) in the case of autonomous driving. In this scenario, we are interested in a 2D bounding box from a bird’s eye view of the car. For this purpose, a board was used as the calibration object, which was held in parallel with the four sides of the car. The plane of the board was vertical for the sake of simplicity. We show here how the dimensions of a Toyota Prius are calibrated.
However, the Velodyne HDL-64 cannot detect points that are too close to the instrument. Objects that are closer than 1.2–1.5 m are invisible in the point cloud. This is not a problem for measuring the front and back of a Toyota Prius, but it is in case of the sides. During the measurement of the front and near sides, the board was held as near as possible to the car. In the measurement of the left and right sides, the board was moved approximately one meter away, but keeping it aligned. The distance was accurately measured.
The black point clouds in
Figure 9 indicate the acquired LiDAR points of the four different positions of the board.
After the measurement, the planes are calculated using RANSAC [
5], and the detection error of the Velodyne LiDAR is set as a threshold. Then, the intersections of the planes are calculated, and finally, the bounding box is acquired by intersecting the lines with the
plane. In
Figure 9, the green lines mark the intersections of the detected planes, and the calculated bounding box is visualized by red lines.
The size of a Toyota Prius is m × 1.760 m according to the official car specification sheet. The dimension of the calculated bounding box is m × m. Therefore, the error of the bounding box is below two centimeters, which is satisfactory for autonomous driving.