Ca2Lib: Simple and Accurate LiDAR-RGB Calibration Using Small Common Markers

Modern visual perception techniques often rely on multiple heterogeneous sensors to achieve accurate and robust estimates. Knowledge of their relative positions is a mandatory prerequisite to accomplish sensor fusion. Typically, this result is obtained through a calibration procedure that correlates the sensors’ measurements. In this context, we focus on LiDAR and RGB sensors that exhibit complementary capabilities. Given the sparsity of LiDAR measurements, current state-of-the-art calibration techniques often rely on complex or large calibration targets to resolve the relative pose estimation. As such, the geometric properties of the targets may hinder the calibration procedure in those cases where an ad hoc environment cannot be guaranteed. This paper addresses the problem of LiDAR-RGB calibration using common calibration patterns (i.e., A3 chessboard) with minimal human intervention. Our approach exploits the flatness of the target to find associations between the sensors’ measurements, leading to robust features and retrieval of the solution through nonlinear optimization. The results of quantitative and comparative experiments with other state-of-the-art approaches show that our simple schema performs on par or better than existing methods that rely on complex calibration targets.


I. INTRODUCTION
The ability to fuse readings from heterogeneous sensors is often beneficial in many robotics and perception applications.In particular, LiDAR and RGB sensors exhibit a strong compatibility: the former being able to capture high precision sparse range readings while the latter measure dense color intensity measurements.
These properties makes integration between the two sensors suited for the task of depth estimation.Historically, stereo based solutions leverage the known relative offset between two cameras, along with concepts from epipolar geometry to estimate a depth value for every pixel in a image.Albeit its popularity, due to its optical nature, these approaches suffer in texture-less regions and in areas where the depth exceed a maximum value determined by the baseline of the stereo.While the texture-less problem has been partially solved by the usage of active stereo sensors (i.e.Realsense D435, Kinect), the maximum range still poses a challenge.On the contrary, LiDARs operates using Time of Flight (TOF) principle, which is proven to be robust at detecting range measurements on non-reflective surfaces with accuracy even at high distances.
These considerations lead the community to investigate the problem of depth-completion, namely estimating a dense depth image by superimposing an accurate sparse depth measurement along an intensity RGB image.Multiple publicly available datasets like KITTI and VOID [20] [21] allowed the community to tackle this problem either by fully leveraging the sparse depth measurement (unguided) or by fusing RGB features (guided).Furthermore, in the field of 3D reconstruction, recent findings show that coupling the two sensors may lead to a more robust and accurate trajectory estimate [2].
Besides, to accomplish any of these tasks, one would require to know the relative offset between the two sensors.
This work aims at solving the task of LiDAR-RGB calibration, namely, estimating the relative offset (extrinsic parameters) between the two sensors, using their raw measurements.
The core idea behind this multi-modal calibration is to find spatial correspondences between the sensors measurements.Most approaches rely on one or more calibration patterns to establish common features between the sensors.These arXiv:2309.07874v1[cs.RO] 14 Sep 2023 patterns are often complex or expensive to produce [1].The main contribution of this paper is a versatile calibration toolbox that allows to estimate the extrinsic parameters between LiDAR and RGB with minimal user intervention, using a simple calibration checkerboard target.We leverage a joint non-linear formulation of the problem to achieve high accuracy results even with a minimum of three measurements.The requirement for our method is to use a calibration pattern (e.g.Checkerboard, ChAruCO [5]) that must be observed by both sensors during the acquisition.We exploit the planarity of the target to find a common observation used to estimate the extrinsic parameters.Moreover, we release an open-source implementation of our toolbox.

II. RELATED WORK
This paper's section delves into LiDAR-RGB calibration and explores the two main classes of approaches: targetbased and target-less.As the name suggests, target-based approaches require the user to place artificial markers that both the camera and LiDAR can easily detect.This contrasts with target-less methods that free the use from this task.The core idea of calibration is common in the two classes of approaches: computing common features between heterogeneous measurements and estimating the transformation that minimizes the distance between corresponding features.
First, an overview of target-less approaches is presented: Pandey et al. presents an automatic data-driven approach based upon the maximization of mutual information between the sensor-measured surface intensities [13].The authors exploits the correlation coefficient for the reflectivity and intensity values of many scan-image pairs using different calibration parameters.However, shadows of objects or colored surfaces that completely absorb infrared lights might result in weaker correlation between scan-image pairs.Yoon et al. proposes a calibration method using regionbased object pose estimation.Objects are segmented in both measurements, then a 3D mesh is generated from the LiDAR measurements, while images are used to reconstruct the scene using Structure from Motion (SfM).The two models are then registered together to acquire an initial guess on the relative pose.The final solution is obtained iteratively by finding correspondences between the reconstructed objects from both measurements [22].In recent years, the development of learning based methods have also spanned in this field: Lv et al. proposes a real-time self-calibration network that predict the extrinsic parameters by constructing a cost volume between RGB and LiDAR features [11], while Sun et al. first estimates an initial guess by solving an handeye calibration method [18].Moreover, the guess is finetuned by segmenting the image-cloud pair and by aligning the distances between centroids.The advantage of targetless method is that they can be used without preparing the environment.This comes at a cost of a lower accuracy and robustness when compared to their target-based counterpart.
Target-based methods estimate the relative pose using an observed known structure.Given the difference of resolution for the two sensors, it is high unlikely that cor-respondences within the measurements can be established directly.For this reason, point-to-point methods tends to process LiDAR measurements to implicitly obtain virtual points1 easily detectable from an RGB sensor.For instance, Park et al. utilizes a specially designed polygonal planar calibration board with known lengths of adjacent sides [14].By estimating the 3D corresponding points from the LiDAR, vertices of the board can be determined as the meeting points of two projected sides.The vertices, along with the corresponding points detected from the color image, are used for calibration.Pusztai et al. introduces a methodology that utilizes cubic objects with predetermined side lengths [15], [16].The corners of the cubes are estimated by initially detecting each side of the box and subsequently determining their intersection points.Furthermore, the corners along with their corresponding RGB image are employed to calibrate the system by solving Iterative Corresponding Point (ICP).Zhouet al. proposes a single-shot calibration method requiring a checkerboard [23].The target is detected both in the RGB image, and LiDAR measurement, using RANSAC [4] for the latter.Furthermore, the four edges of the checkerboard are estimated and aligned to compute the relative offset between the two sensors.Tóth et al. introduces a fully automatic calibration technique that leverages the utilization of spheres, enabling accurate detection in both point clouds and camera images [19].Upon successful detection, the algorithm aligns the set of sphere centers using SVD.Beltrán et al. presents a methodology that utilizes a custom calibration target equipped with 4 holes and AruCO markers specifically designed for monocular detection [1].The methodology employs a set of techniques for each sensor to estimate the center points of the holes.Subsequently, the relative offset between sensors are determined by aligning the set of centers obtained from each sensor, Li et al. adopt a similar approach while using a checkerboard with 4 holes [10].Fan et al. propose a two-stage calibration method using an auxiliary device with distinctive geometric features [3].The method extracts lines from images and LiDAR point clouds, providing an initial estimation of the external parameters.Nonlinear optimization is then applied to refine these parameters.In the work of Singandhupe et al., the authors first extract planar information from RGB and LiDAR measurements, then, two grid of points are extracted from the computed planar patches and aligned using a customized ICP algorithm [17].
Albeit these approaches provides relatively accurate results with few measurements, care should be taken during the estimation of virtual correspondences, as they can cause significant errors in the estimation step.Moreover these custom targets often requires precise construction or expensive manufacturing.
Another group of approaches does not directly solve the calibration problem using point-to-point correspondences, but rather exploit the planarity of the target to reduce the feasible set of solutions using plane-to-plane constraints.Mirzaei et al. addresses the challenge of accurate initial estimates by dividing the problem into two sub-problems and analytically solving each to obtain precise initial estimates [12].The authors then refine these estimates through iterative minimization.They also discuss the identifiability conditions for accurate parameter estimation.Finally, a method similar to our proposal, Kim et al. combine observed normals to first estimate the relative orientation with SVD and then iteratively estimates an initial guess of the relative translation by minimizing the pairwise planar distances between measurements [8].Finally, the translation is refined using a non-linear optimization problem using Levenberg-Marquardt (LM).Despite its simplicity, this method decouples the estimation of orientation and translation, thus leading to potential losses in accuracy while also increasing the number of required measurements.
Compared with the state of the art, we propose: • a formulation for joint nonlinear optimization that couples relative rotation and translation using a plane-toplane metric; • an extensible framework that decouples the optimization from target detection.Currently supports Checkerboard and ChARuCO patterns of typical A3-A4 sizes, easily obtainable from commercial printers; • the possibility to handle different camera models and distortion; • an open-source C++ implementation.

III. OUR APPROACH
In this section, we will provide a detailed and comprehensive description of our method.First we describe the preliminaries required to understand our approach, then every component of the pipeline is described, following the procedure from the acquisition of the measurements up to the computation of the relative poses between the two sensors (extrinsic parameter).
Plane Representation: Let π = (n, d) be a 3D plane, where n ∈ S 2 represents the unit vector orthogonal to the plane and d ∈ R is the shortest distance of the plane respect to the origin.Applying a transform X ∈ SE(3) to a plane π yields new coefficients π ′ : as follows: Here X = ⟨R; t⟩ is represented by a rotation matrix R ∈ SO(3), and the translation vector t ∈ R 3 .If the transformation is modified by a small local perturbation ∆X = (∆R|∆t) then we can rewrite: Deriving the result with respect to ∆X leads to the following Jacobian: The distance between two planes depends both on the difference between their normals and the signed distance of the planes from the origin.These quantities can be captured by a 4D error vector e p expressing the plane-to-plane error metric: Here p(π k ) is the point on the plane closest to the origin of the reference system, and it is obtained by taking a point along the normal direction n at distance d.
Pinhole Model (RGB): Let p be a point expressed in camera frame and K be t camera matrix.Assuming any lens distortion effect have been previously corrected, then the projection on the image plane of p is computed as where ϕ(v) represents the homogeneous division and π c (p) the pinhole projection function.For simplicity, we detail only the pinhole camera projection, however the same principle applies for more complex camera models.
Projection by ID (LiDAR): Let p be a point detected by the LiDAR and expressed in its frame.Its projection is computed as: where f x represent the azimuth resolution of the LiDAR, while c x denotes the offset in pixels.The ring(v) function is either obtained directly from the LiDAR sensor, which augment every measured point with a number that represents the beam that detected it or, assuming the cloud is ordered, by dividing the point index by the horizontal resolution of the sensor.Compared with the classical spherical projection, the projection by ID does not preserve the geometric consistency of the scene.Still, it provides an image with no holes, which is preferred for computer-vision applications.First, we process the incoming raw LiDAR and RGB measurements to acquire planar information.Assuming the scene to remain static throughout the acquisition of a single joint measurement, the LiDAR cloud is embedded in an image using the projection by ID.Moreover, the system awaits the user interaction to guess the position of the calibration target on the LiDAR image.
A parametric circular patch around the user's selection is used to estimate a plane using RANSAC and, concurrently, the calibration target detection is attempted on the RGB image.Once the target is detected, the RGB plane is computed by solving the ICP.If the user is satisfied with both LiDAR and RGB planes, they are stored for processing.
Whereas a straightforward rank analysis of the Jacobians reveals that just 3 measurments are sufficient to constrain a solution, it is well known from the estimation theory that the accuracy grows with the number of measurements.
Once the set of measurements are acquired, we jointly estimate the relative orientation and translation of the Li-DAR with respect to the RGB sensor X ∈ SE(3) by solving the following nonlinear minimization problem: where e p represent the plane-to-plane error.
During acquisition, it may happen that the user accept one or more wrongly estimated measurements.Due to the quadratic nature of the error terms, these outliers are often over-accounted, resulting in wrong estimations.To account for this factor, as described in [6], we employ an Huber Mestimator ρ(•) that treats differently measurements based on their error.We rewrite Eq. ( 12) as follows: To resolve Eq. ( 13) we employ the Gauss-Newton (GN) algorithm implemented in the srrg2 solver [6].

IV. EXPERIMENTAL EVALUATION
In this section, we describe the experiments we conducted to establish the quality of our calibration toolbox.We perform quantitative experiments in the simulated environment provided by [1] to compare our estimates with groundtruth while we also conduct qualitative and quantitive experiments on real scenarios using our acquisition system.We directly compare our results with [8], as it is the work which is closest to ours.In addition, we compare to [1] that produce accurate results relying on a very complex target (CNC printed).

A. Synthetic Case
We conducted experiments on Gazebo simulator [9] to evaluate the accuracy and robustness of our approach, injecting different noise figures to the sensor measurements.We also experiment how the number of observations affect the final results.The setup of the scene includes a Velodyne HDL-64 LiDAR, a BlackFly-S RGB sensor and a 6 × 8 checkerboard target with corner size of 0.2 meters.We randomly generate and acquire 53 valid 2 measurements.
To quantify the impact of the number of measurements on the accuracy of our approach, we run the calibration procedure with an increasing number of measurements w s = [3 . . .39] and at three different LiDAR noise levels σ l (0 mm, 2 A valid measurement is one for which both LiDAR and RGB sensor are able to detect the target  7 mm and 14 mm).For every w s , we sample 40 sets of measurements.
From Tab.I, we observe a steady decrease of error for every noise level, reaching an average of 2.6mm translation error in the intermediate noise case.In case of 3 measurements the high uncertainty is due to the potentially poorly conditioned system when using planes that have similar normals.Nonetheless, we compare our best result with 3 measurements against the best results of the methods presented in [1] and [8].Tab.II shows the results.

B. Real Case
In this section, we describe the experiments conducted on real measurements.We perform a quantitative test on our acquisition system shown in Fig. 2 that is equipped with a Ouster OS0-128 LiDAR with a resolution of 128 × 1024, a RealSense T-265 stereo camera and two Manta-G145 RGB cameras arranged in a wide horizontal stereo configuration.
Since no groundtruth information is available, we take advantage of the stereo extrinsics to provide an estimate of the calibration error.The offset between multiple camera is measured using optical calibration procedures which typically reach subpixel precision.
In the first experiment, we consider the LiDAR and the Realsense T-265 sensor which provides factory calibrated intrinsic/extrinsic parameters for both cameras.The task of the experiment is to demonstrate the accuracy of the calibrator in real case scenarios and to understand how the number of measurements considered affects the quality of the solution.
As for the synthetic case, we first acquire a set of 17 cloud-image LiDAR RGB measurements for both cameras.Moreover we perform 40 calibrations with w s randomly selected measurements with w s ∈ {3, 15}.Finally, for every w s , we combine the computed extrinsics for both cameras to obtain an estimate stereo transform.Assuming approximately symmetrical errors in the two cameras, Fig. 3 shows the results of this experiment.We were able to obtain at best   The second experiment is conducted using the wide stereo setup for which we also calibrate the intrinsics and extrinsics of the cameras, providing expected results in a typical scenario.The acquisition procedure is the same as in the first experiment and Fig. 4 shows the experimental result, where we obtain the best solution with 4.6 mm and 0.002 rads in orientation.
Moreover, Fig. 1 and Fig. 5 show the reprojection onto the right camera respectively of the fisheye and wide baseline RGB sensor.In the latter, the large parallax between the sensors leads to strong occlusions effects, that have been mitigated with an hidden point removal algorithm [7].
In summary, our evaluation indicates that our method is capable of generating extrinsic estimates that are comparable or superior to those obtained using other state-of-the-art approaches.It is important to note that careful consideration is required when selecting a minimal number of measurements.However, our experiments clearly demonstrate that the accuracy of these estimates improves as the number of measurements increases.

V. CONCLUSION
In summary, our paper introduces a simple and effective method for accurately estimating extrinsic parameters between LiDARs and RGB sensors.By leveraging the inherent planarity of standard calibration patterns, we establish common observations between these sensors, greatly simplifying the calibration procedure.Our experiments show that planar features mitigate the LiDAR noise, leading to accurate results even with common A3/A4 calibration patterns.Finally, we also release an open source C++ implementation to benefit the community.

Fig. 1 :
Fig. 1: Reprojection of a LiDAR point cloud on a fisheye RGB camera rigidly attached to the former.The offset between the sensors leads to shadows on parts of the image.

Fig. 2 :
Fig. 2: Acquisition system used for the Real Case experiments.

TABLE I :
Average translation error in millimeters with different noise levels and number of measurements.

TABLE II :
Best solutions obtained by calibration using N = 3 measurements.anaverage error of 7.1 mm in translation and 0.01 rads in orientation.