Automatic Calibration of Odometry and Robot Extrinsic Parameters Using Multi-Composite-Targets for a Differential-Drive Robot with a Camera

This paper simultaneously calibrates odometry parameters and the relative pose between a monocular camera and a robot automatically. Most camera pose estimation methods use natural features or artificial landmark tools. However, there are mismatches and scale ambiguity for natural features; the large-scale precision landmark tool is also challenging to make. To solve these problems, we propose an automatic process to combine multiple composite targets, select keyframes, and estimate keyframe poses. The composite target consists of an aruco marker and a checkerboard pattern. First, an analytical method is applied to obtain initial values of all calibration parameters; prior knowledge of the calibration parameters is not required. Then, two optimization steps are used to refine the calibration parameters. Planar motion constraints of the camera are introduced in these optimizations. The proposed solution is automatic; manual selection of keyframes, initial values, and robot construction within a specific trajectory are not required. The competing accuracy and stability of the proposed method under different target placements and robot paths are tested experimentally. Positive effects on calibration accuracy and stability are obtained when (1) composite targets are adopted; (2) two optimization steps are used; (3) plane motion constraints are introduced; and (4) target numbers are increased.


Introduction
Odometry and monocular camera have been widely used in indoor mobile robots owing to low cost and rich information. Fusing these two sensors for robot navigation is a popular research topic [1][2][3]. The operation of the fusion system requires odometry parameters, robot extrinsic parameters, and camera intrinsic parameters as prior knowledge. First, the odometry parameters usually refer to the structural parameters of the robot, such as wheel spacing, wheel radius, encoder line number, reduction ratio, and so on. These parameters can be roughly obtained from the design drawings of the robot or by manual measurements. However, in practical applications, the actual parameters are different from the design parameters due to manufacturing errors, non-point contact between the tire and the ground, changes in tire pressure, and load changes. Therefore, odometry calibration is required before the robot system works. Second, the robot extrinsic parameters refer to the relative pose between the camera and the robot. They are difficult to determine through the design drawings or manual measurements because of that the optical centre of the camera is a virtual point or that there are limitations of physical measurement tools. Accordingly, a robot extrinsic parameters calibration method is necessary. Third, the problem of camera intrinsic calibration has been studied extensively. Thus, this paper mainly considers the first two problems, namely simultaneously calibrating odometry and robot extrinsic parameters automatically.
Odometry errors can be categorized into two types: systematic and non-systematic errors [4]. The systematic error is mainly caused by inaccurate kinematic parameters, such as imprecise wheel radius. The non-systematic error is caused by robot environment interactions, such as wheel-slippage and uneven floors. The origination of most odometry errors has been discussed in [5]. Systematic errors are the primary source of odometry errors on smooth indoor surfaces [6]. Therefore, the odometry calibration addressed in this paper focuses on estimating robot kinematic parameters. Several studies [6][7][8][9][10][11] have considered this issue and proposed solutions. Among these, the UMBmark method [6] is widely used. In the UMBmark method, a differential-drive robot was controlled to follow a square path, where the odometry was calibrated based on the error between the final pose and the predicted pose. A generalized method was proposed accordingly for an arbitrary trajectory in [7]. Another method is recursive filtering, wherein calibration parameters are added to the system state vector, and a Bayesian filter is used to estimate parameters [8,9]. This method uses an odometry model for state propagation while updating from an external pose sensor. The third approach is the nonlinear batch optimization, which minimizes errors originating from the odometry [10,11]. These methods all assume that at least one external sensor is used to measure the actual pose of the robot.
Determining the relative pose between the camera and mobile robot is referred to in this paper as robot extrinsic parameter calibration. The purpose is to identify the best pose transformation to connect the mobile robot trajectory and camera trajectory. Estimating the robot trajectory typically resorts to wheel odometry [12,13], where the odometry parameters must be known in advance. The camera trajectory is obtained using natural features, such as points and lines in the environment [12][13][14], or by using artificial landmark tools, such as aruco markers [15] and others [16]. The natural feature-based method requires sufficient reliable features in the environment, a condition that is difficult to guarantee in some indoor settings. Furthermore, scale information cannot be obtained from a monocular camera. It is also infeasible to create a large, accurate landmark tool to ensure calibration accuracy using the artificial landmark-based method.
In fact, if robot extrinsic parameters were known, then a camera could be used as an external sensor to measure the robot pose and facilitate odometry calibration. Similarly, if the odometry parameters were known, then the robot extrinsic parameters could be easily obtained. Odometry parameter calibration and robot extrinsic parameter calibration are chicken-egg problems, which some researchers have tried to solve simultaneously. The analytical method [17,18] is one type of solution. In [17], a 3D landmark tool was applied to calibrate the camera intrinsic parameters and estimate its trajectory. The odometry and robot extrinsic parameters were determined based on the single value decomposition (SVD) method. Another technique is the filter-based method, in which the odometry parameters, robot extrinsic parameters, and robot configurations combine to form a state vector [19,20]. The odometry model is commonly employed for state propagation while updates are derived from visual observations. Another frequently used method involves solving an optimization problem to jointly minimize errors arising from integrated odometry measurements and (reprojection) errors in visual terms [15,21]. Such optimization approach offers the advantage of repeated linearization of the inherently nonlinear cost terms, which limits linearization errors. However, the filter-based method and optimization-based method each require good initial values for calibration parameters. By contrast, the analytical approach can provide calibration results without initial assumptions, but the result possesses lower accuracy. Therefore, an automatic calibration method was developed by combining the analytical method with an optimization-based method [15].
In this paper, an automatic method is proposed to calibrate the odometry and robot extrinsic parameters simultaneously. This approach automatically selects keyframes, calculates initial parameter values, and has no constraints on the robot path. First, we use multiple specially designed composite targets. Then, a new approach is introduced to combine these composite targets, select keyframes, and estimate the keyframe poses. Next, we use an analytical method to obtain initial values of the calibration parameters. Finally, two optimization steps are applied to determine the refined parameters. In addition, planar motion constraints are introduced into the optimization functions.
The remainder of this paper is organized as follows. In Section 2, we specify the problem. In Section 3, the proposed method is described in detail. A series of experiments are presented in Section 4. Finally, we offer conclusions and discussions in Section 5.

Preliminaries
The system of a differential-drive mobile robot equipped with a monocular camera is shown in Figure 1. r L and r R are the left and right wheel radiuses. b is the wheel spacing. Two coordinate frames are established, namely, the robot coordinate frame {r} that is fixed to the robot, and the camera coordinate frame {c} that is fixed to the monocular camera. The robot is assumed to move on a two-dimensional (2D) plane. The z r -axis of {r} is perpendicular to the plane. The x r -axis of {r} is pointed to the front of the robot. The origin of {r} is set at the midpoint of the wheel axis.

Odometry Model
The pose of a mobile robot moving on a 2D plane can be expressed as [x, y, θ] T , where [x, y] T represents the translation movement, and θ represents the rotation angle. The robot pose at time point k is represented by [x k , y k , θ k ], then the robot pose at the next time point k + 1 is given by the odometry model [22]: where s L k and s R k are encoder increments generated by the left encoder and the right encoder, respectively, between the time point k and k + 1; and k L and k R are the left wheel factor and right wheel factor, respectively. The wheel factors transform encoder increments in the unit tick to wheel displacements in the unit m. The purpose of the odometry calibration is to obtain the precise wheel factors k L and k R as well as the precise wheel spacing b.

Camera Model
Many different camera models have been proposed in the literature [23,24]. A universal symbol π is used to represent the camera model that projects a three-dimensional (3D) point c p in the camera coordinate frame {c} to a 2D image coordinate u = [u, v] T : In addition, the camera intrinsic parameters are assumed to be carefully pre-calibrated.

Robot Extrinsic Parameters
The robot extrinsic parameters to be calibrated represent the relative pose between the camera coordinate frame {c} and the robot coordinate frame {r}. They are expressed by a 4 × 4 homogeneous transformation matrix r c T as shown Figure 1. To facilitate subsequent derivation, the ZYZ Euler angles α 1 , α 2 , and α 3 are used to represent the rotation: r c R = Rot z (α 1 )Rot y (α 2 )Rot z (α 3 ); and t x , t y , and t z are used to represent the translation: r c t = [t x , t y , t z ] T . As the differential-drive robot moves on a 2D plane, there is no translation component in the z r -axis direction; thus, t z is unobservable [13], which has been proved by [9]. This means that t z cannot be obtained by calibration. We set it to be 0. In summary, the robot extrinsic parameters to be estimated are α 1 , α 2 , α 3 , t x and t y in this paper.
In short, eight parameters must be calibrated in total: k L , k R , b, α 1 , α 2 , α 3 , t x and t y .

Automatic Calibration Solution
The proposed automatic calibration method is discussed in this section. First, the main ideas and a calibration pipeline are introduced. Then, four key steps in the solution are explained.

System Overview
The proposed method is illustrated in Figure 2. Several specially designed composite targets are placed in the environment. The composite target consists of an aruco marker [25] and a checkerboard pattern. The checkerboard pattern provides a number of precise corners; these corners are used for camera poses and target poses estimation. The aruco marker can provide an independent ID, which is used to avoid mismatches. Commonly, multiple targets may be captured in one image. Checkerboard corners of each target should be extracted separately in the image. To address this problem, we have two conditions at hand. First, we assume that the target boundary positions relative to the aruco marker are pre-known, which can be obtained from the target design parameters. Second, the aruco marker can provide the relative pose between the camera and the aruco marker [25]. Then, the target boundaries can be projected to the image, and the image area of the target is obtained. Thus, the checkerboard corners of this target can be extracted from this image area.
Multiple composite targets are used to improve the accuracy of camera pose estimation, thereby enhancing calibration accuracy. It is difficult for natural feature-based methods to achieve high precision camera pose estimation due to mismatching, scale problems for the monocular camera, or a lack of features in the environment. Commonly used artificial landmarks provide few corner points and suffer from low precision problems, such as binary square fiducial markers. The composite target proposed in this paper can produce many accurate corner points via the checkerboard. Using an aruco ID also avoids mismatches. However, a small size target or landmark tool only covers part of the image, resulting in low camera pose estimation accuracy. Moreover, a high precision, large target or landmark tool is difficult to produce, has poor portability, and is costly. Given these drawbacks, multiple targets are employed here.
After the targets are laid out, the robot is controlled to move arbitrarily in front of them. The encoders and image data with timestamps are recorded. The layout of the targets and the robot trajectory are not limited, but for the precision considerations, the robot is recommended to move slowly to avoid wheel slippage. Additionally, as many targets as possible should be within the field of the camera.
Some key images (the third row of Figure 2) are selected carefully, rather than using all the images. The key images are also called keyframes. The camera pose corresponding to the keyframe is called the key camera pose or keyframe pose. The keyframe is used for two reasons: first, the difference between two adjacent image samples is too small and will cause a pathological state in the analytical solution as described in Section 3.3; second, using all images will result in a sharp increase in the amount of computation, which renders a solution challenging. Generally, the encoders and the camera form an asynchronous acquisition system. Typically, no encoder sampling occurs at the time of the keyframe (compare the third and the fifth rows in Figure 2) because the sampling frequency of the encoder is higher than that of the camera. Therefore, the linear interpolation method is used to produce an encoder sample corresponding to the jth keyframe with the previous encoder sample e be f ore before the jth keyframe and the next one e a f ter after the jth keyframe (see the fifth and the sixth rows in Figure 2), such as e 1 j in Figure 2: t e a f ter − t e be f ore (e a f ter − e be f ore ) + e be f ore , where t with a right subscript represents the time corresponding to the jth keyframe or encoder samplings e be f ore and e a f ter . The calibration process can be divided into the following four steps:

•
Step 1: an automatic pipeline is designed to combine the composite targets, select the keyframes and estimate the keyframe poses.

•
Step 2: an analytical method is used to solve the initial values of the calibration parameters.

•
Step 3: an optimization problem is built by minimizing odometry error terms to refine the calibration parameters.

•
Step 4: a total optimization containing all error terms is constructed to obtain the final optimized calibration parameters.
These four steps are discussed in detail in subsequent sections.

Estimation of Keyframe Poses
This step aims to select keyframes and to estimate each keyframe pose w c j T by integrating multiple targets. The recorded images are extracted and processed according to the processing flow shown in Figure 3. The flow is divided into four parts: map, system initialization, camera pose estimation, and map management.

Map
The map is used to maintain a series of keyframes and targets. Checkerboard corner points on each target build a cluster of 3D landmark points, referred to as map points. The positions of the map points relative to the target coordinate frames can be known in advance according to the structure of the targets. Each target has an aruco marker ID to avoid false matches. The 2D corners of one target in the image are called features. The collocation of a key image (see the third row of Figure 2) and extracted 2D features form a keyframe.

System Initialization
When the first image arrives, the first target observed by the image will be added to the map. The world coordinate frame {w} coincides with the coordinate frame {b 0 } of the first target.
Correspondingly, the first image is constructed as a keyframe: the feature points are extracted, and the coordinate frame relationship w c 0 T between the world coordinate fame {w} and the camera coordinate frame {c 0 } is calculated using the perspective-n-point (PnP) method [26].

Camera Pose Estimation
After initialization, the keyframe and the target exist in the map. The system then enters a normal process. When the lth image comes in, its pose will be estimated. Multiple targets may be observed in the image. The features of each observed target from the image will be extracted firstly. The corresponding targets are searched in the map using aruco IDs. After this, matches of 3D map points to 2D features are obtained. Then, the PnP method is used to get the camera pose w c l T. It needs to point out that there are multiple targets in the map in the normal process, and the coordinates of the map points of each target are unified to the world coordinate frame {w}.

Map Management
Next, the lth image is checked to determine whether new targets exist, if so, they will be added to the map. First, the PnP method is used to calculate the homogeneous transformation c l b i T between the camera coordinate frame {c l } and the new target coordinate frame {b i }. Then, the homogeneous transformation from the new target coordinate frame {b i } to the world coordinate frame {w} is achieved: The map points on the new targets can be unified to the world coordinate frame. Then, the selection of keyframes is introduced under two conditions. First, at least two targets must be observed by the image to ensure sufficient constraints between the keyframes and the targets to achieve higher accuracy in subsequent optimization steps. Second, the angle change and distance between the current image and the last keyframe should be greater than thresholds θ th and d th , where: The rand[] refers to the uniform sampling. The purpose of using random thresholds is to ensure that the angles and the displacements between keyframes are different to guarantee the correct solution for the analytical solve step (Section 3.3). When both above conditions are satisfied, the lth image is constructed as the jth keyframe and added to the map.
When a target is observed in a keyframe, a line in Figure 4a is drawn to connect the keyframe with the target. More connections create a more stable network with more accurately estimated keyframe poses. Once all images in the recorded data have been processed, a global optimization is carried out: where κ denotes the set of all targets, Θ denotes the set of all keyframes. The optimization equation (Equation (6)) is composed of three parts. The first is the projection error from the map points to the features. For one connection between the ith target and the jth keyframe, the error term is defined as: where Q is the number of corners in the target, and j u q is the qth 2D feature in the target extracted from the jth keyframe, b i p q is the qth 3D map point in the target expressed in the target coordinate frame {b i }, and Σ q is the covariance matrix. The second and the third parts are constraints generated by the planar motion of the camera. The second part considers the keyframe positions. The error term is defined as the square of the distance from the keyframe position w c j t : [ w c j t x , w c j t y , w c j t z ] to the motion plane Ax + By + Cz + D = 0. For the jth keyframe: The initial values of A, B, C, and D are obtained via data fitting using all keyframe positions. w b 0 T is not optimized because the coordinate frame {b 0 } of the first target coincides the world coordinate frame {w}.
Good initial values are necessary for global optimization. However, the keyframe and the target insertion process can generate cumulative errors, resulting in low precision of initial poses of keyframes and targets. To address this problem, a local optimization process is introduced to reduce error accumulation. The local optimization is performed when a newly inserted target (the target with the blue edge in Figure 4b) is observed by more than N th keyframes. The local optimization involves a set {s} of keyframes that connected directly to the target (those red ones in Figure 4b), and targets that directly connected to the keyframe set {s} (those with the red board in Figure 4b), and the connections between them. Different from the global optimization, only the projection errors are minimized in the local optimization, and the oldest target is fixed.
In this step, the precise poses of M targets, the poses of N keyframes, and the connection relationships between them are obtained. Next, an analytical solution will be used to obtain the initial values of the calibration parameters.

Estimating Initial Values of Calibration Parameters
In this step, a modified analytical method [17] is used to estimate the initial values of the calibration parameters. The difference is that the coordinate frames vary slightly, which results in variations in the derivation. Considering the relative pose changes between the jth keyframe and the (j + 1)th keyframe in Figure 2, we obtain the following: Decomposing Equation (10) into rotation and translation forms: The rotation matrix c j c j+1 R is converted into the axis-angle representation r =r · ∆θ j j+1 .r = [r x , r y , r z ] is the normalized axis, and ∆θ j j+1 is the rotation angle. The rotation of the rigid body keeps uniform throughout, thus,r should be in the same direction as the z r -axis of the robot coordinate frame {r j }. Then, α 2 and α 3 are given [27] by: Because the robot moves on a 2D plane and the z r -axis is perpendicular to this plane, the robot rotation change r j r j+1 R between the jth keyframe and the (j + 1)th keyframe is only a rotation around the z r -axis: r j r j+1 R = Rot z (∆θ j j+1 ). The rotation angle ∆θ j j+1 can also be calculated by the odometry model (Equation (1)) with an initial value of 0: where K is the number of encoder samples between two adjacent keyframes; and β L and β R are intermediate variables. Thus, N − 1 Equation (13) can be derived using N keyframes. If N > 3, an overdetermined equation is formed: which can be solved by the least square method: If all ∆θ j j+1 are equal, then the function is numerically poor. Therefore, the random threshold of the angle is used in Section 3.2.4 to avoid this problem.
Consider the second line of Equation (11). As sin(α 1 ) and cos(α 1 ) introduce nonlinear terms in r c R, two intermediate variables: s1 = sin(α 1 ), c1 = cos(α 1 ) that are constrained by s1 2 + c1 2 = 1 are introduced. Because r c t only has two degrees of freedom, only two functions are provided by the second row of Equation (11): where a1 − a5 and b1 − b5 are known coefficients that derived from the second line of Equation (11). Then, N − 1 Equation (17) are obtained by N keyframes , forming a overdetermined equation when N > 4: With the proper choice of keyframes (as in Section 3.2.4), Rank(Ω) = 4 [17]. Thus, Ω has a one-dimensional null space. To solve Equation (18), we decompose Ω by the SVD method: where U is a 2(N − 1) × 2(N − 1) unitary matrix; Σ is a 2(N − 1) × 5 rectangular diagonal matrix with five non-negative singular values listed in decreasing order along the main diagonal. V is a 5 × 5 unitary matrix whose columns are right-singular vectors. Thus, the fifth column of V, v 5 , spans the null space of Ω. Accordingly, the general solution of Equation (18) is given by: where η is a constant factor. Under the constraint of s1 2 + s2 2 = 1, η can be determined by where v 5,1 and v 5,2 are the first and the second elements of v 5 . Then, s1, c1, k L , t x , and t y are solved. Thus, Clearly, α 2 and α 3 calculated using the analytical method only requires the information of two keyframes and can thus easily experience noise interference. The condition of Rank(Ω) = 4 is also easily affected by noises and does not hold. The least squares and SVD methods do not consider the difference in observation noises. The above problems reduce the accuracy of the analytical method. Tacking the analytical solutions as the initial values, two optimization steps are designed to obtain more accurate calibration parameters as follows.

Optimization of Calibration Parameters
In this step, only the calibration parameters are optimized by minimizing the odometry observation errors: The error term is calculated between two adjacent keyframes: the jth and (j + 1)th keyframe. It is designed as the difference between the robot pose change calculated by the odometry and the robot pose change derived from the keyframe poses: where the log() operator transforms a 4 × 4 homogeneous transformation matrix into a six-dimensional column vector that containing three ZYZ Euler angles and three translation elements. Σ odom(j, j+1) is the covariance matrix with the odometry observation. Next, we demonstrate how to obtain the covariance matrix. The left/right real encoder increments between two encoder samples is assumed to obey a Gaussian distribution [22]:ŝ The mean is calculated by the two encoder readings, and the left/right variance is proportional to the absolute value of the increment. Assuming that the 3 × 3 covariance of the robot pose at the time point of the kth encoder is Σ o,k , then the robot pose covariance of the next moment k + 1 is given by the linear error deduction method [22]: where G o is the Jacobian matrix of Equation (1) vs. the robot pose [x k , y k , θ k ] T ; G e is the Jacobian matrix of Equation (1) vs. the encoder displacement [s L k , s R k ] T ; and Σ e,k is the covariance of the left and the right encoder displacements, according to Equation (25): If there are K encoder samples between the jth keyframe and the (j + 1)th keyframe, the robot pose covariance at time point of the (j + 1)th keyframe is obtained by iteratively using the Equation (26) with initial robot pose covariance Σ o,1 = 0: It is the odometry observation covariance between the two adjacent keyframes. It should be noted that the odometry observation covariance used in Equation (24) is a 6 × 6 matrix. To ensure the establishment of Equation (24), the variances of the other two angles are assumed to be the same as θ, and the translational variance of z is assumed to equal to the minimum of variances of x and y:

Final Optimization
Lastly, a total optimization is performed to minimize all errors. Ceres Solver [28] is used to solve all optimization problems in this paper:

Experiments
Several experiments were performed to verify the accuracy and the stability of the proposed method and to test the effect of four strategies: the adoption of multiple composite targets, the use of two optimization steps, the introduction of planar constraints, and the increases of the number of targets. Figure 5 displays the Redbot used in our experiments. The Redbot is a small differential-drive mobile robot. The nominal left/right wheel radius r L/R is 0.05 m, the wheel spacing b is 0.32 m, the reduction ratio of the left/right motor is 7.2:1, and the encoder produces 1024 ticks per motor round with the frequency of 100 Hz. A daheng MER-302-56U3M/C camera with a Kawa LM3NCM wide-angle lens (Torrance, CV, USA) was installed. The camera resolution was set to 2048 × 1536 pixels, and the frame rate was set to 10 Hz. A Thinkpad laptop collected the encoder readings and images with a rosbag tool [29] provided in the Robot Operating System (ROS) [30]. The rosbag tool can record multiple types of data with timestamps. The checkboard pattern size is 6 × 7, and its grid size is 0.07 m. The side length of the aruco marker is 0.14 m. Composite targets were printed on A1-sized papers and attached to 15 mm thick Polyvinyl chloride (PVC) boards.

Performance Test
The purpose of this experiment was to test the stability of our method on different data. To verify the impact of the target placement form, as shown in Figure 6, ten targets were placed in three forms: A, B, and C. Then, the robot was controlled to move arbitrarily in front of the targets. For each placement form, three different data were recorded (e.g., A1, A2, and A3 for the placement form A). To verify the method, three other variants were compared as follows. In the NoPlanar method, the planar motion error terms in Equations (6) and (30) were deleted to verify the role of the planar constraints. In the Analytical method, the last two optimization steps (Sections 3.4 and 3.5) were removed to test the function of the last two optimization steps. In the Aruco method, the four corners of each aruco marker were used as map points instead of the corners of the checkerboard pattern, intended to prove the advantages of using composite targets.
Because the thresholds were random when selecting the keyframes, the result of each run of the algorithm could differ. For each data in Figure 6, each method was run ten times. The results of eight calibration parameters are shown in Figure 7, with the standard deviations of the ten runs. The calibration results of different methods varied. In terms of the mean values of the results, relatively small fluctuations across different data appeared in the proposed method and the NoPlanar method because the two optimization steps improved the stability. Additionally, these two methods used the composite targets compared with the Aruco method. The composite targets contain many more precision map points; hence, these two methods achieved higher stability. The standard deviations represent the effect of keyframe selection. The standard deviations of the Analytical method were the largest given the absence of the two optimization steps. By contrast, the standard deviations of the proposed method and the NoPlanar method were relatively small compared with the Analytical method and the Aruco method, further confirming the positive effects of the two optimization steps and the composite targets. Moreover, the means and standard deviations of the proposed method and the NoPlanar method were similar, indicating that the introduction of planar constraints exerted little influence on the means and standard deviations of the calibration results; thus, they had little effect on calibration stability. Some differences also emerged between different target placement forms in the proposed method. Figure 7 presents that the target placement form C achieved the highest stability, followed by B with A the worst. Moreover, the calibration results of placement forms B and C were similar. Figure 6 shows that the three trajectories of placement form A assumed open chain shapes; therefore, the networks built by the keyframes and targets were also open chains, resulting in unstable calibration results. On the contrary, the trajectories of placement form C were circles, and the networks assumed a closed form. Consequently, the calibration results were more stable than the other two target placement forms. The networks of placement form B were somewhere between A and C. Therefore, the targets should be arranged in a circular shape to obtain more stable calibration results.
Although we could obtain the mean and variance of the calibration parameters, we could not test the accuracy of the calibration results directly due to the absence of real calibration parameter values. To address this problem, two indirect experiments were designed to verify the calibration accuracy of the odometry and the external parameters of the robot.

Odometry Calibration Accuracy Test
If the odometry parameters are calibrated accurately, then the pose estimated by odometry will be accurate. An experiment was designed based on this assumption. The robot was controlled to run for a long distance and controlled back to the initial pose. Ideally, the end pose should be the same as the initial pose with a value of 0. However, due to errors in the odometry calibration, the two poses may be different. We used the endpoint error to indicate odometry calibration accuracy. Data were recorded in this experiment as shown in Figure 8. The lengths of these two trajectories were 79.6 m and 54.5 m. All calibration results in Section 4.2 were used to calculate the endpoint errors illustrated in Figure 9.
In most cases, the proposed method obtained the highest odometry accuracy due to using multiple composite targets, introducing planar constraints, and adopting two optimization steps. The odometry errors with calibration results from target placement form C were smaller than those from A and B, suggesting that the circular target placement form achieved higher odometry calibration accuracy. Although the means and standard deviations of the calibration results of the proposed and NoPlanar methods were similar, the proposed method achieved higher odometry calibration accuracy, potentially because the introduction of plane constraints improved the estimation accuracy of keyframe poses. Figure 10 shows keyframe position differences between the proposed method and the NoPlaner method with one data (B1). The robot moved on a plane, and the estimated keyframe positions should also be distributed on the same plane; however, keyframe positions estimated by the NoPlanar method were not on the same plane and demonstrated larger errors (see Figure 10b). Conversely, the proposed method introducing planar constraints obtained relative better keyframe position estimations (see Figure 10a).

Robot Extrinsic Calibration Accuracy Test
To test the accuracy of the robot extrinsic calibration, as shown in Figure 11, the camera was placed at four positions on the robot for calibration, and ten targets were arranged in a circle. These four positions formed a rectangle. The structure of the robot could determine the length (0.2 m) and width (0.16 m) of the rectangle. One data was collected per position. As in Section 4.2, each method was run ten times with the data. Ideally, greater similarity between the nominal rectangle and the rectangle formed by calibration results will lead to more accurate calibration results. As shown in Figure 11, the covariance ellipses with 95% confidence represent the distributions of [t x , t y ] T and the symbols * indicate the means of [t x , t y ] T . The nominal rectangle is shown in Figure 11 by dotted lines. Considering the mean values, the four corner points obtained by the proposed method and NoPlanar method were similar to the corner points of the nominal rectangle. However, the Analytical method and Aruco method demonstrated large gaps. For the covariance ellipse, the ellipse size of the proposed method was smaller than others, indicating that the composite target and the optimization steps used in this paper can improve the accuracy and stability of extrinsic parameter calibration. It was not possible to test absolute accuracy, but the relative accuracy was evaluated. Mean values were used to calculate the average length and width of the rectangle along with the average angle of the four angles. Then, they were compared with the nominal length, width, and angle to compute the relative errors. Results are listed in Table 1. Although only the angular error of the proposed method was smallest compared to other methods, the width and length errors of the proposed method were highly similar to the minimum errors. Unfortunately, the accuracy of α 1 , α 2 , and α 3 could not be assessed effectively. Figure 7 reveals that the results of the proposed method with a target placement of C demonstrated small angle fluctuations within a range of less than 0.01 rad. Such stability can meet the needs of conventional applications. Overall, the robot extrinsic calibration accuracy and stability of the proposed method was relatively higher than others.

The Impact of the Number of Targets
To test the impact of the number of targets on the performance of the proposed method, 1-10 targets were used and arranged as presented in Figure 12. For each form, the proposed method was run ten times. Results are shown in Figure 13 with the standard deviations. As the number of targets increased, the calibration result became more stable, the jitter of the means became smaller, and the standard deviations were reduced. The indirect odometry calibration accuracy test method used in Section 4.3 was also used here with results shown in Figure 14. Overall, as the number of targets increased, the odometry error exhibited a downward trend. However, some abnormal situations appeared, presumably because (1) the trajectories of the 10 pieces of data were identical; (2) the number of targets observed by the keyframes were different; and (3) the moving speeds of the robot varied, causing different motion blurs among the 10 data and introducing different noises.

Design Odometry Parameters Comparison
The odometry parameters calibrated by the proposed method were also compared with the design odometry parameters. The left/right wheel factor k L/R can be obtained by the robot's mechanical parameters: where r L/R is left/right radius, N L/R is pulse number of a round of the left/right encoder that is mounted on the shaft of the left/right motor, and I L/R is the reduction ratio of the left/right motor. Using the designed parameters shown in Section 4.1, we get the design odometry parameters: parameters: r L/R = 3.7410 × 10 −5 , b = 0.32 m. The compared odometry parameters calibrated by the proposed method were obtained from the data in front of the target placement form C in Section 4.2.
The mean results of data C1-C3 were used: k L = 4.0652 × 10 −5 , k R = 4.0668 × 10 −5 , and b = 0.3166. The two sets of odometry parameters were compared using the method in Section 4.3. These two sets of odometry parameters were loaded into the robot respectively, and the two sets of data in Section 4.3 were run. The obtained trajectories are shown in Figure 15. It can be seen that, under the design odometry parameters, the odometry trajectories quickly diverge. In contrast, the odometry parameters calibrated by the proposed method enable high accuracy odometry trajectories.

Discussion
Our approach is similar to the work of [15]. The authors used multiple aruco markers to estimate camera poses and designed a two-step pipeline to simultaneously calibrate the odometry and the robot extrinsic parameters. First, the initial parameters are estimated through a non-iterative process by exploiting plane motion constraints; then, the parameters are refined by a joint optimization. This method has been proven to be robust to image noises and requires only a few aruco markers to be arranged in the environment, and is simple to operate. However, the method has its own limitations.
First, one of the problems of aruco markers is that the accuracy of their corner positions is not too high, even after applying sub-pixel refinement, which results in a low estimation of the camera poses; moreover, the effect of the number of the aruco markers on the calibration result is not tested. Second, the plane constraint is only introduced into the initial value estimation process, but is not into the join optimization solution step; in addition, only the simulation experiment is carried out, but no real experiment is performed to verify the role of the plane constraint. In contrast, the proposed method uses multiple composite targets, which combine the advantages of the aruco marker and the checkerboard pattern. The corners of checkerboard patterns can be refined more accurately since each corner is surrounded by two black squares, which results in a more precision calibration result than that of the aruco marker. It has been tested by our experiments. In addition, an automatic pipeline to combine these composite targets to select keyframes and estimate keyframe poses is proposed. In addition, we design an experiment to test the impact of the number of the composite targets. Moreover, we introduce two types of planar constraints and use them in all calibration processes. In the end, the effects of planar constraints were tested experimentally.
We have to point out that the proposed method uses multiple composite targets that are a bit complicated to fabricate and arrange. In contrast, the natural feature-based method [12][13][14] does not require targets or other equipment, and is thus easier to use. However, it also has its own limitations. For example, it requires sufficient texture in the environment, which means that the calibration results may vary in different environments. Moreover, the monocular camera pose estimated by the visual simultaneous localization and mapping method or the structure from motion method is up to a scale; the scale is usually obtained by pre-calibrated odometry parameters [13]. In this paper, the purpose of using multiple composite targets is to improve the stability and accuracy of the calibration. Of course, using fewer targets can reduce the complexity of the calibration, but the calibration accuracy and stability will reduce. In practical applications, a balanced selection can be made between the number of targets and the calibration accuracy and stability.

Conclusions
In this paper, we propose an automatic pipeline to simultaneously estimate the odometry and robot extrinsic parameters of a differential-drive mobile robot equipped with a monocular camera. This approach does not limit the path of the robot and does not require an initial assumption of calibration parameters, producing relatively more accurate and stable results than other methods. To address the low-accuracy problem of traditional artificial landmark tools, we propose a composite target consisting of an aruco marker and a checkerboard pattern and introduce a method to automatically combine multiple composite targets, select keyframes, and estimate keyframe poses. Initial values of the calibration parameters are computed by an analytical method and then optimized via two optimization steps. Several experiments were conducted to test the stability and accuracy of the proposed approach as well as the effectiveness and roles of key strategies. Results confirm the comparable performance of this method.

Conflicts of Interest:
The authors declare no conflict of interest.