1. Introduction
Multi-sensor fusion has attracted the interest of SLAM researchers in the field of autonomous robots. VINS [
1], ORB-SLAM3 [
2], and Smart Markers [
3] are excellent visual–inertial systems. However, the precise extrinsic parameters of camera-IMU are a prerequisite for these systems. Although some systems (such as VINS and VI-ORB-SLAM [
4]) provide online calibration methods, they often fail to calibrate the extrinsic parameters of a camera-IMU mounted on vehicles. This is because the vehicle generally drives on a flat surface and does not generate enough excitation. In this case, accurate camera poses are beneficial, allowing for us to calibrate the external parameters of camera-IMU using the hand–eye method [
5]. Many outstanding visual SLAM systems can be used to obtain camera poses, such as ORB-SLAM2 [
6], LSD-SLAM [
7], and LDSO [
8]. However, these methods, based on natural features or optical flow, could fail under illumination conditions, or instances with high dynamics, vigorous rotation, or a low-texture environment [
9]. Additionally, they cannot obtain camera poses with a metric scale in the monocular case. Therefore, we prefer to sacrifice the flexibility of such methods to improve accuracy and robustness using artificial planar landmarks.
The planar marker is an easy-to-obtain artificial mark. We can easily print the ArUco [
10,
11] or AprilTag [
12] markers in our laboratory. Therefore, planar-marker-based SfM or SLAM are very popular [
13,
14,
15,
16,
17]. Although marker-based SLAM is much simpler than feature-based SLAM, a challenging problem is that the planar marker contains ambiguity. The pose is the relative pose between the camera and the marker. Due to the noise in the four corners of the detection marker, two different solutions will be estimated in practice, and ambiguity only occurs in the rotational components of the pose [
18,
19,
20]. Generally, we use the ’infinitesimal plane-based pose estimation’ (IPPE) method [
20] to obtain the two solutions, which include the correct one.
Figure 1 shows two solutions
(the red bounding box) and
(the green bounding box), which are returned by IPPE method for a planar marker with ID 205. The red bounding box is the correct solution. The literature [
14,
15] has proposed several means to distinguish the correct solution. Nevertheless, these methods are still not robust enough and have difficulties in rotational movements.
To avoid or alleviate the difficulties in rotational movements, we propose enhancements to the traditional SPM-SLAM system [
15]. We present an improved version of this existing algorithm. We mainly improve four steps in SPM-SLAM. First, in the one-frame initialization, the correct maker pose is only chosen by IPPE if the marker is within a suitable view of the camera and the
ratio from Equation (
4) is lower than the threshold (
Section 4.1). The two-frame initialization is successful if the distance or angle between two frames is greater than the corresponding threshold (
Section 4.1). This makes it easier for the initialization to succeed under rotational movements. Second, the current frame can be inserted into the map as a keyframe if the angle between the current frame and the reference keyframe is greater than the threshold. This ensures that our system inserts enough keyframes under the rotation movements to continue tracking (
Section 4.2). Third, we present a new approach to select the best marker pose from a set of images by minimizing the M2M error (
Section 4.3). Finally, during the relocalization process, when one marker is detected, its correct pose is chosen using the same method as the one-frame initialization approach. When multiple markers are detected, their correct poses are chosen by minimizing the M2M error (
Section 4.4). We tested our system on the public dataset and our dataset. Compared with SPM-SLAM [
15], UcoSLAM [
16], and TagSLAM [
17], our system achieves better accuracy in most public sequences, and the speed of our system is much faster than that of UcoSLAM and TagSLAM. The results of our dataset show that our method is more robust and obtains more planar markers in the map than SPM-SLAM under rotational movements. Additionally, the open-source code is available on GitHub.
2. Related Research
As our work is dedicated to the planar marker-based SLAM system, a review of the natural feature-based SLAM literature is beyond the scope of this paper. Both Ref. [
9] and Ref. [
21] contain detailed reviews of natural feature-based SLAM systems. This section will focus on the marker-based SLAM and the problem of marker pose ambiguity.
SPM-SLAM [
15] is the first real-time marker-based SLAM system. It is able to deal with ambiguity problems and operate in large indoor environments. The SLAM system sequentially runs the tracking, mapping, and loop-closing processes in a single thread and achieves a good performance in most environments. However, our experiments (
Section 5.2) indicated that SPM-SLAM often failed to locate in certain situations where the camera’s motion included rotational movements. UcoSLAM [
16] is an extended version of SPM-SLAM and fuses the keypoints with squared fiducial markers. When disabling keypoints, it is similar to SPM-SLAM. However, it needs more computational resources and the addition of keypoints did not significantly improve SPM-SLAM’s tracking accuracy in our experiments.TagSLAM [
17] is also a visual SLAM with AprilTag fiducial markers. It considers every frame to be a keyframe and relies on iSAM2 [
22], which uses a factor graph to represent the pose optimization problems. However, TagSLAM cannot run in real-time unless a trusted map of tag poses is already available, because the graph grows over time and the CPU load increases. Our experiments indicated that TagSLAM is unstable. MarkerMapper [
14] is an offline method that obtains a map of fiducial markers and then obtains camera poses. It has several limitations. The most serious one is that the proposed method cannot run in real-time.
Additionally, the
ratio test [
20] is a general method to resolve the planar-marker’s pose ambiguity. Marker-based SfM [
14], SPM-SLAM [
15], and UcoSLAM [
16] also apply this method. The
ratio test means that if the
ratio from Equation (
4) is below a certain threshold, the IPPE solution with lower reprojection errors is the correct one. However, this was proved to not always be correct. Unlike the
ratio test, which used one frame to solve marker pose ambiguity, S. -F. Ch’ng [
23] exploited multi-view constraints for disambiguation. This method can efficiently choose the correct pose from two ambiguous poses by formulating a clique-constrained rotation averaging problem and a maximum weighted clique problem. However, the method consumes a high amount of computing resources and is not suitable for SLAM running in real-time.
4. Enhanced Methods
4.1. Initialization
The initialization process aims to establish a world reference system for the map, and add keyframes and markers with correct poses to the map, which is vital to an SLAM system. As in SPM-SLAM, we adopt one-frame and two-frame initialization. However, we set different conditions to decide whether the process succeeds.
One-frame initialization succeeds if the
ratio of at least one marker is lower than the threshold
in SPM-SLAM [
15]. However, when the marker is on the edge of an image, it is sometimes misjudged. In our system, the following two conditions must be met for one-frame initialization to succeed.
where
is calculated by (
5). The condition (
) means that a marker
m must be observed within a proper scope of view. The value
is related to the camera’s angle of view (FOV). This is set to 0.4–0.6 times the camera FOV used in our experiment.
For two-frame initialization, our method is the same as that of SPM-SLAM, which estimates the pose of frames and markers that are ambiguously observed in the two frames [
15]. Two-frame initialization is successful if one of the following two conditions is met.
where
is the distance between
and
. The angle
between
and
is calculated by (
6). In contrast to SPM-SLAM’s method, the condition
helps our system to successfully initialize under rotational movements.
4.2. New Keyframes and Markers Insertion
When a frame pose is successfully estimated in the tracking process, we need to decide whether the frame can be judged as a keyframe and inserted into the map. The insertion of keyframes has a heavy influence on local optimization and tracking. Too many keyframes may cause local optimization to consume a high amount of computational resources, while fewer keyframes may lead to a poor tracking performance, or even tracking failure. Therefore, keyframes can be inserted into the map when one of the following conditions are satisfied:
If at least one new marker is detected in a frame, the frame is spawned as a new keyframe and added to the map with the new marker(s). Notably, the pose(s) of the new marker(s) are set to zero. Later, these are estimated from multiple views (
Section 4.3).
If the distance between the current frame and the reference keyframe is larger than the threshold , the frame is added to the map.
If the angle between the current frame and the reference keyframe is larger than the threshold , the frame is added to the map.
Unlike SPM-SLAM, which only uses conditions 1 and 2, the use of condition 3 ensures that our system instantly inserts new keyframes during rotational movements. This improves the robustness of the system, allowing for it to adapt to more environments.
4.3. Marker Pose Estimation
Once a new marker
m is added to the map, the marker poses
should rapidly be obtained. We propose a novel method to estimate the marker pose from a set of keyframes that observe that marker
m. Let us denote
as the set of keyframes. The poses of
with keyframes are represented by
, which are known after the tracking process. Considering the pose ambiguity of marker
m in the set of keyframes
, we can obtain possible poses for the marker
m according to (
1), which is denoted by:
Our goal is to choose the best pose
of the marker
m from
. Given a set of
, the relative rotations from marker
i to marker
j (M2M) have four solutions, which are represented by
where
are the rotational components of the two ambiguous poses of marker
i in the frame
. They have a
matrix. Let us denote
as the observed markers with the correct poses from the set of keyframes
. Remember that these poses are known, and that we denote
as the rotational components.
Our method is to select the best pose
from
by minimizing the M2M error between marker
m and markers
in the set of keyframes
. The M2M error is represented by
where
is the rotational components of
, and
denotes the Frobenius norm. The best solution is the one that minimizes the M2M error e(
) in the set of keyframes
. Note that the translation component is already included in
. Therefore, we can find a correct pose for marker
m by (
11). Furthermore, the angle or distance between the keyframes in
must be larger than threshold
or threshold
, respectively. This could make (
11) more effective. Note that the parameters
and
are the same as those mentioned in
Section 4.2.
4.4. Relocalization
The relocalization is run when the system fails to track. Let us denote as a marker with the correct pose detected in current frame , and n is the number of markers in .
If , the process will fail.
If
, two methods can be used to obtain a solution. One is that, if the marker satisfies condition (
7), we can obtain the marker’s unambiguous pose
. Then, we can successfully compute the current frame pose
using (
1). The other method is that, if the marker is not satisfied with the condition (
7), we can obtain the marker’s ambiguous poses
and
using (
1). Naturally, the two poses (
and
) of the current frame are computed. Then, the two distances between the current frame and the reference frame are calculated using the two poses. Finally, the pose with the smaller distance is the correct one.
If
, let us denote
as the ambiguous poses of markers
in current frame
, and denote
as the correct poses. The M2M error in
is represented by:
where
(
matrix) is the rotational component of
, and
is the rotational component of
. Here, note that if the right rotational component is selected, the translation component is also selected. Accordingly, our goal is firstly to select a pair of poses
and
from
by minimizing the M2M error (
14), and then calculate the pose with (
1). As the correct poses (
) are known, we can obtain two solutions (
and
) for the current frame pose. If the distance between
and
is lower than the threshold
(default is 0.01m in our system), the relocalization process will succeed and the two poses (
and
) are available. Otherwise, relocalization will fail.
5. Experiments and Results
This section evaluates our method’s performance in the publicly available SPM dataset [
15] and our dataset. The SPM dataset has eight sequences and provides ground truth trajectories for each sequence. Our dataset contains six indoor sequences and two outdoor sequences, which include more rotational movements. We aim to evaluate the tracking accuracy and robustness of our methods. Since the SPM dataset offers the ground truth, it was used to evaluate the tracking accuracy. We assessed the robustness of our system on our dataset, which includes many rotational movements. All tests were run on an Intel Core i5-11600K 3.9GHz desktop computer.
In addition, our method can be compared with other marker-based SLAM systems, which include SPM-SLAM, TagSLAM, and UcoSLAM, as SPM-SLAM, UcoSLAM, and our system have the same parameters. In order to be comparable, these same parameters must be equal. The parameters
and
(
Section 4) employed in all three systems were set at 7 mm and 0.333 in the two datasets. The other parameters
and
, which were only employed in our system, were set at 5° and 0.01 m, which proved to be good values in most of our experiments.
5.1. SPM Dataset
The SPM dataset contains eight video sequences, with fiducial markers recorded in a laboratory, and provides a ground truth. The camera works at 60 Hz with a resolution after rectification of 1920 × 1080 pixels. The sequences 01, 02, 03, 07, and 08 were recorded by a camera pointed toward the walls, and others were recorded by a camera pointed towards the ceiling. We evaluated the accuracy of the trajectory using the Absolute Trajectory Error (ATE) measure, which is the translational RMSE. In addition, the system’s mean speed was also considered as an evaluation indicator.
SPM-SLAM, TagSLAM, and UcoSLAM were selected for comparison in the experiments.
Table 1 shows the ATEs and Trck of the methods tested on the eight sequences.
Figure 4 shows the trajectories for ground truth (dash gray), SPM-SLAM (blue), TagSLAM (green), UcoSLAM (red), and ours (purple). As can be observed, TagSLAM’s accuracy was the worst, especially in sequence 01, where its ATE was 0.43 m. Compared with SPM-SLAM and UcoSLAM, our method’s accuracy was slightly better in most sequences.
Figure 5 shows the mean speed of the four systems. The graph shows that TagSLAM and UcoSLAM are much slower than other methods. This is because TagSLAM considers every frame to be a keyframe and UcoSLAM combines keypoints and markers, forming the CPU load. In contrast, our method and SPM-SLAM can run at about 145 frames per second (fps) on the SPM dataset.
5.2. Our Dataset
The test on the public dataset showed that our method significantly outperforms TagSLAM and UcoSLAM in accuracy and speed. However, the accuracy of our method is slightly better than SPM-SLAM in most sequences, and the speed is almost the same as SPM-SLAM. Therefore, to further evaluate that our method is more robust than SPM-SLAM in terms of rotational movements, it was tested and compared with SPM-SLAM on our dataset.
For this test, we recorded eight video sequences. Sequences 01–06 were recorded in a laboratory, and 22 markers were placed on the wall in an approximated dimension of 1.5 × 1.5 square meters. Sequences 07 and 08 were recorded outdoors, and 48 markers were placed on the ground in an approximated dimension of 20 × 15 and 30 × 20 square meters.
Figure 6 shows the markers on the ground for sequences 07 and 08. It is worth noting that outdoor sequences 07 and 08 were recorded with a camera mounted on a vehicle. The camera captured images at a 1920 × 1080 pixel resolution and a frame rate of 30Hz. For sequences 01 and 02, the camera was pointed towards the wall and was spinning. Furthermore, sequence 01 was rotated about 270 degrees, and sequence 02 started and ended at the same position to facilitate loop-closure detection. Different from sequences 01–02, sequences 03–06 have more positional movements and only have rotational movements at the corners. For outdoor sequences, sequence 08 showed more rotational movement than sequence 07. Sequences 02–08 can facilitate the detection of loop closure. Although we could not obtain the ground truth for the camera poses, the tracking success rate (Trck) and the number of markers reconstructed in the mapping were used to evaluate the system’s robustness. The Trck is the ratio of the number of tracked frames to the total number of frames in a sequence.
Table 2 shows the number reconstructed markers and the tracking rate. As can be observed, the SPM-SLAM Trcks were lower than 55% on sequences 01–03 and 05–06, while the proposed method achieved over 99% on all sequences. Additionally, SPM-SLAM only reconstructed all markers in sequence 07. In contrast, our method reconstructed all markers in all sequences.
Figure 7 shows the reconstruction results of both methods for sequences 01 and 02. This clearly shows that more markers were reconstructed by our method than SPM-SLAM.
Figure 8a–h show the reconstructed trajectory of both methods for sequences 03–06. We can observe that our method outperforms SPM-SLAM, especially in sequences 03 and 06, where SPM-SLAM almost completely failed to reconstruct the trajectory.
Figure 8i–l show the reconstructed trajectory of both methods for sequences 07 and 08 (outdoors). As can be observed, our method was slightly better than SPM-SLAM on sequence 07 and much better than SPM-SLAM on sequence 08. This is not unexpected, since sequence 08 has more rotational movement than sequence 07. In summary, the experiments show that the robustness of our system is better than that of the SPM-SLAM under rotational movements.