You are currently viewing a new version of our website. To view the old version click .
ISPRS International Journal of Geo-Information
  • Article
  • Open Access

9 July 2024

Globally Optimal Relative Pose and Scale Estimation from Only Image Correspondences with Known Vertical Direction

,
,
,
,
and
1
Global Navigation Satellite System (GNSS) Research Center, Wuhan University, Wuhan 430072, China
2
School of Future Technology, South China University of Technology, Guangzhou 510641, China
3
School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430072, China
*
Author to whom correspondence should be addressed.
This article belongs to the Topic 3D Computer Vision and Smart Building and City, 2nd Volume

Abstract

Installing multi-camera systems and inertial measurement units (IMUs) in self-driving cars, micro aerial vehicles, and robots is becoming increasingly common. An IMU provides the vertical direction, allowing coordinate frames to be aligned in a common direction. The degrees of freedom (DOFs) of the rotation matrix are reduced from 3 to 1. In this paper, we propose a globally optimal solver to calculate the relative poses and scale of generalized cameras with a known vertical direction. First, the cost function is established to minimize algebraic error in the least-squares sense. Then, the cost function is transformed into two polynomials with only two unknowns. Finally, the eigenvalue method is used to solve the relative rotation angle. The performance of the proposed method is verified on both simulated and KITTI datasets. Experiments show that our method is more accurate than the existing state-of-the-art solver in estimating the relative pose and scale. Compared to the best method among the comparison methods, the method proposed in this paper reduces the rotation matrix error, translation vector error, and scale error by 53%, 67%, and 90%, respectively.

1. Introduction

Estimating the relative pose from two views is a cornerstone in multi-view geometry problems which has been applied in many fields, such as visual odometry (VO) [1,2], structure from motion (SfM) [3,4], and simultaneous localization and mapping (SLAM) [5,6,7]. In addition, the camera sensor for indoor positioning is also an application [8,9,10]. The multi-camera system consists of multiple individual cameras fixed on a system. It has the advantages of a large field of view and high precision in relative pose estimation. Therefore, a large number of scholars have studied how to improve the accuracy, robustness, and efficiency of the relative pose estimation algorithm.
Compared with the standard pinhole camera, the multi-camera system lacks a single projection center. To solve this problem, Plücker lines are proposed to represent the light rays that pass through the multi-camera system [8]. This model is called the generalized camera model (GCM) [11], which describes the relationship between the Plücker lines and the generalized essential matrix. Generalized cameras are widely used in fields such as autonomous driving, robotics, and micro aerial vehicles. Pless proposed calculating the relative poses of multiple cameras using 17-point correspondences. Some methods have been developed on multi-camera relative pose estimation using the generic camera model [12,13,14,15,16,17]. However, these algorithms can estimate multi-camera poses only if the distances between all camera centers are known. This assumption limits the use of the generic camera model. In this paper, we focus on estimating the relative poses and scales of the generic camera model.
The relative pose and scale of the multi-camera system consist of a total of 7 degrees of freedom (DOFs), including 3 DOFs of the rotation matrix, 3 DOFs of translation vectors, and 1 DOF for scale, as shown in Figure 1. To improve the relative pose estimation of the camera and reduce the minimum number of required feature point pairs, an auxiliary sensor, such as an IMU, is often added [13,14,15,18,19,20]. An IMU can provide the yaw, pitch, and roll angles. The pitch and roll angles are more accurate than the yaw angle [21]. Therefore, we can use the pitch and roll angle provided by the IMU to reduce the degrees of freedom of the rotation matrix from 3 to 1. In this case, the relative pose and scale of the multi-camera system consist of a total of 5 degrees of freedom (DOFs). Thus, only five feature point pairs are needed to estimate the relative poses and scales of the camera [22].
Figure 1. The rotation matrix, translation vector, and scale are R , t , and s , respectively.
Depending on the number of points required to solve the relative poses of a multi-camera system, solvers can be divided into minimum sample solvers and non-minimum sample solvers. When the distance between camera centers is known, a large number of scholars have used both minimum and non-minimum samples to solve for the relative poses of cameras. A constraint equation for the rotation matrix, translation vector, scale, and Plücker lines in the generic camera model is proposed in [22]. The minimum solution for a known vertical direction is also known. This method avoids 3D point noise in the calculation of relative poses and scale using the method with 2D-3D point correspondences. A simple heuristic global energy minimization scheme based on local minimum suppression is proposed in [23]. However, this method is not a closed-form solver.
This paper mainly focuses on globally optimal relative pose and scale estimation from 2D-2D point correspondences. Cameras and IMUs, such as mobile phones, unmanned aerial vehicles, and autonomous vehicles, are usually fixed together, so we assume the vertical direction is known. The main contributions of this paper are summarized below:
  • A novel globally optimal solver to estimate relative pose and scale is proposed from N 2D-2D point correspondences (N > 5). This problem is transformed into a cost function based on the least-squares sense to minimize algebraic error.
  • We transform the cost function to solve two unknowns in two equations, which are composed of the parameter of the relative rotation angle. The highest degree of rotation angle parameter is 16.
  • We derive and provide a solver based on polynomial eigenvalues to calculate the relative rotation angle parameter. The translation vector and scale information are obtained from the corresponding eigenvectors.
The rest of this paper is organized as follows. We review the related work in Section 2. In Section 3, we introduce methodology, including the generalized epipolar constraint, problem description, and globally optional solver. We test the performance of the proposed method on synthetic and real-world data and discuss this in Section 4. The conclusions are presented in Section 5.

3. Methodology

3.1. Epipolar Constraint

In a multi-camera system, we define a corresponding point pair ( x k i , x k j , R k i , R k j , t k i , t k j ) , where x k i is the normalized homogeneous coordinate of the i -th camera capture image in the k frame and x k j is the normalized homogeneous coordinate of the j -th camera capture image in the k + 1 frames. The rotation matrix and translation vector of the i -th camera in the k frame are R k i and t k i . The rotation matrix and translation vector of the j -th camera in the k + 1 frame are R k j and t k j in Figure 2. I k i and I k j denote a pair of corresponding Plücker line vectors pointing at k and k + 1 frames. The Plücker line vector is written as follows:
I k i = f k i t k i × f k i , I k j = f k j t k j × f k j ,
where f k i and f k j denote the direction vectors of the corresponding rays between the two generalized cameras. f k i and f k j are written as follows:
f k i = ( R k i x k i ) R k i x k i , f k j = ( R k j x k j ) R k j x k j .
Figure 2. The rotation matrix and translation vector of the i -th camera in the k frame are R k i and t k i . The rotation matrix and translation vector of the j -th camera in the k + 1 frame are R k j and t k j . The rotation matrix, translation vector, and scale vector between aligned k and k + 1 are R y , t ˜ , and s .
The generalized epipolar constraint is written as follows:
I k i T E R R 0 I k i = 0 ,
where R and t represent the original rotation matrix and original translation vector, respectively. The essential matrix E = t × R and t × represents the antisymmetric matrix of t .
f k j T E f k i + f k j T ( R [ t k j ] × [ t k i ] × R ) f k i = 0 .
A large number of solvers have been proposed for this problem. The premise of applying these methods is then based on the assumption that scale information has been reconciled. In the application, the scale information is fuzzy, so the above methods are not available. Therefore, we mainly focus on the relative pose estimation when the scale is unknown. Equation (5) can be easily obtained according to Equation (4).
f k j T E f k i + f k j T ( R s [ t k j ] × [ t k i ] × R ) f k i = 0 ,
where s represents the scale. Similar to Equation (3), Equation (5) can be expressed as follows:
I k j T E R s R 0 I k i = 0 .
A multi-camera system is usually coupled with the IMU, which can provide the roll ( θ x ) and pitch ( θ z ) angles. We define the Y-axis of the camera coordinate system to be parallel to the vertical direction, and the X-Z plane is orthogonal to the Y-axis. R imu and R imu are provided by the IMU at k and k + 1 frame, respectively.
R imu = R x R z ,
where R x and R z can be written as follows:
R x = 1 0 0 0 cos θ x sin θ x 0 sin θ x cos θ x ,    R z = cos θ z sin θ z 0 sin θ z cos θ z 0 0 0 1 .
The rotation matrix and translation vector between the aligned frames k and k + 1 are R y and t ˜ . The relationship between the original relative pose ( R and t ) and the aligned relative pose ( R y and t ˜ ) can be expressed as follows:
R = ( R imu ) T R y R imu , t = ( R imu ) T t ˜ ,
where
R y = cos θ y 0 sin θ y 0 1 0 sin θ y 0 cos θ y ,
where θ y represents the rotation angle of the Y-axis. The rotation matrix R y can be rewritten with the Cayley parameterization as follows:
R y = 1 1 + y 2 1 y 2 0 2 y 0 1 0 2 y 0 1 y 2 ,
where y = tan θ y 2 . In practical applications, θ y = 180° is very rare [33].
By substituting Equation (9) into Equation (6), we can obtain
R imu 0 0 R imu I k j T t ˜ × R y R y s R y 0 R imu 0 0 R imu I k i = 0 .
By substituting Equation (1) into Equation (12), we can obtain
R imu f k j R imu ( t k j × f k j ) T t ˜ × R y R y s R y 0   R imu f k i R imu ( t k i × f k i ) = 0 .

3.2. Problem Description

Based on the constraint equation of Equation (13) in the previous section, Equation (13) can be rewritten as follows:
f k j × ( R imu ) T R y R imu f k i f k j T t k j × ( R imu ) T R y R imu f k i f k j T ( R imu ) T R y R imu t k i f k i M T t ˜ s 1 = 0 ,
where the size of M is 5 × 1. Suppose there are N point correspondences. If N = 5 , Sweeney proposes the solver using 5-point correspondences in [22]. In this paper, we focus on non-minimal sample point correspondences ( N > 5 ). We can stack the constraints in each correspondence
M T t ˜ s 1 = ( m 1 m n ) T t ˜ s 1 = 0 ,
We minimize the algebraic error based on the least-squares sense and establish the cost function
arg R y , t ˜ min t ^ T C t ^ ,
where t ^ = t ˜ s 1 T and C = M × M T . Matrix C can be expressed as follows:
C = C 11 C 12 C 13 C 14 C 15 C 12 C 22 C 23 C 24 C 25 C 13 C 23 C 33 C 34 C 35 C 14 C 24 C 34 C 44 C 45 C 15 C 25 C 35 C 45 C 55 .
The only unknown in matrix C is y . Supposing λ C , min is the smallest eigenvalue of matrix C , the optimization problem is expressed by
R y = arg min R y λ C , min .
Based on the form of the eigenvalues, we can write
det ( C λ I ) = λ 5 + f 1 λ 4 + f 2 λ 3 + f 3 λ 2 + f 4 λ + f 5 ,
where λ is the eigenvalue of matrix C , and I is a 5 × 5 unit matrix. Specific expressions for f 0 , f 1 , f 2 , f 3 , and f 4 contain only the unknown y .
For convenience of narration, we use λ instead of λ G , min . Based on the properties of eigenvalues, Equation (19) can be rewritten as follows:
λ 5 + f 1 λ 4 + f 2 λ 3 + f 3 λ 2 + f 4 λ + f 5 = 0 .
d λ d s = 0 is the necessary condition for λ to be the smallest eigenvalue of C . We can obtain
d f 1 d λ λ 4 + d f 2 d λ λ 3 + d f 3 d λ λ 2 + d f 4 d λ λ + d f 5 d λ = 0 .
Define α = 1 + y 2 . Then, we can obtain
f 1 = g 1 δ 2 , f 2 = g 2 δ 4 , f 3 = g 3 δ 6 , f 4 = g 4 δ 7 , f 5 = g 5 δ 8 ,
d f 1 d y = h 1 δ 3 , d f 2 d y = h 2 δ 5 , d f 3 d y = h 3 δ 7 , d f 4 d y = h 4 δ 8 , d f 5 d y = h 5 δ 9 ,
where g 1 , g 2 , g 3 , g 4 , g 5 , h 1 , h 2 , h 3 , h 4 , and h 5 are only polynomials of y . The highest degree of y is shown in Table 1.
Table 1. Highest degree of variable y .
Define β = δ λ . Equations (20) and (21) can be rewritten as follows:
δ 3 β 5 + δ 2 β 4 g 1 + δ β 3 g 2 + β 2 g 3 + β g 4 + g 5 = 0 δ 2 β 4 h 1 + δ β 3 h 2 + β 2 h 3 + β h 4 + h 5 = 0 .
We can rewrite Equation (24) as follows:
δ 3 δ 2 g 1 δ g 2 g 3 g 4 g 5 0 δ 2 h 1 δ h 2 h 3 h 4 h 5 β 5 β 4 β 3 β 2 β 1 = 0 .

3.3. Globally Optimal Solver

We can easily find two equations with 6 monomials in Equation (24). To equalize the number of equations and the number of monomials, the first equation of Equation (24) is equalized by β 3 , β 2 , and β , and the second equation is equalized by β 4 , β 3 , β 2 , and β . We can obtain seven equations.
δ 3 β 6 + δ 2 β 5 g 1 + δ β 4 g 2 + β 3 g 3 + β 2 g 4 + β g 5 = 0 δ 3 β 7 + δ 2 β 6 g 1 + δ β 5 g 2 + β 4 g 3 + β 3 g 4 + β 2 g 5 = 0 δ 3 β 8 + δ 2 β 7 g 1 + δ β 6 g 2 + β 5 g 3 + β 4 g 4 + β 3 g 5 = 0 δ 2 β 5 h 1 + δ β 4 h 2 + β 3 h 3 + β 2 h 4 + β h 5 = 0 δ 2 β 6 h 1 + δ β 5 h 2 + β 4 h 3 + β 3 h 4 + β 2 h 5 = 0 δ 2 β 7 h 1 + δ β 6 h 2 + β 5 h 3 + β 4 h 4 + β 3 h 5 = 0 δ 2 β 8 h 1 + δ β 7 h 2 + β 6 h 3 + β 5 h 4 + β 4 h 5 = 0 .
Based on Equations (24) and (26), we can easily obtain nine equations with nine monomials.
V 9 × 9 X 9 × 1 = 0 ,
where V is 9 × 9 and X is 9 × 1.
V = 0 0 0 δ 3 δ 2 g 4 δ g 3 g 2 g 1 g 0 0 0 δ 3 δ 2 g 4 δ g 3 g 2 g 1 g 0 0 0 δ 3 δ 2 g 4 δ g 3 g 2 g 1 g 0 0 0 δ 3 δ 2 g 4 δ g 3 g 2 g 1 g 0 0 0 0 0 0 0 0 δ 2 h 4 δ h 3 h 2 h 1 h 0 0 0 0 δ 2 h 4 δ h 3 h 2 h 1 h 0 0 0 0 δ 2 h 4 δ h 3 h 2 h 1 h 0 0 0 0 δ 2 h 4 δ h 3 h 2 h 1 h 0 0 0 0 δ 2 h 4 δ h 3 h 2 h 1 h 0 0 0 0 0 ,
X = β 8 β 7 β 6 β 5 β 4 β 3 β 2 β 1 T .
The elements of V are univariate polynomials in y , and the highest degree of y is 16. Equation (27) can be rewritten as follows:
( V 0 + y V 1 + + y 16 V 16 ) X = 0 .
We define the matrix B , J , L .
B = 0 I 0 0 0 0 I V 0 V 1 V 15 , J = I 0 0 0 I 0 0 0 0 V 16 , L = X y X y 15 X .
Equation (30) can be rewritten as B L = y J L . The eigenvalue of J 1 B is y . J 1 B can be written as follows:
J 1 B = 0 I 0 0 0 0 I V 16 1 V 0 V 16 1 V 1 V 16 1 V 15 .
where V 0 , V 1 , …, V 15 and V 16 are 9 × 9 matrices.
It is found that the matrix V 16 1 is inaccurate because there are zero columns, resulting in V 16 being a singular matrix. Since V 0 is full rank, we define z = 1 y , and Equation (30) can be rewritten as follows:
( z 16 V 0 + z 15 V 1 + z 14 V 2 + + V 16 ) X = 0
where z is the eigenvalue of matrix G .
G = 0 I 0 0 0 0 I V 0 1 V 16 V 0 1 V 15 V 0 1 V 1
A method for reducing the size of the polynomial eigenvalue problem is proposed in [32]. Zero columns and corresponding zero rows of matrix G are removed. The eigenvalues of matrix G are calculated using Schur decomposition. The translation vector and scale can be obtained through Equation (15) when y is known. The algorithm flow chart is shown in Figure 3.
Figure 3. Algorithm flow chart.
The method proposed in this paper is suitable for cameras and IMUs that are fixedly connected together. The specific steps of the algorithm are as follows: (1) The data input to the algorithm include feature point pairs, pitch and roll angles provided by the IMU, and calibration parameters. (2) Based on the input data, calculate the coefficients of each element of matrix C in Equation (17) with respect to the variable y. (3) Calculate g1, g2, g3, g4, g5, h1, h2, h3, h4, and h5 according to Equations (20)–(23). (4) Calculate matrix G according to Equation (34). (5) The eigenvalue of matrix G is the rotation parameter y , thus obtaining the rotation matrix. (6) The translation vector and scale are obtained by bringing the rotation matrix into Equation (15).

4. Experiments

In this section, we test the accuracy of the rotation matrix, translation vector, and scale on synthetic and real-world data. Since the solver proposed in this paper is a global optimization solver with a known vertical direction, we chose Sw [22] for comparison. The method proposed in this paper is called OURs.
ε R = arccos ( trace ( R g t R T ) 1 2 )
ε t = 2 t g t t ( t g t + t )
ε t , dir = arccos t g t T t ( t g t + t )
ε s = | s s g t | s g t
where R g t , t g t , and s g t are the truth of rotation, translation, and scale, respectively. R , t , and s are the evaluated values of rotation, translation, and scale, respectively.

4.1. Experiments on Synthetic Data

To demonstrate the performance of the method proposed in this paper, we first perform experiments on a synthetic dataset. We randomly generate five cameras in space. Three hundred random 3D points are generated within a range of −5 to 5 m (X-axis), −5 to 5 m (Y-axis), and 5–20 m (Z-axis). The resolution of the image is 640 × 480 pixels, and the focal length of the camera is 400 pixels. We assume that the principal point is located at the center of the image at (320, 240) pixels. The scale s is randomly generated in the range [0.5, 2]. Rotation matrices and translation vectors for the five cameras to the reference frame are randomly generated. We establish the frame of reference in the middle of the five cameras. The rotation angle between two adjacent reference frames ranges from −10° to 10°. The direction of the translation vector between two adjacent frames is also randomly generated, and the norm of the translation vector is between 1 and 2 m.
Parameter sensitivity analysis: Due to the fact that the solver proposed in this paper uses a non-minimum sample to estimate the relative pose of the camera and scale, we analyze the accuracy of the calculation by varying the number of points. We set the noise in the image to 1 pixel. The noise of the pitch angle and the noise of the roll angle are both zero degrees. Figure 4 shows the accuracy of rotation, translation, and scale estimation by OURs with different numbers of feature points. It is evident that as the number of points increases, the rotation error, translation error, and scale error estimated by the OURs method all decrease in Figure 4.
Figure 4. Effect of the number of feature points on the accuracy of rotation, translation, and scale estimation by the method proposed in this paper with different feature points. (a) Rotation error (degree); (b) translation error (degree); (c) translation error; (d) scale error.
Noise resilience: The solvers of Sw and OURs are tested with added image noise under four motions: random motion ( t = t x t y t z T ), planar motion ( t = t x 0 t z T ), forward motion ( t = 0 0 t z T ), and sideways motion ( t = t x 0 0 T ). We add image noise ranging from 0 to 1 pixel. We added noise to the pitch angle and roll angle when adding noise to the IMU in the synthetic data. The noise of the image is fixed to 1 pixel when noise is added to the roll angle and pitch angle. The maximum noise of the roll angle and pitch angle is 0.5 degrees. Figure 5 shows the error values of the rotation matrix, translation vector, and scale calculated by OURs and Wu when the image noise, pitch angle noise, and roll angle are added under random motion. Figure 6 shows the error values of the rotation matrix, translation vector, and scale calculated by OURs and Wu when the image noise, pitch angle noise, and roll angle are added under planar motion. Figure 7 shows the error values of the rotation matrix, translation vector, and scale calculated by OURs and Wu when the image noise, pitch angle noise, and roll angle are added under sideways motion. Figure 8 shows the error values of the rotation matrix, translation vector, and scale calculated by OURs and Sw when the image noise, pitch angle noise, and roll angle are added under forward motion. The second column shows the calculation results of adding pitch angle noise. The third column shows the calculation results of adding roll angle noise. The first row represents the rotation matrix error. The second and third rows represent translation vector errors. The fourth row represents scale error. From Figure 5, Figure 6, Figure 7 and Figure 8, it can be observed that the method proposed in this paper yields rotation matrix errors, translation vector errors, and scale errors smaller than those estimated by the Sw method. The effectiveness of the proposed method is demonstrated through simulated data.
Figure 5. Estimating errors in the rotation matrix, translation vector, and scale information under random motion. The first column shows the calculation results of adding image noise. The second column shows the calculation results of adding pitch angle noise. The third column shows the calculation results of adding roll angle noise. The first, second, third and fourth rows represent the values of ε R , ε t , ε t , dir and ε s respectively.
Figure 6. Estimating errors in the rotation matrix, translation vector, and scale information under planar motion. The first column shows the calculation results of adding image noise. The second column shows the calculation results of adding pitch angle noise. The third column shows the calculation results of adding roll angle noise. The first, second, third and fourth rows represent the values of ε R , ε t , ε t , dir and ε s respectively.
Figure 7. Estimating errors in the rotation matrix, translation vector, and scale information under sideways motion. The first column shows the calculation results of adding image noise. The second column shows the calculation results of adding pitch angle noise. The third column shows the calculation results of adding roll angle noise. The first, second, third and fourth rows represent the values of ε R , ε t , ε t , dir and ε s respectively.
Figure 8. Estimating errors in the rotation matrix, translation vector, and scale information under forward motion. The first column shows the calculation results of adding image noise. The second column shows the calculation results of adding pitch angle noise. The third column shows the calculation results of adding roll angle noise. The first, second, third and fourth rows represent the values of ε R , ε t , ε t , dir and ε s respectively.
Random motion: When the image noise is 1 pixel, the average rotation matrix error calculated by the OURs method is 0.011°, with a median of 0.008° and a standard deviation of 0.010. The average rotation matrix error calculated by the Sw method is 0.197°, with a median of 0.06° and a standard deviation of 0.559. The average translation vector error calculated by the OURs method is 0.249°, with a median of 0.156° and a standard deviation of 0.374. The average translation vector error calculated by the Sw method is 6.927°, with a median of 1.323° and a standard deviation of 14.908. The average scale error calculated by the OURs method is 0.005, with a median of 0.004 and a standard deviation of 0.005. The average scale error calculated by the Sw method is 0.031, with a median of 0.015 and a standard deviation of 0.050. When the pitch noise is 0.5°, the average rotation matrix error calculated by the OURs method is 0.164°, with a median of 0.089° and a standard deviation of 0.224. The average rotation matrix error calculated by the Sw method is 0.624°, with a median of 0.245° and a standard deviation of 1.321. The average translation vector error calculated by the OURs method is 1.397°, with a median of 0.196° and a standard deviation of 4.239. The average translation vector error calculated by the Sw method is 5.562°, with a median of 1.099° and a standard deviation of 12.152. The average scale error calculated by the OURs method is 0.010, with a median of 0.006 and a standard deviation of 0.014. The average scale error calculated by the Sw method is 0.045, with a median of 0.025 and a standard deviation of 0.068. When the roll noise is 0.5°, the average rotation matrix error calculated by the OURs method is 0.183°, with a median of 0.103° and a standard deviation of 0.230. The average rotation matrix error calculated by the Sw method is 0.591°, with a median of 0.253° and a standard deviation of 1.136. The average translation vector error calculated by the OURs method is 2.051°, with a median of 0.221° and a standard deviation of 8.799. The average translation vector error calculated by the Sw method is 5.595°, with a median of 1.021° and a standard deviation of 12.733. The average scale error calculated by the OURs method is 0.010, with a median of 0.006 and a standard deviation of 0.012. The average scale error calculated by the Sw method is 0.037, with a median of 0.019 and a standard deviation of 0.052.
Planar motion: When the image noise is 1 pixel, the average rotation matrix error calculated by the OURs method is 0.015°, with a median of 0.012° and a standard deviation of 0.013. The average rotation matrix error calculated by the Sw method is 0.368°, with a median of 0.097° and a standard deviation of 1.134. The average translation vector error calculated by the OURs method is 0.194°, with a median of 0.093° and a standard deviation of 0.722. The average translation vector error calculated by the Sw method is 4.864°, with a median of 0.802° and a standard deviation of 12.211. The average scale error calculated by the OURs method is 0.005, with a median of 0.004 and a standard deviation of 0.005. The average scale error calculated by the Sw method is 0.0374, with a median of 0.016 and a standard deviation of 0.069. When the pitch noise is 0.5°, the average rotation matrix error calculated by the OURs method is 0.127°, with a median of 0.071° and a standard deviation of 0.156. The average rotation matrix error calculated by the Sw method is 0.883°, with a median of 0.357° and a standard deviation of 1.644. The average translation vector error calculated by the OURs method is 4.802°, with a median of 0.838° and a standard deviation of 11.473. The average translation vector error calculated by the Sw method is 7.813°, with a median of 1.847° and a standard deviation of 14.928. The average scale error calculated by the OURs method is 0.015, with a median of 0.008 and a standard deviation of 0.019. The average scale error calculated by the Sw method is 0.049, with a median of 0.030 and a standard deviation of 0.059. When the roll noise is 0.5°, the average rotation matrix error calculated by the OURs method is 0.156°, with a median of 0.090° and a standard deviation of 0.182. The average rotation matrix error calculated by the Sw method is 0.931°, with a median of 0.353° and a standard deviation of 1.897. The average translation vector error calculated by the OURs method is 5.271°, with a median of 1.006° and a standard deviation of 12.505. The average translation vector error calculated by the Sw method is 6.234°, with a median of 1.762° and a standard deviation of 12.573. The average scale error calculated by the OURs method is 0.019, with a median of 0.008 and a standard deviation of 0.020. The average scale error calculated by the Sw method is 0.046, with a median of 0.026 and a standard deviation of 0.065.
Sideways motion: When the image noise is 1 pixel, the average rotation matrix error calculated by the OURs method is 0.015°, with a median of 0.011° and a standard deviation of 0.012. The average rotation matrix error calculated by the Sw method is 0.295°, with a median of 0.093° and a standard deviation of 0.797. The average translation vector error calculated by the OURs method is 0.250°, with a median of 0.151° and a standard deviation of 0.373. The average translation vector error calculated by the Sw method is 6.638°, with a median of 1.231° and a standard deviation of 13.752. The average scale error calculated by the OURs method is 0.005, with a median of 0.004 and a standard deviation of 0.005. The average scale error calculated by the Sw method is 0.0374, with a median of 0.016 and a standard deviation of 0.076. When the pitch noise is 0.5°, the average rotation matrix error calculated by the OURs method is 0.160°, with a median of 0.094° and a standard deviation of 0.185. The average rotation matrix error calculated by the Sw method is 1.019°, with a median of 0.441° and a standard deviation of 1.708. The average translation vector error calculated by the OURs method is 9.559°, with a median of 2.765° and a standard deviation of 17.419. The average translation vector error calculated by the Sw method is 9.693°, with a median of 3.385° and a standard deviation of 15.701. The average scale error calculated by the OURs method is 0.020, with a median of 0.012 and a standard deviation of 0.024. The average scale error calculated by the Sw method is 0.044, with a median of 0.026 and a standard deviation of 0.058. When the roll noise is 0.5°, the average rotation matrix error calculated by the OURs method is 0.169°, with a median of 0.099° and a standard deviation of 0.198. The average rotation matrix error calculated by the Sw method is 0.732°, with a median of 0.274° and a standard deviation of 1.365. The average translation vector error calculated by the OURs method is 8.354°, with a median of 2.348° and a standard deviation of 12.123. The average translation vector error calculated by the Sw method is 10.764°, with a median of 3.230° and a standard deviation of 19.194. The average scale error calculated by the OURs method is 0.020, with a median of 0.010 and a standard deviation of 0.026. The average scale error calculated by the Sw method is 0.045, with a median of 0.026 and a standard deviation of 0.062.
Forward motion: When the image noise is 1 pixel, the average rotation matrix error calculated by the OURs method is 0.016°, with a median of 0.013° and a standard deviation of 0.015. The average rotation matrix error calculated by the Sw method is 0.421°, with a median of 0.135° and a standard deviation of 1.215. The average translation vector error calculated by the OURs method is 0.321°, with a median of 0.149° and a standard deviation of 1.325. The average translation vector error calculated by the Sw method is 7.156°, with a median of 1.27° and a standard deviation of 15.035. The average scale error calculated by the OURs method is 0.005, with a median of 0.004 and a standard deviation of 0.005. The average scale error calculated by the Sw method is 0.036, with a median of 0.016 and a standard deviation of 0.068. When the pitch noise is 0.5°, the average rotation matrix error calculated by the OURs method is 0.082°, with a median of 0.048° and a standard deviation of 0.104. The average rotation matrix error calculated by the Sw method is 0.965°, with a median of 0.411° and a standard deviation of 1.614. The average translation vector error calculated by the OURs method is 2.195°, with a median of 0.589° and a standard deviation of 5.921. The average translation vector error calculated by the Sw method is 9.442°, with a median of 2.364° and a standard deviation of 17.648. The average scale error calculated by the OURs method is 0.012, with a median of 0.007 and a standard deviation of 0.015. The average scale error calculated by the Sw method is 0.049, with a median of 0.029 and a standard deviation of 0.060. When the roll noise is 0.5°, the average rotation matrix error calculated by the OURs method is 0.229°, with a median of 0.141° and a standard deviation of 0.260. The average rotation matrix error calculated by the Sw method is 1.068°, with a median of 0.537° and a standard deviation of 1.639. The average translation vector error calculated by the OURs method is 11.024°, with a median of 2.628° and a standard deviation of 22.204. The average translation vector error calculated by the Sw method is 9.431°, with a median of 3.476° and a standard deviation of 14.699. The average scale error calculated by the OURs method is 0.020, with a median of 0.012 and a standard deviation of 0.024. The average scale error calculated by the Sw method is 0.046, with a median of 0.026 and a standard deviation of 0.068.

4.2. Experiments on Real-World Data

To further validate the effectiveness of the proposed method, the KITTI dataset was chosen for real data evaluation [31]. The KITTI dataset was jointly founded by the Karlsruhe Institute of Technology in Germany and the Toyota American Institute of Technology. It is currently the largest computer vision algorithm evaluation dataset in the world for autonomous driving scenarios. The raw dataset is divided into the categories ‘Road’, ‘City’, ‘Residential’, ‘Person’, and ‘Campus’. The car is equipped with GPS, an IMU, one 64-line 3D LiDAR, and two grayscale cameras. The KITTI dataset provides ground truth for 11 sequences (00–10). The pitch angle and roll angle can be extracted from the IMU sensor data. The intrinsic parameters of the camera and the rotation and translation between two cameras from the reference frame are given in the data document [34]. We utilized the SIFT algorithm to obtain corresponding feature point pairs. Figure 8 shows the results of feature extraction and matching using SIFT in the KITTI dataset. We selected one out of every five pairs of points for display in Figure 9.
Figure 9. Test image pair from KITTI dataset with feature detection.
Table 2 shows the error results of the rotation matrix, translation vector, and scale estimated by the OURs method and the Sw method on the KITTI dataset. From Table 2, we can see that the accuracy of the rotation matrix, translation vector, and scale estimated by the OURs method is significantly better than that of the Sw method. Table 3 shows the percentage by which the error calculated by the OURs method is reduced compared to the error calculated by the Sw method. For the rotation matrix, the OURs method computes the error on average about 53% less than the Sw method. The maximum error of the OURs method decreased by 71%, and the minimum decreased by 32%. For the translation vector, the OURs method computes the error on average about 67% less than the Sw method. The maximum error of the OURs method decreased by 73%, and the minimum decreased by 61%. For the scale, the OURs method computes the error on average about 90% less than the Sw method. The maximum error of the OURs method decreased by 95%, and the minimum decreased by 80%. The standard deviation of the rotation matrix estimated by the OURs method is 65% less than that of the Sw method. The standard deviation of the translation vector estimated by the OURs method is 51% less than that of the Sw method. The standard deviation of the scale estimated by the OURs method is 94% less than that of the Sw method.
Table 2. Rotation, translation, and scale error on KITTI sequence by the Sw method and the OURs method.
Table 3. The percentage by which the error calculated by the OURs method is reduced compared to the error calculated by the Sw method.
To further verify the performance of the proposed method, we add the Kneip method [23], which is the latest method for estimating relative pose and scale from 2D-2D only without the use of an IMU. This method solves the relative pose and scale by using the non-minimum sample number. This method requires at least eight-point correspondences. The Kneip method transforms relative pose and scale estimation into a symmetric eigenvalue problem. A simple heuristic global energy minimization scheme based on local minimum suppression is used in the Kneip method. We added the Peter method [35], which requires a minimum of three 3D-3D correspondences to find the similarity. This is the orthogonal Procrustes approach. The 3D points are obtained by triangulating in each view.
Table 4 shows the rotation matrix errors, translation vector errors, and scale errors calculated by the Kneip method and the Peter method. As can be seen from Table 4, the average value, maximum value, minimum value, standard deviation, and root mean square error of the rotation matrix error estimated by the Kneip method are 0.8052°, 1.0643°, 0.5956°, 0.1340, and 0.8163. The average value, maximum value, minimum value, standard deviation, and root mean square error of the translation vector error estimated by the Kneip method are 5.1662°, 6.4101°, 4.2787°, 0.5699, and 5.1975. The average value, maximum value, minimum value, standard deviation, and root mean square error of the scale error estimated by the Kneip method are 0.0509, 0.0957, 0.0131, 0.0322, and 0.0603. The average value, maximum value, minimum value, standard deviation, and root mean square error of the rotation matrix error estimated by the Peter method are 1.8892°, 2.0592°, 1.7606°, 0.0894, and 1.8913. The average value, maximum value, minimum value, standard deviation, and root mean square error of the translation vector error estimated by the Peter method are 6.7648°, 9.8001°, 5.2621°, 1.2423, and 6.8779. The average value, maximum value, minimum value, standard deviation, and root mean square error of the scale error estimated by the Peter method are 0.0963, 0.1254, 0.0712, 0.0180, and 0.0979.
Table 4. Rotation, translation, and scale error on KITTI sequence by the Kneip method and the Peter method.

4.3. Discussion

The method proposed in this paper estimates the relative pose (4 DOFs) and scale from image correspondences. Therefore, the Sw method was chosen as the comparison method for the simulation experiment. From Figure 4, Figure 5, Figure 6 and Figure 7, it can be observed that the method proposed in this paper yields rotation matrix errors, translation vector errors, and scale errors smaller than those estimated by the Sw method. The effectiveness of the proposed method is demonstrated through simulated data. In the experiment on real data, to further verify the performance of the proposed method, we added the Kneip method and Peter method as comparison methods. The errors of the rotation matrix, translation vector, and scale calculated by our method are all smaller than those of the comparison method. The root mean square error of the rotation matrix estimated by the OURs method is 56% less than that of the Sw method. The root mean square error of the rotation matrix estimated by the OURs method is 90% less than that of the Kneip method. The root mean square error of the rotation matrix estimated by the OURs method is 96% less than that of the Peter method. The root mean square error of the translation vector estimated by the OURs method is 67% less than that of the Sw method. The root mean square error of the translation vector estimated by the OURs method is 77% less than that of the Kneip method. The root mean square error of the translation vector estimated by the OURs method is 82% less than that of the Peter method. The root mean square error of the scale estimated by the OURs method is 92% less than that of the Sw method. The root mean square error of the scale estimated by the OURs method is 98% less than that of the Kneip method. The root mean square error of the scale estimated by the OURs method is 99% less than that of the Peter method.

5. Conclusions

We propose a new globally optimal solver to estimate the relative pose and scale for a multi-camera system from only image correspondences with a known vertical direction. Firstly, we transformed the problem into a cost function based on the least-squares sense to minimize algebraic error. Based on the characteristic equation method and its first derivative equal to zero, two independent polynomial equations with two unknowns are provided. These two equations consist of the rotation angle parameter. We utilized the polynomial eigenvalue method to solve the rotation angle parameter. The translation vector and scale information were obtained based on the corresponding eigenvectors. We tested the superiority of the proposed method in relative pose and scale estimation on synthetic data and the KITTI dataset. Compared to the best method in the comparison methods, the method proposed in this paper reduced the rotation matrix error, translation vector error, and scale error by 53%, 67%, and 90%, respectively.
From Equation (14), it can be observed that the error sources of the proposed algorithm mainly come from the accuracy of feature extraction and matching, as well as the accuracy of the IMU. The measurement accuracy of the IMU is determined by the gyroscope. Our next task is to study how to enhance the accuracy of feature extraction and matching.

Author Contributions

Conceptualization, Zhenbao Yu and Shirong Ye; methodology, Zhenbao Yu and Pengfei Xia; software, Zhenbao Yu and Ronghe Jin; validation, Zhenbao Yu, Shirong Y, Kang Yan, and Ronghe Jin; formal analysis, Zhenbao Yu and Changwei Liu; investigation, Zhenbao Yu and Changwei Liu; resources, Zhenbao Yu, Shirong Ye, Kang Yan, and Changwei Liu; data curation, Shirong Ye and Changwei Liu; writing—original draft preparation, Zhenbao Yu and Ronghe Jin; writing—review and editing, Zhenbao Yu and Pengfei Xia; visualization, Zhenbao Yu, Changwei Liu, Kang Yan, and Pengfei Xia; supervision, Zhenbao Yu, Shirong Ye, and Ronghe Jin; project administration, Zhenbao Yu, Changwei Liu, Kang Yan, and Ronghe Jin; funding acquisition, Shirong Ye. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 41974031), the National Key Research and Development Program of China (No. 2019YFC1509603), and the Ministry of Industry and Information Technology of China through the High Precision Timing Service Project under Grant TC220A04A-80.

Data Availability Statement

The original contributions presented in this study are included in this article; further inquiries can be directed to the corresponding author.

Acknowledgments

We would like to thank the editor and anonymous reviewers for their constructive comments and suggestions for improving this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhang, J.; Xu, L.; Bao, C. An Adaptive Pose Fusion Method for Indoor Map Construction. ISPRS Int. J. Geo-Inf. 2021, 10, 800. [Google Scholar] [CrossRef]
  2. Svärm, L.; Enqvist, O.; Kahl, F.; Oskarsson, M. City-scale localization for cameras with known vertical direction. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1455–1461. [Google Scholar] [CrossRef] [PubMed]
  3. Li, C.; Zhou, L.; Chen, W. Automatic Pose Estimation of Uncalibrated Multi-View Images Based on a Planar Object with a Predefined Contour Model. ISPRS Int. J. Geo-Inf. 2016, 5, 244. [Google Scholar] [CrossRef]
  4. Raposo, C.; Barreto, J.P. Theory and practice of structure-from-motion using affine correspondences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5470–5478. [Google Scholar]
  5. Wang, Y.; Liu, X.; Zhao, M.; Xu, X. VIS-SLAM: A Real-Time Dynamic SLAM Algorithm Based on the Fusion of Visual, Inertial, and Semantic Information. ISPRS Int. J. Geo-Inf. 2024, 13, 163. [Google Scholar] [CrossRef]
  6. Qin, J.; Li, M.; Liao, X.; Zhong, J. Accumulative Errors Optimization for Visual Odometry of ORB-SLAM2 Based on RGB-D Cameras. ISPRS Int. J. Geo-Inf. 2019, 8, 581. [Google Scholar] [CrossRef]
  7. Mur-Artal, R.; Tardós, J.D. SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Trans. Robot 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
  8. Niu, Q.; Li, M.; He, S.; Gao, C.; Gary Chan, S.H.; Luo, X. Resource-efficient and Automated Image-based Indoor Localization. ACM Trans. Sens. Netw. 2019, 15, 19. [Google Scholar] [CrossRef]
  9. Poulose, A.; Han, D.S. Hybrid Indoor Localization on Using IMU Sensors and Smartphone Camera. Sensors 2019, 19, 5084. [Google Scholar] [CrossRef] [PubMed]
  10. Kawaji, H.; Hatada, K.; Yamasaki, T.; Aizawa, K. Image-based indoor positioning system: Fast image matching using omnidirectional panoramic images. In Proceedings of the 1st ACM International Workshop on Multimodal Pervasive Video Analysis, Firenze, Italy, 25–29 October 2010. [Google Scholar]
  11. Pless, R. Using many cameras as one. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Madison, WI, USA, 18–20 June 2003; Volume 2, p. II-587. [Google Scholar]
  12. HenrikStewénius, M.O.; Aström, K.; Nistér, D. Solutions to minimal generalized relative pose problems. In Proceedings of the Workshop on Omnidirectional Vision, Beijing, China, October 2005. [Google Scholar]
  13. Hee Lee, G.; Pollefeys, M.; Fraundorfer, F. Relative pose estimation for a multi-camera system with known vertical direction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 540–547. [Google Scholar]
  14. Liu, L.; Li, H.; Dai, Y.; Pan, Q. Robust and efficient relative pose with a multi-camera system for autonomous driving in highly dynamic environments. IEEE Trans. Intell. Transp. 2017, 19, 2432–2444. [Google Scholar] [CrossRef]
  15. Sweeney, C.; Flynn, J.; Turk, M. Solving for relative pose with a partially known rotation is a quadratic eigenvalue problem. In Proceedings of the 2014 2nd International Conference on 3D Vision, Tokyo, Japan, 8–11 December 2014; Volume 1, pp. 483–490. [Google Scholar]
  16. Li, H.; Hartley, R.; Kim, J. A linear approach to motion estimation using generalized camera models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
  17. Kneip, L.; Li, H. Efficient computation of relative pose for multi-camera systems. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 446–453. [Google Scholar]
  18. Wu, Q.; Ding, Y.; Qi, X.; Xie, J.; Yang, J. Globally optimal relative pose estimation for multi-camera systems with known gravity direction. In Proceedings of the International Conference on Robotics and Automation, Philadelphia, PA, USA, 23–27 May 2022; pp. 2935–2941. [Google Scholar]
  19. Guan, B.; Zhao, J.; Barath, D.; Fraundorfer, F. Minimal cases for computing the generalized relative pose using affine correspondences. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 6068–6077. [Google Scholar]
  20. Guan, B.; Zhao, J.; Barath, D.; Fraundorfer, F. Relative pose estimation for multi-camera systems from affine correspondences. arXiv 2020, arXiv:2007.10700v1. [Google Scholar]
  21. Kukelova, Z.; Bujnak, M.; Pajdla, T. Closed-form solutions to minimal absolute pose problems with known vertical direction. In Proceedings of the Asian Conference on Computer Vision, Queenstown, New Zealand, 8–12 November 2010; pp. 216–229. [Google Scholar]
  22. Sweeney, C.; Kneip, L.; Hollerer, T.; Turk, M. Computing similarity transformations from only image correspondences. In Proceedings of the TEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3305–3313. [Google Scholar]
  23. Kneip, L.; Sweeney, C.; Hartley, R. The generalized relative pose and scale problem: View-graph fusion via 2D-2D registration. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision, Lake Placid, NY, USA, 7–10 March 2016; pp. 1–9. [Google Scholar]
  24. Grossberg, M.D.; Nayar, S.K. A general imaging model and a method for finding its parameters. In Proceedings of the Eighth IEEE International Conference on Computer Vision, Vancouver, BC, Canada, 7–14 July 2001; Volume 2, pp. 108–115. [Google Scholar]
  25. Kim, J.H.; Li, H.; Hartley, R. Motion estimation for nonoverlap multicamera rigs: Linear algebraic and L∞ geometric solutions. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 32, 1044–1059. [Google Scholar]
  26. Guan, B.; Zhao, J.; Barath, D.; Fraundorfer, F. Minimal solvers for relative pose estimation of multi-camera systems using affine correspondences. Int. J. Comput. Vision 2023, 131, 324–345. [Google Scholar] [CrossRef]
  27. Zhao, J.; Xu, W.; Kneip, L. A certifiably globally optimal solution to generalized essential matrix estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12034–12043. [Google Scholar]
  28. Sweeney, C.; Fragoso, V.; Höllerer, T.; Turk, M. gdls: A scalable solution to the generalized pose and scale problem. In Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014; pp. 16–31. [Google Scholar]
  29. Bujnak, M.; Kukelova, Z.; Pajdla, T. 3d reconstruction from image collections with a single known focal length. In Proceedings of the IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; pp. 1803–1810. [Google Scholar]
  30. Fitzgibbon, A.W. Simultaneous linear estimation of multiple view geometry and lens distortion. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Kauai, HI, USA, 8–14 December 2001; Volume 1, p. I. [Google Scholar]
  31. Ding, Y.; Yang, J.; Kong, H. An efficient solution to the relative pose estimation with a common direction. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation, Paris, France, 31 May–31 August 2020; pp. 11053–11059. [Google Scholar]
  32. Kukelova, Z.; Bujnak, M.; Pajdla, T. Polynomial eigenvalue solutions to minimal problems in computer vision. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 34, 1381–1393. [Google Scholar] [CrossRef]
  33. Larsson, V.; Astrom, K.; Oskarsson, M. Efficient solvers for minimal problems by syzygy-based reduction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 820–829. [Google Scholar]
  34. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
  35. Schonemann, P. A generalized solution of the orthogonal Procrustes problem. Psychometrika 1966, 31, 1–10. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.