2.1. The Principle of GVO
The principle of the proposed method is shown in
Figure 1, where the trajectories of the vehicle are simplified from 3D (3-Dimensional) space to 2D (2-Dimensional) space to help understand where the height information in 3D space is omitted here, and the trajectories are represented on a 2D plane. In
Figure 1, the light blue lines indicate the trajectories measured by GPS, the red lines represent the motion trajectories estimated by VO, and the broken lines in yellow and dark blue are the attitude estimated by VO. The trajectory of the vehicle is estimated by GPS and Visual Odometer, represented by
and
respectively, where
n represents the index of points. As shown in
Figure 1, firstly, the GPS and visual odometer estimate the flight trajectories separately. Secondly, the trajectories estimated by visual odometer and GPS are transformed into the same coordinate system. At last, we assume that the error of GPS measurements
is unbiased white noise [
29] and that there is an accumulative error in the estimated position of VO
. So, the GPS positioning results are taken as measurements of the true position to correct the estimations of VO, and the scale factor between
and
is derived. Considering that the pose of the camera obtained by VO contains the attitude information, the attitude of the vehicle can be estimated as well.
2.2. The Position and Attitude Estimation
The update frequency of GPS is usually 5–10 Hz. The computational frequency of VO is 25–30 Hz, which is several times of the update frequency of GPS. So, it is necessary to have a hardware alignment or software alignment method to realize the synchronization between the pose obtained by VO and estimated by GPS.
Let
be the position measurement of GPS, which has been synchronized with VO at frame
k, and let
be the transform matrix estimated by VO between frame
k − 1 and
k. For the sake of brevity, 3D coordinates like
are automatically transformed into matrix
when multiplied with transform matrix
.
is given as follows,
where
,
is the rotation matrix,
is the special orthogonal group
,
is the translation matrix.
The position measurements of GPS are usually indicated by geographic coordinates, longitude
, latitude
and height
. In order to obtain the attitude angles of pitch and roll, the East-North-Up (ENU) coordinate system is selected to be the intermediate coordinate system to fuse the measurements of GPS and VO, where the transformations between geographic coordinates and ENU frame are given in [
30]. After time alignment between GPS and VO, the coordinates of GPS are transformed into the ENU frame
to integrate with VO.
As the relationship between the initial local frame of VO and ENU frame is certain, the transformation matrix of them is constant. The position and attitude of the camera in the ENU frame can be obtained by using the poses estimated by VO. The transformation between measurements of VO and GPS in the ENU frame can be represented by,
where
is the position in the local frame of VO at frame
k,
is the origin in the local frame of VO, and
is the origin of GPS measurements in ENU frame.
is the rotation matrix from the local frame of VO to ENU frame,
is a diagonal matrix, and it represents the scale factor from ENU frame to the local frame of VO, in which the three row vectors represent the scale factors in east, north, and up directions, respectively.
To simplify Equation (2),
and
are defined as,
For the convenience of calculation, the elements in (5) are moved to the left side as given below,
However, there is usually no analytic solution of
and
due to the white noise in GPS measurements and the accumulative error of VO. So, the least square method is used to solve this equation. To obtain
and
, define the solution vector
as follows,
When measurements of GPS and VO from frame 1 to frame
k are introduced into (6), it can be rewritten as,
As shown in (6), each pair of positioning results of GPS and VO gives three formulas, while there are 12 unknown variables. So, this over-determined equation could be solved by using more than four pairs of positioning results of GPS and VO, and the solution vector
can be obtained after the fourth frame of the input image sequence. Then, the estimated matrices
and
are given as follows,
where
is the estimated nonorthogonal matrix that approximates the orthogonal matrix
. Since
does not always meet orthogonality condition, a further process of
needs to be done, and (9) is divided by
, as shown in (10).
Here, the scale factor
is obtained in this equation. But
may still not be orthogonal, and we have to find an orthogonal matrix, which is the most approximate to
, where the Singular Value Decomposition (SVD) method is used to solve this problem, and it is given as follows,
where
are the orthogonal matrices,
is a diagonal matrix with positive elements.
In this way, is calculated by using more than four pairs of positioning results of GPS and VO, where the selected points must not be on a straight line. Because the linear relationship between the points would produce infinite solutions, the selection of the points can be conducted by using the positioning results of GPS; as long as the they are not on a straight line, they can be used to calculate . When there are more data involved in the calculation after a period of flight, the results can be more accurate. This is one of the limitations of the proposed method that the positioning results of GPS and VO used to calculate and must not be on a straight line. Another limitation is the defect of VO that the ground must appear in the view of the camera, and if there are only ocean and sky, it will lead to large errors in the estimations of VO.
Then, the orientation
in the pose
of VO can be used to derive the attitude of the vehicle
in the ENU frame, which is given by
Through the equations above, the rotation matrix and scale factor between GPS and VO are obtained, and the attitude of the vehicle is derived as well.
2.3. The Optimization of GVO
However, the rotation matrix and scale factor between GPS and VO derived in the last section are not optimal solutions because there are accumulative errors in the poses obtained by VO, which would lead to a large error in the estimated rotation matrix and scale factor. In order to reduce the error of estimated position and attitude as much as possible, the optimization process has to be done.
According to (5), the position measurements of VO in the ENU frame are given as follows,
where
is the position estimated by VO, which has been transformed to the ENU frame.
After the transformation in (13), the optimization process is carried out, and the proposed optimization algorithm is given in Algorithm 1.
Algorithm 1 Optimization of GVO |
Require: Positioning results of GPS and poses of camera . |
1: Optimize and by minimizing error between GPS measurements and the positions estimated by VO with initial values obtained by (10) and (11). |
2: Optimize the poses of camera by minimizing error between GPS measurements and the positions estimated by VO . |
3: Use BA (Bundle Adjustment) to optimize the camera poses again by minimizing the image reprojection error between the matched 3D points in ENU frame and their keypoints in the image. |
4: if The optimization converges then |
5: return The optimized , and . |
6: else |
7: Repeat step1 to step 6 |
8: end if |
There are three major steps of the optimization process. Firstly, taking
and
obtained by (10) and (11) as the initials to help accelerate the convergence of optimization. They are optimized by minimizing errors between GPS measurements and the positions estimated by VO, which is given by,
Let
be the poses of VO, which have been transformed to the ENU frame,
After the optimization of
and
, the poses of the camera
at each frame are optimized by minimizing the errors between the GPS positioning results and the poses estimated by VO, which is represented by,
At last, we use BA (Bundle Adjustment) to optimize the camera poses by minimizing the image reprojection error between the matched 3D points in ENU frame and their key points in the image, where the method is referenced in [
31]. The position of the ground mark
in the ENU frame is given by
where
is the position of ground mark in the local frame of VO.
After the optimization process above, the system would judge whether the optimization result converges or not. If it converges, the optimization result will be output. Otherwise, perform the optimization process again until it converges.