Position and Attitude Estimation Method Integrating Visual Odometer and GPS.

The monocular visual odometer is widely used in the navigation of robots and vehicles, but it has defects of the unknown scale of the estimated trajectory. In this paper, we presented a position and attitude estimation method, integrating the visual odometer and Global Position System (GPS), where the GPS positioning results were taken as a reference to minimize the trajectory estimation error of visual odometer and derive the attitude of the vehicle. Hardware-in-the-loop simulations were carried out; the experimental results showed that the positioning error of the proposed method was less than 1 m, and the accuracy and robustness of the attitude estimation results were better than those of the state-of-art vision-based attitude estimation methods.


Introduction
The position and attitude are essential measurements in guidance and control of an Unmanned Aircraft Vehicle (UAV), and they are usually obtained by Inertial Navigation Sensors (INS) [1,2] and being corrected by the GPS (Global Position System) [3][4][5], magnetic sensors [6], and visual odometers [7,8], etc. However, these methods require the initial alignment of INS, leading to the problem that the vehicle needs to stay stationary at the beginning when aligning the INS [9,10]. This problem can be solved by transfer alignment with external attitude measurement equipment [11][12][13], but it would increase the cost and weight on the navigation system. Monocular cameras are widely used in the visual navigation of UAVs because they have the advantages of being inexpensive and small. For a UAV equipped with a monocular camera, the attitude estimation can be realized without INS, and the methods of estimating the attitude with pure computer vision are studied by many researchers, while a variety of methods have been proposed, such as horizon detection method [14,15] and vanishing points method [16][17][18]. Timotheatos proposed a vehicle attitude estimation method based on horizon detection [19], which detects the horizon line based on a canny edge and a hough detector along with an optimization step performed by a Particle Swarm Optimization (PSO) algorithm. The roll angle of the vehicle is determined from the angle formed by the slope of the horizon, and the pitch angle is computed using the current and the initial horizon positions. This method is further studied by Dumble [20], who extracts the horizon more accurately based on the spatial correlation and removes the straight horizon line assumption. Additionally, Hwangbo proposed an attitude estimation method based on vanishing points [21]. The attitude of the vehicle is estimated based on the assumption that the outlines of buildings are either vertical or parallel to the horizon. However, all these methods take external references to estimate attitude, and they may be inefficient when there are no references or the references are not identical to their assumptions.
In addition to using references to achieve attitude estimation, the attitude of the vehicle can be obtained by another vision-based attitude estimation method, Visual Odometer (VO), also known as The principle of the proposed method is shown in Figure 1, where the trajectories of the vehicle are simplified from 3D (3-Dimensional) space to 2D (2-Dimensional) space to help understand where the height information in 3D space is omitted here, and the trajectories are represented on a 2D plane. In Figure 1, the light blue lines indicate the trajectories measured by GPS, the red lines represent the motion trajectories estimated by VO, and the broken lines in yellow and dark blue are the attitude estimated by VO. The trajectory of the vehicle is estimated by GPS and Visual Odometer, represented by x n and x vo n , respectively, where n represents the index of points. As shown in Figure 1, firstly, the GPS and visual odometer estimate the flight trajectories separately. Secondly, the trajectories estimated by visual odometer and GPS are transformed into the same coordinate system. At last, we assume that the error of GPS measurements x n is unbiased white noise [29] and that there is an accumulative error in the estimated position of VO x vo n . So, the GPS positioning results are taken as measurements of the true position to correct the estimations of VO, and the scale factor between x vo n and x n is derived. Considering that the pose of the camera obtained by VO contains the attitude information, the attitude of the vehicle can be estimated as well.

The Position and Attitude Estimation
The update frequency of GPS is usually 5-10 Hz. The computational frequency of VO is 25-30 Hz, which is several times of the update frequency of GPS. So, it is necessary to have a hardware alignment or software alignment method to realize the synchronization between the pose obtained by VO and estimated by GPS.
Let be the position measurement of GPS, which has been synchronized with VO at frame k, and let be the transform matrix estimated by VO between frame k − 1 and k. For the sake of brevity, 3D coordinates like = ( , , ) are automatically transformed into matrix = [ 1] when multiplied with transform matrix . is given as follows, where = ∈ ℝ , ∈ SO(3) is the rotation matrix, SO(3) is the special orthogonal group SO(3) = ∈ ℝ | = , det( ) = 1 , ∈ ℝ is the translation matrix.
The position measurements of GPS are usually indicated by geographic coordinates, longitude , latitude , and height . In order to obtain the attitude angles of pitch and roll, the East-North-Up (ENU) coordinate system is selected to be the intermediate coordinate system to fuse the measurements of GPS and VO, where the transformations between geographic coordinates and ENU frame are given in [30]. After time alignment between GPS and VO, the coordinates of GPS are transformed into the ENU frame = ( , , ) to integrate with VO.
As the relationship between the initial local frame of VO and ENU frame is certain, the transformation matrix of them is constant. The position and attitude of the camera in the ENU frame can be obtained by using the poses estimated by VO. The transformation between measurements of VO and GPS in the ENU frame can be represented by, where is the position in the local frame of VO at frame k, = (0 0 0) is the origin in the local frame of VO, and = (0 0 0) is the origin of GPS measurements in ENU frame. is the rotation matrix from the local frame of VO to ENU frame, is a diagonal matrix, and it represents the scale factor from ENU frame to the local frame of VO, in which the three row vectors represent the scale factors in east, north, and up directions, respectively.

The Position and Attitude Estimation
The update frequency of GPS is usually 5-10 Hz. The computational frequency of VO is 25-30 Hz, which is several times of the update frequency of GPS. So, it is necessary to have a hardware alignment or software alignment method to realize the synchronization between the pose T n obtained by VO and x n estimated by GPS.
Let x k be the position measurement of GPS, which has been synchronized with VO at frame k, and let T k be the transform matrix estimated by VO between frame k − 1 and k. For the sake of brevity, 3D coordinates like x = (x, y, z) are automatically transformed into matrix x = [x y z 1] T when multiplied with transform matrix T k . T k is given as follows, where T 0 = I ∈ R 4×4 , R k ∈ SO(3) is the rotation matrix, SO(3) is the special orthogonal group SO(3) = R k ∈ R 3×3 R k R T k = I, det(R k ) = 1 , t k ∈ R 3 is the translation matrix. The position measurements of GPS are usually indicated by geographic coordinates, longitude L k , latitude B k , and height H k . In order to obtain the attitude angles of pitch and roll, the East-North-Up (ENU) coordinate system is selected to be the intermediate coordinate system to fuse the measurements of GPS and VO, where the transformations between geographic coordinates and ENU frame are given in [30]. After time alignment between GPS and VO, the coordinates of GPS are transformed into the ENU frame x enu k = x enu k , y enu k , z enu k to integrate with VO. As the relationship between the initial local frame of VO and ENU frame is certain, the transformation matrix of them is constant. The position and attitude of the camera in the ENU frame can be obtained by using the poses estimated by VO. The transformation between measurements of VO and GPS in the ENU frame can be represented by, where x vo k is the position in the local frame of VO at frame k, x vo 0 = (0 0 0) is the origin in the local frame of VO, and x enu 0 = (0 0 0) is the origin of GPS measurements in ENU frame. R is the rotation matrix from the local frame of VO to ENU frame, K is a diagonal matrix, and it represents the scale factor from ENU frame to the local frame of VO, in which the three row vectors represent the scale factors in east, north, and up directions, respectively. To simplify Equation (2), t k and d k are defined as, Thus, obtaining (5), For the convenience of calculation, the elements in (5) are moved to the left side as given below, However, there is usually no analytic solution of R and K due to the white noise in GPS measurements and the accumulative error of VO. So, the least square method is used to solve this equation. To obtain R and K, define the solution vector m as follows, m = [r 11 r 12 r 13 r 21 r 22 r 23 r 31 r 32 When measurements of GPS and VO from frame 1 to frame k are introduced into (6), it can be rewritten as, As shown in (6), each pair of positioning results of GPS and VO gives three formulas, while there are 12 unknown variables. So, this over-determined equation could be solved by using more than four pairs of positioning results of GPS and VO, and the solution vector m can be obtained after the fourth frame of the input image sequence. Then, the estimated matrices R and K are given as follows, where R is the estimated nonorthogonal matrix that approximates the orthogonal matrix R. Since R does not always meet orthogonality condition, a further process of R needs to be done, and (9) is divided by R , as shown in (10).
Here, the scale factor K is obtained in this equation. ButR may still not be orthogonal, and we have to find an orthogonal matrix, which is the most approximate toR, where the Singular Value Decomposition (SVD) method is used to solve this problem, and it is given as follows, where U, V are the orthogonal matrices, Σ is a diagonal matrix with positive elements. In this way, R is calculated by using more than four pairs of positioning results of GPS and VO, where the selected points must not be on a straight line. Because the linear relationship between the points would produce infinite solutions, the selection of the points can be conducted by using the positioning results of GPS; as long as the they are not on a straight line, they can be used to calculate R. When there are more data involved in the calculation after a period of flight, the results can be more accurate. This is one of the limitations of the proposed method that the positioning results of GPS and VO used to calculate R and K must not be on a straight line. Another limitation is the defect of VO that the ground must appear in the view of the camera, and if there are only ocean and sky, it will lead to large errors in the estimations of VO.
Then, the orientation R k in the pose T k of VO can be used to derive the attitude of the vehicle a k in the ENU frame, which is given by Through the equations above, the rotation matrix and scale factor between GPS and VO are obtained, and the attitude of the vehicle is derived as well.

The Optimization of GVO
However, the rotation matrix and scale factor between GPS and VO derived in the last section are not optimal solutions because there are accumulative errors in the poses obtained by VO, which would lead to a large error in the estimated rotation matrix and scale factor. In order to reduce the error of estimated position and attitude as much as possible, the optimization process has to be done.
According to (5), the position measurements of VO in the ENU frame are given as follows, where x ve k ∈ R 3 is the position estimated by VO, which has been transformed to the ENU frame. After the transformation in (13), the optimization process is carried out, and the proposed optimization algorithm is given in Algorithm 1. There are three major steps of the optimization process. Firstly, taking R and K obtained by (10) and (11) as the initials to help accelerate the convergence of optimization. They are optimized by minimizing errors between GPS measurements and the positions estimated by VO, which is given by, Let T ve k be the poses of VO, which have been transformed to the ENU frame, After the optimization of R and K, the poses of the camera T ve k at each frame are optimized by minimizing the errors between the GPS positioning results and the poses estimated by VO, which is represented by, At last, we use BA (Bundle Adjustment) to optimize the camera poses by minimizing the image reprojection error between the matched 3D points in ENU frame and their key points in the image, where the method is referenced in [31]. The position of the ground mark x ve mark in the ENU frame is given by where x vo mark is the position of ground mark in the local frame of VO. After the optimization process above, the system would judge whether the optimization result converges or not. If it converges, the optimization result will be output. Otherwise, perform the optimization process again until it converges.

The Experimental Environment
Hardware-in-the-loop simulations were conducted to verify the proposed method, and the block diagram of the simulation system is shown in Figure 2.
Sensors 2020, 20, x FOR PEER REVIEW 6 of 16 At last, we use BA (Bundle Adjustment) to optimize the camera poses by minimizing the image reprojection error between the matched 3D points in ENU frame and their key points in the image, where the method is referenced in [31]. The position of the ground mark in the ENU frame is given by where is the position of ground mark in the local frame of VO. After the optimization process above, the system would judge whether the optimization result converges or not. If it converges, the optimization result will be output. Otherwise, perform the optimization process again until it converges.

The Experimental Environment
Hardware-in-the-loop simulations were conducted to verify the proposed method, and the block diagram of the simulation system is shown in Figure 2.
As shown in Figure 2, the simulation system consisted of three parts-the simulation computer, PX4 embedded flight controller, and Nvidia Jetson TX2 embedded image processing board. The computer was used to simulate the visual environment, the PX4-embedded flight controller was used to control the attitude of aircraft, and the Nvidia Jetson TX2-embedded image processing board was used to run the GVO method. The CPU of the computer was Intel i7-7700K 4.20 GHz, the GPU of the computer was Nvidia GTX 1080 8 G, and the RAM of the computer was 32 GB. The integral proportional guidance law was added to the PX4-embedded flight controller with existing codes to help guidance control, where the codes could be downloaded on the website [32]. The size and weight of the TX2 module were 50 × 87 mm and 85 g, respectively. The power of TX2 when running the algorithm was 20 W and the cost of it was $399. The block diagram of the working process of the simulation system is shown in Figure 3. First, the 3D environment was rendered by Xplane10 on the computer. The real-time image sequences were captured during flight by setting the installation and intrinsics of the camera. Then, the image sequences would be transmitted to the NVIDIA image processing board to run the proposed GVO method and get the estimated position and attitude of UAV, where the GPS positioning results were obtained from the PX4 flight controller. The estimated position and attitude were fed back to PX4 so  As shown in Figure 2, the simulation system consisted of three parts-the simulation computer, PX4 embedded flight controller, and Nvidia Jetson TX2 embedded image processing board. The computer was used to simulate the visual environment, the PX4-embedded flight controller was used to control the attitude of aircraft, and the Nvidia Jetson TX2-embedded image processing board was used to run the GVO method. The CPU of the computer was Intel i7-7700K 4.20 GHz, the GPU of the computer was Nvidia GTX 1080 8 G, and the RAM of the computer was 32 GB. The integral proportional guidance law was added to the PX4-embedded flight controller with existing codes to help guidance control, where the codes could be downloaded on the website [32]. The size and weight of the TX2 module were 50 × 87 mm and 85 g, respectively. The power of TX2 when running the algorithm was 20 W and the cost of it was $399.
The block diagram of the working process of the simulation system is shown in Figure 3. First, the 3D environment was rendered by Xplane10 on the computer. The real-time image sequences were captured during flight by setting the installation and intrinsics of the camera. Then, the image sequences would be transmitted to the NVIDIA image processing board to run the proposed GVO method and get the estimated position and attitude of UAV, where the GPS positioning results were obtained from the PX4 flight controller. The estimated position and attitude were fed back to PX4 so that it could realize attitude control using the preset waypoints of the Qground Control software on the computer and give the attitude control message to Xplane10. When Xplane10 received the attitude control message from PX4, it changed the flight status of the aircraft, while the dynamic model of the aircraft and the windy model of the environment were taken into account. In addition, the flight status of the UAV would be fed back to PX4 to form a closed loop, which would assist PX4 to control the attitude of the UAV during flight. The update frequency of the PX4 controller, the image output frequency of XPlane 10, and the image sampling frequency of NVIDIA Jeston were all 50 Hz.

Positioning Results of GVO
To verify the positioning accuracy of GVO in different conditions, multiple heights were employed to conduct simulations, and the simulation conditions are shown in Table 1.

Positioning Results of GVO
To verify the positioning accuracy of GVO in different conditions, multiple heights were employed to conduct simulations, and the simulation conditions are shown in Table 1.
The flying speed in Table 1 is the cruising speed of MyTwinDream, a fixed-wing Unmanned Aircraft Vehicle (UAV), manufactured by FPVmodel in Xiamen China. Due to the disturbance of wind, the speed of UAV was not fixed during flight. The camera parameters here were the parameters of a commonly used onboard camera. The positioning results of GVO and GPS are shown in Figure 4, where the blue lines are estimated by GVO, and the red lines are measurements of GPS with a standard variance of 5 m.  As we could see from Figure 4, the trajectories estimated by GVO basically overlapped with the original GPS measurements. The statistics of the positioning errors of GVO are given in Table 2, and the corresponding bar diagram is shown in Figure 5.   As we could see from Figure 4, the trajectories estimated by GVO basically overlapped with the original GPS measurements. The statistics of the positioning errors of GVO are given in Table 2, and the corresponding bar diagram is shown in Figure 5.  As we could see from Figure 4, the trajectories estimated by GVO basically overlapped with the original GPS measurements. The statistics of the positioning errors of GVO are given in Table 2, and the corresponding bar diagram is shown in Figure 5.   As we could see from Table 2 and Figure 5, the mean deviation of positioning results was less than 1 m, and the standard variance of GVO was less than 0.7 m, which was much smaller than that of pure GPS with a standard variance of 5 m.
The convergence property of the proposed method was tested as well. The positioning errors of GVO are shown in Figure 6, and they were convergent in all cases, while the slowest one converged after the 250th frame at the frame rate of 25 frames per second. It showed that the system converged in a short time, which was within 10 s by using the proposed method.
Sensors 2020, 20, x FOR PEER REVIEW 9 of 16 As we could see from Table 2 and Figure 5, the mean deviation of positioning results was less than 1 m, and the standard variance of GVO was less than 0.7 m, which was much smaller than that of pure GPS with a standard variance of 5 m.
The convergence property of the proposed method was tested as well. The positioning errors of GVO are shown in Figure 6, and they were convergent in all cases, while the slowest one converged after the 250th frame at the frame rate of 25 frames per second. It showed that the system converged in a short time, which was within 10 s by using the proposed method.

Attitude Estimation Results
To assess the robustness and accuracy of GVO on attitude estimation, this section has compared GVO with two state-of-art monocular visual attitude estimation methods-the horizon detection method, and the vanishing points method.
The premise of the horizon detection method for attitude estimation is that the horizon must appear in the view of the camera. This algorithm estimated the roll angle of the camera by detecting the slope of the horizon, and the pitch angle was estimated from a distance between the center of the horizon and the central axis of the camera, which is shown in Figure 7a. The vanishing points method

Attitude Estimation Results
To assess the robustness and accuracy of GVO on attitude estimation, this section has compared GVO with two state-of-art monocular visual attitude estimation methods-the horizon detection method, and the vanishing points method.
The premise of the horizon detection method for attitude estimation is that the horizon must appear in the view of the camera. This algorithm estimated the roll angle of the camera by detecting the slope of the horizon, and the pitch angle was estimated from a distance between the center of the horizon and the central axis of the camera, which is shown in Figure 7a. The vanishing points method estimated attitude when there were a large number of artificial constructions. The attitude was estimated by the vanishing points of outlines of the buildings when the outlines were supposed to be parallel or vertical to the horizon, as shown in Figure 7b. Simulations were carried out in multiple environments like the plain, mountain, and city to compare the proposed method with different state-of-art attitude estimation methods, and the experimental conditions are given in Table 3. The horizon detection method was compared with the proposed method in plain and mountain areas, where the experimental results are given in Figures 8 and 9, respectively. As shown in Figure 8, the attitude estimation results of GVO were very close to those of the horizon detection method in the plain area, but the horizon detection method failed occasionally, which was also at the time when the estimated values jumped to zero. The horizon detection method judged whether the detected lines met the length threshold of the horizon to ensure the horizon detection accuracy, and it failed when there were no detected lines that could meet the requirements of the horizon assumption. The GVO method was not affected by the environment, so there were no failures of GVO in the simulations. Simulations were carried out in multiple environments like the plain, mountain, and city to compare the proposed method with different state-of-art attitude estimation methods, and the experimental conditions are given in Table 3. The horizon detection method was compared with the proposed method in plain and mountain areas, where the experimental results are given in Figures 8 and 9, respectively. As shown in Figure 8, the attitude estimation results of GVO were very close to those of the horizon detection method in the plain area, but the horizon detection method failed occasionally, which was also at the time when the estimated values jumped to zero. The horizon detection method judged whether the detected lines met the length threshold of the horizon to ensure the horizon detection accuracy, and it failed when there were no detected lines that could meet the requirements of the horizon assumption. The GVO method was not affected by the environment, so there were no failures of GVO in the simulations. the attitude estimation results of GVO were very close to those of the horizon detection method in the plain area, but the horizon detection method failed occasionally, which was also at the time when the estimated values jumped to zero. The horizon detection method judged whether the detected lines met the length threshold of the horizon to ensure the horizon detection accuracy, and it failed when there were no detected lines that could meet the requirements of the horizon assumption. The GVO method was not affected by the environment, so there were no failures of GVO in the simulations. The simulation results of GVO and horizon detection method in the mountainous area are shown in Figure 9. As shown in Figure 9, the accuracy and success rate of GVO were obviously higher than that of the horizon detection method. Compared with the plain area, the horizon in the mountainous area was more difficult to be detected. We could see from Figure 9 that the failure frequency of the horizon detection method was much larger than that of the plain area, and the attitude estimation accuracy in the mountainous area was worse. Meanwhile, the GVO still worked properly, which proved that the advantages of GVO were more prominent in the mountainous areas. The simulation results of GVO and horizon detection method in the mountainous area are shown in Figure 9. As shown in Figure 9, the accuracy and success rate of GVO were obviously higher than that of the horizon detection method. Compared with the plain area, the horizon in the mountainous area was more difficult to be detected. We could see from Figure 9 that the failure frequency of the horizon detection method was much larger than that of the plain area, and the attitude estimation accuracy in the mountainous area was worse. Meanwhile, the GVO still worked properly, which proved that the advantages of GVO were more prominent in the mountainous areas. shown in Figure 9. As shown in Figure 9, the accuracy and success rate of GVO were obviously higher than that of the horizon detection method. Compared with the plain area, the horizon in the mountainous area was more difficult to be detected. We could see from Figure 9 that the failure frequency of the horizon detection method was much larger than that of the plain area, and the attitude estimation accuracy in the mountainous area was worse. Meanwhile, the GVO still worked properly, which proved that the advantages of GVO were more prominent in the mountainous areas. (e) (f) The statistics of GVO and horizon detection method in plain and mountain environments are given in Table 4. As shown in Table 4, the accuracy of GVO was a little higher than that of the horizon detection method, but the success rate of GVO was much larger, especially in the mountainous area. As long as VO and GPS could work properly, the success rate of GVO was always 100%, which proved that the proposed method could estimate the attitude efficiently without being influenced by the external environment. So, the GVO method had the advantages of higher accuracy and better robustness than the horizon detection method.  The statistics of GVO and horizon detection method in plain and mountain environments are given in Table 4. As shown in Table 4, the accuracy of GVO was a little higher than that of the horizon detection method, but the success rate of GVO was much larger, especially in the mountainous area. As long as VO and GPS could work properly, the success rate of GVO was always 100%, which proved that the proposed method could estimate the attitude efficiently without being influenced by the external environment. So, the GVO method had the advantages of higher accuracy and better robustness than the horizon detection method. As shown in Figure 10, simulations in the city were conducted to compare GVO with the vanishing points method, and the statistics are given in Table 5. As we could see from Figure 10 and Table 5, the success rate of the vanishing points method decreased sharply with the increase of height, while the success rate of the GVO method was still 100%. When the height increased, the outlines of the buildings got shorter, and the detection accuracy of the vanishing points method would be reduced when the lengths of the lines were too short to meet the detection threshold. The GVO was not affected by the variation of height, and it had higher attitude estimation accuracy with an average error of 1.  Overall, GVO showed better performance of accuracy and robustness in attitude estimation than  Overall, GVO showed better performance of accuracy and robustness in attitude estimation than the state-of-art methods. From the simulations above, the success rate of GVO in attitude estimation was always 100% in spite of the variation of environment and height. In comparison, the average success rates of the horizon detection method and vanishing points method were 91.3% and 87.5%, respectively. The attitude estimation error of GVO ranged from 1.1 • to 1.3 • , while the error of horizon detection method ranged from 2.0 • to 2.1 • in plain area and 2.2 • to 2.5 • in the mountainous area. The average error of vanishing points method ranged from 2.7 • to 3.1 • in the city.

Conclusions
In this paper, an algorithm, integrating the Visual Odometer and GPS, was proposed to estimate the position and attitude of UAV. The integration method of VO and GPS was described, and the estimations of the proposed method were optimized as well. Hardware-in-the-loop simulations were conducted; the TX2 image processing board that we used was released in 2016, and there are alternative image processing chips with smaller sizes, such as Huawei Hi3559A with the size of 38 × 38 mm, manufactured by Hisilicon in Shanghai China, which is more suitable to be applied on UAV. As such chips enable this method to be applied in engineering. The experimental results showed that the mean deviation of positioning results of GVO was less than 1 m, and the standard variance was less than 0.7 m. Simulations compared with two state-of-art visual attitude estimation methods were carried out to verify the proposed method on attitude estimation, where the results showed that the GVO method performed better in both accuracy and robustness.