Long-Duration UAV Localization Across Day and Night by Fusing Dual-Vision Geo-Registration with Inertial Measurements

Xing, Xuehui; He, Xiaofeng; Liu, Ke; Chen, Zhizhong; Song, Guofeng; Hao, Qikai; Zhang, Lilian; Mao, Jun

doi:10.3390/drones9050373

Open AccessArticle

Long-Duration UAV Localization Across Day and Night by Fusing Dual-Vision Geo-Registration with Inertial Measurements

by

Xuehui Xing

^1,2,†

,

Xiaofeng He

^2,†

,

Ke Liu

³,

Zhizhong Chen

¹,

Guofeng Song

¹,

Qikai Hao

¹,

Lilian Zhang

² and

Jun Mao

^2,*

¹

Northwest Institute of Mechanical and Electrical Engineering, Xianyang 712000, China

²

College of Intelligent Sciences and Technology, National University of Defense and Technology, Changsha 410000, China

³

Sergeant School of Army Academy of Armored Forces, Beijing 100000, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Drones 2025, 9(5), 373; https://doi.org/10.3390/drones9050373

Submission received: 15 April 2025 / Revised: 4 May 2025 / Accepted: 5 May 2025 / Published: 15 May 2025

Download

Browse Figures

Versions Notes

Abstract

Remote sensing visual-light spectral (VIS) maps provide stable and rich features for geo-localization. However, it still remains a challenge to make use of VIS map features as localization references at night. To construct a cross-day-and-night localization system for long-duration UAVs, this study proposes a visual–inertial integrated localization system, where the visual component can register both RGB and infrared camera images in one unified VIS map. To deal with the large differences between visible and thermal images, we inspected various visual features and utilized a pre-trained network for cross-domain feature extraction and matching. To obtain an accurate position from visual geo-localization, we demonstrate a localization error compensation algorithm with considerations about the camera attitude, flight height, and terrain height. Finally, the inertial and dual-vision information is fused with a State Transformation Extended Kalman Filter (ST-EKF) to generate long-term, drift-free localization performance. Finally, we conducted actual long-duration flight experiments with altitudes ranging from 700 to 2400 m and flight distances longer than 344.6 km. The experimental results demonstrate that the proposed method’s localization error is less than 50 m in its RMSE.

Keywords:

deep learning; vision geo-localization; integrated navigation

1. Introduction

Long-endurance UAVs are characterized by long durations, long flight distances, and high altitudes. Therefore, long-endurance UAVs require navigation systems to maintain robust and accurate localization performance for long periods. The primary navigation method used on current long-endurance UAVs is GNSS–inertial integrated systems. However, GNSS is susceptible to interference, leading to task failure or catastrophes in challenging environments.

As an emerging navigation technology, visual navigation is fully autonomous and resistant to interference, making it an alternative method to GNSS. The visual navigation used can be broadly divided into two categories: the Relative Visual Localization (RVL) methods and the Absolute Visual Localization (AVL) methods [1]. RVL uses image flows to estimate the relative motion, such as Visual Odometry (VO) [2,3,4] and Visual Simultaneous Localization and Mapping (VSLAM) [2,5,6], while the AVL compares camera images with a reference map to determine the carrier’s absolute position.

As most long-duration UAVs are equipped with navigation-level inertial systems for accurate dead reckoning, this work focuses on using AVL techniques to compensate for the inertial drifts and achieve cross-day-and-night accurate localization.

In AVL methods, template matching and feature matching are two common approaches for matching aerial images with reference maps [1]. In order to achieve better matching results, obtaining a top-down view similar to the reference map is necessary. However, in UAV application scenarios, due to the horizontal attitude changes in the UAV, the camera axis is typically not perpendicular to the ground. Although mechanisms such as gimbals and stabilizers have been used to maintain the camera’s perpendicularity to the ground [7,8,9,10], the limitations in UAV size make it difficult to accommodate these mechanical structures with multiple sensors in different positions on the UAV.

Long-range UAVs need to take long flights; therefore, the vision geo-registration method needs to face the scene of night work. However, most of the traditional AVL research focuses on visible-light image matching. At nighttime, visible-light-image-based AVL methods usually cannot work well. Using thermal images instead is a realizable way to settle it. AVL methods also need reference maps, but few institutions devote themselves to building complete high-precision infrared radiation (IR) maps. Multimodal image matching (MMIM) is an interesting research topic in image matching, which is studied widely in the medical field [11]. Matching IR and visible images (IR-VIS) is one of the most important research directions in MMIM. Similar to the conventional visible image-matching methods, the IR-VIS method also can be divided into template matching [12,13] and feature matching [14,15,16,17]. Learning-based methods [18,19] are also used in complex scenes that cannot extract descriptors directly. However, most MMIM methods only focus on street scenes and structured scene image registration.

We propose a long-term localization method by fusing the inertial navigation and cross-domain visual geo-registration component, as shown in Figure 1. This method includes three main components: inertial navigation, visible/thermal vision-based geo-localization, and inertial–visual integrated navigation. Compared to traditional matching localization algorithms, our research aims at the day-and-night visual localization needs under long-term localization conditions. Therefore, we propose using deep learning features to describe and match images using graph neural networks. In addition, we analyze the potential errors in visual matching localization and propose compensation methods. The main contributions of our work are as follows:

To match visible and thermal camera images to a remote sensing RGB map, we investigate several visual features and propose outlier filtering methods to achieve cross-day-and-night geo-registration performance.
To obtain accurate localization results from geo-registration, we analyze the influence of horizontal attitude on the visual geo-localization, and we propose a compensation method to correct the localization from raw image registration.
We conduct actual long-duration flight experiments in different situations. The experiments include visual registration using various feature-extracting methods on visible light and thermal images, geo-localization with attitude compensation, and integrated navigation. The experimental results prove the effectiveness of our methods.

This paper is organized as follows. Section 2 presents the research related to our work. Section 3 introduces the proposed method, including our integrated navigation framework, cross-domain visual registration method, geo-localization method, and filter state updating. Section 4 describes our experiments, including system setup, dataset description, data evaluation criteria, and experiment results. Section 5 presents our conclusion.

2. Related Works

We focus on integrating dual-vision geo-registration and inertial measurement units (IMUs) across day and night. Therefore, we focus on two related categories of work: visual navigation methods and IR-VIS methods.

2.1. Visual Navigation

Traditional visual navigation technologies can be divided into RVL and AVL [1]. VO and VSLAM are the two leading technologies under RVL, with VO often serving as a component of VSLAM. Classic pure visual relative localization algorithms include SVO [18], DSO [19], and ORB-SLAM [20], among others. Many researchers have concentrated on fusing vision with other sensors to compensate for the integration errors of pure visual relative localization, especially integrating the inertial measurement unit (IMU) with vision systems. Representative works include VINS-mono [4] and ORB-SLAM3 [5]. In recent years, RVL based on non-visible-light images has also been applied in navigation [21]. RVL can estimate the relative position of the carrier, but without prior geographical coordinates, RVL cannot determine the carrier’s absolute position in the geographical coordinate system. Moreover, VO has cumulative errors, and VSLAM requires loop closure to achieve high-precision localization and mapping, which limits the application of RVL technology in UAVs, especially for long-term and large-scale navigation requirements.

AVL aims to match camera images with reference maps to determine the carrier’s position on the map. When the reference map can be aligned with the geographical coordinate system, the absolute position of the carrier in the geographical coordinate system can be determined. Image matching methods are mainly divided into template matching and feature matching. Template matching involves treating aerial images as part of the reference map, so the prerequisite is to transform them into images with the same scale and direction as the reference map before comparing them. The main parameters used include Normalized Cross-Correlation (NCC) [9,22], Mutual Information (MI) [23], and Normalized Information Distance (NID) [8]. Though template matching methods achieve higher localization precision than feature-based methods [24], they need to compare the camera images with the map at the pixel level (raw intensity or normalized intensity); this requires ample space to store the maps and high computation costs to calculate the similarities throughout the maps. At the same time, for better registration between aerial images and reference maps, mechanisms such as gimbals and stabilizers are commonly used to ensure the camera is always perpendicular to the ground [7,8].

Another AVL method is to match features between aerial images and reference maps. Although feature-matching methods introduce errors in feature detection and describing steps, it has the advantage that feature-matching methods need less storage and computation cost as they compress raw maps and images to sparse features, which can be efficiently implemented for real-time operation on embedded systems. It is meaningful for real-time processing in airborne equipment. Conte [25] used edge detection to match aerial images with maps, but the results showed a low matching rate. M. Mantelli [10] proposed abBRIEF features based on BRIEF features [26], showing better performance than traditional BRIEF on AVL. J. Surber [27] and others used a self-built 3D map as a reference and matched it with BRISK features [26], also utilizing weak GPS priors to reduce visual aliasing. M. Shan and others [28] constructed an HOG feature query library for the reference map. Since it includes a global search process, this method can determine the carrier’s absolute position without providing prior location information. A. Nassar [29] proposed using SIFT features for image feature description and introduced Convolutional Neural Networks (CNNs) for feature matching and vehicle localization in images.

In summary, while RVL has the disadvantage of cumulative errors, AVL can relatively accurately determine the carrier’s position without relying on prior location information. However, most work directly uses the center point of the matched image in the reference map to approximate the carrier’s position. Incorrect matching and UAV attitudes can significantly affect the results of visual matching localization. Moreover, most works focus on visible-light images, which have poor matching performance at night.

2.2. IR-VIS Method

IR-VIS methods can be classified into template matching and feature matching. In template matching, Yu et al. [13] proposed a strategy to enhance Normalized Mutual Information (NMI) matching. They employed a grayscale weighted window method to extract edge features, thereby reducing the NMI’s joint entropy and local extrema. The author of [12] first transforms the image into an edge map and describes geometrical relations between rough and fine matching by affine and Free-Form Deformation (FFD). The work optimized matching by maximizing the overall similarity of MI between edge maps. As for feature matching, Hrkać, T et al. [14] detected Harris corners from IR and visible images, then used a simple similarity transformation to match them. However, this experiment is performed in situations where images have stable corners. Ma et al. [15] built an inherent feature of the image by extracting the edge map of the image. They propose a Gaussian field criterion to achieve registration. This method shows good performance in registering IR and visible face images. The authors of [16] proposed a scale-invariant Probabilistic Intensity-based Image Feature Detector (PIIFD) for corner feature description. In addition, an affine matrix estimation method based on the Bayesian framework is proposed. The authors of [17] first extracted the edge using the morphological gradient method. They used a C_SIFT detector on the edge map to search for distinct points and used BRIEF for description, finally making scale- and orientation-invariant matching come true.

With the development of deep learning, matching methods based on learning have become essential in IR-VIS research. The authors of [30] proposed a two-phase Graph Neural Network (GNN), which includes a domain transfer network and a geometrical transformer module. This method was used to obtain better-warped images across different modalities. Baruch et al. [31] used a hybrid CNN framework to extract and match features jointly. The framework consists of a Siamese CNN and a dual non-weight-sharing CNN, which can capture multimodal image features.

We can see that although many methods are proposed for use in IR-VIS, most focus on the image registration of structured scenes, like cities, buildings, and streets. Additionally, some research focuses on facial images. Utilizing the IR-VIS method to achieve vision geo-registration has rarely been researched. Distinguishing from structured scenes and human faces, geo-registration problems focus on unstructured scenes represented by the natural environment. Particularly for some situations, like forest, desert, gobi, and ocean, a few features present an excellent challenge for matching.

3. Method

This section presents our visual–inertial navigation system (VINS) framework, which comprises three components: inertial navigation, vision geo-registration, and integrated navigation. The vision geo-registration component matches camera images (including both visible-light and thermal images) to the reference map and calculates the geolocation of the UAV. The INS component provides high-frequency navigation state and covariance (including attitude, velocity, and position) prediction. The integrated navigation component utilizes vision geo-registration results to compensate for the drift caused by INS prediction. The framework is illustrated in Figure 2.

3.1. Visual–Inertial System State Construction and Propagation

In this part, we present the filter used in the integrated navigation, which includes state construction, nominal state propagation, and covariance propagation. We use the State Transformation Extended Kalman Filter (ST-EKF) [32,33] to estimate the navigation state.

3.1.1. Nominal State Propagation

Following Ref. [33], we define the nominal state vector

x

as follows:

x = {[\begin{matrix} q & v & p \end{matrix}]}^{T}

(1)

where

q

,

v

and

p

are attitude, velocity, and position, respectively.

q

,

v

and

p

need to be calculated using IMU data. Unlike most visual/inertial navigation systems based on Micro-Electromechanical System IMUs (MEMS IMUs) [4,5,6], which apply numerical integration on the state differential equations, we used optimal INS algorithms with coning and sculling corrections [34,35,36,37], as long-duration vehicles are usually equipped with navigation-grade IMUs.

3.1.2. Filter State

The same as Ref. [33], the error states

δ x

can be expressed as:

δ x = {[\begin{matrix} δ ϕ & δ v & δ r & b_{g} & b_{a} \end{matrix}]}^{T}

(2)

where

δ ϕ

,

δ v

and

δ r

are attitude error, velocity error, and position error, respectively.

b_{g}

and

b_{a}

are biases of gyros and accelerometers, respectively.

To achieve consistent state estimation results, we employ the ST-EKF model to express the kinematics of error states in the North-East-Down (NED) frame as:

\begin{matrix} δ {\dot{ϕ}}^{n} = & - ω_{i n}^{n} \land δ ϕ^{n} + δ ω_{i n}^{n} - C_{b}^{n} \cdot δ ω_{i b}^{b} \\ δ {\dot{v}}_{ϕ}^{n} = & - [g^{n} \land + ({\tilde{v}}_{e b}^{n} \land) (ω_{i e}^{n} \land)] δ ϕ^{n} - (2 ω_{i e}^{n} + ω_{e n}^{n}) \land δ v_{ϕ}^{n} \\ + C_{b}^{n} (b_{a} + w_{a}) + ({\tilde{v}}_{e b}^{n} \land) C_{b}^{n} (b_{g} + w_{g}) + v_{e b}^{n} \land δ ω_{i e}^{n} \\ δ {\dot{r}}^{n} = & F_{r r} ({\tilde{v}}_{e b}^{n} \land) δ ϕ^{n} + F_{r v} \cdot δ v_{ϕ}^{n} + F_{r r} \cdot δ r_{ϕ}^{n} \end{matrix}

(3)

where

ω_{i e}^{n}

is the earth rotation vector;

ω_{i e}^{n} + ω_{e n}^{n} = ω_{i n}^{n}

is the angular rate vector of the navigation frame relative to the inertial frame;

{\tilde{v}}_{e b}^{n}

is the estimated value of velocity;

δ ϕ^{n}

is the attitude error vector in the navigation frame;

C_{b}^{n}

is the direction cosine matrix (DCM) from body frame to navigation frame;

g^{n}

is the gravity vector;

w_{a}

and

w_{g}

are the white noises of the gyros and accelerometers, respectively, and

δ ω_{i b}^{b} = b_{g} + w_{g}

; and the symbol

\land

represents the transformation from vectors to skew-symmetric matrices.

F_{r r}

and

F_{r v}

can be expressed as follows:

F_{r r} = [\begin{matrix} 0 & 0 & - \frac{v_{N}}{{(R_{N} + h)}^{2}} \\ \frac{v_{E} \tan L}{(R_{E} + h) \cos L} & 0 & - \frac{v_{E}}{{(R_{E} + h)}^{2}} \\ 0 & 0 & 0 \end{matrix}] \begin{matrix}  \end{matrix} F_{r v} = [\begin{matrix} \frac{1}{R_{N} + h} & 0 & 0 \\ 0 & \frac{1}{(R_{E} + h) \cos L} & 0 \\ 0 & 0 & - 1 \end{matrix}]

(4)

where

v_{N}

and

v_{E}

are northward and eastward velocity, respectively.

R_{N}

and

R_{E}

are the radius of curvature in meridian and prime vertical, respectively.

L

is the latitude, and

h

is the height.

Note that, compared with the common kinematics of error states, the specific force in the velocity error differential equation is replaced by the gravity vector, which helps to improve the state estimation accuracy and consistency in dynamic conditions [32,33].

According to Equations (3) and (4), the error model of INS at time

t

can be described as follows:

δ \dot{x} (t) = F (t) δ x (t) + G (t) u (t)

(5)

where

F

is the system matrix,

G

is the noise input matrix, and

u

is the noise vector of the system. They are the same as those in the traditional Extended Kalman Filter (EKF) of integrated navigation.

3.1.3. Covariance Propagation

From the vision geo-registration that will be presented in Section 3.2 and INS prediction presented in Section 3.1.1, we obtain a 2D visual localization position

[\begin{matrix} l a t i_{V} & l o n g i_{V} \end{matrix}]

and an INS prediction position

[\begin{matrix} l a t i_{I N S} & l o n g i_{I N S} \end{matrix}]

. We define the observation vector

z = {[\begin{matrix} z_{l a t i} & z_{l o n g i} \end{matrix}]}^{T} = {[\begin{matrix} l a t i_{V} - l a t i_{I N S} & l o n g i_{V} - l o n g i_{I N S} \end{matrix}]}^{T}

. Then, the observation equation can be described as:

z = [\begin{matrix} 0_{2 \times 6} & I_{2 \times 2} & 0_{2 \times 7} \end{matrix}] δ x

(6)

The subscript above represents the dimensions of the matrix.

Then, the observation matrix

H

can be given as:

H = [\begin{matrix} 0_{2 \times 6} & I_{2 \times 2} & 0_{2 \times 7} \end{matrix}]

(7)

In the covariance propagation, we need to predict the covariance matrix

P_{k | k - 1}

first as follows:

P_{k | k - 1} = Φ_{k, k - 1} P_{k - 1} Φ_{k, k - 1}^{T} + Q_{k - 1}

(8)

where

Φ_{k, k - 1}

is the discretized system matrix, and

Q_{k - 1}

is the noise distribution covariance matrix.

Then, we calculate the Kalman gain

K_{k}

as:

K_{k} = P_{k | k - 1} H^{T} {(H P_{k | k - 1} H^{T} + R_{k})}^{- 1}

(9)

where

R_{k}

is the measurement noise covariance matrix of the sensors.

Finally, we update the covariance matrix:

P_{k | k} = (I - K_{k} H) P_{k | k - 1} {(I - K_{k} H)}^{T} + K_{k} R_{k} K_{k}^{T}

(10)

The diagonals of the updated covariance matrix contain elements related to the horizontal position

[\begin{matrix} p_{k - l a t i} & p_{k - l o n g i} \end{matrix}]

, which will be used to construct the Gaussian elliptic constraint in integrated navigation.

3.2. Cross-Domain Visual Registration

Our work mainly aims at long-duration navigation, which means that visual geo-localization needs to remain effective at nighttime. This section will present the method for achieving cross-domain visual registration.

3.2.1. Feature Extraction and Matching Method

To achieve day–night vision-based localization, the key is to develop visual features effective in visible-light and thermal images. Although a series of hand-crafted features and learned features have been used in RGB-image geo-registration [7,8,9,10], few works have shown the capabilities in thermal images.

To obtain a unified feature for cross-domain image matching, we investigated multiple hand-crafted and learned features, including SIFT [38], XFeat [39], and SuperPoint [40]. Figure 3 shows some samples of the performance of these features on visible-light and thermal images. SIFT applies to visible-light images but demonstrates limited effectiveness on thermal images. XFeat can also be used comparably to SuperPoint in visible and thermal images, but XFeat exhibits fewer matching points and a lower matching rate than SuperPoint. The investigation shows that SuperPoint shows the best performance in both visible light and thermal matching. In the experimental part, we designed a comparative experiment to prove our conclusion.

3.2.2. Reference Visible Map Pre-Processing

For long-duration navigation, the task areas can be extensive, which can cause difficulties in both storage and real-time processing. Therefore, we need to pre-process the reference map before the task to improve efficiency during the flight.

While the UAVs are equipped with both visible and thermal cameras, we propose only to use remote visible maps for reference, as high-resolution visible maps are easy to obtain.

To improve the algorithm’s efficiency, we divide the map into sub-maps and use the predicted location from INS to load the sub-maps. Figure 4 shows the method for building the reference map set.

Firstly, the remote visible-light map along the planned flight path needs to be pre-downloaded. Then, we separate the whole map into sub-maps with a size of

w_{m a p} \times h_{m a p}

. The

w_{m a p}

and

h_{m a p}

are the actual length of the sub-map in east and north, respectively. When the drone flies to the maximum height in the flight, the sub-map should contain as much of the scenery of the camera image as possible. It means the following:

\min (w_{m a p}, h_{m a p}) \geq 2 H_{\max} \tan \frac{θ}{2}

(11)

where

H_{\max}

is the maximum height of UAVs, and

θ

is the field of view (FOV) of the camera. At the same time, when visual matching localization fails, pure inertial navigation will cause drift. Therefore, it is meaningful to consider appropriately increasing

w_{m a p}

and

h_{m a p}

to avoid map retrieval failure caused by localization divergence, which, in turn, leads to the complete failure of scene-matching localization. In addition, when the image matching area is close to the edge of the map, matching may fail due to the incomplete matching area. Therefore, when cropping the map, we first crop it to a size of

(w_{m a p} / 2) \times (h_{m a p} / 2)

and then stitch adjacent

2 \times 2

map blocks into a sub-map. This ensures that there is a 50% overlap between adjacent sub-maps. Finally, we obtain the sub-map image set

\{m_{11}, m_{12}, \dots, m_{1 q}, \dots m_{p 1}, \dots m_{p q}\}

numbered in row and column order.

Then, we build an index of the sub-map set by the latitude and longitude coordinates of the center. Finally, we extract features and descriptors for each sub-map and associate them with the index in Table 1.

3.2.3. Camera Image Pre-Processing

For long-duration tasks, the UAV can fly at different heights and with various headings, which leads to scale and view variations in the camera images. To reduce the visual aliasing between the camera images and the map, image rotation and scaling are applied before geo-registration. Furthermore, we apply a gamma transformation for thermal images to enhance the contrast, thereby highlighting features more and increasing the number of matching pairs.

Rotation and Scaling

We perform rotation and scaling pre-processing on the images before feature point extraction to align the geographic coordinate system represented by the images with the reference map (the reference map’s default upward direction being north) and to make the pixel resolution close to that of the reference map. The scaling factor

i m g_r a t i o

can be given by:

\begin{matrix} i m g_r a t i o_{x} = \frac{H_{r}}{f_{x} ρ_{m}} \\ i m g_r a t i o_{y} = \frac{H_{r}}{f_{y} ρ_{m}} \end{matrix}

(12)

where

H_{r}

is relative height,

ρ_{m}

is the pixel resolution of the sub-map (length in the world coordinate system represented by a single pixel, with units in meters per pixel (m/pix)),

f_{x}

and

f_{y}

are camera intrinsics, which can be given by Zhang’s camera calibration method [41].

Thermal Image Gamma Transformation

Infrared imaging is based on the thermal radiation and temperature characteristic differences of ground scenes. However, in high-altitude ground imaging, the collected scenes are mainly large areas of buildings, forests, farmland, roads, etc., with high scene similarity and minor thermal radiation differences, leading to infrared imaging having blurred contours, high noise, and low contrast, which is not conducive to feature extraction. Therefore, before extracting features from infrared images, pre-processing of the infrared images can be performed to extract the potential features of the images better. We choose the gamma transformation to adjust image contrast and brightness to enhance details:

s = c r^{γ}, r \in [0, 1]

(13)

where

r

is the normalized grayscale value of the infrared image,

c

is the grayscale scaling factor,

γ

is the gamma value (in this study, we use 0.8 as the gamma value), and

s

is the normalized grayscale value after gamma transformation.

3.2.4. Camera Map Registration

For visual registration, the key lies in matching camera images with reference maps. This section will illustrate the complete process, including sub-map querying, geo-registration method, and anomaly match checking. Figure 5 shows the whole process.

Sub-map querying

To achieve camera map registration, the first step is to search for the corresponding sub-map. In Section 3.2.2, the sub-maps are indexed by their 2D geo-position. Therefore, we can query the required sub-map features by the position predicted by the INS prediction component.

Camera map registration method

SuperGlue [42] is a widely used matching method for SuperPoint features. However, the front end of SuperGlue contains multiple attention mechanisms and fully connected networks, which presents a disadvantage for real-time processing. Therefore, we introduce SGMNet [43] to replace this component to improve the efficiency.

When the camera image and sub-map features are sent to the SGMNet, the network first calculates two groups of matches

M_{a}

and

M_{b}

as the seed matches. Then, the two seed matches pass through a multi-layer seed graph network to determine the best matches. Finally, the best matches are sent to the back end of SuperGlue to obtain the final matches.

Anomaly match checking

Due to pre-processing, the matched points of the camera image and sub-map should be connected with lines that are approximately parallel and of similar lengths. Note that, while camera images may be distorted due to pitch and roll, the distortion can be neglected in our method because the application scenario involves medium- to high-altitude operations. Furthermore, the pitch and roll of long-endurance drones during flight are relatively small.

For incorrect matches, the lines are usually not geometrically close, so it is possible to filter out incorrect matched point pairs by using the consistency of the matched point line vectors, thereby removing the inconsistent parts from the set of matched point pairs. Assuming that after matching, there are

n

pairs of matched points in the collection, where the feature point coordinates in the camera image are

[\begin{matrix} u_{i}^{i m g} & v_{i}^{i m g} \end{matrix}]

, and the corresponding feature point coordinates on the sub-map are

[\begin{matrix} u_{i}^{m a p} & v_{i}^{m a p} \end{matrix}]

, then the line vector for each set of matched point pairs is as follows:

t_{i} = [\begin{matrix} u_{i}^{t r a n s} & v_{i}^{t r a n s} \end{matrix}] = [\begin{matrix} u_{i}^{m a p} - u_{i}^{i m g} & v_{i}^{m a p} - v_{i}^{i m g} \end{matrix}]

(14)

For each

t_{i}

, we can calculate its magnitude

d i s t_{i}

and slope

s l o p e_{i}

:

\begin{matrix} d i s t_{i} = \sqrt{{(u_{i}^{t r a n s})}^{2} + {(v_{i}^{t r a n s})}^{2}} \\ s l o p e_{i} = \frac{v_{i}^{t r a n s}}{u_{i}^{t r a n s}} \end{matrix}

(15)

We define the distance consistency threshold

σ

and direction consistency threshold

ω

; then, we can obtain the consistency set

s i m_{i}

for each

t_{i}

:

\begin{matrix} s i m_{i} = \{t_{j} | d i s t_{i} - d i s t_{j} \leq σ, \arctan (s l o p e_{i}) - \arctan (s l o p e_{j}) \leq ω\} \\ i, j = 1, 2, \dots, n \end{matrix}

(16)

s i m_{i}

represents the set of matched point pairs, which are similar to the corresponding

t_{i}

. For each frame, there is a set

s i m_{\max}

with the maximum capacity. Then, we consider the matched point pairs in

s i m_{\max}

as the filtered matches of the frame. If the number of filtered matches does not meet the threshold of a successful registration, it is considered a failed registration, thereby eliminating incorrectly matched images. If the number of matches meets the threshold, the obtained matches should have eliminated the vast majority of incorrect matches.

3.3. Geo-Localization from Visual Registration

After matching, we obtain a set of matching point pairs. In this section, we illustrate the method for obtaining location from visual registration to achieve the visual geo-localization result. Furthermore, we consider the influence of attitude on geo-localization and the design compensation method.

3.3.1. Geo-Localization

The matching relationship between images can be modeled as an affine transformation. Consequently, in our approach, the geo-location of any point on the reference map within the images taken from UAVs can be determined by computing the affine transformation matrix

H

. We use homography matrix estimation based on Random Sample Consensus (RANSAC) to determine the affine matrix. Note that the roll and pitch of long-endurance UAV are not very large during normal flight. And long-endurance UAVs usually fly at a high altitude. Thus, it can be assumed that there is little 3D variation in camera images.

If we assume that the center of the camera image

[\begin{matrix} u_{c}^{i m g} & v_{c}^{i m g} \end{matrix}]

is the location of the UAV, we can transform the camera image into the reference map by using

H

. We will obtain the transformed center of the camera image

[\begin{matrix} u_{c - i m g}^{m a p} & v_{c - i m g}^{m a p} \end{matrix}]

, and the center of the reference map

[\begin{matrix} u_{c}^{m a p} & v_{c}^{m a p} \end{matrix}]

represents the location in the real world

[\begin{matrix} {l a t i}^{m a p} & {l o n g i}^{m a p} \end{matrix}]

. Then, the location of the UAV

[\begin{matrix} l a t i & l o n g i \end{matrix}]

can be given by:

\begin{matrix} l a t i = {l a t i}^{m a p} + ρ_{m} (u_{c - i m g}^{m a p} - u_{c}^{m a p}) \\ l o n g i = {l o n g i}^{m a p} + ρ_{m} (v_{c - i m g}^{m a p} - v_{c}^{m a p}) \end{matrix}

(17)

3.3.2. Attitude Error Compensation

In this part, we use the NED (north-east-down) local geodetic coordinate frame as the navigation coordinate frame (n-frame) and the FRD (front-right-down) frame as the carrier coordinate frame (b-frame). Then, we can define rotation matrices:

We can define a unit vector

e_{b} = {[\begin{matrix} 0 & 0 & 1 \end{matrix}]}^{T}

in the n-system. Note that the subscripts “

b

” and “

n

” represent the vectors in the corresponding frames.

Then, we can transform the vector to the body frame as:

\begin{matrix} e_{n} & = [\begin{matrix} \cos φ & - \sin φ & 0 \\ \sin φ & \cos φ & 0 \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} \cos θ & 0 & \sin θ \\ 0 & 1 & 0 \\ - \sin θ & 0 & \cos θ \end{matrix}] [\begin{matrix} 1 & 0 & 0 \\ 0 & \cos γ & - \sin γ \\ 0 & \sin γ & \cos γ \end{matrix}] e_{b} \\ = {[\sin φ \sin γ + \cos φ \sin θ \cos γ \sin φ \sin θ \cos γ - \cos φ \sin γ \cos θ \cos γ]}^{T} \end{matrix}

(18)

where

r

,

θ

and

φ

are roll, pitch, and yaw, respectively.

Then, the eastward error

δ n

and northward error

δ e

can be given by:

\begin{matrix} δ n = H_{r} \frac{\sin φ \sin γ + \cos φ \sin θ \cos γ}{\cos θ \cos γ} \\ δ e = H_{r} \frac{\sin φ \sin θ \cos γ - \cos φ \sin γ}{\cos θ \cos γ} \end{matrix}

(19)

where

H_{r}

is the relative height of the UAV. And Figure 6 shows the error caused by attitude.

Note that the error in Equation (19) is in meters within n-frame rather than in latitude and longitude. So, we need to transform the error into latitude and longitude. Then, the visual geo-localization can be described as follows:

[\begin{matrix} {l a t i}_{V} \\ {l o n g i}_{V} \end{matrix}] = [\begin{matrix} {l a t i}_{V - t r u t h} \\ {l o n g i}_{V - t r u t h} \end{matrix}] + [\begin{matrix} \frac{\sin φ \sin γ + \cos φ \sin θ \cos γ}{(R_{N} + A l t i) \cos θ \cos γ} \\ \frac{\sin φ \sin θ \cos γ - \cos φ \sin γ}{(R_{E} + A l t i) \cos (L a t i) \cos θ \cos γ} \end{matrix}] H_{r}

(20)

where the positions with subscript “

V

” are the original visual geo-localization position. The positions with subscript “

V - t r u t h

” are the real horizontal position after attitude compensation.

A l t i

is the absolute height of the UAV and

L a t i

is the latitude of the UAV.

After redefining the error, we need to build the new observation equation. The former term is the same as Equation (6). As for the latter term in Equation (20), if we assume that there is little variation with

A l t i

, we can regard

A l t i

and

H_{r}

as constants. Then, we use Taylor’s Formula and retain the first-order component. Finally, we can obtain the observation equations as Equation (21).

\begin{matrix} z & = [\begin{matrix} C_{a} & 0_{2 \times 3} & I_{2 \times 2} & 0_{2 \times 7} \end{matrix}] δ x \\ C_{a} & = [\begin{matrix} C_{l a t i}^{T} \\ C_{l o n g i}^{T} \end{matrix}] \\ C_{l a t i} & = \frac{H_{r}}{R_{N} + A l t i} [\begin{matrix} \sec^{2} γ \sec θ \sin φ \\ \tan γ \sec θ \tan θ \sin φ + \sec^{2} θ \cos φ \\ \tan γ \sec θ \cos φ - \tan θ \sin φ \end{matrix}] \\ C_{l o n g i} & = \frac{H_{r}}{(R_{E} + A l t i) \cos (L a t i)} [\begin{matrix} - \sec^{2} γ \sec θ \cos φ \\ \sec^{2} θ \sin φ - \tan γ \sec θ \tan θ \cos φ \\ \tan γ \sec θ \sin φ + \tan θ \cos φ \end{matrix}] \end{matrix}

(21)

In Equation (21), we can see that

H_{r} / R_{n} + A l t i

and

H_{r} / (R_{e} + A l t i) \cos (L a t i)

are usually small in the mid-latitude or low-latitude region, which is usually less than

10^{- 2}

of the order of magnitude. In our experiment, it is always less than

10^{- 5}

. So,

C_{a}

can be set to

0_{2 \times 3}

directly. Then, the observation equation is the same as Equation (6).

3.4. Filter State Update with Geo-Localization Observations

For the EKF, a poor observation often has a devastating impact on the estimation of the filtering states. Therefore, when there are outliers in the observations, adopting a conservative screening method can better ensure the proper functioning of the EKF. We use the above methods to maximize the correctness of geo-localization as much as possible. Despite this, the geo-localization results will inevitably contain some wrong localization consequences. To eliminate these inaccurate results, we introduce a conservative Gaussian elliptic constraint [44] to inspect the availability of visual geo-localization before applying ST-EKF. Figure 7 shows the Gaussian elliptic constraint.

In ST-EKF, the diagonal elements of the covariance matrix

P_{k}

represent the error bounds of each state variable. Then, we can use horizontal-position-related elements

p_{k - l a t i}

and

p_{k - l o n g i}

in

P_{k}

to establish a Gaussian elliptic equation:

\frac{z_{k - l a t i}^{2}}{p_{k - l a t i}} + \frac{z_{k - l o n g i}^{2}}{p_{k - l o n g i}} < ξ

(22)

where

ξ

is the threshold of the Gaussian ellipse, which is a hyperparameter.

If

z_{k}

does not satisfy the Gaussian elliptic constraint, this observation will not be used in integrated navigation.

The pseudocode of state updating is as Algorithm 1:

Algorithm 1: Vision–inertial Integration Navigation

1.: Input: Navigation state of last frame x, IMU measurements [gyro_1×3 acc_1×3]^T, vision observation z, covariance matrix P.
2.: Output: Integrate navigation output x and covariance matrix P
3.: x ← INS prediction from x using [gyro_1×3 acc_1×3]^T
4.: while z meet Equation (22) do
5.: Compute Φ according to x
6.: P ← ΦPΦ^T + Q
7.: K ← PH(H^TKH + Q)⁻¹
8.: δx ← K(z−Hx)
9.: P ← (I − KH)P(I − KH)^T + KRK^T
10.: x ← x + δx
11.: end while
12.: return x, P

4. Experiment

To prove the effectiveness of the proposed method, we designed a set of experiments, including a comparison of feature matching between several different features, a vision geo-localization experiment and an integrated navigation experiment.

4.1. System Setup

The experiment location was in Jinmen, Hubei. We used a CH-4 UAV as the carrier, and the experiment system includes a Ring Laser Gyro (RLG) IMU with a drift rate of 0.05°/h working at 200 Hz, a set of down-looking cameras that consist of a visual-light camera and an infrared camera, a barometric altimeter, and a satellite receiver that can output differential GPS data as the ground truth. The experiment was in a

15 km \times 65 km

area, as shown in Figure 8a. Figure 8b shows the UAV and camera set. And Table 2 shows some parameters of the cameras.

4.2. Dataset Description and Evaluation Method

Two datasets were used to test the proposed method. Dataset 1 was collected on 22 May 2024, which includes daytime and nighttime RGB and thermal camera images. Dataset 2 was obtained on 9 May 2024; this dataset only has visible-light images. The details of the two datasets are shown in Table 3.

Based on the two datasets, three groups of experiments were conducted. The first experiment is feature matching. There are three feature extraction methods used in the experiment: SIFT, XFeat and SuperPoint. In the experiment, we chose 50 matches as the threshold of an acceptable visual registration and defined the matching rate as the proportion of accepted visual registrations in the dataset. In addition, both visible-light and thermal images are registered with a visible-light map.

The second experiment is the geo-localization experiment. We use features extracted by SuperPoint to match images and evaluate precision using root mean square error (RMSE) in experiments. We set 100 m as the threshold of a valid geo-localization result. Like the definition of matching rate, we define the valid matching rate as the proportion of the valid geo-localization in the dataset. In addition, we compare the accuracy and number of effective geo-localizations with or without attitude compensation to verify the effectiveness of attitude compensation.

The last experiment is the integrated navigation experiment. Since IMU data are only available in Dataset 2, we use a single Dataset 2 for the integrated navigation experiment. Note that we used multiple samples with optimal coefficients. Due to our navigation-grade IMU, we chose the two-sample algorithm [35,36,37] to solve the inertial navigation states. We prove the effectiveness of the proposed algorithm by comparing errors with or without the proposed filtering methods.

4.3. Experimental Results

4.3.1. Registration Rate Results

This experiment presents the matching rate of three features in the two datasets, SIFT, XFeat, and SuperPoint. Figure 9 and Figure 10 show the performance of the three features in Dataset 1 and Dataset 2, respectively.

The matching rate of SIFT on both visible-light and thermal images is far less than XFeat and SuperPoint. On visible-light images, XFeat achieves a matching rate of 31.14%, close to 32.72% of SuperPoint. However, when applied to thermal images, XFeat exhibits inferior performance compared to SuperPoint. The performance of SuperPoint on infrared images is the best among the three features, with a matching rate of 68.67%. Considering the effectiveness of registration, the effective matching rate of SIFT on visible-light images is only 2.9%, and only two images exhibit linear errors of less than 100 m. Although XFeat maintained a satisfactory localization precision in visible light, its performance in thermal images remained suboptimal. SuperPoint has the best performance in localization, with an effective matching rate of 22.47% in visible light and 54.03% in thermal.

Additionally, from Figure 9, it can be concluded that while the geo-localization rate of visible images is superior to that of thermal images, thermal images can address the limitation of visible images being unusable at nighttime.

Generally, the factors affecting the matching rate include the FOV of cameras, the reference map accuracy, the complexity of the flight area feature changes and others. The results of the three features in Dataset 2 exhibit performance consistent with the aforementioned analysis. Note that the matching rate in Dataset 2 is lower than that in Dataset 1. It is mainly because the cruising altitude of Dataset 2 is lower than that of Dataset 1, which means images in Dataset 2 have a smaller FOV than those in Dataset 1. This indicates that images from Dataset 2 contain fewer distinctive features for extraction, consequently leading to a lower matching rate. Additionally, as a lightweight model, XFeat demonstrates greater degradation than SuperPoint in matching performance due to FOV contraction and the reduced availability of features. The sample images of XFeat and SuperPoint in Section 3.2.4 also demonstrate similar conclusions.

4.3.2. Visual Geo-Localization Results

This experiment mainly presents the efficiency of attitude compensation. We use RMSE to describe the precision of geo-localization.

We evaluate the RMSE of geo-localization above in Figure 11. The figure shows that attitude compensation can improve the precision of geo-localization. In addition, after attitude compensation, the visible-light geo-localization shows better precision than thermal. At last, by comparing the precision between Dataset 1 and Dataset 2, we can discover that different heights can influence the precision of geo-localization. The reason is that, according to Equation (11), the pixel resolution of the map and relative height will affect the scaling factor. If the scaling factor is excessively small, it will be difficult to extract enough features used in registration, which leads to a worse matching rate. Therefore, for better registration performance, we try to control the scaling factor between 0.5 and 1. It means that we should confirm flight altitude and choose the map with the corresponding pixel resolution before the flight.

Figure 12 presents sample images of visible light and thermal, which the attitude compensation has an effect on. We can easily see that both samples perform well in registration, but the geo-localization results show a significant error. It illustrates that the horizontal attitude angle significantly impacts the visual geo-localization as the UAV approaches the turning point. In our experiment, the UAV consistently operates at a specified altitude. Therefore, the experimental data only clearly indicate that the roll angle

r

significantly influences the geo-localization precision.

4.3.3. Visual–Inertial Integrated Navigation Results

We use Dataset 2 to present the effectiveness of our VINS method. Figure 13 shows the trajectory of integrated navigation.

The experiment lasted 5170 s. The start point was obtained from IMU/GNSS integrated navigation. Then, we use VINS and PINS from the start point. From Section 4.3.1, we know that there are 997 correct matches. In integrated navigation, 1007 matches are used in ST-EKF, which proves that the proposed attitude compensation and Gaussian elliptic constraint can improve the availability of vision geo-registration. We use GPS and INS integrated navigation as the ground truth, and we compare VINS results with and without Gaussian elliptic constraint. We evaluate the precision using root mean square error (RMSE).

Table 4 shows the precision statistics of the experiment. When using integrated navigation without the Gaussian elliptic constraint, the RMSE is 112.65 m. But the RMSE of the proposed method is 42.38 m. As demonstrated in Figure 13 and Table 4 the Gaussian elliptic constraint identified and excluded five outliers. These high-error points, if incorporated into the EKF, would significantly compromise the precision of integrated navigation solutions. This experimental result statistically corroborates the efficacy of our proposed Gaussian elliptic constraint methodology.

Figure 14 shows some sample matches eliminated by the Gaussian elliptic constraint. Most eliminated matches occur in environments characterized by more repetitive and homogeneous terrain. Visual registration cannot remain robust in these environments, and registration results will show either that there are too few features to match or that the matching result matches the terrain similar to the real position, resulting in a localization error like in Figure 14. From the trajectory, we can see that the variety of mismatches has a negative effect on ST-EKF, but these mismatches are difficult to eliminate by simply using visual strategies. Gaussian elliptic constraint can eliminate them by using the covariance matrix of ST-EKF and leave more accurate observations for the integrated navigation system.

5. Conclusions

In this study, we establish a VINS framework to solve long-term navigation problems in GNSS-denied situations. The framework consists of vision registration, INS prediction, and ST-EKF, which uses vision geo-registration to redress the PINS error. We analyze the influence of attitude on vision geo-localization and propose a compensation method to increase the precision of visual geo-localization. The proposed method makes ST-EKF obtain more observations with fewer position errors in GNSS-denied situations. The designed experiment proves the effectiveness of our method.

From the experiment, we find that in mountain scenes or other scenes that have high repeatability, the vision geo-registration does not work well. So, how to increase visual navigation in these situations may be the next step of our research.

Author Contributions

Conceptualization, X.X. and Xiaofeng He; data curation, X.X.; formal analysis, X.X.; methodology, X.X. and K.L.; project administration, J.M.; resources, X.H., L.Z. and J.M.; software, X.X., K.L., Z.C., G.S. and Q.H.; supervision, X.H., L.Z. and Jun Mao; validation, X.X.; visualization, X.X.; writing—original draft, X.X., X.H. and K.L.; writing—review and editing, L.Z. and J.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Nature Science Foundation of China, grant number 62103430, 62103427 and 62073331.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy.

DURC Statement

Current research is limited to aerospace engineering, which is beneficial to the development of navigation technology and does not pose a threat to public health or national security. The authors acknowledge the dual-use potential of the research involving a CH-4 drone and confirm that all necessary precautions have been taken to prevent potential misuse. In terms of ethical responsibility, the authors strictly adhere to relevant national and international laws about DURC. The authors advocate for responsible deployment, ethical considerations, regulatory compliance, and trans-parent reporting to mitigate misuse risks and foster beneficial outcomes.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Couturier, A.; Akhloufi, M.A. A review on absolute visual localization for UAV. Robot. Auton. Syst. 2021, 135, 103666. [Google Scholar] [CrossRef]
Fraundorfer, F.; Scaramuzza, D. Visual odometry: Part I: The first 30 years and fundamentals. IEEE Robot. Autom. Mag. 2011, 18, 80–92. [Google Scholar]
Scaramuzza, D.; Zhang, Z. Visual-inertial odometry of aerial robots. arXiv 2019, arXiv:1906.03289. [Google Scholar]
Qin, T.; Li, P.; Shen, S. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.; Tardós, J.D. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Geneva, P.; Eckenhoff, K.; Lee, W.; Yang, Y.; Huang, G. Openvins: A research platform for visual-inertial estimation. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 4666–4672. [Google Scholar]
Bianchi, M.; Barfoot, T.D. UAV localization using autoencoded satellite images. IEEE Robot. Autom. Lett. 2021, 6, 1761–1768. [Google Scholar] [CrossRef]
Patel, B.; Barfoot, T.D.; Schoellig, A.P. Visual localization with google earth images for robust global pose estimation of uavs. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 6491–6497. [Google Scholar]
Conte, G.; Doherty, P. Vision-based unmanned aerial vehicle navigation using geo-referenced information. EURASIP J. Adv. Signal Process. 2009, 2009, 1–18. [Google Scholar] [CrossRef]
Mantelli, M.; Pittol, D.; Neuland, R.; Ribacki, A.; Maffei, R.; Jorge, V.; Prestes, E.; Kolberg, M. A novel measurement model based on abBRIEF for global localization of a UAV over satellite images. Robot. Auton. Syst. 2019, 112, 304–319. [Google Scholar] [CrossRef]
Jiang, X.; Ma, J.; Xiao, G.; Shao, Z.; Guo, X. A review of multimodal image matching: Methods and applications. Inf. Fusion 2021, 73, 22–71. [Google Scholar] [CrossRef]
Dwith Chenna, Y.N.; Ghassemi, P.; Pfefer, T.J.; Casamento, J.; Wang, Q. Free-form deformation approach for registration of visible and infrared facial images in fever screening. Sensors 2018, 18, 125. [Google Scholar] [CrossRef]
Yu, K.; Ma, J.; Hu, F.; Ma, T.; Quan, S.; Fang, B. A grayscale weight with window algorithm for infrared and visible image registration. Infrared Phys. Technol. 2019, 99, 178–186. [Google Scholar] [CrossRef]
Hrkać, T.; Kalafatić, Z.; Krapac, J. Infrared-visual image registration based on corners and hausdorff distance. In Proceedings of the Image Analysis: 15th Scandinavian Conference, SCIA 2007, Aalborg, Denmark, 10–15 June 2007; pp. 383–392. [Google Scholar]
Ma, J.; Zhao, J.; Ma, Y.; Tian, J. Non-rigid visible and infrared face registration via regularized Gaussian fields criterion. Pattern Recognit. 2015, 48, 772–784. [Google Scholar] [CrossRef]
Du, Q.; Fan, A.; Ma, Y.; Fan, F.; Huang, J.; Mei, X. Infrared and visible image registration based on scale-invariant piifd feature and locality preserving matching. IEEE Access 2018, 6, 64107–64121. [Google Scholar] [CrossRef]
Zeng, Q.; Adu, J.; Liu, J.; Yang, J.; Xu, Y.; Gong, M. Real-time adaptive visible and infrared image registration based on morphological gradient and C_SIFT. J. Real-Time Image Process. 2020, 17, 1103–1115. [Google Scholar] [CrossRef]
Forster, C.; Zhang, Z.; Gassner, M.; Werlberger, M.; Scaramuzza, D. SVO: Semidirect visual odometry for monocular and multicamera systems. IEEE Trans. Robot. 2016, 33, 249–265. [Google Scholar] [CrossRef]
Engel, J.; Koltun, V.; Cremers, D. Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 611–625. [Google Scholar] [CrossRef]
Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
Song, X.; Zhang, L.; Zhai, Z.; Yu, G. A Multimode Visual-Inertial Navigation Method for Fixed-wing Aircraft Approach and Landing in GPS-denied and Low Visibility Environments. In Proceedings of the 2019 IEEE/AIAA 38th Digital Avionics Systems Conference (DASC), San Diego, CA, USA, 8–12 September 2019; pp. 1–10. [Google Scholar]
Van Dalen, G.J.; Magree, D.P.; Johnson, E.N. Absolute localization using image alignment and particle filtering. In Proceedings of the AIAA Guidance, Navigation, and Control Conference, San Diego, CA, USA, 4–8 January 2016; p. 0647. [Google Scholar]
Yol, A.; Delabarre, B.; Dame, A.; Dartois, J.-E.; Marchand, E. Vision-based absolute localization for unmanned aerial vehicles. In Proceedings of the 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, Chicago, IL, USA, 14–18 September 2014; pp. 3429–3434. [Google Scholar]
Sweeney, K.; Nusseibeh, A.; Kukowski, T.; Keyes, S.A. Honeywell Vision-Aided Navigation for GPS-Denied Environments. In Proceedings of the 34th International Technical Meeting of the Satellite Division of The Institute of Navigation (ION GNSS+ 2021), St. Louis, MI, USA, 20–24 September 2021; pp. 2151–2165. [Google Scholar]
Conte, G.; Doherty, P. An integrated UAV navigation system based on aerial image matching. In Proceedings of the 2008 IEEE Aerospace Conference, Dayton, OH, USA, 16–18 July 2008; pp. 1–10. [Google Scholar]
Calonder, M.; Lepetit, V.; Strecha, C.; Fua, P. Brief: Binary robust independent elementary features. In Proceedings of the Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, 5–11 September 2010; Proceedings, Part IV 11. 2010; pp. 778–792. [Google Scholar]
Surber, J.; Teixeira, L.; Chli, M. Robust visual-inertial localization with weak GPS priors for repetitive UAV flights. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 6300–6306. [Google Scholar]
Shan, M.; Wang, F.; Lin, F.; Gao, Z.; Tang, Y.Z.; Chen, B.M. Google map aided visual navigation for UAVs in GPS-denied environment. In Proceedings of the 2015 IEEE International Conference on Robotics and Biomimetics (ROBIO), Zhuhai, China, 6–9 December 2015; pp. 114–119. [Google Scholar]
Nassar, A.; Amer, K.; ElHakim, R.; ElHelw, M. A deep CNN-based framework for enhanced aerial imagery registration with applications to UAV geolocalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1513–1523. [Google Scholar]
Wang, L.; Gao, C.; Zhao, Y.; Song, T.; Feng, Q. Infrared and visible image registration using transformer adversarial network. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 1248–1252. [Google Scholar]
Baruch, E.B.; Keller, Y. Multimodal Matching Using a Hybrid Convolutional Neural Network. Ph.D. Thesis, Ben-Gurion University of the Negev, Beersheba, Israel, 2018. [Google Scholar]
Wang, M.; Wu, W.; He, X.; Li, Y.; Pan, X. Consistent ST-EKF for long distance land vehicle navigation based on SINS/OD integration. IEEE Trans. Veh. Technol. 2019, 68, 10525–10534. [Google Scholar] [CrossRef]
Wang, M.; Wu, W.; Zhou, P.; He, X. State transformation extended Kalman filter for GPS/SINS tightly coupled integration. Gps Solut. 2018, 22, 1–12. [Google Scholar] [CrossRef]
Wu, Y. Crossroads of Inertial Navigation Computation across Five Decades. Authorea Prepr. 2025. [Google Scholar] [CrossRef]
Miller, R.B. A new strapdown attitude algorithm. J. Guid. Control Dyn. 1983, 6, 287–291. [Google Scholar] [CrossRef]
Savage, P.G. Strapdown inertial navigation integration algorithm design part 1: Attitude algorithms. J. Guid. Control Dyn. 1998, 21, 19–28. [Google Scholar] [CrossRef]
Savage, P.G. Strapdown inertial navigation integration algorithm design part 2: Velocity and position algorithms. J. Guid. Control Dyn. 1998, 21, 208–221. [Google Scholar] [CrossRef]
Ng, P.C.; Henikoff, S. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003, 31, 3812–3814. [Google Scholar] [CrossRef]
Potje, G.; Cadar, F.; Araujo, A.; Martins, R.; Nascimento, E.R. XFeat: Accelerated Features for Lightweight Image Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 11–15 June 2024; pp. 2682–2691. [Google Scholar]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 224–236. [Google Scholar]
Zhang, Z. Flexible camera calibration by viewing a plane from unknown orientations. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Corfu, Greece, 20–25 September 1999; pp. 666–673. [Google Scholar]
Sarlin, P.-E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4938–4947. [Google Scholar]
Chen, H.; Luo, Z.; Zhang, J.; Zhou, L.; Bai, X.; Hu, Z.; Tai, C.-L.; Quan, L. Learning to match features with seeded graph matching network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6301–6310. [Google Scholar]
Liu, K.; He, X.; Mao, J.; Zhang, L.; Zhou, W.; Qu, H.; Luo, K. Map Aided Visual-Inertial Integrated Navigation for Long Range UAVs. In Proceedings of the International Conference on Guidance, Navigation and Control, Harbin, China, 5–7 August 2022; pp. 6043–6052. [Google Scholar]

Figure 1. The reference map set is built offline. The UAV obtains visible light and thermal images with the camera set and calculates the raw geo-registration position (yellow point) online. Then, the UAV uses the attitude predicted by the INS to correct the raw geo-registration and obtain the final registration result (red point). Finally, the corrected registration results are fused with the INS (blue point) to acquire a consecutive, drift-free integrated localization output (orange point).

Figure 2. VINS framework. The IMU provides measurements (angular velocity

ω

and specific force

f

) for INS prediction and gets pure inertial navigation system (PINS) navigation parameters (attitude

ϕ_{I N S}

, velocity

v_{I N S}

, and position

p_{I N S}

). The camera set offers visible-light and thermal camera images for vision geo-registration, and localization results are calculated. Then, the PINS result and visual result are integrated by ST-EKF.

Figure 2. VINS framework. The IMU provides measurements (angular velocity

ω

and specific force

f

) for INS prediction and gets pure inertial navigation system (PINS) navigation parameters (attitude

ϕ_{I N S}

, velocity

v_{I N S}

, and position

p_{I N S}

). The camera set offers visible-light and thermal camera images for vision geo-registration, and localization results are calculated. Then, the PINS result and visual result are integrated by ST-EKF.

Figure 3. Sample images of three features performing on visible-light and thermal images. The green lines across two pictures represent matching points. The green closed curve represents the projection of the camera image on the reference map. (a) SIFT-visible light, (b) XFeat-visible light, (c) SuperPoint-visible light, (d) SIFT-thermal, (e) XFeat-thermal, (f) SuperPoint-thermal.

Figure 4. Building the reference map set. The whole map is separated into non-overlapping tiles first. Then, the adjacent

2 \times 2

tiles are combined into a sub-map (red and blue squares represent corresponding sub-maps, respectively). The slide window ensures adjacent sub-maps have a 50% overlap.

Figure 4. Building the reference map set. The whole map is separated into non-overlapping tiles first. Then, the adjacent

2 \times 2

tiles are combined into a sub-map (red and blue squares represent corresponding sub-maps, respectively). The slide window ensures adjacent sub-maps have a 50% overlap.

Figure 5. Camera map registration.

Figure 6. Error caused by attitude. The yellow point is the true horizontal position of the UAV on the trajectory (blue trajectory). The red point represents the vision geo-localization position influenced by attitude.

Figure 7. Gaussian elliptic constraint. The Gaussian ellipse (orange ellipses) is constructed by

P_{k}

, and the filter only accepts visual geo-localization position amidst the ellipse (green points) as the observation. The point outside of the Gaussian ellipse (red points) will be eliminated.

Figure 7. Gaussian elliptic constraint. The Gaussian ellipse (orange ellipses) is constructed by

P_{k}

, and the filter only accepts visual geo-localization position amidst the ellipse (green points) as the observation. The point outside of the Gaussian ellipse (red points) will be eliminated.

Figure 8. Experiment conditions. (a) Experiment area and ground truth of datasets. The blue and orange trajectories are the ground truths of Dataset 1 and Dataset 2, respectively, (b) CH-4 long-range UAV and camera set.

Figure 9. Matching rate and effective matching rate in Dataset 1. SuperPoint has the best performance of the three feature extraction methods in both registration and localization. Visible-light images are easy to register and locate but thermal images can register and locate in nighttime.

Figure 10. Performance of three features in Dataset 2.

Figure 11. RMSE of geo-localization. With attitude compensation, the RMSE of both datasets is smaller than without attitude compensation. Visible-light images have a higher matching rate but have larger RMSE before attitude compensation. After attitude compensation, visible-light images have better performance than thermal images. Dataset 2 has a lower height than Dataset 1, which has a lower RMSE and matching rate.

Figure 12. Visible light and thermal sample of attitude compensation.

Figure 13. Integrated navigation trajectory. The trajectory of the proposed method (red trajectory) is the closest to the ground truth (black trajectory) compared to PINS (blue trajectory) and VINS without Gaussian elliptic constraint (purple trajectory).

Figure 14. Matches eliminated by Gaussian elliptic constraint. The vision geo-localization position (red point) is far away from the ground truth (yellow point). Then, these observation values will be eliminated by Gaussian elliptic constraint. (a) Sample 1 in Figure 13, (b) Sample 2 in Figure 13.

Table 1. Reference map set index.

Sub-Map	Center of Sub-Map (Longitude, Latitude)	Descriptors
$m_{11}$	$(l a t i_{11}, l o n g i_{11})$	$(k p t_{11}^{s}, d e s c_{11}^{s})$
$m_{12}$	$(l a t i_{12}, l o n g i_{12})$	$(k p t_{12}^{s}, d e s c_{12}^{s})$
…	…	…
$m_{p q}$	$(l a t i_{p q}, l o n g i_{p q})$	$(k p t_{p q}^{s}, d e s c_{p q}^{s})$

Table 2. Parameters of cameras.

	Visible-Light Camera (Flir BFS-U3-51S5C-C)	Thermal Camera (Guide Sensmart Plug617)
Resolution	$1280 \times 1024$	$1280 \times 1024$
Frequency	1 Hz	1 Hz

Table 3. Experiment datasets.

Dataset	Dataset 1	Dataset 2
Cruising Altitude (m)	2400	700
Time (s)	6892	5170
Length (km)	344.6	284.35
Daytime (s)	2504	5170
Frequency (Hz)	1	1
Acceptable Matching Threshold	50	50
Location Error Threshold (m)	100	100

Table 4. Precision evaluation and comparison results.

	RMSE (m)	Max Linear Bias (m)	End Point Error (m)
PINS	450.933	890.55	627.49
Without Gaussian Elliptic Constraint	112.65	423.65	75.10
Proposed Method	42.38	261.23	65.21

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xing, X.; He, X.; Liu, K.; Chen, Z.; Song, G.; Hao, Q.; Zhang, L.; Mao, J. Long-Duration UAV Localization Across Day and Night by Fusing Dual-Vision Geo-Registration with Inertial Measurements. Drones 2025, 9, 373. https://doi.org/10.3390/drones9050373

AMA Style

Xing X, He X, Liu K, Chen Z, Song G, Hao Q, Zhang L, Mao J. Long-Duration UAV Localization Across Day and Night by Fusing Dual-Vision Geo-Registration with Inertial Measurements. Drones. 2025; 9(5):373. https://doi.org/10.3390/drones9050373

Chicago/Turabian Style

Xing, Xuehui, Xiaofeng He, Ke Liu, Zhizhong Chen, Guofeng Song, Qikai Hao, Lilian Zhang, and Jun Mao. 2025. "Long-Duration UAV Localization Across Day and Night by Fusing Dual-Vision Geo-Registration with Inertial Measurements" Drones 9, no. 5: 373. https://doi.org/10.3390/drones9050373

APA Style

Xing, X., He, X., Liu, K., Chen, Z., Song, G., Hao, Q., Zhang, L., & Mao, J. (2025). Long-Duration UAV Localization Across Day and Night by Fusing Dual-Vision Geo-Registration with Inertial Measurements. Drones, 9(5), 373. https://doi.org/10.3390/drones9050373

Article Menu

Long-Duration UAV Localization Across Day and Night by Fusing Dual-Vision Geo-Registration with Inertial Measurements

Abstract

1. Introduction

2. Related Works

2.1. Visual Navigation

2.2. IR-VIS Method

3. Method

3.1. Visual–Inertial System State Construction and Propagation

3.1.1. Nominal State Propagation

3.1.2. Filter State

3.1.3. Covariance Propagation

3.2. Cross-Domain Visual Registration

3.2.1. Feature Extraction and Matching Method

3.2.2. Reference Visible Map Pre-Processing

3.2.3. Camera Image Pre-Processing

3.2.4. Camera Map Registration

3.3. Geo-Localization from Visual Registration

3.3.1. Geo-Localization

3.3.2. Attitude Error Compensation

3.4. Filter State Update with Geo-Localization Observations

4. Experiment

4.1. System Setup

4.2. Dataset Description and Evaluation Method

4.3. Experimental Results

4.3.1. Registration Rate Results

4.3.2. Visual Geo-Localization Results

4.3.3. Visual–Inertial Integrated Navigation Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

DURC Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI