UAVs-Based Visual Localization via Attention-Driven Image Registration Across Varying Texture Levels

Ren, Yan; Dong, Guohai; Zhang, Tianbo; Zhang, Meng; Chen, Xinyu; Xue, Mingliang

doi:10.3390/drones8120739

Open AccessArticle

UAVs-Based Visual Localization via Attention-Driven Image Registration Across Varying Texture Levels

by

Yan Ren

¹

,

Guohai Dong

^1,*,

Tianbo Zhang

¹,

Meng Zhang

¹,

Xinyu Chen

¹

and

Mingliang Xue

²

¹

School of Artificial Intelligence, Shenyang Aerospace University, Shenyang 110136, China

²

School of Computer Science and Engineering, Dalian Minzu University, Dalian 116600, China

^*

Author to whom correspondence should be addressed.

Drones 2024, 8(12), 739; https://doi.org/10.3390/drones8120739

Submission received: 13 November 2024 / Revised: 5 December 2024 / Accepted: 6 December 2024 / Published: 9 December 2024

Download

Browse Figures

Versions Notes

Abstract

This study investigates the difficulties associated with image registration due to variations in perspective, lighting, and ground object details between images captured by drones and satellite imagery. This study proposes an image registration and drone visual localization algorithm based on an attention mechanism. Initially, an improved Oriented FAST and Rotated BRIEF (ORB) algorithm incorporating a quadtree-based feature point homogenization method is designed to extract image feature points, providing support for the initial motion estimation of UAVs. Following this, we combined a convolutional neural network with an attention mechanism and the inverse-combined Lucas-Kanade method to further extract image features. This integration facilitates the efficient registration of drone images with satellite tiles. Finally, we utilized the registration results to correct the initial motion of the drone and accurately determine its location. Our experimental findings indicate that the proposed algorithm achieves an average absolute positioning error of less than 40 m for low-texture flight paths and under 10 m for high-texture paths. This significantly mitigates the positioning challenges that arise from inconsistencies between drone images and satellite maps. Moreover, our method demonstrates a notable improvement in computational speed compared to existing algorithms.

Keywords:

UAV visual positioning; image matching; image registration; UAV aerial images; satellite imagery; attention mechanism; deep learning

1. Introduction

In recent years, Unmanned Aerial Vehicle (UAV) technology has rapidly advanced, resulting in its widespread use across several key sectors, such as meteorological observation, agricultural monitoring, geological surveying, and military operations [1,2,3,4]. These applications demand high-precision location estimates to ensure the autonomous flight and effective task execution of UAVs, making efficient positioning systems essential. UAV navigation systems currently employ various methods, including Global Navigation Satellite Systems (GNSS), Inertial Navigation Systems (INS), and radio navigation systems. GNSS is particularly prevalent due to its extensive coverage and established reliability [5]. However, its positioning accuracy can be adversely affected by external factors and may struggle to provide stability in environments with weak or absent signals [6]. Therefore, it is crucial to explore UAV positioning technologies that can adapt to a range of environmental conditions.

The rapid advancement of computer vision technology and visual sensors has underscored the benefits of vision-based positioning and navigation techniques in addressing various challenges [7]. These methods provide effective solutions for the safe and efficient operation of Unmanned Aerial Vehicles (UAVs), especially in environments where Global Navigation Satellite Systems (GNSS) signals are unreliable, making their potential applications highly promising. As a result, vision-based positioning has become a focal point of research. Vision-based positioning can be divided into two main approaches: absolute visual positioning and relative visual positioning [8]. Absolute visual positioning involves comparing UAV aerial images captured with prior satellite images to identify the most similar regions through a search process, enabling the UAV to ascertain its location on the satellite map [9]. While this approach offers high accuracy, it depends on prior satellite maps and is sensitive to environmental variations. Conversely, relative visual positioning estimates the motion of the drone by analyzing consecutive frames captured by the UAV, without relying on prior maps. However, this method is susceptible to cumulative errors during prolonged flights. Researchers are actively investigating improved visual positioning techniques to enhance UAV autonomous navigation in complex environments.

Oriented FAST and Rotated BRIEF (ORB) is an efficient algorithm for detecting and describing image feature points, widely used in applications of computer vision such as object detection and motion tracking [10]. The algorithm is based on Features from Accelerated Segment Test (FAST) [11] for rapid feature point detection and integrates Binary Robust Independent Elementary Features (BRIEF) [12] to construct binary descriptors. ORB introduces two key improvements over the original methods: first, it calculates the grayscale centroid of the neighborhood around each feature point, enabling orientation invariance and robustness to image rotation. Second, it employs a pyramid structure to achieve scale invariance, enabling robust feature detection across varying image scales. Furthermore, ORB refines the binary pattern generation process of BRIEF descriptors, improving feature matching accuracy while maintaining high computational efficiency. Compared to traditional feature extraction algorithms such as SIFT and SURF, ORB provides superior robustness at significantly lower computational costs. This makes it particularly suitable for real-time applications with stringent performance requirements, such as UAV navigation, image stitching, and target tracking.

Subsequently, numerous excellent image feature-based methods have been used for visual localization. For example, Patel et al. [13] utilized the SURF algorithm to align UAV aerial images with satellite remote sensing maps from Google Maps, determining the position of UAV based on the registration results. Majidizadeh et al. [14] employed a U-Net network to segment roads and buildings in aerial imagery; however, this algorithm was limited to urban settings due to its reliance on road segmentation. Zhong et al. [15] proposed a monocular visual odometry based on information entropy and Lucas-Kanade optical flow method for SLAM domain. Literature [16,17] proposed a visual SLAM algorithm for point and line feature fusion, which achieves simultaneous localization and maps in indoor low-texture scenes, but is not applicable to aviation scenes. Goforth et al. [18] implemented a neural network-based image registration technique to correct UAV motion, but this approach faced challenges with slow computation speeds and suboptimal performance in low-texture areas. Such limitations are also prevalent in visual odometry methods [19], which tend to accumulate errors over long distances and require adequate overlap between consecutive frames. Therefore, enhancing existing methods to mitigate cumulative errors is of paramount importance.

Building upon previous research, this study proposes a novel positioning method that combines relative visual positioning with absolute visual positioning. Initially, visual odometry techniques are employed to match consecutive frames of UAV aerial images, allowing for a preliminary estimation of the motion trajectory of UAV. Subsequently, the UAV aerial images are matched with satellite images of the target area to effectively eliminate cumulative errors. Experimental validation demonstrates that this method performs exceptionally well under low-texture image conditions, providing significant academic and practical value for UAV positioning research in low-texture environments such as wilderness and mountainous regions. The main contributions of this study are as follows:

This study introduces a quadtree feature point uniformization algorithm applied to Oriented FAST and Rotated BRIEF (ORB) feature points. The aim is to reduce the negative impact of densely concentrated feature points in areas with significant texture variation on motion estimation. This enhancement improves the accuracy of motion estimation for UAVs during high-speed flight in low-texture environments;
This study present a twin neural network registration algorithm that leverages an attention mechanism by integrating the Convolutional Block Attention Module (CBAM) into the feature extraction layer of neural network. The Lucas-Kanade algorithm is then employed to align the feature maps processed by the convolutional network. This enhanced algorithm shows a marked improvement in matching performance for low-texture images;
This study propose a novel vision positioning method that combines relative and absolute visual positioning through image registration. This method effectively mitigates the cumulative error associated with relative visual positioning during long-distance flights, enabling rapid and accurate positioning in low-texture environments, such as agricultural fields and river landscapes.

The subsequent sections of this paper are organized as follows: Section 2 reviews related research, focusing on image feature extraction and matching methods, as well as their applications in visual positioning. In Section 3, we provide a detailed explanation of the proposed vision positioning method based on image registration, covering the structural details of the image registration network, the estimation of the initial motion of UAV, motion correction, and specific details of the positioning process. Section 4 presents the experimental design and result analysis. Finally, Section 5 concludes the study, summarizing the research process and experimental findings, while also outlining future research directions.

2. Related Works

Image matching is a fundamental element of UAV visual positioning and serves as a critical step in the image registration process. This process involves extracting and matching features, as well as estimating the transformation matrix between two images to align them within the same coordinate system. Numerous researchers have proposed various classic image matching algorithms, including those based on templates, grayscale, and features. Among these methods, features such as points, lines, and textures are crucial for efficiently representing the entire image due to their simplicity and stability, forming the foundation of mainstream image matching algorithms [20]. Harris et al. introduced the Harris corner detection operator, which defines corners based on the rate of intensity change in two orthogonal directions [21]. This operator employs second-order moments or autocorrelation matrices to expedite the search for local extrema while maintaining directional invariance. To further reduce computational complexity, faster feature extraction methods based on pixel grayscale comparisons in local regions were developed, including the SUSAN operator [22] and FAST [11]. Rublee et al. later introduced the Oriented FAST and Rotated BRIEF (ORB) algorithm, which has been widely adopted for visual tasks [10]. However, these methods are often sensitive to scale and affine transformations. To address these challenges, David Lowe proposed the classic scale-invariant feature transform (SIFT) algorithm [23], followed by enhanced algorithms like SURF [24] and ASIFT [25]. These algorithms demonstrate strong performance in terms of translation and rotation invariance and are robust against variations in illumination, noise, and slight changes in viewing angles. Nonetheless, traditional image matching algorithms typically rely on manually designed feature extraction and matching strategies, which, although effective in simple scenes, may struggle in complex backgrounds and highly variable images.

With the rapid development of deep learning, it has been widely applied in various fields such as natural language processing and computer vision [26,27,28]. In the field of image processing, deep learning-based methods have gradually overcome the limitations of traditional feature matching techniques, owing to their exceptional ability in deep feature learning and representation. These methods excel in learning and representing complex features, resulting in significant performance improvements in image processing. Consequently, they have increasingly supplanted traditional handcrafted feature approaches, positioning themselves at the forefront of contemporary image matching technology. Modern descriptors generally train deep networks on pre-cropped patches derived from SIFT keypoints. Notable examples of these descriptors include Deepdesc [29], L2-Net [30], and LogPolarDesc [31]. In recent research, Verdie et al. [32] developed a reliable keypoint detection framework called TILDE. This framework successfully addresses the challenges posed by significant illumination variations due to changes in weather, seasons, and time, facilitating the reliable detection of repeatable keypoints. Barroso-Laguna et al. [33] developed KeyNet, a keypoint detector that utilizes a multi-scale shallow structure. This method efficiently extracts keypoints and exhibits strong adaptability to changes in image perspective. Similarly, Daniel et al. [34] introduced SuperPoint, a self-supervised learning framework based on fully convolutional neural networks. This framework features both a localization decoder and a description decoder, enabling the simultaneous extraction of feature point descriptors. Building on this foundation, SuperGlue [35] proposed a multi-layer graph neural network that integrates an attention mechanism. This innovative approach consolidates contextual information from images, allowing the matching process to consider both local features of keypoints and global information across the entire graph. By transforming the matching challenge into an optimal transport problem, SuperGlue significantly enhances matching performance. Another promising approach is the use of end-to-end learning networks, such as LIFT [36] and LoFTR [37], which also demonstrate exceptional performance in feature point matching tasks. LoFTR leverages self-attention and cross-attention layers from the Transformer architecture to extract feature descriptors from two images. Initially, it establishes pixel-level dense matching at a coarse level, followed by a refinement process at a finer level, effectively achieving reliable matching results even in low-texture regions.

D2-Net [38] tackles the challenge of pixel-level matching in complex imaging conditions through dense feature extraction and description, resulting in sparse feature points and their corresponding descriptors. Hou et al. [39] applied D2-Net to map matching for UAV visual positioning, demonstrating that the average positioning error under seasonal variations meets the localization requirements for UAVs. However, this method lacks robustness during long-distance flights and does not consider the effects of low-texture environments. In urban landing scenarios, Xu et al. [40] introduced the YOLOX network with an attention mechanism, achieving notable performance. Meanwhile, Marius-Mihail Gurgu [41] employed SuperGlue for localization in non-urban environments, such as fields and forests; however, this approach relies on retrieving the rotational angle of the onboard camera from image metadata. The 2chADCNN model [42] integrates a dual-channel fully convolutional neural network with an attention mechanism to focus on season-invariant features, yielding favorable results in scenarios with seasonal changes. Nevertheless, this method still depends on prior satellite maps for practical localization, which limits its applicability. An innovative approach proposed in [43] combines MEMS-based inertial sensors with visual odometry by enhancing the visual odometry pipeline. This method replaces the feature matching step with the Lucas-Kanade algorithm and employs an Extended Kalman Filter (EKF) to fuse data from both systems for accurate estimation of position and heading, while correcting for errors in the inertial sensors. Building on these prior works, we aim to leverage the benefits of visual odometry while addressing its cumulative error challenges.

3. Proposed Methodology

Image registration methods based on visual odometry focus primarily on aligning two homogeneous images. These methods generally perform better than absolute visual positioning across various environments, including urban, indoor, and outdoor settings. However, a significant drawback of visual odometry is its cumulative error over long-distance flights, which does not affect absolute visual positioning. Building on insights from previous research, this paper introduces a novel algorithm for image registration and UAV visual positioning that leverages an attention mechanism. This innovative approach aims to effectively integrate visual odometry with absolute visual positioning. By doing so, it can significantly reduce cumulative errors during long-distance flights, thereby improving localization accuracy.

The following sections will provide a detailed description of the method we propose. We begin by outlining the overall workflow of our approach, which consists of three key stages. Section 3.1 introduces the first stage, focusing on the improved Oriented FAST and Rotated BRIEF (ORB) algorithm and its application in estimating the initial motion of the drone. Section 3.2 thoroughly discusses the second stage, highlighting the principles and implementation details of the attention-based twin neural network we designed. Section 3.3 corresponds to the third stage, where we explain the process of motion correction for the drone and the detailed implementation of the final positioning.

Figure 1 illustrates the flowchart of the proposed localization algorithm, which consists of three key stages. First, in the initial stage, we employ an improved visual odometry algorithm to estimate the initial motion matrix of the drone by matching feature points between consecutive image frames captured by the UAV. This innovative approach significantly enhances both the accuracy and speed of motion estimation. Specifically, visual odometry algorithms are prone to cumulative errors. To address this, we improve the Oriented FAST and Rotated BRIEF (ORB) algorithm and extract feature points from consecutive frames of drone images. The FLANN algorithm is then used to match the feature points and estimate the initial motion. This stage is discussed in detail in Section 3.1, with the goal of quickly estimating the UAV’s initial motion, while the cumulative errors remain unaddressed at this point. Then, in the second stage, the initial motion of the UAV estimated in the first stage allows us to obtain the current UAV aerial image as well as its corresponding satellite image. We design an attention-based twin neural network to extract image features from both the UAV’s aerial image and the satellite tile image. This network will be described in detail in Section 3.2. The extracted image features from the drone’s aerial images and satellite tile images are then fed into the third stage. Finally, in the third stage, the image features of the UAV aerial image and the satellite tile image obtained in the second stage are used with the inverse composition Lucas-Kanade algorithm to achieve precise registration between the satellite image and the UAV aerial image, generating a correction matrix to eliminate the cumulative errors present in the visual odometry process. By combining the initial motion matrix of the drone with the correction matrix, the final motion of the UAV for each frame can be determined. Through the continuous estimation of motion, the UAV can accurately localize itself within the satellite map, achieving precise positioning. This stage is described in detail in Section 3.3.

To further clarify, the core idea of our method is to combine relative visual localization and absolute visual localization for UAVs, overcoming two major limitations of existing techniques: (1) eliminating the impact of cumulative errors in visual odometry, and (2) significantly reducing the time required for searching the satellite image database in absolute visual localization methods. The first stage is essentially part of the UAV relative visual localization. However, during long-distance flights, cumulative errors cannot be eliminated. To address this, we have designed an improved Oriented FAST and Rotated BRIEF (ORB) feature extraction algorithm to rapidly estimate the UAV’s initial motion by extracting image features. The second and third stages are essentially part of the UAV absolute visual localization. Compared to traditional absolute visual localization methods, since the initial motion of the drone has already been estimated in the first stage, there is no need to search through the entire satellite image database. We designed an attention-based twin neural network to extract and register image features from both the UAV aerial images and the corresponding satellite tile images, thereby correcting the UAV’s motion and eliminating cumulative errors. This approach effectively improves both the accuracy and speed of the localization process.

3.1. Estimation of Initial Motion for UAVs

Affine transformations maintain the linearity of images, which helps reduce distortions in the representation of roads and building boundaries [44]. As a result, these transformations are commonly employed in the kinematic and dynamic modeling of drones [45]. Affine transformations offer six degrees of freedom, enabling various geometric adjustments such as rotation, translation, and scaling. To estimate the continuous motion of the drone, it is crucial to establish the initial motion matrix

A_{r e l}

based on the relationships between affine transformations of consecutive images captured by the drone. Next, the aerial images from the current frame are aligned with satellite tile images using a neural network alongside the Lucas-Kanade algorithm, resulting in a correction matrix

A_{c o r r}

. This correction is then applied to the initial motion of the drone, leading to a more precise estimation of its movement. The affine matrix

A_{r e l}

between two frames of neighboring images can be defined as follows:

A_{r e l} = (\begin{matrix} α & - β & t_{x} \\ β & α & t_{y} \\ 0 & 0 & 1 \end{matrix})

(1)

where

α

and

β

represent the parameters for the rotation and height changes of the drone,

α = s \cdot cos θ

,

β = s \cdot sin θ

, s indicates the scaling factor between the UAV aerial images and the reference map, and

θ

denotes the yaw angle between two consecutive frames. The parameters

t_{x}

and

t_{y}

correspond to the translation values of the drone in the x and y directions, respectively. The scaling factor s is determined by the ratio of the ground sample distance (GSD) of the UAV aerial image to that of the satellite image. Taking the experimental dataset from the Shenyang region in China as an example, the

G S D_{s}

of the Google satellite map at zoom level 17 is 1.76 m per pixel. The

G S D_{u}

of the UAV aerial image, can be calculated from the metadata of the aerial images and is approximately 0.6 m per pixel. Therefore, the scaling factor is approximately 0.34, meaning the drone’s aerial image is scaled to 34% of its original size.

To accurately track the movement of the drone on the satellite map, it is essential to first establish a correlation between the initial frame captured at takeoff and the satellite map. At the moment the initial frame

f = 1

is captured, the position of the drone and heading are assumed to be known. With this information, the homography matrix

A_{r e l}^{f}

can be computed, accounting for the approximate location, orientation, and scale of the initial frame on the satellite map. As the drone navigates, the relationship between the current frame f and the previous frame

f - 1

is described by the homography matrix, which captures the changes in position. The position

L^{f}

of the drone in the current image can be expressed as:

L^{f} = A_{r e l}^{f} \cdot A_{r e l}^{f - 1} \cdot A_{r e l}^{f - 2} \cdot . . . \cdot L^{1}

(2)

where

A_{r e l}^{f}

denotes the affine transformation that occurs between frames f and

f - 1

.

L^{1}

represents the position of the first frame of the UAV aerial image in the satellite image. This demonstrates an iterative process, where the position of the first frame image,

L^{1}

, determines the position of the second frame image,

L^{2}

, and so on, until the position of the current frame,

L^{f}

, is reached. This transformation accounts for the rotation and translation of the drone as it moves between these two adjacent frames.

Based on the aforementioned principles, this study designs an improved Oriented FAST and Rotated BRIEF (ORB) algorithm to detect feature points, thereby estimating the motion of drone. Initially, feature points are detected in both the current frame f and the previous frame

f - 1

. Next, the FLANN [46] algorithm is used to match the detected points, while the RANSAC [47] algorithm is applied to eliminate any mismatched pairs. For the matched feature point pairs, the pixel coordinates for frame

f - 1

are denoted as

(x, y)

, while those in frame f are represented as

(x^{'}, y^{'})

. The relationship between the affine matrix

A_{r e l}

of the adjacent frames and the feature point coordinates can be expressed as:

(\begin{matrix} \begin{matrix} x^{'} \\ y^{'} \end{matrix} \\ 1 \end{matrix}) = A_{r e l} (\begin{matrix} x \\ y \\ 1 \end{matrix}) = (\begin{matrix} α & - β & t_{x} \\ β & α & t_{y} \\ 0 & 0 & 1 \end{matrix}) (\begin{matrix} x \\ y \\ 1 \end{matrix})

(3)

Morphing Equation (3) yields:

\begin{matrix} x^{'} = α x - β y + t_{x} \\ y^{'} = β x + α y + t_{y} \end{matrix}

(4)

To accurately determine the initial motion

A_{r e l}

, at least three pairs of feature points are needed. Thus, acquiring a sufficient number of reliable matching point pairs between adjacent frames is essential. However, feature points tend to cluster around corners and areas with significant texture variations in the images. Consequently, many feature points may lie outside the overlapping regions of consecutive images, while only a limited number are detected within the overlapping areas. This uneven distribution can lead to incorrect calculations of the initial motion parameters. Figure 2 illustrates the distribution of feature points in the UAV aerial images.

To resolve the problem of concentrated feature point distribution illustrated in Figure 2, it is essential that the overlapping regions of consecutive images are sufficiently large to ensure accurate motion estimation of the drone. However, when the drone flies at high speeds, the overlap between consecutive images is often significantly reduced. This study proposes a quadtree-based feature point uniformization algorithm to ensure a more balanced distribution of feature points across the entire image. This approach enables accurate motion estimation even during high-speed drone flight.

The quadtree-based feature point uniformization algorithm consists of four stages, as illustrated in Figure 3. The process begins by setting an initial cell size according to the aspect ratio of the UAV aerial images, followed by the initial cell partitioning. Next, the Oriented FAST and Rotated BRIEF (ORB) feature point detection algorithm is applied to extract feature points with each cell. In the second stage, each cell is further divided into smaller nodes. Initially, it is assumed that each cell contains only one node. If a node has more than one feature point, it is split into four sub-nodes. Conversely, nodes without any feature points are removed. In the third stage, nodes are recursively split until either the total node count reaches a predetermined threshold or further division is no longer feasible. In the fourth stage, the feature point with the highest Harris corner response value is selected from each node. Ultimately, other feature points are removed to achieve a uniform distribution of feature points.

3.2. Siamese Neural Network Based on Attention Mechanism

This study introduces a novel attention-enhanced Siamese neural network designed specifically for registration UAV aerial images with satellite tiles. By incorporating attention mechanisms, the model significantly boosts the feature extraction capability of the Siamese network. As depicted in Figure 4, the network consists of two identical branches, each composed of three convolutional blocks followed by a Convolutional Block Attention Module (CBAM). Inspired by the initial three blocks of the VGG16 architecture, the first two convolutional blocks in each branch consist of two ReLU-activated convolutional layers and a max-pooling layer. The third block includes two convolutional layers with ReLU activation, while the final layer in this block is unactivated. The Convolutional Block Attention Module (CBAM) module consists of two submodules, namely the channel attention module and the spatial attention module, connected in series [48]. Within each branch, the Convolutional Block Attention Module (CBAM) module refines feature representation by employing two attention sub-modules: channel attention and spatial attention. The channel attention mechanism reduces the spatial dimension while keeping the channel dimension intact, which enhances the network’s capacity ability to capture and emphasize important channel features. spatial attention compresses the channel dimension while preserving the spatial dimension, which allows the model to focus on key positional information of objects within the image.

This attention mechanism enables the network to automatically identify and emphasize critical feature regions within UAV aerial images and satellite images, such as buildings in agricultural fields, exposed riverbeds, and large structures within urban areas. This ensures that these significant features retain their prominence even after the images undergo transformations such as translation and rotation. The integration of the Convolutional Block Attention Module (CBAM) not only enhances the ability of the model to capture subtle texture variations in the images but also significantly improves the overall performance of the model, providing robust support for the accurate registration of UAV aerial images and satellite tile images.

Next, a detailed description of the attention mechanism-based Siamese neural network is provided. First, UAV aerial images and satellite tile images are used as the two inputs to the Siamese neural network, which then generates feature maps

F \in R^{C \times H \times W}

after passing through three convolutional blocks. Subsequently, the feature map F undergoes max-pooling and average-pooling operations in the channel attention module. The resulting one-dimensional vector is passed through a fully connected layer (MLP) for computation, and the sum of the results generates a one-dimensional channel attention. This one-dimensional channel attention is then multiplied by the input elements to obtain the adjusted feature map

F^{'}

after the channel attention module. The calculation process of the channel attention module is as follows:

\begin{matrix} M_{C} (F) = & σ (M L P (A v g P o o l (F)) + M L P (M a x P o o l (F))) \\ = & σ (W_{1} (W_{0} (F_{a v g}^{c})) + W_{1} (W_{0} (F_{max}^{c}))) \end{matrix}

(5)

where

σ

denotes the sigmoid function,

W_{0}

and

W_{1}

are the two weights of the MLP,

F_{a v g}^{c}

and

F_{m a x}^{c}

are the features that have undergone average-pooling and max-pooling layers, respectively, within the channel attention module.

Then, the feature map

F^{'}

undergoes max-pooling and average-pooling operations in the spatial attention module. The two resulting two-dimensional vectors from average pooling are concatenated and processed through a convolution operation to generate the two-dimensional spatial attention feature map

M_{S} \in R^{1 \times H \times W}

. The obtained

M_{S}

is then element-wise multiplied with the feature map

F^{'}

to produce the final feature map

F^{″}

. The calculation process of the spatial attention module is as follows:

\begin{matrix} \begin{matrix} M_{S} (F) = & σ (f^{7 \times 7} ([(A v g P o o l (F)); M a x P o o l (F)])) \\ = & σ (f^{7 \times 7} ([F_{a v g}^{s}; F_{max}^{s}])) \end{matrix} \end{matrix}

(6)

where

f^{7 \times 7}

represents a convolution operation with the filter size of 7 × 7,

F_{a v g}^{s}

and

F_{m a x}^{s}

are the features that have undergone average-pooling and max-pooling layers, respectively, within the spatial attention module.

The process of Convolutional Block Attention Module (CBAM) generating

F^{″}

can be expressed as:

\begin{matrix} F^{'} = M_{C} (F) \otimes F \\ F^{″} = M_{S} (F^{'}) \otimes F^{'} \end{matrix}

(7)

where ⊗ represents element-wise multiplication.

Finally, the UAV aerial images and satellite tile images are processed through the Siamese neural network to obtain the UAV aerial image feature map and the satellite tile image feature map, denoted as

F_{U}

and

F_{M}

, respectively.

3.3. Image Registration Based on the Lucas-Kanade Algorithm

The role of the Lucas-Kanade inverse compositional algorithm in image registration is to estimate the displacement information between images using optical flow, while leveraging the inverse compositional method to improve computational efficiency. The core idea is based on the assumption that pixel intensity values between images remain consistent over short time intervals, and the motion is estimated using local gradient information, thereby achieving image alignment and registration.

In this study, we denote the feature maps of drone aerial images and satellite images as

F_{U}

and

F_{M}

, respectively. To align these two original images, we employ the inverse compositional Lucas-Kanade algorithm. The motion model linking

F_{U}

and

F_{M}

is expressed as

W (x; p)

, where

x = (x_{i}, y_{i})

represents the pixel coordinates in the original image. The total number of pixels in the image is N, and

p = {(p_{1}, p_{2}, \dots, p_{6})}^{T}

signifies the parameter vector associated with the motion model. Equation (1) presents the six degrees of freedom affine matrix, which is crucial for this alignment process. In this section, we will detail the relationship between

p

and

A_{r e l}

:

W (x; p) = (\begin{matrix} 1 + p_{1} & p_{2} & p_{3} \\ p_{4} & 1 + p_{5} & p_{6} \\ 0 & 0 & 1 \end{matrix}) (\begin{matrix} x \\ y \\ 1 \end{matrix}) = (\begin{matrix} α & - β & t_{x} \\ β & α & t_{y} \\ 0 & 0 & 1 \end{matrix}) (\begin{matrix} x \\ y \\ 1 \end{matrix})

(8)

Next, we apply a method based on photometric consistency loss to align the images. This process involves calculating the sum of squared differences in grayscale values between corresponding pixels in the two images. Our goal is to determine the optimal parameters

p

that minimize the grayscale differences between the drone aerial image and the satellite tile image, which has been altered by the motion model. This relationship can be expressed as follows:

min_{p} {∥F_{U} (x) - F_{M} (W (x; p))∥}_{2}^{2}

(9)

Finally, we correct

A_{r e l}

using the affine matrix

A_{c o r r}

to obtain the final motion of the drone. The calculation formula is as follows:

A_{c o r r}^{f} \cdot A_{r e l}^{f} \to A_{r e l}^{f}

(10)

where

A_{c o r r}^{f}

represents the correction matrix between the drone image in the f-th frame and the corresponding satellite tile image.

3.4. Train

The training image dataset is sourced from a publicly available collection hosted on the United States Geological Survey’s Earth Explorer platform. It covers a region of 5.9 × 7.5 km in New Jersey, USA, and encompasses a variety of landscapes, including urban, forested, suburban, and rural areas. The dataset comprises images captured during spring, summer, and autumn, each with dimensions of 7582 × 5946 pixels and a spatial resolution of 1 m. Prior to neural network training, the raw images undergo a warping procedure, and the resulting processed images serve as the input data for the network. We begin by randomly selecting two images from the raw dataset. From these images, we crop a pair of patches at corresponding locations, with pixel sizes varying between 175 and 225. One patch remains unchanged, while the other undergoes warping using randomly generated affine parameters, which include scaling, translation, and rotation. These two patches—one processed and the other unchanged—constitute a training pair that is then fed into a twin neural network based on an attention mechanism. During the training phase, we generated a total of 3000 pairs of image patches as training data. The loss function employed for training is the angular loss, calculated as follows:

L o s s (p, \hat{p}) = \frac{1}{4} \sum_{i = 1}^{4} {∥W (c_{i}; p) - W (c_{i}; \hat{p})∥}_{2}

(11)

where

\hat{p}

refers to the true values of the warping parameters utilized to distort the images, while

c_{1}, \dots, c_{4}

indicate the positions of the four corners. The network weights are updated by backpropagating the angular loss, using the Adam optimizer with a learning rate set at 1 × 10⁻³.

4. Experiments

The experiments were carried out using Python 3.7, CUDA 10.2, and the deep learning framework PyTorch 1.7.1, with an NVIDIA GeForce 940MX graphics card as the hardware platform (NVIDIA Corporate, Santa Clara, CA, USA). To assess the effectiveness of the proposed method for accurately positioning drone-captured images relative to satellite maps across different texture conditions, this section presents experiments involving two flight paths in both high-texture and low-texture regions. The experimental dataset was gathered from the Changbai area in Heping District, Shenyang, Liaoning Province, China. The low-texture route covers a distance of 3.06 km and includes farmland, rivers, and a few buildings. In contrast, the high-texture route features a substantial number of buildings and roads, with the first high-texture route measuring 1.84 km and the second measuring 2.75 km. The satellite remote sensing maps utilized in this study are from Google Maps at level 17. The drone-captured images have a resolution of 750 × 750 pixels, with a 30% overlap between adjacent frames. It is crucial to note that the drone images and satellite images for all routes were captured at different times and using different sensors. As a result, visual discrepancies exist between the drone-captured images and the satellite images, including variations caused by vegetation, roads, buildings, and shadows. Figure 5 provides a comparative view of the drone-captured images and the satellite images.

This section outlines comparative experiments with various algorithms to assess the effectiveness of the proposed method. The comparison will involve calculating initial motion using different point feature detection algorithms, including SURF, SIFT, and an improved Oriented FAST and Rotated BRIEF (ORB) algorithm. Furthermore, we will employ distinct neural network architectures for image feature extraction. Specifically, we will extract features using both the original VGG16 model and a modified version that integrates the Convolutional Block Attention Module (CBAM) after the third convolutional block. The registration results from these methods will be compared to perform an ablation study of the Convolutional Block Attention Module (CBAM).

4.1. Feature Point Extraction and Matching

To determine the position of the drone, we first estimate the matrix

A_{r e l}^{1}

, which links the images of the drone to the satellite maps based on known takeoff information. Subsequently, we conduct feature point detection and matching on consecutive frames captured by the drone, employing a 30% overlap between these frames. In cases where the overlap is minimal, the quality and uniform distribution of feature points become crucial for accurately estimating the initial motion of the drone. This study enhances the Oriented FAST and Rotated BRIEF (ORB) algorithm by implementing a quadtree feature point uniformization technique, which ensures an even distribution of extracted feature points throughout the image. Figure 6 shows the distribution of feature points in the images captured by the drone.

The experimental results reveal that using the Oriented FAST and Rotated BRIEF (ORB) algorithm alone results in most feature points being clustered on edges with prominent textures, such as those found in buildings, grass, and roads, while feature points are scarce in low-texture areas. In contrast, the enhanced ORB algorithm achieves a more uniform distribution of feature points, thereby mitigating the clustering effect. Following the feature point detection, we apply the FLANN algorithm for feature point matching. The matching results for adjacent frames with low-texture and high-texture images are presented in Figure 7 and Figure 8, respectively.

In this study, we employ the RANSAC algorithm for feature point pair selection. RANSAC requires a minimum of four correct matching pairs to obtain accurate initial motion parameters. The experimental results demonstrate that the number of successful matches between low-texture images using the Oriented FAST and Rotated BRIEF (ORB) algorithm alone is relatively low, with some scenes yielding fewer than four pairs. In contrast, the combination of the ORB algorithm and the quadtree uniformization technique significantly increases the number of matching pairs for low-texture images. High-texture images, characterized by their clear and distinct features, facilitate the identification of a larger number of correct matches. However, images processed with the quadtree uniformization algorithm may still present some erroneous matches. This issue arises because, while the algorithm enhances the even distribution of feature points, it can also reduce their quality, resulting in incorrect matches. Despite this, the high volume of successfully matched feature point pairs enables accurate calculations of the initial motion of the drone, even after filtering out the erroneous matches.

4.2. Image Registration and Localization for Low-Texture Areas

This section utilizes neural networks to extract features from both drone-captured images and satellite tiles. The inverse compositional Lucas-Kanade algorithm is then applied for image registration, allowing for the correction of the motion of the drone and precise determination of its location on the satellite map. Figure 9 illustrates the registration results for specific low-texture route images. The upper part of the figure displays the satellite tile image, while the lower part presents the drone image. The central section shows the registration results of the satellite tile using the drone image as a template, with a highlighted rectangular box to facilitate a clearer comparison of the outcomes. The proposed network structure builds upon the VGG16 model by incorporating the Convolutional Block Attention Module (CBAM) attention module, which enhances the preservation of essential features in critical areas, such as buildings and roads. Experimental results indicate that, despite variations between the drone-captured images and the satellite tiles, the registration results achieved with the proposed method significantly surpass those obtained using the standard VGG16 network structure.

Figure 10 illustrates the positioning results for low-texture routes, with the flight direction in the satellite image moving from the lower left corner to the upper right corner. The results indicate that from the ninth frame captured by the drone, the distance between the predicted coordinates obtained using the VGG16-based positioning method and the actual location progressively increases. In contrast, the predicted drone position coordinates derived from the proposed method remain significantly closer to the true location.

This study utilizes absolute positioning error as an objective metric to evaluate the positioning results. It specifically measures the two-dimensional Euclidean distance between the predicted coordinates and the actual GPS coordinates for each frame. Within the same network architecture based on the VGG16 structure, we compare the positioning errors associated with the SURF, SIFT, and improved Oriented FAST and Rotated BRIEF (ORB) algorithms. Furthermore, using the improved ORB algorithm as the feature point detection method, this study also examines the positioning errors obtained from both the VGG16 network structure and the proposed method. Figure 11 displays the positioning error line chart for the methods discussed. The data shows that, under the same network architecture, the positioning errors of the improved ORB (ORB_Qt) algorithm are notably lower than those of the SURF and SIFT algorithms. In low-texture areas, the errors for each frame using the VGG16 network structure tend to increase gradually. In contrast, the errors for each frame produced by the proposed method remain consistently small and stable. These results indicate that the positioning outcomes from the proposed network structure are clearly superior to those obtained with the VGG16 network structure.

4.3. Image Registration and Localization for High-Texture Areas

Figure 12 illustrates the registration results for images from high-texture routes. The top section features the satellite tile image, while the bottom section displays the UAV aerial images. The middle section presents the registration outcomes of the satellite tile image, using the UAV aerial image as a reference. The experimental results indicate that both the VGG16-based method and the proposed approach effectively register drone images with satellite tile images. This performance can be attributed to the presence of numerous richly textured buildings and roads within the images.

Figure 13 illustrates the positioning results for low-texture routes, where the flight direction is from left to right on the satellite map. The findings show that the predicted drone coordinates from both methods closely align with the actual positions, demonstrating a high level of positioning accuracy.

The comparison experiments for high-texture and low-texture routes utilized the same methodology. Figure 14 presents the positioning errors for high-texture routes. It is clear from the figure that the errors of the three feature point detection algorithms are largely consistent across the same network architecture. When applying the same feature point detection algorithm, the proposed network structure produces positioning errors similar to those of the VGG16-based method. However, the results from the proposed method exhibit greater stability.

Table 1 gives the time required to compute the initial motion parameters for the four different methods on the four routes as well as the average localization error. The times indicated in Table 1 represent the duration required for the initial motion of the drone calculations, particularly focusing on feature point detection. The results clearly show that the feature point detection method used in this study significantly reduces computation time compared to other approaches, achieving a 90% reduction relative to SURF [24], an 83% reduction compared to SIFT [23], and a 41% reduction compared to Method [18]. Furthermore, the proposed method demonstrates superior performance in terms of average positioning error. On two high-texture routes, the localization error was consistently less than 10 m, with an average localization error of 5.713 m over a cumulative flight distance of 4.59 km. In contrast, on two low-texture routes, the lack of significant point, line, and surface features resulted in fewer extractable feature points, leading to a localization error of less than 40 m, with an average localization error of 22.6 m over a cumulative flight distance of 3.06 km.Overall, the positioning error approaches GNSS levels, indicating promising experimental results.

4.4. Ablation Study

To validate the logical soundness and practical effectiveness of the proposed visual localization algorithm, we conducted ablation experiments focused on the attention mechanism module. In these experiments, the first and third stages of the algorithm, as described in the main text, were left unchanged and served as the baseline framework (hereafter referred to as “BASE”), while modifications were applied only to the attention mechanism module in the second stage. The attention mechanism module comprises the Channel Attention Module (CAM) and the Spatial Attention Module (PAM). The method without the attention mechanism module is referred to as “BASE”, while the variants containing only the Channel Attention Module and only the Spatial Attention Module are referred to as “BASE+CAM” and “BASE+PAM”, respectively. The complete method, which incorporates both modules, is denoted as “BASE+CAM+PAM”. Quantitative results for the entire dataset are presented in Table 2.

As shown in Table 2, the proposed complete localization method achieves higher localization accuracy across four different routes compared to methods without the attention mechanism module, with particularly significant improvements in low-texture regions. Specifically, in low-texture areas where image features are sparse or indistinct, conventional convolution operations struggle to effectively extract critical features. In contrast, the attention mechanism significantly enhances localization accuracy by focusing on the limited but crucial areas in the image, such as riverbanks and field edges. In high-texture regions, where image features are abundant, the localization accuracy of methods without the attention mechanism module approaches GNSS-level precision during long-distance flights. As a result, the addition of the attention mechanism provides only slight improvements in these regions. Overall, integrating the attention mechanism into the network architecture is a critical component of the proposed visual localization method. Aerial images are typically characterized by a large amount of information, complex content, the coexistence of global and local features, and diverse scales. Traditional convolution operations, constrained by their limited receptive fields, tend to be overly sensitive to local features while neglecting global context. The attention mechanism addresses this limitation by capturing long-range dependencies between spatial locations, enabling globally-aware feature extraction and enhancing the robustness of feature representations.

5. Conclusions

This paper introduces a drone visual positioning algorithm that combines image registration with an attention mechanism. By effectively integrating convolutional neural networks with the inverse compositional Lucas-Kanade registration algorithm, we enhance the accuracy of drone positioning. For feature point detection, we utilize Oriented FAST and Rotated BRIEF (ORB) and a quadtree-based method for uniformizing feature points, which significantly accelerates computation speed. This is crucial for meeting the demands of fast drone flights, especially in scenarios where consecutive frames exhibit low overlap. Furthermore, we have refined the feature extraction network by incorporating a Convolutional Block Attention Module (CBAM) in the feature extraction layer. This addition improves the accuracy of image feature extraction, particularly in cases where there are significant differences between low-texture UAV aerial images and satellite imagery. Experimental results indicate that our approach not only enhances the computational speed but also improves the accuracy of drone visual positioning algorithms, demonstrating substantial value for practical engineering applications.

Using UAV aerial imagery for visual positioning is a crucial method for navigation in environments where GNSS is unavailable. Visual odometry estimates the motion of the drone by analyzing sequences of aerial images. However, long-distance flights may result in cumulative errors. By combining visual odometry with absolute visual positioning methods, these errors can be effectively mitigated, leading to improved accuracy. We acknowledge that incorporating absolute visual positioning methods increases computational requirements, which can be challenging given the limited processing power of onboard drone systems. Consequently, future research will focus on optimizing computational models to reduce this burden and shorten positioning times. Additionally, from the experimental results, this is an acceptable outcome, with the localization accuracy being reasonably close to that of GNSS. However, this depends on the specific application scenario. For example, in low-texture areas, where precise tracking and target localization are required, further improvement in localization accuracy is necessary. In such cases, focusing on image dense features rather than point features would be beneficial, and employing an end-to-end approach could be a feasible solution. In high-texture areas, the image’s point features are abundant and diverse, so the focus should be on the stability of feature extraction. Moreover, it is important to design feature extraction methods that are robust to factors such as lighting conditions and seasonal variations.

Author Contributions

Conceptualization, Y.R., G.D. and T.Z.; methodology, Y.R., G.D. and T.Z.; software, Y.R., X.C. and M.Z.; validation, M.X., G.D. and T.Z.; formal analysis, Y.R., G.D. and T.Z.; investigation, Y.R., M.Z. and X.C.; resources, Y.R., X.C. and M.X.; data curation, M.X., G.D. and T.Z.; writing—original draft preparation, Y.R., M.Z. and T.Z.; writing—review and editing, Y.R., G.D. and X.C.; visualization, Y.R., G.D. and T.Z.; supervision, Y.R., M.Z. and X.C.; project administration, Y.R., X.C. and M.X.; funding acquisition, Y.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the Natural Science Foundation of Liaoning Province (No. 2021-MS-265), the fundamental research funds for the universities of liaoning province (No. LJ212410143046), the Research Foundation of Liaoning Educational Department (No. LJKMZ2-0220400), and the Basic Research Project (Key Research Project) of the Education Department of Liaoning Province (No. JYTZD2023009).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Scherer, J.; Yahyanejad, S.; Hayat, S.; Yanmaz, E.; Andre, T.; Khan, A.; Rinner, B. An autonomous multi-UAV system for search and rescue. In Proceedings of the First Workshop on Micro Aerial Vehicle Networks, Systems, and Applications for Civilian Use, Florence, Italy, 18 May 2015; pp. 33–38. [Google Scholar]
Siebert, S.; Teizer, J. Mobile 3D Mapping for Surveying Earthwork Using an Unmanned Aerial Vehicle (UAV). In Proceedings of the International Symposium on Automation and Robotics in Construction, Montreal, QC, Canada, 11 August 2013. [Google Scholar]
Tokekar, P.; Hook, J.V.; Mulla, D.; Isler, V. Sensor Planning for a Symbiotic UAV and UGV System for Precision Agriculture. IEEE Trans. Robot. 2016, 32, 1498–1511. [Google Scholar] [CrossRef]
Lu, Y.; Macias, D.; Dean, Z.S.; Kreger, N.R.; Wong, P.K. A UAV-Mounted Whole Cell Biosensor System for Environmental Monitoring Applications. IEEE Trans. Nanobiosci. 2015, 14, 811–817. [Google Scholar] [CrossRef] [PubMed]
Tomaštík, J.; Mokroš, M.; Surový, P.; Grznárová, A.; Merganič, J. UAV RTK/PPK method—An optimal solution for mapping inaccessible forested areas? Remote Sens. 2019, 11, 721. [Google Scholar] [CrossRef]
Choi, J.; Myung, H. BRM localization: UAV localization in GNSS-denied environments based on matching of numerical map and UAV images. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 4537–4544. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Lu, Z.; Liu, F.; Lin, X. Vision-based localization methods under GPS-denied conditions. arXiv 2022, arXiv:2211.11988. [Google Scholar]
Couturier, A.; Akhloufi, M.A. A review on absolute visual localization for UAV. Robot. Auton. Syst. 2021, 135, 103666. [Google Scholar] [CrossRef]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
Trajković, M.; Hedley, M. Fast corner detection. Image Vis. Comput. 1998, 16, 75–87. [Google Scholar] [CrossRef]
Calonder, M.; Lepetit, V.; Strecha, C.; Fua, P. Brief: Binary robust independent elementary features. In Proceedings of the 11th European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2010; pp. 778–792. [Google Scholar]
Patel, B.; Barfoot, T.D.; Schoellig, A.P. Visual localization with Google Earth images for robust global pose estimation of UAVs. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 6491–6497. [Google Scholar]
Majidizadeh, A.; Hasani, H.; Jafari, M. Semantic segmentation of UAV images based on U-NET in urban area. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2023, 10, 451–457. [Google Scholar] [CrossRef]
Zhong, L.; Meng, L.; Hou, W.; Huang, L. An improved visual odometer based on Lucas-Kanade optical flow and ORB feature. IEEE Access 2023, 11, 47179–47186. [Google Scholar] [CrossRef]
Zhang, G.; Yuan, Q.; Liu, Y. Research on Optimization Method of Visual Odometer Based on Point Line Feature Fusion. In Proceedings of the 2023 7th International Conference on High Performance Compilation, Computing and Communications, Jinan, China, 17–19 June 2023; pp. 274–280. [Google Scholar]
Mu, Q.; Guo, S. Improved algorithm of indoor visual odometer based on point and line feature. In Proceedings of the 2022 2nd International Conference on Control and Intelligent Robotics, Nanjing, China, 24–26 June 2022; pp. 794–799. [Google Scholar]
Goforth, H.; Lucey, S. GPS-denied UAV localization using pre-existing satellite imagery. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 2974–2980. [Google Scholar]
He, M.; Zhu, C.; Huang, Q.; Ren, B.; Liu, J. A review of monocular visual odometry. Vis. Comput. 2020, 36, 1053–1065. [Google Scholar] [CrossRef]
Ma, J.; Jiang, X.; Fan, A.; Jiang, J.; Yan, J. Image matching from handcrafted to deep features: A survey. Int. J. Comput. Vis. 2021, 129, 23–79. [Google Scholar] [CrossRef]
Harris, C.; Stephens, M. A combined corner and edge detector. In Alvey Vision Conference; Alvey Vision Club: Manchester, UK, 1988; Volume 15, pp. 10–5244. [Google Scholar]
Smith, S.M.; Brady, J.M. SUSAN—A new approach to low level image processing. Int. J. Comput. Vis. 1997, 23, 45–78. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Bay, H.; Tuytelaars, T.; Van Gool, L. Surf: Speeded up robust features. In Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; Proceedings, Part I; Springer: Berlin/Heidelberg, Germany, 2006; pp. 404–417. [Google Scholar]
Morel, J.M.; Yu, G. ASIFT: A new framework for fully affine invariant image comparison. SIAM J. Imaging Sci. 2009, 2, 438–469. [Google Scholar] [CrossRef]
Wang, Q.; Huang, Z.; Fan, H.; Fu, S.; Tang, Y. Unsupervised person re-identification based on adaptive information supplementation and foreground enhancement. IET Image Process. 2024. [Google Scholar] [CrossRef]
Ren, W.; Luo, J.; Jiang, W.; Qu, L.; Han, Z.; Tian, J.; Liu, H. Learning Self-and Cross-Triplet Context Clues for Human-Object Interaction Detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 9760–9773. [Google Scholar] [CrossRef]
Zheng, Q.; Zhao, P.; Zhang, D.; Wang, H. MR-DCAE: Manifold regularization-based deep convolutional autoencoder for unauthorized broadcasting identification. Int. J. Intell. Syst. 2021, 36, 7204–7238. [Google Scholar] [CrossRef]
Simo-Serra, E.; Trulls, E.; Ferraz, L.; Kokkinos, I.; Fua, P.; Moreno-Noguer, F. Discriminative learning of deep convolutional feature point descriptors. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 118–126. [Google Scholar]
Tian, Y.; Fan, B.; Wu, F. L2-net: Deep learning of discriminative patch descriptor in Euclidean space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 661–669. [Google Scholar]
Ebel, P.; Mishchuk, A.; Yi, K.M.; Fua, P.; Trulls, E. Beyond Cartesian representations for local descriptors. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 253–262. [Google Scholar]
Verdie, Y.; Yi, K.; Fua, P.; Lepetit, V. Tilde: A temporally invariant learned detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5279–5288. [Google Scholar]
Barroso-Laguna, A.; Riba, E.; Ponsa, D.; Mikolajczyk, K. Key.net: Keypoint detection by handcrafted and learned CNN filters. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5836–5844. [Google Scholar]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperPoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 224–236. [Google Scholar]
Sarlin, P.E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperGlue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4938–4947. [Google Scholar]
Yi, K.M.; Trulls, E.; Lepetit, V.; Fua, P. Lift: Learned invariant feature transform. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VI 14; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 467–483. [Google Scholar]
Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8922–8931. [Google Scholar]
Dusmanu, M.; Rocco, I.; Pajdla, T.; Pollefeys, M.; Sivic, J.; Torii, A.; Sattler, T. D2-Net: A trainable CNN for joint description and detection of local features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8092–8101. [Google Scholar]
Hou, H.; Xu, Q.; Lan, C.; Lu, W.; Zhang, Y.; Cui, Z.; Qin, J. UAV pose estimation in GNSS-denied environment assisted by satellite imagery deep learning features. IEEE Access 2020, 9, 6358–6367. [Google Scholar] [CrossRef]
Xu, Y.; Zhong, D.; Zhou, J.; Jiang, Z.; Zhai, Y.; Ying, Z. A novel UAV visual positioning algorithm based on A-YOLOX. Drones 2022, 6, 362. [Google Scholar] [CrossRef]
Gurgu, M.M.; Queralta, J.P.; Westerlund, T. Vision-based GNSS-free localization for UAVs in the wild. In Proceedings of the 2022 7th International Conference on Mechanical Engineering and Robotics Research (ICMERR), Krakow, Poland, 9–11 December 2022; pp. 7–12. [Google Scholar]
Ren, Y.; Liu, Y.; Huang, Z.; Liu, W.; Wang, W. 2ChADCNN: A template matching network for season-changing UAV aerial images and satellite imagery. Drones 2023, 7, 558. [Google Scholar] [CrossRef]
Abdelaziz, S.I.K.; Elghamrawy, H.Y.; Noureldin, A.M.; Fotopoulos, G. Body-centered dynamically-tuned error-state extended Kalman filter for visual inertial odometry in GNSS-denied environments. IEEE Access 2024, 12, 15997–16008. [Google Scholar] [CrossRef]
Pang, W.; Zhu, D.; Chu, Z.; Chen, Q. Distributed adaptive formation reconfiguration control for multiple AUVs based on affine transformation in three-dimensional ocean environments. IEEE Trans. Veh. Technol. 2023, 72, 7338–7350. [Google Scholar] [CrossRef]
Hajder, L.; Barath, D. Relative planar motion for vehicle-mounted cameras from a single affine correspondence. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 8651–8657. [Google Scholar]
Wang, S.; Guo, Z.; Liu, Y. An image matching method based on SIFT feature extraction and FLANN search algorithm improvement. J. Phys. Conf. Ser. 2021, 2037, 012122. [Google Scholar] [CrossRef]
Martínez-Otzeta, J.M.; Rodríguez-Moreno, I.; Mendialdua, I.; Sierra, B. RANSAC for robotic applications: A survey. Sensors 2022, 23, 327. [Google Scholar] [CrossRef] [PubMed]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV) 2018, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]

Figure 1. The proposed pipeline for our visual localization method begins by utilizing visual odometry techniques to process consecutive aerial images captured by the drone. This step estimates the initial motion of the drone and determines its initial position within the satellite imagery. Subsequently, an image registration network is employed to correct the motion of the drone, aiming to eliminate cumulative errors and ultimately ascertain the precise location of UAV.

Figure 2. (a,b) are aerial images of a drone in two different scenes taken in consecutive frames. The red box indicates the common part of the consecutive frames, where the phenomenon of feature points being densely distributed is very obvious. Most feature points are densely located in regions with significant texture variations. If the overlapping areas of consecutive drone-captured frames do not include regions with pronounced texture changes, the accuracy of image matching can be significantly reduced, leading to erroneous estimation of the drone’s initial motion.

Figure 3. (a–d) are the four steps of the quadtree homogenisation algorithm. The quadtree-based feature point uniformization algorithm guarantees an even distribution of feature points across the entire image area. This prevents inaccuracies in calculations that can arise from clustering in specific regions. As a result, this uniform distribution enhances the accuracy of feature matching.

Figure 4. We propose an improved twin neural network architecture based on a portion of the VGG16 structure, which incorporates both channel attention and spatial attention modules. This enhancement aims to increase the ability of the model to capture subtle texture variations in images.

Figure 5. (a) represents UAV aerial images, while (b) represents satellite images. UAV aerial images and satellite images of the same area exhibit significant visual discrepancies due to differences in imaging principles, capture times, and other factors. These discrepancies pose considerable challenges for image matching.

Figure 6. (a) represents the feature point extraction results of the original Oriented FAST and Rotated BRIEF (ORB) algorithm, while (b) represents the feature point extraction results of the improved ORB algorithm. The visualization results of feature point distribution based on the improved ORB algorithm demonstrate that the proposed method effectively mitigates the concentration of feature points in regions with high texture variation. Notably, the number of feature points in the overlapping areas of consecutive frames has significantly increased. The uniform distribution of feature points allows for a more comprehensive representation of different parts of the image, thereby improving the ability to capture details and variations and enhancing overall performance.

Figure 7. Matching results of UAV aerial images in low-texture areas. (a) presents the matching results obtained after feature point extraction using the original Oriented FAST and Rotated BRIEF (ORB) algorithm, while (b) presents the matching results following feature point extraction with the improved ORB algorithm. The enhanced feature point extraction algorithm significantly increases the number of correctly matched point pairs.

Figure 8. Matching results of UAV aerial images in high-texture areas. (a) presents the matching results obtained after feature point extraction using the original ORB algorithm, while (b) presents the matching results following feature point extraction with the improved ORB algorithm. The enhanced feature point extraction algorithm significantly increases the number of correctly matched point pairs.

Figure 9. Registration results of drone aerial images in low-texture areas. (a) represents the registration results based on VGG16, and (b) represents the registration results based on the proposed method in this paper.

Figure 10. Visualization results of absolute localization errors in low-texture areas. (a) shows the localization results based on the VGG16 network architecture, and (b) shows the localization results based on the method proposed in this paper.

Figure 11. Results of absolute localization error from comparative experiments of different algorithms in low-texture areas. (a) presents the absolute localization errors of various feature point detection algorithms, and (b) presents the absolute localization errors of different network architectures.

Figure 12. Registration results of drone aerial images in high-texture areas. (a) represents the registration results based on VGG16, and (b) represents the registration results based on the proposed method in this paper.

Figure 13. Visualization results of absolute localization errors in high-texture areas. (a) shows the localization results based on the VGG16 network architecture, and (b) shows the localization results based on the method proposed in this paper.

Figure 14. Results of absolute localization error from comparative experiments of different algorithms in high-texture areas. (a) presents the absolute localization errors of various feature point detection algorithms, and (b) presents the absolute localization errors of different network architectures.

Table 1. Comparative experimental results of the proposed method against several other benchmark methods. The proposed method demonstrates superior speed and accuracy in localization in both high-texture and low-texture areas compared to the other methods.

Experimental Route	SIFT [23]		SURF [24]		Method [18]		Ours
Experimental Route	Time (s)	Error (m)	Time (s)	Error (m)	Time (s)	Error (m)	Time (s)	Error (m)
Low-texture route I	198.429	281.332	346.932	543.122	53.288	33.06	33.176	38.670
Low-texture route II	140.862	6.919	260.785	6.925	47.517	17.82	26.538	6.502
High-texture route I	92.120	4.515	174.056	4.481	20.159	5.061	17.276	3.914
High-texture route II	118.591	7.539	168.616	7.599	28.367	6.324	18.186	7.512

Table 2. Quantitative results of ablation experiments on four different flight paths. ✓ indicates the use of this module.

BASE	CAM	PAM	Low-Texture Route I	Low-Texture Route II	High-Texture Route I	High-Texture Route II
✓			58.11	16.56	9.523	10.66
✓	✓		46.31	10.23	5.061	90.34
✓		✓	44.57	8.223	5.166	7.135
✓	✓	✓	38.67	6.502	3.914	7.512

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ren, Y.; Dong, G.; Zhang, T.; Zhang, M.; Chen, X.; Xue, M. UAVs-Based Visual Localization via Attention-Driven Image Registration Across Varying Texture Levels. Drones 2024, 8, 739. https://doi.org/10.3390/drones8120739

AMA Style

Ren Y, Dong G, Zhang T, Zhang M, Chen X, Xue M. UAVs-Based Visual Localization via Attention-Driven Image Registration Across Varying Texture Levels. Drones. 2024; 8(12):739. https://doi.org/10.3390/drones8120739

Chicago/Turabian Style

Ren, Yan, Guohai Dong, Tianbo Zhang, Meng Zhang, Xinyu Chen, and Mingliang Xue. 2024. "UAVs-Based Visual Localization via Attention-Driven Image Registration Across Varying Texture Levels" Drones 8, no. 12: 739. https://doi.org/10.3390/drones8120739

APA Style

Ren, Y., Dong, G., Zhang, T., Zhang, M., Chen, X., & Xue, M. (2024). UAVs-Based Visual Localization via Attention-Driven Image Registration Across Varying Texture Levels. Drones, 8(12), 739. https://doi.org/10.3390/drones8120739

Article Menu

UAVs-Based Visual Localization via Attention-Driven Image Registration Across Varying Texture Levels

Abstract

1. Introduction

2. Related Works

3. Proposed Methodology

3.1. Estimation of Initial Motion for UAVs

3.2. Siamese Neural Network Based on Attention Mechanism

3.3. Image Registration Based on the Lucas-Kanade Algorithm

3.4. Train

4. Experiments

4.1. Feature Point Extraction and Matching

4.2. Image Registration and Localization for Low-Texture Areas

4.3. Image Registration and Localization for High-Texture Areas

4.4. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI