1. Introduction
In recent years, unmanned aerial vehicles (UAVs) have surged in popularity and in the scope of their applications. While UAVs have been previously used in the military domain, the use of drones in the civilian sector is also booming. Drones are frequently used in agriculture [
1], safety and emergency missions [
2], and logistics [
3]. Integrating UAVs with reliable autonomous decision making is crucial for reducing time and resource costs.
To accurately position a UAV, the current technology relies on global navigation satellite system (GNSS) sensors. This, however, is not reliable in all scenarios, as global positioning system (GPS) signals are susceptible to natural interference, spoofing, and jamming. To put it simply, there is a need for autonomous positioning technology that is not based on irregular sensor data. In GNSS-challenged settings, a standard self-contained solution is the inertial navigation system (INS), typically implemented as a strapdown inertial navigation system (SINS), which integrates inertial measurement unit data to estimate attitude, velocity, and position. In practice, however, stand-alone INS/SINS suffers from error accumulation (so called drift) over time due to sensor biases and noise [
4] and therefore is commonly aided by an external source or information (e.g., GNSS, vision, LiDAR, map/terrain constraints) to reduce the accumulated error. Similarly, LiDAR-based mapping approaches, while more accurate, introduce additional constraints due to increased equipment and computational requirements. These issues motivate the exploration of vision-based algorithms either as a substitute or a supplement to such methods.
With the rise of computer vision (CV) algorithms and the availability of remote sensing imagery, vision-based methods have been employed to position a UAV in a satellite. In this context, estimating the drones’ position can be achieved using handcrafted feature-matching approaches. These methods are the basis of many vision simultaneous localization and mapping (SLAM) pipelines; however, they suffer from poor robustness when viewpoint, illumination, and scale changes. For instance, scale-invariant feature transform [
5], frequently used in vision SLAM to find and match keypoints within images, is notorious for degraded performance under noise, viewpoint and light changes [
6]. Some matching-based approaches integrate deep learning (DL) into their pipeline. For instance, ref. [
7] proposed using a convolutional neural network (CNN) to extract local descriptors and match them using a graph neural network. Finally, ref. [
8] used a custom CNN when extracting global features, thus enhancing key-point matching with fused local and global features. However, while these approaches are computationally cheap and fast, they still suffer from poor robustness when viewpoint, illumination, and scale variations are present.
Due to advancements in CV, different detector-free localization approaches have been proposed. Currently, there are two main ways to position a drone using DL pipelines: either retrieve the most similar satellite patch from a satellite image gallery or precisely locate the point within a satellite image. While satellite retrieval was the predominant CV-based approach [
9,
10,
11], it suffers some drawbacks. First, to achieve localization via the retrieval method onboard, additional preprocessing and disk space is required. Every patch of the satellite image must have its features extracted, which increases the computational cost during deployment. Then, similarities for the current view and all of the images in the constructed satellite gallery need to be calculated every time the position needs to be estimated. Moreover, the computational load of such procedures scales up with the size of the satellite map.
In recent years, Dai et al. [
12] proposed an end-to-end precision localization framework referred to as finding point in map (FPI). The FPI approach aims to solve some of the problems of the satellite retrieval method. In particular, it aims to reduce heavy data preprocessing, memory cost, and improve precise positioning by locating the UAV within the satellite map directly and producing a dense heatmap of possible locations. Recently, Chen et al. [
13] proposed a one-stream model that drastically reduced model computational complexity. Since then, research in FPI has shifted from using two-stream methods that are Siamese-like to one-stream backbone extractors and restoring the original output resolution with various decoding strategies. Overall, in FPI, a lot of work has been done in designing better, more lightweight one-stream feature extractors [
13,
14,
15,
16], as this is where critical advancements in lowering computational cost can be achieved. In this context, the decoding strategy should not be overlooked, as it has been shown that different decoding designs are crucial to overall performance as well. However, the specific decoding strategies usually rely on the particular architecture of the backbone, as some of them do cross-modal modeling in every stage [
13,
14], and others only at the lower levels [
15,
16]. Rather than proposing a new backbone architecture, we suggest adopting the most widely used [
13,
17] one-stream pyramid vision transformer with conditional positional encoding (OS-PCPVT) as a strong baseline. In recent years, the literature on dense prediction tasks has shifted towards enhancing standard feature pyramid network (FPN)-based decoders towards more task-suitable FPN designs. We believe in the potential of custom FPN refining methods, with special focus on certain scales, particularly for precise UAV localization in challenging, real-world environments.
In any DL framework, the scope of generalizability matters as much as the whole method. This makes training and validation data of great importance and, as of this day, there is only one dataset that is used throughout all FPI research—UL14 [
12]. In other words, only one benchmark dataset exists, as FPI in itself is a relatively new task, and traditional UAV–satellite pair datasets are insufficient. This hinders the comparison and evaluation of different methodologies, as, while the dataset itself is properly constructed, it has its own limitations. The main problem with UL14 is that the data are very homogenous. This property holds true for both the content (similar architecture and areas) and quality (UAV flights were conducted at a low height; satellite imagery showed little to no change). For this reason, evaluating methods on this dataset alone is not sufficient in conducting more rigorous comparisons of methods and extending the findings to a more natural setting. While there is some research that extends the analysis beyond UL14 [
14,
15,
16,
18,
19], the comparative datasets are not as rigorous as the UL14, because in those studies only one satellite search map is used [
18], the impact of satellite size on UL14 is not reported [
14], or the satellite map is insufficient to extend precision positioning findings towards bigger satellite maps [
15,
16,
18].
In summary, despite recent progress in FPI-based UAV localization, several gaps remain. First, most existing works primarily focus on backbone architecture design, while decoder structures, while frequently customized, rarely employ level-aware strategies that explicitly account for the complementary semantic richness of deeper feature layers and the spatial accuracy of the shallower ones. Second, evaluation is overwhelmingly centered on the UL14 dataset, which, while well-constructed, exhibits limited content diversity and relatively small satellite search areas. Third, the behavior of FPI methods under large-area, high-altitude localization scenarios remains under-explored, even though such conditions are common in real-world UAV operations. This work explicitly targets these three gaps. Seeking to enrich the scientific literature in UAV localization, our key contributions can be summarized as follows:
Seeking to emulate the difficulty and benchmarking characteristics of the UL14 dataset, we developed a novel UAV–satellite pair dataset for extended FPI task analysis. Our dataset, UAV-Sat, retains the quality of different satellite size imagery in the test set, yet the content of the dataset is more general purpose and representative of diverse UAV applications.
Development of a lightweight transformer-based network featuring a novel FPN design that incorporates multi-receptive deformable convolutions to enhance the final backbone feature layer, along with feature alignment between high- and mid-level feature maps.
Extensive comparative analysis of our method on two datasets (UL14, UAV-Sat) and an exhaustive ablation study of our method.
3. Materials, Method and Evaluation
3.1. Datasets and Preprocessing
To evaluate our method, we utilized two publicly available UAV–satellite pair datasets: UL14 [
12] and UAV-VisLoc [
40]. The UL14 dataset was left as is; however, from UAV-VisLoc, we created our own, UAV-Sat dataset. While UL14 is crucial for evaluation as a benchmark dataset for the FPI task, the addition of UAV-Sat allows to extend our findings to a setting in which satellite maps are comparatively larger and more variable and the UAV images are sampled from a higher altitude. In the following subsections, we outline the specifics of each dataset and our data preprocessing configuration. Both the UL14 and UAV-Sat datasets were created so that the validation and test sets include augmented images. This makes the method’s validation much more reliable.
3.1.1. UL14 Dataset
UL14 consists of UAV–satellite image pairs from 14 different university campuses in China, 10 of which are for training and 4 for testing. The training set comprises UAV images at 512 × 512 pixels, and satellite images at 1280 × 1280 pixels. The test set contains UAV images at 256 × 256 pixels, and satellite images at 768 × 768 pixels. The satellite imagery in this dataset has a spatial resolution of 0.294 m/pix. Unprocessed training set satellite imagery is centrally aligned, meaning that the UAV is at the center of the paired satellite patch. Satellite images from the test set, however, do not share this quality; they are not centrally aligned and also have varying sampling scales. In the test set, for each UAV image there are 12 satellite patch images, in which the UAV location is randomly distributed, and satellite scales vary from 700 to 1800 pixels.
UAV images are sampled from 80, 90, or 100 m heights, which makes this dataset low altitude in terms of UAV data. Most satellite images in UL14 exhibit a relatively straight-down view, with little visible tilt. Qualitatively, since UAV images are sampled from university campuses, this dataset is urban-focused, with no suburban areas or challenging nature environments (such as deserts, valleys, mountains). Together with the absence of any noticeable off-nadir satellite imagery, UL14 is a relatively limited dataset for general purpose UAV positioning applications.
3.1.2. The Creation of UAV-Sat Dataset
To evaluate and enhance the applicability of our method and to address the limitations of the UL14 dataset, from the publicly available UAV-VisLoc dataset we created our own UAV-Sat dataset that aligns with the objectives of the FPI task. Originally, UAV-VisLoc comprised UAV–satellite image pairs and was collected from 9 various regions in China. Differing from UL14, this dataset is more diverse in terms of visual content: UAV-VisLoc contains images from both manmade (urban and suburban) and natural (river valleys, farmlands, mountainous areas) environments. UAV flight heights range from 405 to 840 m, making this dataset more applicable in evaluating the performance in high-altitude scenarios. Compared with the satellite imagery of UL14, UAV-VisLoc satellite images exhibit a greater amount of change because of the time difference between satellite and UAV capture. This makes UAV-VisLoc better suited for evaluating methods under realistic conditions in which satellite imagery may be suboptimal. The satellite imagery in this dataset has a spatial resolution of 0.3 m/pix. From UAV-VisLoc we created our own UAV-Sat dataset by using the following procedures:
Inclusion of UAV–satellite pairs within the whole dataset: We chose to exclude pairs that contained UAV images with uninformative or non-pertinent features. For this purpose, UAV images were centered cropped, and if the images were deemed uninformative after this procedure, they were excluded from the final dataset. We consider an image to be informative if it contains some sort of permanent feature (e.g., a docking station on a shore, some road within the forest, permanent natural landmarks). We also excluded pairs with satellite images for which it was not possible to crop them to a 3500 × 3500 pixel range without padding.
Train, validation, and test set splitting: After initial image selection, training, validation, and test sets were constructed, splitting the whole dataset into training and test sets (85/15 ratio) and then splitting the training set again into training and validation sets (90/10 ratio). Final splitting proportions were 76.5/8.5/15.
Train set preparation: For the training set, we chose to crop satellite images to 3500 × 3500 pixels to get enough coverage area for random scale crop (RSC, see
Section 3.1.2) augmentation and enrich the final dataset with bigger satellite coverage images. Such image cropping yields image pairs that cover approximately a 1 square kilometer satellite area. UAV images were not processed further (the same center crop from step 1 was retained).
Validation and test set preparation: As in the training set, we also cropped satellite images to 3500 × 3500 pixels and retained the center crop of UAV from step 1. Additionally, each satellite image underwent RSC augmentation in a similar manner to the UL14 dataset. For every image, we constructed 12 satellite images of varying sizes, with the minimum satellite image size being 2400 pixels and the maximum being 3500 pixels. Notably, each of those 12 images had its target location pixel randomly sampled.
Image format and saving: after the initial processing (steps 1–4) images were saved to a fixed resolution of 512 × 512 and 1280 × 1280 for UAV and satellite images, respectively, in the training set and 256 × 256 and 768 × 768 for UAV and satellite images, respectively, in the validation/training sets.
These steps were taken to ensure that our UAV-Sat dataset would be as comparable in evaluation difficulty to the UL14 as possible.
Table 1 and
Figure 1 highlight a comparison between the two datasets.
3.1.3. Data Preprocessing
As previous studies have done, in the data preprocessing module we included RSC [
12] (visualized in
Figure 2). Given a centered image (i.e., an image where the target is at the center), this augmentation selects a random new point and crops a specific area around it. The coordinates of the true location are then recalculated, and the target location is no longer centered. Instead, it appears at a random location within the cropped area. This augmentation has two main parameters: map_size and cover_rate. The cover_rate parameter controls the range around the centered true location where a new random new point can be sampled, and the map_size controls how much of the area around the randomly sampled dot is covered.
In our case, this specific augmentation was used to expand both the training sets (RSC applied randomly with respect to the cover area) and the validation/test sets (the same augmentation applied statically, with the cover area fixed to a set of predefined sizes). By applying RSC in such a way, we ensure that there is no center bias during training and that our validation/test sets are more complex to classify, and thus our evaluation is more reliable.
Additionally, our data preprocessing pipeline includes an image normalization module, as we scale our images to be in the range of . This type of normalization was selected since our initial experiments (conducted only on UL14) yielded superior performance compared with min-max scaling (value range ). Such preprocessing was performed during the training, validation, and testing phases.
3.2. Backbone Network
To conduct our experiments, we leveraged a pretrained OS-PCPVT [
9] backbone as our feature extractor. OS-PCPVT is a transformer-based hierarchical network (outlined in
Figure 3) that utilizes both self-attention and cross-attention to extract relevant features and their interactions early. Moreover, it is a dual-input network that processes both UAV and satellite images simultaneously.
Every stage of OS-PCPVT works as follows: Both inputs undergo a patch embedding operation, after which all tokens are concatenated and enter the transformer block. After the first transformer block, the positional encoding generation is performed and then the transformer block is applied N times. N values are 2, 4, and 6 for stages 1, 2, and 3 respectively. Just like PVT, OS-PCPVT has spatial reduction attention, which keeps the backbone light. In the transformer block, self-attention is performed for the UAV branch (keys, values and queries are sampled only from UAV input), and cross-attention is performed for the attention branch (keys and values are sampled from UAV and queries are sampled from satellite input). In such a way, the UAV branch exclusively models the UAV features, and the satellite branch models the interaction between the UAV and the satellite.
3.3. MuRDE-FPN
To keep our model lightweight, we limit ourselves to modeling only on the satellite branch from the OS-PCPVT backbone. We are able to do such modeling, given that the output from every stage in the satellite branch is constructed using cross-attention. In traditional FPN, the fusion of different scales is achieved as follows: each layer is unified with a convolutional layer with kernels of size 1, and the current and smaller (in terms of spatial dimensions) feature maps are upsampled and then added (or concatenated) with the previous layer. We aimed to enhance the classical FPN structure so it would better serve the FPI task. The central design principle of MuRDE-FPN is the selective refinement of the most semantically rich but spatially degraded feature level, rather than uniform enhancement across all pyramid levels. Additionally, traditional fusion methods (addition, concatenation, weighted fusion) are insufficient to model cross-scale interactions, and for this reason MuRDE-FPN also includes a spatial alignment module. The general outline of our architecture is visualized in
Figure 4.
First, before lateral channel unification, we included an efficient channel attention (ECA) [
41] operation for every feature map. ECA is a lightweight channel reweighting technique that adaptively suppresses or emphasizes channels without dimensionality reduction. In practice, ECA performs average pooling to extract channel-wise spatial descriptors, which then undergo one-dimensional convolution to capture local cross-channel interactions. We used channel-adaptive kernel selection, as proposed in the original implementation. Final channel-attention weights are normalized using the sigmoid function. Including ECA before channel unification allows for a more effective channel recalibration. We specifically chose this channel reweighting method in order to keep our final computational complexity almost intact, as ECA is very lightweight. After ECA, feature maps were unified to the same channel dimensionality (64 channels) using convolution with kernel size one.
Then, in our design, we included two key blocks: multi-receptive deformable enhancement (MuRDE) and feature alignment module (FAM). In both modules, deformable convolution (DCNv1) [
42] is a key operator to achieve our goals. This type of convolution is an extension of the standard convolution with the inclusion of learnable spatial offsets. Output of such a convolution
at a location
is calculated:
where
is sampled from a regular convolution sampling grid at a location
, and
denotes the learned offset for the same sampling point
. Then,
denotes learnable convolution weights at point
. In other words, this type of convolution is capable of adapting to input and capturing irregular patterns. In its second variant, DCNv2 [
43],
a modulating mask is introduced:
Modulation masks are also learnable, which means that DCNv2 is additionally capable of controlling the contribution of each sampling location. We outline the comparison between the computation of DCNv1 and DCNv2 in
Figure 5. In our implementation, two lightweight convolution layers, both initialized with zero weights and bias, were used to generate offsets and a modulation mask. The deformable convolution samples feature at fractional locations using bilinear interpolation.
To enhance and refine the last layer of the backbone, we specifically designed the MuRDE block. While the last layer from the satellite branch in OS-PCPVT is most semantically rich, it suffers from spatial loss, since the original satellite input is downsampled 16 times. Because of this, we believe that the additional refinement and enhancement of such a feature map is beneficial. To achieve this, the MuRDE block uses DCNv2 as follows: Given an input feature map, different configurations of DCNv2 were computed. Configurations differ in kernel size and dilation, or, in other words, the MuRDE block is able to capture multiple resolutions. Outputs from every DCNv2 convolution were then fused by simple addition, followed by group normalization and the ReLU activation function (shown in
Figure 6). The final configuration of the parameters in the MurDE block can be seen in
Figure 6, and our final configuration of the MuRDE module includes convolutions that have receptive fields of 3 × 3, 5 × 5, and 9 × 9. In our implementation, DCNv2 groups were set to 1.
Inspired by [
44,
45,
46,
47], our final model also has a FAM block (
Figure 7). We include FAM only to reduce the misalignment of the last two feature layers (or specifically, C2 output from OS-PCPVT and output from the MuRDE module), as in our case, including this module between the first two (C1 and C2) levels actually degraded the performance. In our implementation of FAM, we used DCNv2 with kernel of size 3, dilation and groups were set to 1.
FAM works as follows: Given two feature maps, those coming from deeper in the backbone are prone to more misalignment. To address this, the two feature maps are concatenated, and offsets and modulation masks are computed for the deformable convolution. Then, the map that suffers from more misalignment (in our case, output from the MuRDE block, as it was derived from a more downsampled feature layer) is refined withDCNv2, such that the now aligned map is fused with another map; in our case, simple addition was used.
3.4. Experimental Setup and Evaluation
Our experiments were conducted on an NVIDIA GeForce RTX 4090 GPU (manufactured by MSI, sourced in Vilnius, Lithuania). The experiments were implemented in Python 3.12.3 and PyTorch 2.5.0 with CUDA 12.4. To ensure a consistent training environment, we used a single seed (set to 42). Since a pretrained OS-PCPVT backbone was used throughout all of our experiments, we set discriminative learning rates: 5 × 10
−5 for the backbone and 1 × 10
−4 for the head and neck with the AdamW optimizer, with its parameters set to default values. To achieve optimal learning rate scheduling, our method uses MultiStepLRScheduler with milestones set to 10, 14, and 16 and the gamma parameter set to 0.2. We trained our model for 30 epochs, with a batch size of 8, using a Hanning window loss [
39], with window size set to 33.
To evaluate other FPI methods on the UAV-Sat dataset, we opted not to use the same training hyperparameters as for our model but instead we followed procedures that these models use in their own setting as closely as possible. In other words, we kept the original network architectures, number of epochs, batch sizes, loss functions and their parameters, optimizers, schedulers and their respective parameters, learning rates, data normalization procedures, and input sizes. This was done in order to give each method the best shot possible, as, in training DL algorithms, the choice of various hyperparameters matters a lot, and we believe that each model was optimized best by their authors.
In line with the current FPI research, we evaluated our and other methods using RDS and meter-level accuracy (
MA@K) metrics. RDS can be defined as:
where k is a scaling factor (set to 10), and
is the relative distance between the predicted image location
and ground truth image coordinates
:
where
,
, and
and
are the width and height of the image. In this way, RDS is a crucial metric for evaluating models’ performance at the pixel level and is resistant to image scale transformation. Note that higher values of RDS are better, and the score is distributed between 0 and 1. This is the main model optimization metric as well. However, RDS is not very helpful when evaluating real-life distance performance. For this reason, an additional metric, MA@K is defined as:
Here
is a condition defined as:
where
denotes the distance in meters and
marks the chosen meter threshold. Note that
is sensitive to scale. For larger images, the resizing operation gives way to larger unit pixel values in meters. This in turn means that small distances between predicted and true pixel locations might be small, but meter distances are larger compared with smaller images.
We also introduce mean spatial distance (MSD), which is a derivative of SD. For the whole dataset, we define
as:
where
and
denote the distance between the ground truth and predicted points in horizontal and vertical directions expressed in meters for one sample. Just like
, this metric is sensitive to the scale of the image and should not be used as a primary evaluation metric.
5. Ablation Study
To further investigate the impact of different fusion, refinement, and enhancement strategies on the performance, we conducted an ablation analysis on the UL14 dataset. In the first analysis, we compare our strategies in an additive manner. We selected our baseline to be a simple FPN, as described in [
29], and with FPN++ we denote a variant of traditional FPN, where an additional refinement convolution of kernel size 3 is added after upsampling. Next, we additively introduce our proposed steps: ECA, multi-receptive enhancement (MuRE), MuRDE, and the addition of FAM, but only to fuse the output of MuRE/MuRDE and C2. In MuRE, the enhancing logic is the same as our method, except we use classic convolutions here. General results of UL14 are shown in
Table 5.
It can be seen that almost every strategy improved performance. Most notably, the biggest increase in performance can be seen between FPN and FPN++ (+7.18) and an accumulative increase of +8.14 when comparing our final method with FPN++. Comparing FPN++ and FPN++ with ECA, we chose to include ECA in our final model because of slightly better performance in the 3 and 5 m range. MA@3 was observed at 18.6 and 18.37, and MA@5 was seen to be at 38.58 and 38.44 for FPN++ with ECA and FPN++ without ECA, respectively. The benefit of the MuRDE block is seen, as it gives an increase in performance by 7.57 when comparing it with FPN++ and ECA configurations. Finally, the choice of using DCNv2 compared with standard convolution is apparent, as MuRDE outperforms MuRE by 3.03 RDS points. Meter accuracies can be seen in
Figure 11, and the same tendencies can be seen there as well. Regarding FAM feature fusion strategy, while the RDS increased by only 0.59 points, improvements can be specifically seen in the precision range (3–10 m).
Additionally, we visualize the impact of different methods on the UL14 dataset in
Figure 12. In the figure, the first line showcases the heatmap on the whole satellite image, and the second line showcases a zoomed-in view. We specifically visualize those cases where RDS improvements were observed (compared with FPN++, ECA configuration). In those cases, MuRDE and FAM blocks can be seen to be beneficial not only in reducing overall distance between predicted and ground truth locations but also in concentrating the predicted area.
To investigate how kernels of different sizes and dilations (and therefore kernels of different receptive fields) affect the performance, we also conducted additional analysis. In this ablation experiment, we limited ourselves to kernels of sizes 3 and 5, given that the final model should be as light as possible, and kernels of smaller sizes are computationally cheaper.
The results are shown in
Table 6, and we compare experiments with the following kernel configurations: single deformable kernel of size 3 and dilation 1 (SDK); double deformable kernel of sizes 3 and 5 and dilations 1 and 1 respectively (DDK-A); double deformable kernel of sizes 3 and 5 and dilations 1 and 2 (DDK-B); triple kernel consisting of size 3, 5, 5 convolutions and 1, 1, 2 dilations (TDK). In such a way we constructed four different receptive field MuRDE modules: size 3 receptive field (SDK), size 3 and 5 receptive fields (DDK-A), size 3 and 9 receptive fields (DDK-B) and size 3, 5, 9 receptive field modules (TDK, our final model configuration).
Figure 13 visualizes the meter metrics for the MuRDE configuration, and it can be seen that, just like the increase in RDS, these metrics increased as well when comparing single, double, and triple kernel configurations.
6. Conclusions
This work presents MuRDE-FPN, a decoder-centered UAV localization strategy for the FPI task. On the standard UL14 benchmark, MuRDE-FPN achieved an RDS of 84.26 and improved same-backbone baselines (OS-FPI and DCD-FPI) by +8.01 and +7.11 RDS, respectively, while also increasing meter-level accuracies. We additionally introduce a UAV-Sat dataset that is intended to extend the findings to higher altitudes and bigger search area scenarios. On this dataset, MuRDE-FPN attained RDS 63.74 and outperformed OS-FPI and DCD-FPI by +4.71 and +7.30 RDS, respectively, indicating generalization beyond the UL14 setting. Additionally, our model achieved 60.66 RDS in very high-altitude conditions, which was +5.5 increase in RDS compared with the second best, OS-FPI, method. The proposed improvements come with a moderate increase in complexity relative to DCD-FPI (+0.28 M parameters and +1.12 GMACs), while remaining substantially lighter than two-branch alternatives.
Nonetheless, the present study has some important limitations. We did not evaluate our or other methods in difficult conditions, such as UAV image conclusions or extremely difficult scenarios (rainy, foggy or snowy weather). Moreover, we did not quantify robustness to real flight degradations. Future work will therefore focus on extensive evaluation under such conditions and on integration of the proposed vision-based localization output as an aiding signal for navigation pipelines in GNSS-challenged scenarios.
7. Discussion
In our work, we treated the FPI task as a mixture of target tracking and segmentation, as, on the one hand, finding point in an image is targeting but, on the other hand, precision positioning should restore the final map as a dense heatmap. To enrich the precise UAV positioning literature, we propose our MuRDE-FPN model, which utilizes a one-stream backbone with a custom-designed FPN decoder. Our model architecture, MuRDE-FPN, addresses key challenges of multi-scale feature enrichment and spatial misalignment. By introducing the MuRDE module at the deepest stage of the feature pyramid, we enhance high-level semantic representations, and FAM enables the adaptive merging of the enriched features while preserving spatial consistency. We believe that our findings indicate the potential of custom FPN refining methods in the FPI task, with a special focus on specific scales, as with our method we observed the improved performance of UAV localization in two datasets: UL14 and UAV-Sat. The improvements were seen in both relative (RDS) and absolute (MSD, MA@k) metrics. We also found that our method improves localization with bigger satellite search maps and high altitudes compared with other methods.
We are, however, limited by our backbone design, and for this reason, our implications can be extended only to a specific setting: one-stream hierarchical backbone designs, in which cross-modal modeling is done at every stage. Even then, given the improvements in performance, some of our design choices can be integrated into different backbones. For instance, Chen et al. [
14] noticed that effective usage of last feature significantly enhanced performance compared with other methods. Our findings are very similar, as we note an increase in performance when enhancing the most semantically rich layer as well. We adopt a multi-receptive approach and observe a performance increase when using non-deformable convolution as well. However, it is still unclear what could be the magnitude of improvement if such an approach would be integrated into other backbones, as other FPI works [
14,
15,
16] note that the backbone design by itself impacts the performance as well.
Model size and memory requirements matter a lot in UAV applications since onboard deployment dictates a scarcity of resources. While our method is much lighter in terms of model size than the compared two-branch methods, some achieved similar results on the UL14 with lower computational cost [
14,
15,
16]. In the future, we will address this issue with the smallest decrease in performance possible. In addition to popular model design solutions (model pruning techniques) to reduce the computational overhead, we will also aim to investigate different training strategies (knowledge distillation, deep supervision, contrastive pretraining) for optimal performance.
We also propose a new dataset for proper evaluation of the FPI methods. While similar datasets (derived from UAV-VisLoc) were proposed before [
14,
16,
18], our derivation is unique in that the validation set is similar in its difficulty compared with the UL14 dataset and enables evaluation on much bigger satellite images. Our dataset also contains high altitude UAV footage, which makes it inherently more diverse for practical UAV applications. There are, however, some gaps that this dataset is unable to address. It is not clear how the end-to-end methods perform overall under different quality data. In particular, a gap exists when evaluating images with visible/substantial change or occlusion, e.g., in areas that experience rapid urbanization, destruction, climate change, night-time footage, motion artefacts, or seasonal variation. In the future, we will aim to address this gap with additional synthetic expansion (e.g., using generative adversarial networks) of our dataset to include such difficult footage.