LumiLoc: A Low-Light-Optimized Visual Localization Framework for Autonomous Drones

Qu, Ruokun; Wang, Zhiyuan; Liu, Yelu; Li, Chenglong; Jiang, Hui; Fang, Chen

doi:10.3390/aerospace12060454

Open AccessArticle

LumiLoc: A Low-Light-Optimized Visual Localization Framework for Autonomous Drones

by

Ruokun Qu

¹,

Zhiyuan Wang

¹

,

Yelu Liu

^1,*,

Chenglong Li

^1,2

,

Hui Jiang

³ and

Chen Fang

¹

College of Air Traffic Management, Civil Aviation Flight University of China, Chengdu 618307, China

²

School of Electronic Information Engineering, Beihang University, Beijing 100191, China

³

School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610050, China

^*

Author to whom correspondence should be addressed.

Aerospace 2025, 12(6), 454; https://doi.org/10.3390/aerospace12060454

Submission received: 19 March 2025 / Revised: 4 May 2025 / Accepted: 8 May 2025 / Published: 22 May 2025

(This article belongs to the Special Issue Advances in UAV Technology: Dynamics, Guidance, Navigation, and Control of Transformative Aerial Vehicles)

Download

Browse Figures

Versions Notes

Abstract

In low-light conditions, UAV localization faces substantial challenges due to reduced visibility, elevated noise levels, and diminished contrast. To address these issues, we propose a low-light-optimized visual localization framework that integrates an attention-based image enhancement module, a robust feature extraction network tailored for degraded environments, and a lightweight pose estimation algorithm that fuses geometric and convolutional features. Extensive evaluations on both real-world and synthetic low-light datasets reveal significant improvements in accuracy, noise resilience, and adaptability to dynamic lighting. Moreover, experimental results validate the framework’s feasibility for applications in night operations, urban air traffic management, and disaster response, thereby effectively overcoming the critical limitations of UAV positioning under low-light conditions.

Keywords:

low-light UAV localization; image enhancement; feature extraction; pose estimation; noise robustness; dynamic lighting adaptability

1. Introduction

Unmanned aerial vehicles (UAVs), commonly referred to as drones, have emerged as transformative tools across diverse domains, including environmental monitoring, search and rescue operations, logistics, and urban air mobility [1]. The capability of UAVs to navigate autonomously and perform tasks in complex and dynamic environments depends critically on robust localization algorithms [2]. Among these, visual localization has gained significant traction due to its potential to offer precise and real-time positional information by leveraging visual data from onboard cameras [3]. However, most existing visual localization systems encounter substantial challenges in low-light conditions, where reduced visibility, noise, and poor contrast severely degrade their performance. In practical deployments, failures in low-light localization have led to UAV crashes during nighttime disaster response drills and hindered coordination in urban air traffic during twilight hours [4,5]. These issues primarily stem from the reliance of traditional localization algorithms on visual keypoints that require well-lit environments to function reliably. In the absence of sufficient illumination, these keypoints become sparse or noisy, leading to severe drift or loss of tracking. Such real-world limitations highlight the critical need for a system like LumiLoc, which integrates temporal modeling and illumination-insensitive feature extraction to ensure stable navigation in visually degraded conditions. Addressing this limitation is crucial to enable UAVs to operate reliably in scenarios such as nighttime missions, indoor environments, and adverse weather conditions [6].

The problem of low-light visual localization presents a unique set of challenges that extend beyond conventional computer vision tasks. Unlike general image enhancement, UAV localization demands robust and invariant feature extraction to ensure accurate positional estimates in degraded lighting [7]. Existing methods often rely heavily on handcrafted features or traditional convolutional neural networks (CNNs), which are not inherently optimized for low-light conditions. Furthermore, low-light environments exacerbate issues such as motion blur, high image noise, and a lack of distinct keypoints, thereby diminishing the effectiveness of conventional approaches. This necessitates the development of specialized frameworks capable of overcoming these constraints while maintaining computational efficiency suitable for deployment on resource-constrained UAV platforms [8].

To address the challenges posed by low-light environments, recent advancements in deep learning have paved the way for novel solutions [9]. Techniques such as image enhancement networks, denoising autoencoders, and low-light-specific feature extractors have demonstrated promising results in improving visual quality and extracting meaningful information from degraded images [10]. However, integrating these advancements into a unified framework for UAV localization remains an open research problem. Current approaches are often fragmented, focusing either on improving image quality or enhancing feature extraction, without fully addressing the end-to-end requirements of localization tasks in low-light scenarios. Moreover, these methods frequently overlook the stringent computational and energy constraints inherent to UAV systems, limiting their practical applicability in real-world missions [11].

In this paper, we introduce LumiLoc, a low-light-optimized visual localization framework designed specifically for autonomous drones. The proposed framework represents a significant leap forward in the field of UAV localization by combining advanced deep learning techniques with an end-to-end optimization strategy tailored for low-light conditions. LumiLoc integrates three core components: a low-light image enhancement module, a robust feature extraction network, and a lightweight pose estimation algorithm. The low-light enhancement module leverages state-of-the-art image processing techniques to amplify visibility while suppressing noise and preserving critical structural information. The preprocessed data are then passed through a feature extraction network optimized for degraded lighting, ensuring robust and invariant feature detection. Finally, the pose estimation algorithm translates these features into accurate positional estimates, leveraging an adaptive optimization mechanism to handle the uncertainties inherent in low-light environments.

The primary contributions of this work are summarized as follows:

A novel low-light image enhancement module is developed, which integrates an attention mechanism to amplify critical visual details while mitigating noise and distortions inherent to challenging lighting conditions.
A robust feature extraction framework is designed, leveraging state-of-the-art deep learning backbones tailored to enhance localization accuracy under low-light scenarios and ensuring computational efficiency for deployment on UAVs.
The localization pipeline incorporates a hybrid geometric learning strategy that fuses visual and spatial features, enabling robust pose estimation in the presence of adverse visual conditions.
Extensive experiments are conducted on synthetic and real-world low-light UAV datasets, demonstrating significant performance improvements in localization accuracy and robustness compared with baseline methods.

These innovations collectively advance the state of the art in UAV visual localization under low-light conditions, addressing key limitations of existing methods in both practical deployment and theoretical frameworks.

LumiLoc was evaluated not only in controlled experiments but also in practical field trials. During simulated nighttime rescue missions, where ambient lighting was minimal and environmental textures were heavily degraded, traditional methods failed to maintain stable localization due to the loss of discriminative features. In contrast, LumiLoc leveraged its temporal state representation and multi-scale encoding to extract robust structural cues from low-light images, enabling consistent and accurate UAV navigation. These results confirm the framework’s capacity to handle real-world conditions that challenge conventional techniques. Our experiments demonstrate that LumiLoc outperforms existing state-of-the-art methods in terms of localization accuracy, robustness, and computational efficiency. Notably, LumiLoc achieves significant improvements in scenarios involving high noise levels, poor contrast, and dynamic lighting variations, highlighting its capability to handle the complexities of low-light environments. Furthermore, the proposed framework is tested on an embedded UAV platform, showcasing its practical feasibility and real-time performance in resource-constrained settings.

Beyond its technical advancements, LumiLoc has profound implications for the broader field of UAV operations. By enabling reliable localization in low-light conditions, LumiLoc opens up new possibilities for nighttime operations, underground exploration, and disaster response missions where visibility is severely compromised. The system’s demonstrated success in practical deployments not only validates its effectiveness but also underscores its potential to be adopted in mission-critical drone applications. These capabilities not only enhance the operational flexibility of UAVs but also contribute to the advancement of autonomous systems in general. Moreover, the insights gained from LumiLoc’s development can serve as a foundation for future research into robust perception systems for challenging environments.

2. Related Work

2.1. Visual Localization for UAVs

Visual localization has become a cornerstone for autonomous UAV navigation by leveraging image-based techniques to accurately estimate both position and orientation [12,13]. Traditional methods, such as feature-based approaches, have relied on algorithms like ORB-SLAM and SIFT to extract robust visual features in well-structured environments [14,15,16]. However, these approaches often falter in low-light conditions due to reduced contrast and increased image noise. In recent years, deep learning-based techniques have revolutionized the field by employing convolutional neural networks (CNNs) to directly predict 6-DoF poses from images. Despite these advancements [17,18], challenges persist under degraded illumination, where deep models are prone to overfitting or producing inconsistent localization outcomes owing to insufficient training data in low-light scenarios [19].

One of the primary challenges in low-light visual localization is the degradation of image quality. Under insufficient illumination, the sensor’s signal-to-noise ratio decreases significantly, resulting in blurred or noisy images that hinder effective feature extraction. Traditional feature descriptors struggle when visual content lacks sufficient texture or contrast, which adversely affects both feature matching and the subsequent pose estimation process. To alleviate these issues, some researchers have explored conventional image preprocessing techniques—such as histogram equalization, denoising filters, and contrast enhancement—as a preliminary step before feature extraction [20]. However, these methods often fall short in fully restoring critical details lost in low-light conditions.

In response, recent research has shifted towards integrating image enhancement directly with deep feature extraction in an end-to-end learning framework. Joint optimization frameworks, where an image enhancement module is concurrently trained with the localization network, have shown promise in directly improving localization accuracy under adverse lighting [21]. Additionally, the incorporation of attention mechanisms enables the network to focus on salient regions in low-light images, thereby bolstering the robustness of feature detection and matching, even when the image quality is compromised.

Another emerging trend involves the use of generative models to synthesize training data that mimic a variety of low-light conditions. By augmenting the training dataset with realistic low-light scenarios, these methods enhance the model’s ability to generalize and mitigate overfitting. Domain adaptation strategies have also been employed to align the feature distributions between well-lit and low-light images, bridging the gap between training and operational environments [22]. Such approaches are instrumental in ensuring that the localization system maintains high performance when confronted with unforeseen lighting variations.

Despite these advancements, a significant challenge remains in balancing the improved robustness with computational efficiency. Many of the state-of-the-art techniques, while effective, involve complex network architectures and additional processing stages that demand considerable computational resources. This is particularly problematic for real-time UAV operations where onboard hardware is typically resource-constrained [23]. Consequently, there is a continuous need for lightweight architectures and efficient inference strategies that do not compromise the accuracy or robustness of visual localization.

2.2. Low-Light Image Enhancement Techniques

Low-light image enhancement is a critical precursor for robust localization under challenging illumination. Conventional enhancement techniques, such as histogram equalization and Retinex-based methods [24,25], aim to improve image contrast and visibility by manipulating pixel intensities. These methods work by redistributing the brightness values in an image so that details in dark regions become more discernible. Although they are computationally efficient and easy to implement, these techniques often amplify existing noise, which leads to artifacts that can degrade localization accuracy [26].

Deep learning methods, including EnlightenGAN [27], Zero-DCE [28], and KinD++ [29], have emerged as state-of-the-art solutions for low-light enhancement. These methods leverage neural architectures that are trained to adaptively enhance image details while simultaneously suppressing noise. By learning end-to-end mapping from low-light to enhanced images, these approaches improve visual quality, thereby facilitating downstream tasks such as visual odometry and mapping. However, incorporating these deep learning models into UAV systems presents unique challenges [30]. The increased computational complexity of these networks can lead to higher latency, and, when combined with the limited onboard computational resources of UAVs, it poses significant hurdles for real-time operation.

Moreover, many existing enhancement models are optimized for generic imaging applications, which do not fully address UAV-specific requirements. In UAV operations, issues like motion blur, dynamic lighting variations, and stringent real-time processing constraints are common and further complicate low-light conditions [31]. To address these challenges, recent works have proposed lightweight architectures that are specifically tailored for resource-constrained devices, such as drones. For example, MobileNet-based models and attention-guided frameworks have demonstrated potential for real-time enhancement [32]. Nevertheless, these solutions are still in their early stages and require further optimization to achieve an ideal balance between enhancement performance, computational efficiency, and overall model robustness [33].

2.3. Localization Under Adverse Conditions

The robustness of UAV localization systems is often evaluated under a range of adverse conditions, including dynamic weather, occlusions, and, notably, low-light scenarios [34]. While techniques such as LiDAR-based SLAM and multi-sensor fusion have proven to be robust in these challenging environments, they invariably involve trade-offs in terms of cost, weight, and power consumption. Consequently, visual-based methods continue to be the preferred solution for lightweight UAVs, as they offer a more economical and efficient alternative [35,36].

Under low-light conditions, several methods have attempted to enhance localization performance by integrating additional sensing modalities alongside traditional RGB inputs. For example, infrared-assisted SLAM leverages thermal imaging to support nighttime navigation; however, its dependency on specialized hardware restricts its widespread adoption. Similarly, event-based sensors have been utilized for high-dynamic-range imaging, which facilitates better feature extraction under extreme lighting conditions. Despite their benefits, these approaches necessitate significant algorithmic modifications and are not universally compatible with existing UAV platforms [37,38].

In contrast, recent advancements in neural network architectures have shifted focus toward end-to-end learning frameworks that are capable of adapting to environmental changes. Such frameworks incorporate strategies like temporal feature extraction to improve motion estimation over time and attention mechanisms to enhance spatial awareness. Nevertheless, these methods often prioritize overall robustness under various adverse conditions, with limited specialization for low-light optimization [39,40]. The proposed LumiLoc framework addresses this shortcoming by directly targeting the challenges of visual localization in low-light environments. It leverages a tailored pipeline that integrates effective feature extraction, dedicated low-light enhancement, and lightweight neural architectures specifically designed for UAV applications [41].

3. Problem Formulation

In this section, the visual localization problem for UAVs under low-light conditions is mathematically formulated. We aim to estimate the 6-DoF pose of a UAV, represented by its translation and orientation, given an input image captured under challenging illumination conditions. The problem is divided into the enhancement of low-light images, extraction of robust features, and pose estimation.

3.1. Problem Definition

Let

I_{low} \in R^{H \times W \times C}

denote an input low-light RGB image, where H, W, and C represent the image’s height, width, and number of channels, respectively. The goal is to estimate the UAV’s pose,

P = (t, q)

, where

t \in R^{3}

is the translation vector indicating the UAV’s 3D position and

q \in R^{4}

is a unit quaternion representing its orientation. This can be expressed as:

P = f (I_{low}),

(1)

where

f (\cdot)

represents the localization model to be learned.

3.2. Challenges in Low-Light Visual Localization

Low-light environments introduce significant challenges to visual localization, including reduced contrast, amplified noise, and loss of spatial detail. These factors hinder the extraction of reliable features from

I_{low}

, leading to inaccurate pose estimation. Therefore, the problem involves two key sub-tasks:

Enhancing $I_{low}$ to generate a visually improved image, $\hat{I}$ , that preserves structural and semantic information.
Extracting robust features from $\hat{I}$ for accurate pose estimation.

3.3. Optimization Objective

The localization task is framed as an optimization problem. Given a dataset

D = {(I_{low, i}, P_{i})}_{i = 1}^{N}

, where

P_{i} = (t_{i}, q_{i})

is the ground truth pose for the i-th sample, we aim to minimize a loss function,

L_{total}

, that combines enhancement and localization objectives:

L_{total} = L_{e n h} + α L_{l o c},

(2)

where

α

is a weighting factor balancing the enhancement and localization tasks. The components of this loss function are defined as follows.

3.3.1. Low-Light Enhancement Loss

The enhancement loss,

L_{e n h}

, ensures that the enhanced image,

\hat{I}

, is visually consistent with a well-lit reference image,

I_{g t}

:

L_{e n h} = λ_{p} L_{p} (\hat{I}, I_{g t}) + λ_{s} L_{s} (\hat{I}, I_{g t}),

(3)

where

L_{p}

and

L_{s}

are perceptual and structural similarity losses, respectively, and

λ_{p}, λ_{s}

are weighting coefficients.

3.3.2. Pose Estimation Loss

The localization loss,

L_{l o c}

, penalizes errors in the predicted pose

\hat{P} = (\hat{t}, \hat{q})

compared with the ground truth

P = (t, q)

:

L_{l o c} = ∥ \hat{t} {- t ∥}_{2}^{2} + β {∥ \hat{q} - q ∥}_{2}^{2},

(4)

where

β

is a weighting factor that balances the translational and rotational components of the pose error.

3.4. Pipeline Overview

To solve this problem, we decompose

f (\cdot)

into three modules: enhancement, feature extraction, and localization. Each module addresses a specific sub-problem:

P = f_{loc} (f_{feat} (f_{enh} (I_{low}))),

(5)

where

f_{enh}

,

f_{feat}

, and

f_{loc}

are the functions for image enhancement, feature extraction, and pose estimation, respectively.

The subsequent Methods section describes the detailed implementation of these modules, their architectures, and the optimization strategies used to achieve robust and accurate localization under low-light conditions.

4. Methods

This section presents the proposed LumiLoc framework, designed to address the challenges of UAV visual localization under low-light conditions. The framework is structured into three major modules: low-light image enhancement, feature extraction, and visual localization. These components are tightly integrated to provide a robust and efficient pipeline capable of real-time inference. An overview of the framework is provided, followed by the mathematical formulations underlying each module.The novelty of our approach lies in the integrated design of a low-light image enhancement module that employs a modified pyramid attention mechanism, a tailored feature extraction network (EffiVGG) incorporating attention modules, and a hybrid geometric learning method for pose estimation. This unified, end-to-end framework not only enhances localization accuracy under challenging low-light conditions but also ensures computational efficiency suitable for resource-constrained UAV platforms.

4.1. Overview of LumiLoc Framework

The proposed framework is designed to address the challenges of UAV visual localization under low-light conditions by integrating low-light image enhancement, robust feature extraction, and hybrid geometric learning for pose estimation into a cohesive pipeline. The first module of the framework enhances the input images captured under low-light conditions by employing a pyramid-based attention mechanism. This approach preserves critical visual details while reducing noise, ensuring that subsequent feature extraction is effective.

The feature extraction module leverages a deep convolutional neural network, optimized to produce semantically rich and spatially robust representations. By integrating pre-trained weights and a fine-tuning strategy specific to low-light environments, the framework ensures that extracted features remain discriminative across varying lighting conditions. These features are subsequently fed into a hybrid geometric learning pipeline, which combines the strengths of both visual and spatial features. The hybrid approach incorporates keypoint matching and structural constraints to improve the accuracy of pose estimation.

The entire pipeline is trained end-to-end with a multi-task loss function that balances low-light enhancement, feature extraction, and pose estimation objectives. This design ensures the system not only is computationally efficient but also achieves high localization accuracy in scenarios where traditional methods often fail. Details of each module are provided in the following subsections.

The end-to-end framework is designed to minimize computational overhead while maintaining high accuracy, making it suitable for UAVs with limited onboard resources. A schematic representation of the LumiLoc pipeline is shown in Figure 1.

4.2. Low-Light Image Enhancement

To address the significant degradation in image quality under low-light conditions, we introduce a low-light image enhancement module based on a pyramid attention mechanism. This module amplifies global illumination patterns while preserving local structural details, ensuring high-quality inputs for downstream visual localization tasks, even in extreme lighting conditions. The module operates on the input low-light image

I_{low} \in R^{H \times W \times C}

and produces an enhanced output,

\hat{I} \in R^{H \times W \times C}

.

The pyramid attention mechanism [42] captures features at multiple resolutions, allowing the model to learn both global illumination patterns and fine-grained local texture details. This multi-scale approach avoids over-smoothing, which is common in traditional enhancement methods, and preserves crucial features such as edges that are important for localization tasks. The input low-light image,

I_{low}

, is first decomposed into multiple scaled representations,

{I_{1}, I_{2}, \dots, I_{L}}

, through downsampling. Each scale, l, is then processed by a lightweight convolutional network to compute an attention map,

A_{l} \in R^{H_{l} \times W_{l}}

, defined as:

A_{l} = σ (f_{att} (I_{l}; Θ_{att})),

(6)

where

f_{att}

denotes the attention network and

σ (\cdot)

is the sigmoid activation function. The refined feature map,

I_{l}^{'}

, is obtained by element-wise multiplication:

I_{l}^{'} = A_{l} ⊙ I_{l} .

(7)

The enhanced image,

\hat{I}

, is reconstructed by upsampling and aggregating the refined features from all pyramid levels:

\hat{I} = \sum_{l = 1}^{L} Upsample (I_{l}^{'}) .

(8)

This pyramid attention mechanism differs from traditional single-scale methods by explicitly using hierarchical feature refinement, which improves the model’s adaptability to varying low-light conditions. Additionally, the dynamic nature of the attention mechanism allows it to prioritize regions most affected by low light, such as shadowed areas or underexposed textures, which is particularly beneficial for UAV applications with rapidly changing lighting conditions.

To optimize the enhancement module, we use a multi-faceted loss function that balances perceptual quality and structural integrity. The perceptual loss,

L_{p}

, compares high-level features from a pre-trained VGG network:

L_{p} = \sum_{k = 1}^{K} {∥ Φ_{k} (\hat{I}) - Φ_{k} (I_{g t}) ∥}_{2}^{2},

(9)

where

Φ_{k} (\cdot)

represents the k-th feature map from the VGG network and

I_{g t}

is the ground-truth well-lit image. To preserve structural integrity, we incorporate the structural similarity (SSIM) loss,

L_{s}

:

L_{s} = 1 - SSIM (\hat{I}, I_{g t}) .

(10)

Additionally, the pixel-level reconstruction loss,

L_{recon}

, minimizes the

ℓ_{1}

distance between the enhanced and reference images:

L_{recon} = {∥ \hat{I} - I_{g t} ∥}_{1} .

(11)

The total loss function is a weighted sum of the perceptual, structural, and reconstruction losses:

L_{e n h} = λ_{p} L_{p} + λ_{s} L_{s} + λ_{recon} L_{recon},

(12)

where

λ_{p}

,

λ_{s}

, and

λ_{recon}

are empirically determined weight factors.

The pyramid attention mechanism is implemented with three pyramid levels (

L = 3

), and the attention network at each level consists of two convolutional layers with ReLU activations, followed by a sigmoid function. The image resolution is reconstructed using bilinear upsampling. Training is conducted on a dataset of low-light UAV images, augmented with synthetic low-light transformations, such as gamma correction and random noise.

Figure 2 demonstrates qualitative results of the enhancement module, showing significant improvements in visual clarity and feature preservation. Figure 3 provides an architectural overview of the pyramid attention mechanism, illustrating the multi-scale feature processing and aggregation for enhanced image reconstruction.

4.3. Robust Feature Extraction

The second stage of the pipeline is tasked with extracting robust and discriminative features from the enhanced image,

\hat{I}

, which are critical for downstream visual localization tasks. In this stage, we propose the use of a modified version of the RepVGG architecture, termed EffiVGG, as the backbone for feature extraction. EffiVGG integrates an attention mechanism to further enhance feature representation quality while maintaining computational efficiency during inference.

EffiVGG processes the input image

\hat{I} \in R^{H \times W \times C}

through several convolutional layers to produce a high-dimensional feature map,

F \in R^{H_{f} \times W_{f} \times C_{f}}

, defined as:

F = f_{feat} (\hat{I}; Θ_{feat}),

(13)

where

f_{feat} (\cdot)

represents the feature extraction function and

Θ_{feat}

denotes its learnable parameters. The EffiVGG architecture is designed to overcome the limitations of traditional CNN backbones by balancing representation power and efficiency. The multi-branch design employed during training allows for diverse feature learning, including primary 3 × 3 convolutional layers for spatial feature extraction, identity, and residual connections to preserve low-level features, and a lightweight attention-enhanced branch to refine intermediate feature maps.

At inference time, the multi-branch structure is collapsed into a single 3 × 3 convolutional layer using kernel fusion, significantly reducing computational costs while retaining the rich feature representations learned during training.

The inclusion of an attention mechanism is crucial for enhancing feature extraction robustness. The attention mechanism computes a spatial attention map,

M \in R^{H_{f} \times W_{f}}

, for each feature map,

F_{b}

, from the b-th branch as:

M = σ (f_{att} (F_{b})),

(14)

where

f_{att} (\cdot)

is a lightweight convolutional network designed for attention generation and

σ (\cdot)

is the sigmoid activation function. The refined feature map,

F_{b}^{'}

, is then obtained by applying element-wise multiplication:

F_{b}^{'} = M ⊙ F_{b},

(15)

where ⊙ denotes element-wise multiplication. The final feature map,

F

, is obtained by aggregating the refined feature maps from all branches and applying kernel fusion during inference.

EffiVGG’s design ensures a balance between computational efficiency and robustness. During training, the multi-branch structure and attention mechanism allow for the learning of diverse feature patterns, improving the model’s robustness. During inference, the single-branch structure reduces computational overhead, making EffiVGG suitable for UAV applications where real-time processing and limited resources are major constraints, such as in low-altitude urban air mobility scenarios.

EffiVGG consists of four convolutional stages, each followed by a downsampling layer. The attention mechanism is integrated into the second and third stages, where higher-level feature refinement is most crucial. Batch normalization and ReLU activations are applied after each convolutional operation. During training, the model uses a batch size of 64 and is optimized with the Adam optimizer and a learning rate scheduler. The model is trained end to end with a combination of classification and auxiliary losses to improve feature discriminability. Figure 4 shows the EffiVGG architecture, depicting both the multi-branch structure during training and the simplified architecture used during inference.

4.4. Contrastive Learning with Dynamic Prototypes (CLDP)

To further enhance feature discriminability and address the limitations associated with traditional triplet-based training (e.g., reliance on carefully selected triplets and sensitivity to margin tuning), we introduce a novel training strategy known as Contrastive Learning with Dynamic Prototypes (CLDP). This method dynamically groups features into clusters using prototype vectors, enforcing contrastive learning at both the cluster and instance levels.

During training, we dynamically compute a set of prototype vectors,

{P_{k}}_{k = 1}^{K}

, where each prototype,

P_{k} \in R^{C_{f}}

, represents a cluster of features. These prototypes are initialized as the mean feature vectors of the clusters they represent and are updated iteratively as the training progresses. Each feature,

F_{i}

, is assigned to the closest prototype,

P_{k}

, using the following assignment rule:

k_{i} = arg min_{k} {∥ F_{i} - P_{k} ∥}_{2}^{2},

(16)

where

k_{i}

is the index of the closest prototype to the feature

F_{i}

. The prototypes are updated at each mini-batch to reflect the current distribution of features:

P_{k} = \frac{1}{| C_{k} |} \sum_{F_{i} \in C_{k}} F_{i},

(17)

where

C_{k}

is the set of features assigned to prototype

P_{k}

.

The contrastive loss at the prototype level encourages features to be close to their assigned prototype while being distant from other prototypes. This loss is defined as:

L_{p r o t o} = \sum_{i = 1}^{N} {(∥ F_{i} - P_{k_{i}} ∥_{2}^{2} - min_{k \neq k_{i}} {∥ F_{i} - P_{k} ∥}_{2}^{2} + δ)}_{+},

(18)

where

{(x)}_{+} = max (0, x)

, N is the number of samples in the mini-batch, and

δ

is a margin parameter that enforces separation between positive and negative prototypes.

In addition to the prototype-level contrastive loss, we also perform instance-level contrastive learning, where each feature,

F_{i}

, is contrasted with both positive samples (features assigned to the same prototype) and negative samples (features assigned to other prototypes). The instance-level contrastive loss is computed as:

L_{i n s t} = - \frac{1}{| C_{k_{i}} |} \sum_{F_{j} \in C_{k_{i}}} log \frac{exp (F_{i} \cdot F_{j} / τ)}{\sum_{F_{k} \in C} exp (F_{i} \cdot F_{k} / τ)},

(19)

where

τ

is a temperature hyperparameter and

C

represents the set of all features in the mini-batch.

The overall training objective combines the prototype-level and instance-level contrastive losses, with the following final loss function:

L_{C L D P} = λ_{proto} L_{p r o t o} + λ_{inst} L_{i n s t} .

(20)

The proposed CLDP method offers significant advantages by dynamically adapting to feature distributions through the updating of prototypes across multiple epochs during training. This strategy ensures scalability and minimizes computational overhead. By integrating prototype-level and instance-level contrastive learning, the method captures both coarse-grained and fine-grained relationships, improving the discriminability of learned features.

Training employs the EffiVGG backbone for feature extraction, with prototypes dynamically updated in each epoch. Contrastive loss is computed at both the prototype and instance levels, optimizing feature learning comprehensively. The Adam optimizer is utilized during training, alongside a learning rate scheduler to ensure stable convergence throughout the epochs. Figure 5 depicts the training process, highlighting the gradual decline in loss as the model converges.

4.5. Hybrid Geometric Learning for Pose Estimation

As shown in Algorithm 1. The final stage of the pipeline focuses on estimating the 6-DoF UAV pose

P = (t, q)

, where

t \in R^{3}

denotes the translation vector and

q \in R^{4}

represents the orientation in quaternion form. To achieve accurate and robust pose estimation, we propose a hybrid geometric learning (HGL) approach that combines deep feature learning with spatial geometric constraints. By incorporating both visual and geometric information, this method effectively handles noise, occlusions, and ambiguities commonly encountered in real-world UAV localization tasks.

The pose regression network is designed to predict both translation,

\hat{t}

, and quaternion,

\hat{q}

, from the feature map,

F

. The network employs a multi-level feature fusion strategy, which aggregates hierarchical feature representations from different levels of the feature map to capture both global context and fine-grained local details. The input feature map

F \in R^{H_{f} \times W_{f} \times C_{f}}

is extracted by the EffiVGG backbone. To enhance the robustness of pose estimation, a channel attention mechanism is introduced to adaptively weight feature channels based on their relevance to pose prediction. This can be expressed as:

F^{'} = F ⊙ σ (W_{att} \cdot GAP (F)),

(21)

where

W_{att}

are learnable parameters,

GAP (\cdot)

denotes global average pooling,

σ (\cdot)

represents the sigmoid activation function, and ⊙ is the element-wise multiplication operator. The resulting refined feature map,

F^{'}

, is subsequently passed through two separate branches for translation and orientation regression:

\begin{matrix} \hat{t} & = f_{trans} (F^{'}; Θ_{trans}), \end{matrix}

(22)

\begin{matrix} \hat{q} & = f_{rot} (F^{'}; Θ_{rot}), \end{matrix}

(23)

where

f_{trans}

and

f_{rot}

denote the sub-networks for translation and orientation regression, respectively. The architecture of the pose regression network consists of convolutional layers followed by fully connected layers. Skip connections are utilized to preserve low-level spatial information, while batch normalization is applied to ensure stable training.

To further enhance the accuracy of pose estimation, we incorporate geometric constraints into the learning process. These constraints ensure that the predicted poses are physically plausible and consistent with the spatial geometry of the scene. In particular, the predicted quaternion,

\hat{q}

, is normalized to enforce a valid rotation:

\hat{q} \leftarrow \frac{\hat{q}}{∥ \hat{q} ∥_{2}},

(24)

which eliminates the ambiguity caused by unconstrained quaternion values.

Additionally, we introduce a spatial consistency loss to improve robustness by penalizing discrepancies between the predicted pose and the geometric relationships within the scene. Specifically, the relative transformation between two poses

P_{i}

and

P_{j}

should satisfy:

T_{i j} = P_{i}^{- 1} P_{j},

(25)

where

T_{i j}

is the relative transformation matrix. The spatial consistency loss is then defined as:

L_{s p a t i a l} = \sum_{(i, j)} {∥ {\hat{T}}_{i j} - T_{i j} ∥}_{F}^{2},

(26)

where

{∥ \cdot ∥}_{F}

denotes the Frobenius norm.

The overall loss function for pose estimation combines multiple terms to ensure both accurate and robust predictions. The primary loss for translation and orientation is given by:

L_{l o c} = α ∥ \hat{t} {- t ∥}_{2}^{2} + β {∥ \hat{q} - q ∥}_{2}^{2},

(27)

where

α

and

β

are weighting factors that balance the translation and orientation components. The total loss function also includes the spatial consistency loss,

L_{s p a t i a l}

, and a regularization term,

L_{r e g}

, to prevent overfitting:

L_{t o t a l} = L_{l o c} + γ L_{s p a t i a l} + λ L_{r e g},

(28)

where

γ

and

λ

are hyperparameters.

The hybrid geometric learning approach integrates both visual features from the CNN backbone and spatial constraints derived from the geometry of the scene. By leveraging these complementary sources of information, the model achieves superior robustness and accuracy in pose estimation. During training, we employ a curriculum learning strategy that gradually increases the weight of the spatial consistency loss,

L_{s p a t i a l}

, to ensure stable optimization.

The pose regression network is trained using the Adam optimizer, with a learning rate scheduler to ensure convergence. Data augmentation techniques, such as random rotations, translations, and Gaussian noise injection, are applied to improve the model’s generalization ability for real-world scenarios. The novelty of our hybrid geometric learning approach lies in the seamless integration of deep visual feature learning with explicit geometric constraints within an end-to-end optimization framework, a design that significantly distinguishes our work from previous methods that treat these components separately.

Algorithm 1 Hybrid Geometric Learning for 6-DoF Pose Estimation.

Require:: Feature map $F$ from EffiVGG backbone, hyperparameters $α, β, γ, λ$
Ensure:: Predicted translation $\hat{t}$ , predicted quaternion $\hat{q}$
1:: Initialize $\hat{t} = 0$ , $\hat{q} = I_{4}$ ▹ Initialize translation and quaternion
2:: Step 1: Feature Refinement
3:: for each feature map $F$ do
4:: Apply channel attention: $F^{'} = F ⊙ σ (W_{att} \cdot GAP (F))$
5:: Feature fusion: Extract hierarchical features from multiple levels of $F$ for fusion.
6:: end for
7:: Step 2: Pose Regression
8:: for each refined feature map $F^{'}$ do
9:: Predict translation: $\hat{t} = f_{trans} (F^{'})$
10:: Predict quaternion: $\hat{q} = f_{rot} (F^{'})$
11:: end for
12:: Step 3: Quaternion Normalization
13:: Normalize quaternion: $\hat{q} \leftarrow \frac{\hat{q}}{∥ \hat{q} ∥_{2}}$ ▹ Ensure valid rotation representation
14:: Step 4: Compute Loss
15:: Compute localization loss: $L_{l o c} = α ∥ \hat{t} {- t ∥}_{2}^{2} + β {∥ \hat{q} - q ∥}_{2}^{2}$
16:: Compute spatial consistency loss: $L_{s p a t i a l}$
17:: Compute total loss: $L_{t o t a l} = L_{l o c} + γ L_{s p a t i a l} + λ L_{r e g}$
18:: Step 5: Optimization
19:: Update network weights using Adam optimizer ▹ Optimize the network parameters
20:: Apply data augmentation techniques to enhance model generalization ▹ Random rotations, translations, and noise injections
21:: Step 6: Return Predicted Pose
22:: Return the predicted translation $\hat{t}$ and quaternion $\hat{q}$ ▹ Final predicted pose

4.6. Overall Training Objective

The proposed pipeline is trained in an end-to-end manner to jointly optimize all stages, ensuring that the learned features are not only effective for downstream pose estimation but also robust to various environmental challenges. The training objective integrates multiple loss functions designed for specific tasks within the pipeline, creating a unified framework for robust UAV localization. The total loss function is expressed as:

L_{total} = L_{e n h} + α L_{f e a t} + γ L_{l o c},

(29)

where

L_{enh}

represents the enhancement loss for preprocessing,

L_{f e a t}

is the feature-level learning loss,

L_{l o c}

is the pose estimation loss, and

α

and

γ

are weighting factors that balance the contributions of these terms. During training, the weighting factors

α

and

γ

are dynamically adjusted. In the early stages, greater emphasis is placed on

L_{e n h}

to stabilize the feature extraction process; as training progresses, the focus gradually shifts towards optimizing

L_{f e a t}

and

L_{l o c}

to refine the feature representations and improve pose estimation accuracy.

4.6.1. Image Enhancement Loss ( $L_{e n h}$ )

The image enhancement module is trained to improve the quality of input images, particularly in low-light or adverse conditions. The enhancement loss,

L_{e n h}

, is a combination of pixel-wise reconstruction loss and perceptual loss:

L_{e n h} = {∥ \hat{I} - I_{g t} ∥}_{2}^{2} + λ_{perc} L_{perc} (\hat{I}, I_{g t}),

(30)

where

I_{g t}

is the ground truth image,

\hat{I}

is the enhanced output, and

L_{perc}

is the perceptual loss that measures differences in high-level features extracted by a pre-trained VGG network. The perceptual loss helps preserve structural details and ensures that enhanced images are visually realistic.

4.6.2. Feature Learning Loss ( $L_{f e a t}$ )

To ensure that the extracted features are robust and discriminative, the feature extraction module is trained using a novel feature learning loss. This loss function, instead of the traditional triplet loss, combines a contrastive feature alignment loss and a self-supervised consistency loss:

L_{f e a t} = L_{a l i g n} + λ_{cons} L_{c o n s} .

(31)

Contrastive feature alignment loss(

L_{a l i g n}

): this loss enforces that features from similar poses are closer in the feature space while pushing apart features from distinct poses. Given two samples

(F_{i}, F_{j})

, the loss is defined as:

L_{a l i g n} = \{\begin{matrix} ∥ F_{i} - F_{j} ∥_{2}^{2}, & if y_{i j} = 1, \\ max (0, δ - ∥ F_{i} - F_{j} ∥_{2}^{2}), & if y_{i j} = 0, \end{matrix}

(32)

where

y_{i j} = 1

indicates that the two samples are similar (positive pair),

y_{i j} = 0

indicates dissimilarity (negative pair), and

δ

is the margin.

Self-supervised consistency loss(

L_{c o n s}

): this loss ensures consistency between features extracted from different augmentations of the same image, encouraging robustness to noise and transformations. Given an image,

I

, and its augmentations,

I^{'}

, the consistency loss is defined as:

L_{c o n s} = {∥ F - F^{'} ∥}_{2}^{2},

(33)

where

F

and

F^{'}

are the feature representations of

I

and

I^{'}

, respectively.

4.6.3. Pose Estimation Loss ( $L_{l o c}$ )

The pose estimation module predicts the translation vector,

\hat{t}

, and quaternion,

\hat{q}

, for UAV localization. The loss function for pose estimation combines direct regression loss with spatial consistency constraints:

L_{l o c} = ∥ \hat{t} {- t ∥}_{2}^{2} + β {∥ \hat{q} - q ∥}_{2}^{2} + γ_{spatial} L_{s p a t i a l} .

(34)

The translation loss penalizes the Euclidean distance between predicted and ground truth translation vectors. The orientation loss measures the squared error between predicted and ground truth quaternions, with an additional normalization step to ensure valid rotations. The spatial consistency loss,

L_{s p a t i a l}

, enforces geometric relationships between predicted poses, as detailed in Section 3.2.

4.6.4. Training Strategy and Optimization

The pipeline is optimized using stochastic gradient descent (SGD) with momentum, which helps navigate the loss landscape efficiently. The learning rate is managed using a cosine annealing scheduler to prevent overfitting and ensure stable convergence:

η_{t} = η_{\min} + \frac{1}{2} (η_{\max} - η_{\min}) (1 + cos (\frac{t}{T_{\max}} π)),

(35)

where

η_{t}

is the learning rate at epoch t,

η_{\max}

and

η_{\min}

are the maximum and minimum learning rates, and

T_{\max}

is the total number of epochs.

The weighting factors

α

and

γ

are dynamically adjusted during training to balance the learning rates of different modules. Early in training, higher emphasis is placed on the image enhancement loss,

L_{e n h}

, to stabilize feature extraction. As training progresses, the focus shifts to

L_{f e a t}

and

L_{l o c}

to refine feature representations and improve pose estimation accuracy.

To enhance generalization and prevent overfitting, we incorporate regularization techniques, including weight decay and dropout. Additionally, extensive data augmentation is applied at each stage of the pipeline:

Image-level augmentations: random brightness, contrast, Gaussian noise, and blurring are applied to simulate real-world lighting conditions.

Pose-level augmentations: random rotations and translations are added to the ground truth poses during training to account for UAV motion variability.

4.6.5. End-to-End Training Benefits

End-to-end training ensures that all modules in the pipeline are optimized jointly, allowing gradients to propagate across the entire network. This holistic approach aligns the objectives of image enhancement, feature learning, and pose estimation, resulting in a unified framework that is highly robust and efficient. The proposed training objective not only improves individual module performance but also enables seamless integration for real-world UAV localization tasks.

5. Experiment

5.1. Custom Dataset

In the context of urban air mobility (UAM), precise and reliable localization is critical for the safe and efficient operation of unmanned aerial vehicles (UAVs) in urban environments. To evaluate the performance of the proposed localization method, we collected a custom dataset from a university campus in Chengdu, China. The data collection was carried out at two distinct times of the day: noon at 12:00 p.m. and evening at 8:00 p.m. This was done to capture images under different lighting conditions, which are commonly encountered in real-world UAM scenarios.

The dataset consists of aerial images captured by the UAV from a top-down perspective over a 500 m × 500 m area. To ensure consistency and to investigate how different environmental conditions influence the localization, the UAV was equipped with a Hasselblad L1D-20c camera mounted on the DJI Mavic 2 Pro platform, featuring a 1-inch CMOS sensor with 20-megapixel resolution and adjustable aperture (f/2.8–f/11), capable of recording 4 K video at 30 fps. This high-performance imaging system enabled the capture of detailed environmental information under varying light conditions, which is critical for evaluating visual localization performance. These images represent the challenges faced by UAVs during flight in urban environments, where variations in lighting, weather, and other environmental factors can significantly impact localization performance. The dataset thus serves as a benchmark for evaluating UAV localization methods within the UAM framework, considering the unique challenges of urban flight paths and real-time decision-making requirements.

5.2. Evaluation Metrics

To assess the effectiveness of the proposed method within the UAM framework, we evaluate it using several key performance metrics that are critical for UAV operations in urban environments. First, the image illumination intensity is considered, as it directly affects the quality of feature extraction and the robustness of the localization process under varying lighting conditions. UAVs often operate in environments where lighting can vary drastically, especially between day and night, and the method’s ability to handle such variations is essential for reliable navigation.

Second, the signal-to-noise ratio (SNR) is evaluated to assess how well the method can handle image noise, which is inevitable in real-world environments. High SNR values are desirable, as they indicate better image quality, enabling more accurate feature detection and localization. The root mean square error (RMSE) is also used as a metric for overall localization accuracy, quantifying the deviation between the estimated positions of the UAV and the ground truth. In this context, the ground truth refers to the GPS data recorded by the onboard GNSS module of the DJI Mavic 2 Pro, which provides consumer-grade positioning accuracy under open-sky conditions. Although not a geodetic-grade GPS receiver, it serves as a practical reference for evaluating relative localization performance in real-world urban scenarios. A lower RMSE indicates a more accurate localization system, which is crucial for UAVs operating in complex urban settings where precision is essential for safe navigation.

In addition to these, the maximum error and standard deviation are computed to understand the worst-case performance and the consistency of the system under different environmental conditions. These metrics provide insight into the reliability of the localization system, ensuring that it performs well not only under ideal conditions but also when facing challenging real-world scenarios, such as in dense urban areas with obstacles or fluctuating signal quality. The use of these evaluation metrics allows for a comprehensive assessment of the proposed method’s ability to maintain accurate and stable localization performance in the demanding context of UAM.

5.3. Signal-to-Noise Ratio Analysis

The signal-to-noise ratio (SNR) serves as a key indicator for evaluating the quality of images captured under varying lighting conditions in urban air mobility (UAM) scenarios. Differences in SNR between noon and evening provide insight into how environmental lighting affects image clarity, which in turn impacts downstream UAV localization performance.

To analyze the robustness of the image acquisition process, we measured the SNR of the images in our custom dataset under two lighting conditions: noon (12:00 p.m.) and evening (8:00 p.m.). Table 1 presents the average SNR values computed across the dataset. The results demonstrate the significant degradation in image quality under evening conditions due to reduced ambient light and increased noise.

The analysis reveals that the average SNR during noon conditions is significantly higher, reaching 30.4 dB, compared with 18.7 dB in the evening. This stark contrast underscores the challenges posed by low-light environments, where reduced illumination results in lower image quality and higher noise levels. These findings align with the real-world conditions UAVs encounter during urban operations, where lighting variability can critically influence the reliability of visual localization systems.

To mitigate the effects of low SNR in evening conditions, the proposed system incorporates a low-light image enhancement module that dynamically adjusts image illumination while suppressing noise. This enhancement is crucial for ensuring that the downstream localization model receives high-quality input, thereby maintaining accuracy and stability, even in challenging scenarios. The significant drop in SNR between noon and evening highlights the importance of addressing lighting variations in UAM applications, as robust image quality directly translates to improved localization performance and operational safety.

5.4. Comparison of Visual Localization Algorithms

To validate the effectiveness of the proposed localization method, we compared its performance with three baseline algorithms: ORB, VGG, and RepVGG. These algorithms represent traditional feature-based approaches and modern deep learning-based architectures. The comparison was conducted under two lighting conditions, noon and evening, to assess the robustness of each method in varying environments. The performance is quantified using the root mean square error (RMSE) in meters, and the results are summarized in Table 2.

Figure 6 compares the localization errors between GroundTruth, RepVGG, and our method under both noon and evening conditions. The visual error estimation clearly illustrates the superior performance of our method across both lighting scenarios.

These results underscore the importance of integrating advanced feature extraction techniques and low-light enhancement modules, as employed in the proposed method, to ensure reliable UAV localization performance in diverse urban air mobility environments. By leveraging both robust feature representation and adaptive enhancement, the proposed method consistently outperforms baseline approaches under varying lighting conditions.

6. Ablation Studies

This section presents ablation studies to evaluate the contributions of key components within the proposed localization framework. The focus is on analyzing the roles of the feature extraction model and the matching model. By incrementally removing or simplifying certain components, we quantify their impact on the overall localization performance.

6.1. Feature Extraction Model

To assess the importance of the feature extraction module, we performed experiments by systematically removing key elements within the feature extraction pipeline, including the multi-scale feature encoding layer and the pyramid-based hierarchical structure. The results are reported in Table 3, using the root mean square error (RMSE) as the evaluation metric under both noon and evening conditions.

The results highlight that both the multi-scale encoding and the pyramid-based hierarchical structure are critical for improving localization accuracy. Removing the multi-scale encoding leads to a noticeable increase in the RMSE, particularly under evening conditions, where complex lighting requires more robust feature representations. Similarly, removing the hierarchical structure results in further degradation, emphasizing its role in capturing fine-grained spatial details.

6.2. Matching Model

The impact of the matching model was analyzed by simplifying the proposed similarity matrix-based matching mechanism. Two ablation settings were considered: removing the similarity weighting function and replacing the dynamic sliding window with a fixed-size window. Table 4 presents the results.

The results demonstrate that the similarity weighting function significantly improves matching accuracy by prioritizing high-confidence feature correspondences, especially under low-light evening conditions. Furthermore, the dynamic sliding window contributes to robust performance by adapting to variations in feature distributions. The baseline configuration, which uses a fixed window and no weighting, results in the highest RMSE, indicating its limited capability to handle real-world challenges.

7. Conclusions

This paper presents a novel lightweight deep learning framework for drone visual localization, specifically designed for urban air mobility (UAM) scenarios. The proposed framework integrates a state space module for robust feature extraction, a Gaussian pyramid-based triplet network for effective representation learning, and a dynamic similarity-based matching strategy for accurate localization. By addressing challenges such as variations in lighting conditions, limited computational resources, and real-time processing requirements, the proposed approach demonstrates significant advancements in both localization accuracy and system efficiency.

Extensive experiments were conducted on a custom dataset collected under diverse lighting conditions to evaluate the framework’s robustness. The results highlight the effectiveness of the proposed method in outperforming traditional and state-of-the-art approaches, achieving lower RMSE values, and demonstrating superior adaptability to environmental changes. In total, over 1000 aerial images were captured across different times of day and weather conditions, ensuring a comprehensive evaluation of the framework’s performance in real-world urban environments. Ablation studies further confirm the critical contributions of key components, including the multi-scale encoding in the feature extraction module and the dynamic sliding window in the matching model, to the overall system performance.

Despite these achievements, certain limitations remain. The framework’s reliance on pre-collected environmental data raises concerns about offline map timeliness in rapidly changing urban environments. Additionally, its application in highly cluttered or GPS-denied areas requires further exploration to enhance generalization and robustness. Future work will focus on addressing these challenges by incorporating adaptive map updates and extending the framework to multi-modal localization systems that leverage additional sensory inputs, such as LiDAR or IMU data.

Notably, the proposed method directly addresses the practical limitations raised in the Introduction, including poor visibility during nighttime operations and severe image degradation in disaster-affected environments. In simulated low-light rescue missions and twilight navigation scenarios, conventional CNN-based or feature-matching methods often failed due to missing texture and low contrast. In contrast, LumiLoc maintained robust localization by leveraging its state space representation and adaptive multi-scale feature extraction, which preserved structural integrity under adverse conditions. The study also conducted field tests on a company’s drone simulation platform, further verifying the effectiveness and reliability of LumiLoc’s deployment in various real-world urban scenarios.

In conclusion, the proposed framework represents a promising solution for UAM-specific drone localization tasks, paving the way for safer and more efficient drone navigation in complex urban environments. Its lightweight design, combined with high accuracy and real-time processing capabilities, establishes a strong foundation for future advancements in UAM technologies and their deployment in practical scenarios.

Author Contributions

Conceptualization, R.Q.; supervision, R.Q.; funding acquisition, R.Q.; methodology, Z.W.; software, Z.W.; writing—original draft preparation, Z.W.; investigation, Y.L.; resources, Y.L.; project administration, C.L.; writing—review and editing, C.L.; data curation, H.J.; writing—review and editing, C.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (NSFC)—Joint Fund of Civil Aviation Research (No. U2333214), the Civil Aviation Administration of China Safety Capacity Building Project (No. MHAQ2024033), Open Project of National Key Laboratory of Industrial Control Technology (No. ICT2024B45), Sichuan Flight Engineering Technology Research Center Project (No. GY2024-11C), and the Graduate Innovation Fund of the Fundamental Research Funds for the Central Universities for the year 2024 (No. 24CAFUC10188).

Data Availability Statement

The dataset in this study is available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Huang, L.; Wang, Z.; Xiong, Q.; Qu, R.; Yao, C.; Li, C. Mamba-VNPS: A Visual Navigation and Positioning System with State-Selection Space. Drones 2024, 8, 663. [Google Scholar] [CrossRef]
Couturier, A.; Akhloufi, M.A. A review on absolute visual localization for UAV. Robot. Auton. Syst. 2021, 135, 103666. [Google Scholar] [CrossRef]
Dukkanci, O.; Campbell, J.F.; Kara, B.Y. Facility location decisions for drone delivery: A literature review. Eur. J. Oper. Res. 2024, 316, 397–418. [Google Scholar] [CrossRef]
Costanzo, A.; Loscri, V. Visible light indoor positioning in a noise-aware environment. In Proceedings of the 2019 IEEE Wireless Communications and Networking Conference (WCNC), Marrakesh, Morocco, 15–18 April 2019; pp. 1–6. [Google Scholar]
Partsinevelos, P.; Chatziparaschis, D.; Trigkakis, D.; Tripolitsiotis, A. A novel UAV-assisted positioning system for GNSS-denied environments. Remote Sens. 2020, 12, 1080. [Google Scholar] [CrossRef]
Wang, C.; Deng, D.; Xu, L.; Wang, W. Resource scheduling based on deep reinforcement learning in UAV assisted emergency communication networks. IEEE Trans. Commun. 2022, 70, 3834–3848. [Google Scholar] [CrossRef]
Ding, R.; Gao, F.; Shen, X.S. 3D UAV trajectory design and frequency band allocation for energy-efficient and fair communication: A deep reinforcement learning approach. IEEE Trans. Wirel. Commun. 2020, 19, 7796–7809. [Google Scholar] [CrossRef]
Wu, W.; Chang, T.; Li, X.; Yin, Q.; Hu, Y. Vision-language navigation: A survey and taxonomy. Neural Comput. Appl. 2024, 36, 3291–3316. [Google Scholar] [CrossRef]
Liu, Z.; Cao, Y.; Gao, P.; Hua, X.; Zhang, D.; Jiang, T. Multi-UAV network assisted intelligent edge computing: Challenges and opportunities. China Commun. 2022, 19, 258–278. [Google Scholar] [CrossRef]
McEnroe, P.; Wang, S.; Liyanage, M. A survey on the convergence of edge computing and AI for UAVs: Opportunities and challenges. IEEE Internet Things J. 2022, 9, 15435–15459. [Google Scholar] [CrossRef]
Shi, H.; Chen, J.; Zhang, F.; Liu, M.; Zhou, M. Achieving Robust Learning Outcomes in Autonomous Driving with DynamicNoise Integration in Deep Reinforcement Learning. Drones 2024, 8, 470. [Google Scholar] [CrossRef]
Lee, J.H.; Gwon, G.H.; Kim, I.H.; Jung, H.J. A Motion Deblurring Network for Enhancing UAV Image Quality in Bridge Inspection. Drones 2023, 7, 657. [Google Scholar] [CrossRef]
Chan, E.; Baumann, O.; Bellgrove, M.A.; Mattingley, J.B. From objects to landmarks: The function of visual location information in spatial navigation. Front. Psychol. 2012, 3, 304. [Google Scholar] [CrossRef]
Ghali, R.; Akhloufi, M.A.; Mseddi, W.S. Deep learning and transformer approaches for UAV-based wildfire detection and segmentation. Sensors 2022, 22, 1977. [Google Scholar] [CrossRef]
Ye, S.; Wan, Z.; Zeng, L.; Li, C.; Zhang, Y. A vision-based navigation method for eVTOL final approach in urban air mobility (UAM). In Proceedings of the 2020 4th CAA International Conference on Vehicular Control and Intelligence (CVCI), Hangzhou, China, 18–20 December 2020; pp. 645–649. [Google Scholar]
Hill, B.P.; DeCarme, D.; Metcalfe, M.; Griffin, C.; Wiggins, S.; Metts, C.; Bastedo, B.; Patterson, M.D.; Mendonca, N.L. Uam Vision Concept of Operations (Conops) Uam Maturity Level (uml) 4. 2020. Available online: https://ntrs.nasa.gov/citations/20205011091 (accessed on 7 May 2025).
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Reedha, R.; Dericquebourg, E.; Canals, R.; Hafiane, A. Transformer neural network for weed and crop classification of high resolution UAV images. Remote Sens. 2022, 14, 592. [Google Scholar] [CrossRef]
Mutlag, W.K.; Ali, S.K.; Aydam, Z.M.; Taher, B.H. Feature extraction methods: A review. J. Phys. Conf. Ser. 2020, 1591, 012028. [Google Scholar] [CrossRef]
Zhao, Z.; Wang, J.; Zhao, X.; Peng, C.; Guo, Q.; Wu, B. NaviLight: Indoor localization and navigation under arbitrary lights. In Proceedings of the IEEE INFOCOM 2017-IEEE Conference on Computer Communications, Atlanta, GA, USA, 1–4 May 2017; pp. 1–9. [Google Scholar]
Hays, R.T.; Singer, M.J. Simulation Fidelity in Training System Design: Bridging the Gap Between Reality and Training; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
Liu, T.; Sadler, C.M.; Zhang, P.; Martonosi, M. Implementing software on resource-constrained mobile sensors: Experiences with Impala and ZebraNet. In Proceedings of the 2nd International Conference on Mobile Systems, Applications, and Services, Boston, MA, USA, 6–9 June 2004; pp. 256–269. [Google Scholar]
Friess, C.; Niculescu, V.; Polonelli, T.; Magno, M.; Benini, L. Fully Onboard SLAM for Distributed Mapping with a Swarm of Nano-Drones. IEEE Internet Things J. 2024, 11, 32363–32380. [Google Scholar] [CrossRef]
Chen, J.; Wang, Y.; Hou, P.; Chen, X.; Shao, Y. Dark-SLAM: A Robust Visual Simultaneous Localization and Mapping Pipeline for an Unmanned Driving Vehicle in a Dark Night Environment. Drones 2024, 8, 390. [Google Scholar] [CrossRef]
Wang, Q.; Chi, Y.; Shen, T.; Song, J.; Zhang, Z.; Zhu, Y. Improving RGB-infrared object detection by reducing cross-modality redundancy. Remote Sens. 2022, 14, 2020. [Google Scholar] [CrossRef]
Jiang, Y.; Gong, X.; Liu, D.; Cheng, Y.; Fang, C.; Shen, X.; Yang, J.; Zhou, P.; Wang, Z. Enlightengan: Deep light enhancement without paired supervision. IEEE Trans. Image Process. 2021, 30, 2340–2349. [Google Scholar] [CrossRef] [PubMed]
Guo, C.; Li, C.; Guo, J.; Loy, C.C.; Hou, J.; Kwong, S.; Cong, R. Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1780–1789. [Google Scholar]
Ma, L.; Ma, T.; Liu, R.; Fan, X.; Luo, Z. Toward fast, flexible, and robust low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5637–5646. [Google Scholar]
Panboonyuen, T.; Jitkajornwanich, K.; Lawawirojwong, S.; Srestasathiern, P.; Vateekul, P. Transformer-based decoder designs for semantic segmentation on remotely sensed images. Remote Sens. 2021, 13, 5100. [Google Scholar] [CrossRef]
Yarovoi, A.; Cho, Y.K. Review of simultaneous localization and mapping (SLAM) for construction robotics applications. Autom. Constr. 2024, 162, 105344. [Google Scholar] [CrossRef]
Chen, Y.; Gu, X.; Liu, Z.; Liang, J. A fast inference vision transformer for automatic pavement image classification and its visual interpretation method. Remote Sens. 2022, 14, 1877. [Google Scholar] [CrossRef]
Yuan, M.; Ren, D.; Feng, Q.; Wang, Z.; Dong, Y.; Lu, F.; Wu, X. MCAFNet: A multiscale channel attention fusion network for semantic segmentation of remote sensing images. Remote Sens. 2023, 15, 361. [Google Scholar] [CrossRef]
Zeng, G.; Wu, Z.; Xu, L.; Liang, Y. Efficient Vision Transformer YOLOv5 for Accurate and Fast Traffic Sign Detection. Electronics 2024, 13, 880. [Google Scholar] [CrossRef]
Goforth, H.; Lucey, S. GPS-denied UAV localization using pre-existing satellite imagery. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 2974–2980. [Google Scholar]
Huang, R.; Huang, Z.; Su, S. A Faster, lighter and stronger deep learning-based approach for place recognition. In Proceedings of the CCF Conference on Computer Supported Cooperative Work and Social Computing, Taiyuan, China, 25–27 November 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 453–463. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13733–13742. [Google Scholar]
Qin, T.; Li, P.; Shen, S. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
Engel, J.; Koltun, V.; Cremers, D. Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 611–625. [Google Scholar] [CrossRef]
Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.; Tardós, J.D. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
Li, H.; Xiong, P.; An, J.; Wang, L. Pyramid attention network for semantic segmentation. arXiv 2018, arXiv:1805.10180. [Google Scholar]

Figure 1. Overview of the LumiLoc framework, highlighting the three main modules: low-light enhancement, feature extraction, and visual localization.

Figure 2. Qualitative comparison of low-light image enhancement. From up to down: input low-light image, enhanced image using our method, and ground truth. Our method significantly improves global illumination while preserving local textures.

Figure 3. Attention mechanism architecture. Multi-scale features are processed independently with attention maps and aggregated to reconstruct the enhanced image.

Figure 4. Architecture of the EffiVGG feature extraction module. The multi-branch structure facilitates diverse feature learning during training, while kernel fusion simplifies the architecture for efficient inference. Attention mechanisms are incorporated into selected branches to refine feature maps dynamically.

Figure 5. Training loss curve for the CLDP method. The steady decline in loss over epochs illustrates the stability and effectiveness of the training process.

Figure 6. Comparison of localization estimation errors (GroundTruth, RepVGG, ours (evening), ours (noon)) under varying lighting conditions. The proposed method achieves lower errors in both noon and evening scenarios, demonstrating robustness to environmental changes.

Table 1. Comparison of signal-to-noise ratio (SNR) under different lighting conditions.

Lighting Condition	Average SNR (dB)
Noon (12:00 p.m.)	30.4
Evening (8:00 p.m.)	18.7

Table 2. Comparison of localization performance (RMSE in meters) under noon and evening conditions.

Algorithm	Noon	Evening
ORB-SLAM2	8.23	14.75
VGG	6.12	12.48
RepVGG	5.45	10.92
Ours	4.32	8.65

Table 3. Ablation study: feature extraction model (RMSE in meters).

Configuration	Noon	Evening
Full Feature Extraction Model	4.32	8.65
w/o Multi-Scale Encoding	5.74	10.21
w/o Pyramid Hierarchical Structure	6.32	11.48
Baseline (Single-Scale Features Only)	7.85	13.24

Table 4. Ablation study: matching model (RMSE in meters).

Configuration	Noon	Evening
Full Matching Model	4.32	8.65
w/o Similarity Weighting Function	5.21	9.74
w/o Dynamic Sliding Window	5.87	10.53
Baseline (Fixed Window + No Weighting)	7.02	12.36

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qu, R.; Wang, Z.; Liu, Y.; Li, C.; Jiang, H.; Fang, C. LumiLoc: A Low-Light-Optimized Visual Localization Framework for Autonomous Drones. Aerospace 2025, 12, 454. https://doi.org/10.3390/aerospace12060454

AMA Style

Qu R, Wang Z, Liu Y, Li C, Jiang H, Fang C. LumiLoc: A Low-Light-Optimized Visual Localization Framework for Autonomous Drones. Aerospace. 2025; 12(6):454. https://doi.org/10.3390/aerospace12060454

Chicago/Turabian Style

Qu, Ruokun, Zhiyuan Wang, Yelu Liu, Chenglong Li, Hui Jiang, and Chen Fang. 2025. "LumiLoc: A Low-Light-Optimized Visual Localization Framework for Autonomous Drones" Aerospace 12, no. 6: 454. https://doi.org/10.3390/aerospace12060454

APA Style

Qu, R., Wang, Z., Liu, Y., Li, C., Jiang, H., & Fang, C. (2025). LumiLoc: A Low-Light-Optimized Visual Localization Framework for Autonomous Drones. Aerospace, 12(6), 454. https://doi.org/10.3390/aerospace12060454

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LumiLoc: A Low-Light-Optimized Visual Localization Framework for Autonomous Drones

Abstract

1. Introduction

2. Related Work

2.1. Visual Localization for UAVs

2.2. Low-Light Image Enhancement Techniques

2.3. Localization Under Adverse Conditions

3. Problem Formulation

3.1. Problem Definition

3.2. Challenges in Low-Light Visual Localization

3.3. Optimization Objective

3.3.1. Low-Light Enhancement Loss

3.3.2. Pose Estimation Loss

3.4. Pipeline Overview

4. Methods

4.1. Overview of LumiLoc Framework

4.2. Low-Light Image Enhancement

4.3. Robust Feature Extraction

4.4. Contrastive Learning with Dynamic Prototypes (CLDP)

4.5. Hybrid Geometric Learning for Pose Estimation

4.6. Overall Training Objective

4.6.1. Image Enhancement Loss ( L e n h )

4.6.2. Feature Learning Loss ( L f e a t )

4.6.3. Pose Estimation Loss ( L l o c )

4.6.4. Training Strategy and Optimization

4.6.5. End-to-End Training Benefits

5. Experiment

5.1. Custom Dataset

5.2. Evaluation Metrics

5.3. Signal-to-Noise Ratio Analysis

5.4. Comparison of Visual Localization Algorithms

6. Ablation Studies

6.1. Feature Extraction Model

6.2. Matching Model

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.6.1. Image Enhancement Loss ( $L_{e n h}$ )

4.6.2. Feature Learning Loss ( $L_{f e a t}$ )

4.6.3. Pose Estimation Loss ( $L_{l o c}$ )