AGCNeRF: Air–Ground Collaborative Visual Mapping and Navigation via Landmark-Enhanced Neural Radiance Fields

Lu, Chenxi; Yu, Meng; Wang, Yin; Li, Hua

doi:10.3390/drones10030171

Open AccessArticle

AGCNeRF: Air–Ground Collaborative Visual Mapping and Navigation via Landmark-Enhanced Neural Radiance Fields

by

Chenxi Lu

¹

,

Meng Yu

^1,*,

Yin Wang

¹

and

Hua Li

²

¹

College of Astronautics, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

²

Ningbo Key Laboratory of High Energy Density Battery, Yuyao Innovation Institute, Zhejiang Wanli University, Ningbo 315100, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(3), 171; https://doi.org/10.3390/drones10030171

Submission received: 22 January 2026 / Revised: 15 February 2026 / Accepted: 20 February 2026 / Published: 28 February 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Proposes AGC-NeRF, an air–ground collaborative framework that utilizes UAV-generated Neural Radiance Field (NeRF) maps and autonomous saliency-based landmark selection to guide UGV navigation.
Develops a landmark-aware state estimator for ground vehicles that integrates a modified Charbonnier loss function to improve robustness against illumination changes and texture scarcity.

What are the implications of the main findings?

Demonstrates that referencing aerial NeRF priors significantly reduces UGV pose estimation errors, outperforming traditional LiDAR-based SLAM methods (A-LOAM) and existing visual navigation baselines.
Provides a lightweight, vision-only solution for autonomous navigation in high-risk, GPS-denied environments, eliminating the dependency on heavy sensors for search and rescue missions.

Abstract

Unmanned vehicles are becoming increasingly essential in executing high-risk missions in unknown environments such as search and rescue. As the complexity of operational environments escalates, carrying out unmanned tasks becomes cumbersome or even infeasible for a single vehicle, hampered by limited perception and operational constraints. Aiming at enhancing the flexibility of unmanned operations under complicated scenarios, this study introduces AGC-NeRF, an innovative air–ground collaborative exploration framework that harnesses the functional complementarity of UAVs and UGVs—enabling a UGV to navigate through a complex scenario with the assistance of a UAV via referencing a neural radiance map. First, a UAV is employed to collect aerial images for reconstructing the environment to be explored by a UGV, leveraging its aerial perspective to achieve wide-area coverage and global environmental perception that is unattainable for a single UGV. Concurrently, an innovative image saliency evaluation approach is introduced to meticulously select landmarks that are contributive to the UGV’s navigation system, yielding a pre-trained NeRF model of the operation scene. Then, a landmark-aware 6-DOF ego-motion estimator and collision-free trajectory optimizer are designed for the UGV based on the NeRF map. Finally, an online replanning architecture is established which relies on a ground station for NeRF training and state optimization by synergizing the trajectory planner and the state estimator, which forms a dual-agent vision-only navigation pipeline. Simulations and experiments validate that AGC-NeRF enables reliable UGV trajectory planning and state estimation in unknown environments, demonstrating superior efficacy and robustness of the air–ground collaborative paradigm.

Keywords:

air–ground collaboration; saliency evaluation; visual navigation; neural radiance field

1. Introduction

With the increasing demand for automation, unmanned intelligence has placed Unmanned Aerial Vehicles (UAVs) [1] and Unmanned Ground Vehicles (UGVs) [2] at the forefront of high-risk mission execution, including search and rescue or surveillance in environments that pose potential threats to human life. This has significantly accelerated the application research for UAV and UGV autonomous systems [3,4]. However, the escalating complexity of exploration tasks has outstripped the capabilities of an individual agent, mostly due to the limitations of their inherent characteristics. For example, although UAVs are terrain-independent and capable of aero-exploration, their battery life and limited payloads pose great challenges to their long-term operation. On the other hand, UGVs have significantly greater payload and endurance capacity, yet their operations are heavily terrain-dependent and often hindered by obstacles or narrow terrain conditions. As a result, the complementary characteristic of UGVs’ and UAVs’ functions have fostered their cooperation on unmanned operation missions, leading to the emergence of air–ground collaboration research for multi-robot systems [5,6].

Utilizing onboard sensors for environment interpretation and trajectory planning is the backbone of air–ground collaborative navigation systems [7]. A collaborative navigation system usually leverages the perceived information acquired by heterogeneous sensors mounted onboard the agents, e.g., inertial measurement unit (IMU), camera, or LiDAR. In this work, we focus on employing cameras to navigate both a UAV and a UGV in a mutually assisting manner. The main reason for utilizing only cameras is their light weight, which is well-suited for UAV operations. The visual representation of the environment can be polymorphous, such as RGB/gray image (monocular camera), voxel grid/depth image (binocular camera), or Signed Distance Field (SDF). Recently, research on deep learning-based 3D implicit geometric representation known as Neural Radiance Fields (NeRFs) [8] has attracted considerable interest due to its proficiency in reconstructing and rendering photorealistic models. Currently, a substantial number of studies have demonstrated the advantages of NeRFs in the field of robotic vision. In contrast to conventional explicit geometric representation methods (e.g., voxels, meshes, and point clouds), NeRFs are trained directly on dense multi-view image streams, which omits the traditional feature extraction and matching procedures [9,10]. NeRFs represent the geometric form of objects in a scene via continuous volume density, which is highly suitable for robotic trajectory planning [11]. This further motivates this study, in which we integrated the NeRF representation into an air–ground cooperative navigation system, harnessing its volume density information to generate dynamically viable and collision-free trajectories for unmanned vehicles.

The navigation of UGVs requires a priori global map information to carry out path planning, where algorithms like RRT [12] or A* [13] can be employed to generate several paths for the UGV to follow. However, as the exploration environment becomes increasingly intricate and complex, factors like topographical undulation, sensory biases, and the loss of GPS/GNSS signals might mislead the UGV into deviated paths. To this end, a state estimator can be employed to track the UGV pose with respect to pre-selected landmarks. The relative pose estimation can be further used to plan a local vehicle path according to the newly perceived scene information. Some studies [14,15] suggest to deploy several artificial landmarks for better UGV localization, yet in complicated environments, manually deploying landmarks can be labor-intensive and hazardous to humans. Arbanas et al. [16] propose to utilize real-time communication between UGVs and UAVs to fuse the sensory data and achieve high-precision mutual localization. However, the visual data after fusion has not been employed to guide the UGV to correct its planned path, which, to a certain extent, poses a potential threat to the safety of unmanned systems.

Taking into account the aforementioned challenges, our aim is to address the problem of collaborative visual navigation for an air–ground dual-agent system in hazardous off-road environments. The proposed method starts with reconstructing the environmental map using sequential aerial vision under the NeRF architecture. During the UAV’s exploratory flight, a visual saliency evaluation method is introduced to autonomously select landmarks that provide reliable navigational cues tailored to the UGV’s navigation system. Notably, the UAV constructs the NeRF map and shares this dynamically updated NeRF representation, along with navigation-friendly landmarks, with the UGV. Equipped with this a priori NeRF map and landmarks acquired by the UAV, the UGV first performs global path planning and then estimates its current state by referencing landmarks in real time during traversal, thereby enabling online trajectory replanning. Such mechanism ensures the UGV’s capacity of flexible collision-free navigation with high accuracy. In summary, our main contributions are as follows:

We propose AGC-NeRF, a novel air–ground collaborative framework, which includes a method for the best viewpoint selection. This method enables the UAV to autonomously identify the best observation viewpoints and capture environmental images, providing a prior NeRF map of the unknown hazardous environment for the UGV’s ground navigation.
An innovative automatic navigation landmark selection strategy based on image saliency evaluation algorithms is designed to identify navigational landmarks for guiding the UGV’s planning module.
A comprehensive ground navigation framework consisting of a trajectory planner and a landmark-driven state estimator in a NeRF world is presented, realizing high-precision state estimation and collision-free navigation for the UGV.

To explicitly clarify the position of AGC-NeRF, we compare it with existing frameworks in Table 1. Unlike single-agent methods (e.g., iNeRF [17], NeRF-Navigation [18]) that passively use static maps, AGC-NeRF utilizes a heterogeneous air–ground loop where the global prior is actively constructed and filtered via a saliency-based landmark selection mechanism. This active attention mechanism addresses the robustness issues of standard photometric optimization in texture-poor regions. Furthermore, compared to map-free cooperative localization which relies on sparse relative sensing, our method provides dense, photorealistic obstacle perception and global consistency. While this incurs higher computational costs, it offers the fine-grained environmental context indispensable for safe navigation in hazardous, unstructured environments.

The rest of this paper is organized as follows: Section 2 briefly introduces related work. Section 3 formulates the problem to be addressed. Section 4 and Section 5 present the detailed procedure of the proposed air–ground collaborative framework. Section 6 presents the simulation and experiment results and some necessary discussion. Finally, conclusions and future works are summarized in Section 7.

2. Related Work

2.1. UGV–UAV Collaborative System

There is currently a wealth of research on multi-robot collaborative systems. In the study proposed by Ropero et al. [6], a collaborative exploration scenario of planetary surfaces by a UAV and a UGV was devised, the purpose of which is minimizing the total travel distance to several pre-determined waypoints. Simulation results show that due to the inherent limitations of the UGV, it is difficult to visit all target spots, especially if the target locates at a narrowed tunnel or hazardous region. Consequently, a collaborative navigation solution was suggested wherein the UGV functions as a mobile charging station to provide sustained operation for the UAV, while the UAV is responsible for visiting all the ground-inaccessible targets. a’Xing et al. [19] proposed an efficient UAV–UGV cooperative system, which focuses on map construction and path planning. The objective is to enhance the precision in recognizing important targets while avoiding obstacles. Furthermore, this study aims to devise the shortest and most seamless path for the UGV by employing an optimized A* algorithm. Liang et al. proposed a ground station design for UAV/UGV collaboration [10], in which the ground station features a three-dimensional path planning method using an enhanced Ant Colony and Gray Wolf Optimization algorithm given a digital map module. The enhanced Ant Colony exhibits the shortest path length under an improved grid configurations mechanism, which saves 30% calculation time and energy compared to the original method, experimental results demonstrate the real-time performance and accuracy. Kim et al. [20] introduced an innovative air–ground cooperative mapping framework that enhances the path planning capabilities for UGVs by referencing an a priori 3D point-cloud map that is constructed by UAVs. Meanwhile, the proposed method is capable of removing redundant sensory data in an autonomous manner, which reduces the expense of data exchange between the UAV and UGV. Nevertheless, the entire system is still highly dependent on the quality of communication. Recent advancements have proposed solutions to these challenges. For instance, [21,22] demonstrated real-time visual sensing frameworks integrating onboard deep inference and edge-aware data transmission. These works highlight the importance of balancing high-fidelity reconstruction with computational constraints and provide valuable insights for future large-scale environment monitoring.

2.2. 3D Scene Representations

The fidelity of environmental map representation plays an important role in unmanned vehicle navigation. Many NeRF-driven navigation studies [17,18,23] have shown that a neural implicit representation is capable of presenting detailed map textures that are beneficial for robotic perception or path planning. Neural implicit representations utilize a neural network to outline the three-dimensional geometry of complex objects, including their color, texture, and other information. In essence, the 3D implicit representation employs a set of multi-view images containing 5D coordinates

(x, y, z, θ, φ)

and learns a function in the form of

f_{Θ} (p) = σ

, where

f

is a function of the parameters

Θ

of the neural network,

p

is a low-dimensional query point such as an

(x, y, z)

coordinate, and

σ

is the quantity of interest (usually scalar).

With the advancement of differentiable volume rendering techniques, numerous NeRF variants have been introduced to improve computational efficiency. For example, NSVF [24] significantly improves the training and rendering efficiency by explicitly encoding scenes in a sparse voxel grid with online pruning. Plenoxels [25] optimizes voxel features without neural network initialization, wherein the training process is accelerated using an ad hoc designed CUDA optimizer. Instant-NGP [26] implicitly encodes voxel features into a multi-resolution hash grid. TensoRF [27] decomposes 3D voxels into low-rank vector-matrix multiplications, achieving a training speed comparable to Instant-NGP even without CUDA implementation.

Recently, 3D Gaussian Splatting (3DGS) has emerged as a promising explicit representation, enabling photorealistic rendering by optimizing anisotropic Gaussian primitives [14,28,29]. While 3DGS offers superior convergence and rendering speed, this work focuses on autonomous UAV–UGV navigation tasks, where globally consistent geometry and fine structural cues are critical for robust localization and planning under noisy and dynamic observations. Therefore, a NeRF-based representation is adopted because the inherent volumetric continuity and regularization properties of NeRF better align with the high-precision environmental modeling objectives in this paper.

2.3. Visual Navigation

The classical approaches to visual localization and Simultaneous Localization and Mapping (SLAM) in robotics typically employ a multi-level paradigm, where sparse representations, such as image key points, are used for feature tracking and pose estimation [14,30,31]. In addition, dense representations can also be employed to represent the 3D environment. The back-end of classical localization methods often relies on state estimation techniques, including maximum a posteriori (MAP) estimation [32], Kalman filters [33], particle filters [11], and grid-based histogram filters [34]. These techniques can work in conjunction with feature recognition and tracking for real-time vehicle pose estimation and sparse mapping [35].

Recently, research has been conducted on integrating localization and mapping with neural implicit representation for better scene understanding. Sucar et al. [11] addressed the mapping and online NeRF 3D reconstruction from visual data, demonstrating a competitive accuracy compared to the traditional SLAM pipelines. Zhu et al. [23] introduced NICE-SLAM, an extension of iMAP, it utilizes the idea of MLP-based scene representation to larger room environments. Both iMAP and NICE-SLAM utilize depth information from stereo cameras. In the case of RGB-only information, Yen et al. [17] developed iNeRF. Utilizing a pre-trained NeRF model and an initial pose estimate, iNeRF optimizes the 6-DoF vehicle pose via iteratively backpropagating the gradient of image residual with respect to the vehicle pose. Similarly, Adamkiewicz et al. [18] proposed NeRF-Navigation, in which the loss functions are optimized in a manner similar to iNeRF, but with more efficient modification. Maggio et al. [36] introduced a localization method that combines Monte Carlo localization with NeRF, achieving plausible localization results even without initial pose guesses.

3. Problem Formulation

This paper proposes an air–ground collaborative framework for a UGV’s visual navigation in a neural radiance (NeRF) world. A NeRF (

N : ℝ^{3} \times ℝ^{2} \mapsto ℝ^{3} \times ℝ_{+}

) maps a 5D coordinate including a 3D location

p = (x, y, z)

and view direction

(θ, φ)

to an emitted color

c = (r, g, b)

and volume density

σ

. The volume density

σ

is solely dependent on the position, we define

σ (p)

as the volume density at location

p

, and

σ (p)

is also the output of the NeRF model estimated at

p

. Similarly, we define

C_{i}

as the color rendered by the NeRF model for pixel

i

under the camera pose

T \in S E (3)

, where

S E

represents the Special Euclidean group.

We consider the problem of a UGV, equipped with a monocular camera, that seeks to navigate through an unknown hazardous environment with the assistant of a UAV that is also equipped with a monocular camera. The workflow of the UGV–UAV cooperative navigation architecture is illustrated in Figure 1.

Before detailing the framework, we clarify the operational assumptions and the scope of this study. The primary objective of AGC-NeRF is to validate the methodological feasibility of utilizing NeRF for fine-grained 3D reconstruction and implicit obstacle perception in unknown environments. To this end, we assume the availability of a ground station equipped with high-performance computing units (e.g., discrete GPUs) to handle the computationally intensive NeRF training and state optimization tasks. Consequently, strictly limited onboard energy consumption and UAV flight endurance are considered constraints to be optimized in future engineering phases rather than immediate bottlenecks for this proof-of-concept exploration. This setup allows us to focus on demonstrating the superior mapping quality and navigation accuracy of the proposed air–ground collaborative paradigm. The overall workflow consists of two threads as shown in Figure 1, which are as follows:

Aerial Mapping: A UAV first pre-explores the environment, collecting multi-view images for NeRF reconstruction of the entire scene. During the UAV’s flight time, the UAV evaluates the image saliency via its on-board camera to select landmarks suitable for the following UGV’s navigation and stores several images in the vicinity of landmarks for pruning their NeRF models.

Ground Navigation: The UGV seeks to plan a collision-free path from the initial state

Χ_{0}

to the target state

Χ_{f}

. During this process, the UGV evaluates collision probability and estimates ego-motion via referencing the NeRF models of landmarks established during aerial mapping.

Similar to [18], we approximate the UGV’s body with a finite collection of points

B

at which the collision is checked. Typically, this involves a 3D grid of points that represent a bounding box of the UGV; however, given that a NeRF model provides an implicit form of representation, the volume density of these sampled points does not have a direct correlation with their spatial occupancy. Specifically, the volume density representation denotes the differential probability of obstructing a ray at a given spatial point [18].

In a NeRF representation, the volume density

σ (x)

fundamentally represents the differential probability of a light ray terminating at location

x

. We acknowledge that this optical opacity is distinct from physical occupancy. Explicit geometric representations, such as Signed Distance Fields (SDFs) or Occupancy Grids, directly model surface geometry and free space. However, SDF-based methods often require depth supervision or strong geometric regularization, which poses challenges for monocular UAV mapping in unstructured scenes. In contrast, NeRF density is optimized for photometric consistency. While this can lead to artifacts (e.g., “floaters” or semi-transparent fog) in texture-rich or specular regions, our operational environment consists mainly of static, opaque obstacles such as rocks and stone pillars. In such scenarios, high optical density serves as a sufficient and conservative proxy for physical obstacles. To quantify the risk, we approximate the UGV as a set of body points

B

. A collision occurs if any point in the body set intersects with the obstacle space

ℵ_{c o l l i s i o n}

. According to Boole’s inequality, the probability of the union of events is bounded by the sum of their individual probabilities. Therefore, the collision probability of the UGV at time

t

can be formulated by

p_{t}^{c o l l i s i o n} = P (\underset{b_{t} \in B}{\cup} b_{t} \in ℵ_{c o l l i s i o n}) \leq \sum_{b_{t} \in B} σ (b_{t}) s (b_{t}),

(1)

where

s (b_{t})

is the distance traveled by a body-fixed point

b

at time

t

. Minimizing this upper bound effectively keeps the UGV away from high-density regions, providing a conservative and safe navigation strategy.

With an initial NeRF map, the UGV can autonomously generate several waypoints to traverse amid a high-density area (i.e., a scene full of obstacles) represented by NeRF modeling. Since the quality of path planning relies heavily on the accuracy of pose estimation, many robotic path planning methods introduce loop-closure means in their pose estimate process; a loop-closure process can be briefly summarized as correcting the vehicle’s pose via re-observing features that are already registered in the global map, by which means the continuously accumulated pose estimation errors can be effectively corrected. Inspired by loop-closure in SLAM, we introduce a re-localization method based on iNeRF [17], which corrects the UGV pose by actively re-observing the landmarks selected by the UAV and updates the belief about the current UGV pose. Finally, the UGV fine-tunes the local trajectory by referencing the pose that has been corrected through the re-localization process.

4. Aerial Mapping

In this section, we first present a strategy to utilize aerial images to train a 3D NeRF map representation. Then, an image saliency evaluation algorithm for autonomous landmark selection is established. Finally, we propose a local map pruning method using newly added aerial images.

4.1. NeRF Representation with Aerial Images

Accurate map building is the foundation of vehicle path planning. It is also a vital factor that has major influence on the success of target arrival. We assume that the UAV collects images via its on-board monocular camera and sends these images to the ground station for further processing. We first establish a set of predetermined waypoints that facilitate the UAV’s initial exploration of the target environment. Following the results of this preliminary reconnaissance, we identify suitable flight paths for capturing aerial imagery, thereby ensuring a comprehensive coverage of the scene. Next, we use COLMAP [37], a robust Structure-from-Motion (SfM) method, to estimate the camera poses for the input images

{I_{e}^{1}, \dots, I_{e}^{n} | n = 1, 2, \dots}

. This process enables the extraction of camera extrinsic parameters

{p_{e}^{n} = (x_{e}^{n}, y_{e}^{n}, z_{e}^{n}), θ_{e}^{n}, φ_{e}^{n} | n = 1, 2, \dots}

associated with each aerial image, which are essential for the subsequent 3D reconstruction. Following the SfM process, the collected images and their corresponding 5D coordinates are fed into the NeRF network for training, with the objective of generating a 3D implicit model of the environment. For the sake of practical implementation, an off-the-shelf PyTorch v2.0.1 version of the Instant-NGP is employed for data training under the NeRF structure [26].

4.2. Landmark Selection

In a typical UGV–UAV collaboration task, the flying capability of UAVs can be leveraged to construct a map with high consistency with the real world. However, the inconsistency between the prior map and the real world would heavily affects the following autonomous task. To maintain a high level of map consistency, we design an autonomous landmark selection method based on image saliency evaluation, aiming to provide landmarks that are beneficial for the UGV to carry out re-localization and path fine-tuning. The pipeline of autonomous navigation landmark selection is illustrated in Figure 2.

Our saliency examination framework integrates two components: a visually salient region evaluation algorithm (OTSU [38]) and a feature point detection algorithm (ORB [39]). It is worthy of mentioning that the detected feature points will also be employed in the State Estimator (Section 5.3) during the re-localization process, thereby avoiding redundant computation. Note that in the saliency evaluation part, an object with plausible imagery saliency does not necessarily mean that it is suitable to serve as a navigational landmark. To this end, we propose a threshold-based criterion that takes into account both the proportion of the salient area and the total number of feature points within that area to determine if an object qualifies as a landmark. Specifically, we set the saliency area threshold to 40% and the feature point count to 300. These values were empirically determined based on preliminary experiments in the simulation environment. We performed a sensitivity analysis by varying these thresholds by

\pm 10 %

. The results indicated that the landmark detection performance remained stable within this range, confirming that the system is robust to minor parameter variations and does not require hyper-precise tuning.

To make the visual landmark detection adaptive to scene perception in a computationally efficient manner, we design a “sparse-to-frequent” landmark detection strategy. The landmark detection frequency for image saliency evaluation is set to 1 fps (i.e., one frame of saliency evaluation per second) from the beginning, and it will gradually increase to 10 fps if no landmark is detected. Once a landmark is identified, the detection frequency is reset to 1 fps, and the strategy is reiterated. Such a strategy encourages the UAV to find more effective landmarks under limited computational resources, thereby creating more favorable conditions for the subsequent autonomous tasks performed by the UGV.

4.3. Local Map Pruning

After an aerial image is evaluated as salient and containing landmarks, we generate a local map that surrounds the landmark by online NeRF re-training. The generated NeRF representation of the local map will then be implemented as a reference in UGV path planning. To collect sufficient data for NeRF re-training, we design an autonomous circumnavigation program for the UAV to fly around the landmark of interest, which is illustrated in Figure 3.

The UAV’s circumnavigation rules are outlined in Algorithm 1. The UAV is initially guided to a randomly selected observation point, with the starting circling radius set to zero. At this position, the saliency index

E

of the landmark in the captured aerial image

I_{a e r i a l}

is computed using the evaluation method detailed in Section 4.2. By comparing the saliency index

E_{l a n d m a r k}

with the threshold defined in Section 4.2, the system assesses whether the current observation radius is excessively large or insufficient. The radius is then iteratively adjusted, and the UAV is redirected to the updated observation position accordingly. This iterative process continues until the saliency index reaches the desired threshold at a specific observation radius. Once the optimal saliency is achieved, this radius is deemed as the optimal observation distance, and the UAV proceeds to perform the circumnavigation at this optimized radius

r_{c i r c u m}

.

Algorithm 1 UAV circumnavigation rules

Input:

r_{i n i t}, p_{l a n d m a r k}, E_{l a n d m a r k}, I_{a e r i a l}

Output:

r_{c i r c u m}

1:: $r_{c i r c u m} \leftarrow 0$
2:: $p_{u a v} \leftarrow FlightControl (r_{i n i t})$
3:: while $p_{u a v} - p_{l a n d m a r k} \neq r_{c i r c u m}$ do
4:: $E \leftarrow Saliency evaluation (I_{a e r i a l})$ [Section 4.2]
5:: if $E \neq E_{l a n d m a r k}$ then
6:: $r \leftarrow AdjustRadius$
7:: $p_{u a v} \leftarrow FlightControl (r)$
8:: else
9:: $r_{c i r c u m} \leftarrow r$
10:: $p_{u a v} \leftarrow FlightControl (r_{c i r c u m})$
11:: end if
12:: end while

This strategy ensures that the images obtained provide a rich visual context of the landmarks from various angles. Subsequently, the image sequences

{I_{l}^{1}, \dots, I_{l}^{n} | n = 1, 2, \dots}

, coupled with the determined camera extrinsic parameters

{p_{l}^{n} = (x_{l}^{n}, y_{l}^{n}, z_{l}^{n}), θ_{l}^{n}, φ_{l}^{n} | n = 1, 2, \dots}

, are incorporated into the NeRF training process. The NeRF network is then re-trained, enabling the generation of a local refined 3D map around the landmark. Optimization of 3D landmark models enhances the fidelity of their textural details, thereby significantly improving the accuracy of pixel color matching for regions of interest in UGV re-localization.

5. Ground Navigation

In this section, we first simplify the UGV system as a differentially flat system, which allows us to circumvent the complexities associated with dynamic constraints during the planning phase. Then, we introduce a UGV trajectory planner designed to formulate a path that minimizes the probability of collision and control effort. Moreover, we propose a landmark-aware state estimator for estimating the 6-DOF pose and velocity of the UGV. This estimator leverages the landmarks identified in Section 4 for visual re-localization, realizing high-precision pose correction. Finally, we establish an online local trajectory fine-tuning module that synergizes the trajectory planner and the state estimator, resulting in a vision-only UGV autonomous navigation pipeline.

5.1. Differential Flatness

In this section, we introduce a particular property of “differential flatness” into our system to expedite the planning process. This allows the usage of a smaller set of “flat outputs” and their derivatives to represent the inputs and states of such systems. Notably, UGVs are known to be differentially flat, with their position and yaw angle as flat outputs [40]. The kinematic model of a UGV is shown in Figure 4, where

L

denotes the wheelbase of the UGV.

R

denotes the wheel radius of the UGV.

δ

denotes the steering angle of the UGV. It is worth noting that although this study primarily focuses on addressing the navigation challenges specific to UGVs, the principles of differential flatness can also be applied to other agents if they exhibit certain flat outputs. If system outputs exist that allow all the state and input variables to be expressed through these outputs and their derivatives, the principles of differential flatness can apply to those agents. For instance, in UAV systems, the position coordinates and yaw angle are flat outputs, while the attitude and thrust can be computed from these outputs and their derivatives. Therefore, UAV systems can also be represented using differential flatness principles.

Typically, a trajectory planning pipeline for differential flat systems aims to find polynomial trajectories for the flat outputs to minimize the objective function (such as jerk or snap) subject to waypoint constraints. Such optimization progress can be cast as a quadratic programming problem that can be solved in a computationally efficient manner. Moreover, collision avoidance maneuvers can be seamlessly integrated into this framework. However, to maintain convexity in the optimization problem, navigation strategies through obstacle-laden regions often require pre-defined, hard-coded decisions.

5.2. Trajectory Planner

Following the differential flatness assumption, we propose a trajectory planning approach that takes into account both collision avoidance and control efficiency in this section. Our approach differs from traditional methods as we leverage the NeRF modeling to implicitly represent the environment rather than explicitly describing obstacles in closed form (such as polytopes). Rather than optimizing trajectories only between static waypoints, we employ a denser set of waypoints in NeRF environmental representation that allows for dynamic optimization of waypoint locations. The Adam optimizer is employed as the backbone optimization tool, as in many other trajectory planning studies.

The trajectory planner seeks a set of flat output waypoints

W = {p_{0}, \dots, p_{f}}

, by following which the combined objective cost can be minimized:

J (W) = \sum_{τ = 0}^{f} [\sum_{b_{i} \in B} σ (R_{τ} b_{i} + {\hat{χ}}_{τ}) s (b_{i}) + u_{τ}^{T} Γ u_{τ}]

(2)

where

{\hat{χ}}_{τ}

is the position component of a differentially flat state

χ_{τ}

,

R_{τ}

is the rotation matrix from the UGV body frame to the world frame,

s (\cdot)

is a function that returns the distance traveled by the point in the set of the UGV’s point clouds, and

Γ

is the diagonal matrix used to penalize the weights of control effort. The first objective seeks to minimize the probability of collision, as defined in (1), and the second seeks to minimize control effort. Note that

R_{τ}

,

s (\cdot)

, and

u

are functions of the decision variables

{p_{0}, \dots, p_{f}}

, as they are derived from the surrounding waypoints under the constraint of UGV kinematics.

We employ a heuristic search-based path planning algorithm based on A* to generate an initial set of waypoints, starting from the UGV’s current position to the target destination. These waypoints are then refined through an optimization process that utilizes gradient descent to balance multiple objectives, including obstacle avoidance and the minimization of control effort. Figure 5 illustrates an optimized trajectory that deviates from an initial straight path to circumvent regions of high volume density within the NeRF representation.

5.3. Landmark-Aware State Estimator

When a UGV operates in a GPS-denied environment, its pose estimation error inevitably accumulates with its traversal. Such a phenomenon is also known as the cumulative navigation error problem in SLAM studies, and it might lead to notable deviation from the planned route or excessively cautious path planning decisions. To remedy this issue, one can introduce a loop-closure process to correct the pose estimate. For instance, the UGV can recognize several saliency image features and reference their global positions to carry out loop-closure. Nevertheless, it is not always guaranteed that each image would contain salient features qualified for pose estimation; in cases where an image feature lacks diversity in texture and is difficult to match with its counterpart in the global map, the loop-closure might lead to an even worse pose estimation result, thereby compromising the entire task. To make the loop-closure process more robust and reliable, we propose a NeRF-based re-localization method that capitalizes on navigational landmarks registered in the UAV circumnavigation mission. In addition, we address the challenge of updating the pose belief of the UGV based on given measurements and its most recent control action.

Our method builds upon the work presented in [17], where an initial pose estimate is optimized by minimizing the photometric loss between the observed image and the NeRF-rendered image. Unlike the single-shot estimators that rely heavily on the initial conditions, our method employs an Extended Kalman Filter (EKF) for state estimation. Specifically, we define the UGV state vector

p_{t}

to include the position, orientation quaternion, linear velocity, and angular velocity. At each timestep, the UGV captures a new image

I_{o b s}

and executes a control action

a_{t}

. The EKF process begins with the prediction step, where the prior state estimate is propagated through the non-linear kinematic model:

p_{t | t - 1} = f (p_{t - 1}, a_{t}),

(3)

where

f

represents the UGV’s dynamics. To propagate the system uncertainty, we linearize the non-linear motion model

f

via a First-Order Taylor Expansion. The Jacobian matrix

A_{t - 1}

is computed with respect to the state vector

x

at the previous estimate

p_{t - 1}

:

A_{t - 1} = {\frac{\partial f (x, a_{t})}{\partial x}|}_{x = p_{t - 1}} .

(4)

Subsequently, the predicted error covariance matrix

Σ_{t | t - 1}

is computed by projecting the previous covariance

Σ_{t - 1}

through the Jacobian and adding the process noise covariance

Q_{t - 1}

:

\sum_{t | t - 1} = A_{t - 1} \sum_{t - 1} A_{t - 1}^{T} + Q_{t - 1} .

(5)

Following this prediction, the update step is performed. Unlike traditional filters that use explicit geometric measurements, we utilize the differentiable nature of NeRFs. We compute the gradient of the photometric residual between the observed image

I_{o b s}

and the rendered image at the predicted state

p_{t | t - 1}

via backpropagation. This gradient serves as the observation Jacobian to update the state

p_{t | t - 1}

and covariance

Σ_{t | t - 1}

, effectively fusing the NeRF map prior with real-time visual data.

The original iNeRF algorithm utilizes an L2 loss function to measure the photometric loss, which has a tendency to uniformly smooth illumination gradients. While suitable for stable indoor lighting, this approach may obscure critical details under varying outdoor conditions. To solve this issue, we design a variant of Charbonnier loss that weakens its sensitivity to light by introducing a bias term:

L_{p h o t o} ({I^{'}}_{t}, I_{t}) = \sqrt{{‖{I^{'}}_{t} - I_{t}‖}^{2} + ε_{t}^{2}}

(6)

where

I_{t}

is the observed image,

{I^{'}}_{t}

is the rendered image acquired via the equipped NeRF model,

ε_{t} = c \cdot |p_{o d o m}^{1 : t}| / L_{o b}

is a time-dependent parameter,

c

is an empirical parameter and is set to

1 \times 10^{- 6}

in this study,

|p_{o d o m}^{1 : t}|

is total traverse odometry after the UGV reaches the viewing site, and

L_{o b}

is the observation radius of the landmarks. Equation (6) is the same as the standard L1 loss when the UGV just arrives at the landmark viewing site, and it grows to

c

as the UGV approaches the target landmark. This modified loss function enhances sensitivity to gradient changes, improving robustness against rendering inconsistencies caused by illumination variations. Additionally, the Charbonnier loss offers a smoother transition near zero compared to the L2 loss, which preserves fine details more effectively, particularly at image edges. These attributes render the Charbonnier loss highly suitable for an outdoor navigation context.

As in [18], we select a series of pixels

J

for evaluation using existing image feature detectors (e.g., ORB) to identify areas of interest and bias sampling around regions with higher gradient information. The pose of the robot

T_{t}

can be constructed from the position and rotation elements of

p_{t}

. With this information, the cost function to be minimized is defined as

J (p_{t}) = \sqrt{{‖{I^{'}}_{t} (J) - I_{t} (J)‖}_{S_{t}^{- 1}}^{2} + ε_{t}^{2}} + {‖p_{t | t - 1} - p_{t}‖}_{\sum_{t | t - 1}^{- 1}}^{2}

(7)

where

S_{t}

represents the covariance of measurement noise and the symbol

{‖x‖}_{M}^{2} = x^{T} M x

denotes the weighted

l_{2}

norm. Minimizing this equation gives the updated mean

p_{t}

. Outlier rejection is performed for each pixel loss to reduce variance.

Finally, we employ the known relationship between the Hessian of the Gaussian loss function and the covariance [41] to generate the posterior covariance,

\sum_{t} = {({\frac{\partial^{2} J (x)}{\partial x^{2}}|}_{x = p_{t}})}^{- 1} .

(8)

5.4. Online Replanning

We combine the trajectory planner from Section 5.2 and the state estimator from Section 5.3 to formulate an online replanning strategy. Such a strategy requires an initial state belief of the UGV pose, a predefined target destination, a set of pre-selected landmarks collected by the UAV, and a pre-trained NeRF map.

The online replanning module works in a call-on-demand manner in the path planning system. Concretely, the path planning system for a UGV is initiated by dense waypoints that are generated by the algorithm proposed in Section 5.2. Once the UGV arrives in the vicinity of a landmark labeled by the UAV along its path, it calls the online replanning module to fine-tune its later path: it first updates its state belief via the landmark-aware State Estimator (Section 5.3). The mean of the updated posterior is then utilized as the revised initial state within the trajectory planner. The trajectory is then re-optimized using updated UGV states, ensuring that the remaining path is collision-free and energy efficient. Meanwhile, the UAV continues to track the rest of the previous waypoints for registering more landmarks. This process is repeated in an iterative manner until the UGV successfully reaches the predetermined destination. This approach enables the UGV to dynamically update its trajectory plans by mitigating the impact of distractions, cumulative navigational error, and insufficient environmental perception capability, thereby enhancing the efficacy and flexibility of autonomous navigation.

6. Simulations and Experimental Trial

In this section, we evaluate the proposed air–ground collaborative framework in both a high-fidelity hazardous simulated environment (Stonehenge) and a real-world environment. We selected the Stonehenge environment for simulation because it represents a typical unstructured scenario featuring irregular geometric obstacles, severe occlusions, and rich high-frequency textures. Compared to simple flat terrains or structured urban environments, these characteristics pose significantly greater challenges for NeRF reconstruction and visual localization, making it a rigorous benchmark for evaluating our algorithm’s robustness. The experimental analysis is organized as follows. We first demonstrate that high-quality NeRF models can be reliably constructed from UAV-acquired aerial imagery. Next, we validate the proposed landmark selection strategy and analyze the effectiveness of local NeRF map pruning. We then conduct comparative studies to assess the performance of the trajectory planner. In addition, we evaluate the accuracy of the landmark-based state estimator. Finally, we extend the evaluation to real-world experiments to verify the framework’s robustness under natural lighting variations, sensor noise, and imperfect textures, followed by an assessment of its capability for online replanning.

6.1. NeRF Representation Using Aerial Images

The fidelity of the NeRF model in representing the environment is crucial to the success of UGV navigation, as its obstacle perception is entirely contingent upon the volume density derived from the NeRF output. Consequently, to ascertain whether the NeRF-based scene reconstruction suffices for UGV navigation demands or not, we set a comparative experimental setup against the conventional NeRF reconstruction using 360-degree environment images. To simulate the UAV aerial photography process, we manually create a flight trajectory within the Blender 5.0 software for environmental free-exploration. The UAV (denoted by a camera in Blender) follows the predefined trajectory, capturing environmental images.

We use an off-the-shelf PyTorch implementation of Instant-NGP as the 3D reconstruction method. We set up three different camera layout schemes for the Stonehenge environment: (a) 360-degree panoramic coverage, (b) a simple camera (UAV) motion trajectory, and (c) a complex camera (UAV) motion trajectory. 150 images were acquired separately using each scheme to produce the datasets. These three sets of multi-view image datasets are then fed into torch-ngp for training respectively. The qualitative analysis is shown in Figure 6, with quantitative analysis detailed in Table 2. The qualitative analysis indicates that a simple camera (UAV) motion trajectory often falls short in achieving comprehensive scene coverage. Within regions with rich texture details, numerous artifacts have emerged, leading to a scene reconstruction failure. The incorporation of such NeRF models into UGV navigation tasks is further complicated by the fact that these artifacts contribute erroneous volume density information. This spurious scene representation can significantly impact the UGV’s ability to perceive and avoid obstacles. In contrast, a circumnavigating UAV trajectory has been tailored by taking into account the flight kinematic and FoV constraints (the complex trajectory in Figure 6). Comparative analysis with conventional 360-degree panoramic layouts reveals that the image dataset established along the circum-navigating trajectory outperform the 360-degree panorama in terms of Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS) metrics. Qualitative analysis demonstrates that images captured by the UAV following a tailored circumnavigating path can effectively mitigate the artifacts arising from ray sampling. In this study, we use the circumnavigating trajectory as a template that is scalable to the scene size, the purpose of which is to provide an accurate NeRF representation for the UGV to carry out re-localization and path planning.

6.2. Landmark Selection and Map Pruning

We set the threshold for determining whether an object in the environment can be classified as a landmark to be no less than 40% of the salient region and the number of feature points in the salient region is greater than 300. Figure 7 shows the results of the landmark saliency evaluation for six different sets of images, where the objects in the first two images satisfy the requirement and are selected as landmarks.

Furthermore, we conduct an evaluation to ascertain the efficacy of our proposed ‘sparse-to-frequent’ image detection strategy. As is shown in Table 3, a comparative analysis is performed between the number of landmarks identified and the computational time expended by three distinct strategies: low-frequency, high-frequency, and our ‘sparse-to-frequent’ strategy, all applied to the same aerial image sequence. The comparative results indicate that our proposed strategy is capable of detecting more landmarks and, in the meantime, consumes fewer computational resources.

We embed images captured by the UAV around landmarks, along with their respective 5D coordinates, into the existing NeRF model for re-training. As shown in Table 4, the re-trained NeRF model exceeds the original model in all reconstruction accuracy metrics, realizing local map repair for UGV navigation scenes, and also improving the accuracy of visual localization of UGVs using landmarks, which is then verified in Section 6.4.

6.3. Trajectory Planner

The study in [18] has demonstrated that real-world collisions can be well proxied by utilizing the volume density of a NeRF and verified that a UAV can efficiently plan a smooth and collision-free trajectory. We extend this approach to verify the trajectory planning capabilities of a UGV within a hazardous off-road environment. Unlike a UAV, a UGV is subject to more stringent environmental constraints, presenting a heightened challenge for navigation within a NeRF representation.

We compare two widely used path planning techniques in Stonehenge’s NeRF environment representation: the Dijkstra and the Rapid Search Random Tree (RRT). Dijkstra is a classical method for determining the shortest path in a graph. It leverages the concept of breadth-first traversal to calculate the minimum path length between a starting node and all other nodes; RRT is a method based on state-space sampling. It initiates by randomly selecting a point within the search space to serve as a potential node. We selected these classical geometric planners as baselines specifically to isolate and evaluate the quality of the NeRF-based environmental representation. While modern kinodynamic or learning-based planners could offer superior motion efficiency, our primary objective here is to verify that the NeRF-derived density field provides sufficient and accurate geometric information to support valid path generation. Subsequently, it constructs a connection between the root node and this newly identified node, simultaneously assessing for any potential collisions with the surrounding obstacles. RRT requires a binary collision metric, so we first convert the NeRF into a mesh. In order to extract the control actions needed to track the generated trajectories, here we use a differential flatness controller for all the planning algorithms.

To evaluate the proposed trajectory planner’s performance, we run all three methods (Ours, Dijkstra, and RRT) on eight different trajectories with a set of obstacles in the Stonehenge environment. Figure 8a shows the results of the qualitative analysis of the three different trajectory planning methods. Figure 8b shows the start and end points of the 8 trajectories we set up and labels the failure cases. Figure 9 shows the comparative analysis of the mean collision and control costs for the three trajectory planners, along with their respective failure rates in this experiment. In this context, a failed case indicates that the agent collides with obstacles during traversal.

6.4. Landmark-Aware State Estimator

To verify the performance of our proposed landmark-aware UGV state estimator, we conduct ablation studies under the Stonehenge environment. We utilize iNeRF (with a dynamic prior) and NeRF-Navigation as baselines. Then, we apply the improved Charbonnier loss and the landmark-aware localization method to the state estimator, respectively. Additionally, to demonstrate the effectiveness of the vision-based approach, we also compare it with the traditional LiDAR-based SLAM algorithm A-LOAM in terms of pose estimation accuracy. To ensure a fair comparison using the same sensor modality, the point cloud input for A-LOAM is generated by back-projecting the depth images from the RGB-D camera, rather than using a physical LiDAR. Specific parameter settings for A-LOAM were configured to accommodate the depth camera input (e.g., mapping resolution set to 0.2 m and minimum feature distance to 0.4 m). We evaluate these methods using an identical set of initial states and actions, performing 20 sets of trials for each method and calculating the average error over the entire process. The results of the ablation studies are shown in Table 5, where A_c and A_l denote the improved Charbonnier loss and the landmark-aware localization method. Although the translation error, angular velocity error, and velocity error of our landmark-aware localization module are higher than that of NeRF-Navigation, the performance is remarkably improved when it is integrated with the modified Charbonnier loss. The proposed method achieves a competitive improvement in terms of pose estimation accuracy, where the rotation error is reduced by 28.9%, the translation error is reduced by 26.7%, rotation error is reduced by 12.5%, and velocity error is reduced by 23.1%, respectively. Furthermore, by integrating the landmark-aware localization method into A-LOAM, the rotational error is reduced by 28.6% and the translational error is reduced by 25.0%. This suggests that these two modules are highly complementary in the state estimation process. From the comparison of pose estimation accuracy with A-LOAM, it is evident that the vision-based approach still exhibits a certain gap compared to the LiDAR-based method. However, the vision-only method not only outperforms LiDAR-based methods in terms of mapping quality (as the reconstruction results shown in Figure 10) but also offers advantages in terms of lightweight design and cost-effectiveness.

6.5. Online Replanning

We evaluate the performance of the entire UGV navigation framework in the Stonehenge scene. In our study, we simulate the influence of the ground environment on the UGV by incorporating identical noise characteristics into the finite difference equation of the UGV as those utilized in the state estimator experiments. We set up the same eight sets of starting and terminate location as in the trajectory planner experiment to verify the performance of online replanning. Although the online replanning module requires higher control costs to achieve collision-free trajectory planning, it can effectively prevent UGV collisions caused by open-loop execution along the initial path. Exemplary paths with and without loop-closure treatment under the same origin and destination are shown in Figure 11. An illustrative case further shows that the proposed online replanning mechanism would form a timely updated sensing of the obstacles (right picture in Figure 11), thereby avoiding a collision between the UGV and a pillar that occurred in regular cases, demonstrating its advantages in terms of flexibility and robustness.

6.6. Real-World Experiments

To evaluate the performance of our proposed method in real-world environments, we conduct extensive experiments in a laboratory scenario. In the experiments, a quadrotor and an Ackermann-steering mobile robot are used, as shown in Figure 12. The NVIDIA Jetson Orin NX serves as the onboard computing platform for the UGV, running all the navigation algorithms. For environmental perception, the UAV is equipped with a self-stabilizing camera that captures 1920 × 1080 pixel images at 30 fps, while the UGV is fitted with a RealSense D435i RGB-D camera to provide high-quality RGB imagery for visual localization and scene understanding. The 3D reconstruction process leveraging NeRF is implemented on a ground station equipped with an Intel Core i9-13900KF CPU and an NVIDIA RTX 4090 GPU. High-precision motion tracking was achieved using a NOKOV optical motion capture system, providing millimeter-level ground truth data for quantitative evaluation of visual localization accuracy.

In our experiments, five different combinations of starting and ending points were configured as shown in Figure 13. The UGV was required to perform safe obstacle-avoidance navigation based on the initial NeRF map generated from the UAV’s exploration and the extracted visually salient landmarks. The outcomes of UAV pre-exploration and UGV autonomous obstacle-avoidance navigation are illustrated in Figure 14. The quantitative results summarized in Table 6 provide strong evidence that the proposed AGC-NeRF framework maintains stable performance in the real-world environment. In terms of aerial NeRF reconstruction, the UAV-acquired image set yields a PSNR of 28.26, an SSIM of 0.91, and an LPIPS of 0.11, demonstrating that the reconstructed radiance fields preserve both the structural regularity and perceptual fidelity of the laboratory environment. For the UGV navigation component, the proposed landmark-aware state estimator achieves competitive estimation accuracy, with an average rotation error of 0.025 rad, translation error of 0.030 m, angular velocity error of 0.033 rad/s, and linear velocity error of 0.030 m/s. These results indicate that AGC-NeRF effectively suppresses cumulative drift, a well-known limitation of monocular visual odometry, by incorporating the proposed saliency-driven landmark selection and the modified Charbonnier photometric loss. Even when subjected to texture-deficient regions or brightness variations, the localization pipeline exhibits no performance collapse, validating the reliability of its re-localization strategy under real-world sensory ambiguities.

6.7. Timing and Performance

We run these experiments on a computer with an Intel Core i9-13900KF @ 3.0 GHz CPU and Nvidia RTX 4090 GPU. The NeRF reconstruction and rendering time depends on the number of images and the batch size. The computation time of both the trajectory planner and the state estimator depend on the number of iterations. In the initialization phase, the initial trajectory needs to utilize more than 20 s to perform more than 2000 iterations. The state estimator typically consumes 25 s to perform 300 steps of gradient descent iterations. It is crucial to clarify that our online replanning strategy operates in a “call-on-demand” manner rather than a high-frequency real-time control loop. The UGV relies on onboard visual odometry for continuous local control, while the computationally intensive NeRF-based re-localization is triggered only when the UGV reaches the vicinity of a landmark. This “stop-and-compute” latency is acceptable for exploration tasks in static, hazardous environments where safety supersedes speed.

7. Conclusions

In this paper, we propose a novel air–ground collaborative exploration framework, which enables a UGV to navigate autonomously in unknown environments with the assistance of a UAV through a NeRF-based scene representation. The framework consists of two core components: UAV-assisted mapping and ground navigation, forming a closed-loop collaborative mechanism tailored for unknown environment exploration. The UAV serves as a scene rehearsal director that provides both landmark and initial map for the UGV, and the UGV further benefits from referencing the rehearsed scenes for achieving flexible and accurate path planning under complicated unknown environment. Simulation and experiment results show that our proposed method can efficiently reconstruct high-fidelity NeRF models with detailed texture information even in unknown, cluttered scenarios. Furthermore, the integration of the proposed saliency-driven landmark selection method and improved photometric loss function significantly enhances the UGV’s state estimation accuracy, ensuring robust navigation performance in GPS-denied, texture-sparse, or obstacle-rich unknown environments. This research provides a practical solution for breaking through the technical bottlenecks of single-vehicle exploration in unknown environments, offering valuable technical support for high-risk missions such as post-disaster search and rescue, remote wilderness surveying, and underground facility detection.

Nevertheless, the proposed AGC-NeRF still has limitations: it heavily relies on high-performance hardware (e.g., discrete GPUs at a ground station) for NeRF reconstruction and optimization and lacks adaptability to large-scale unknown environments. In future research, we will focus on lightweighting the entire navigation pipeline to reduce hardware dependencies and expanding the experimental validation from controlled laboratory settings to large-scale outdoor field trials. This will allow us to rigorously test the system’s scalability and robustness under uncontrolled lighting and terrain conditions.

Author Contributions

Conceptualization, C.L., M.Y. and Y.W.; Methodology, C.L. and M.Y.; Software, C.L.; Validation, C.L. and M.Y.; Resources, M.Y., Y.W. and H.L.; Writing—original draft, C.L.; Writing—review and editing, C.L., M.Y., Y.W. and H.L.; Visualization, C.L.; Supervision, M.Y., Y.W. and H.L.; Project administration, M.Y., Y.W. and H.L.; Funding acquisition, M.Y. and H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by National Key Research and Development Program of China (No. 2023YFB4005900), National Natural Science Foundation of China (No. U20B2001), and Yongjiang Talent Engineering Technology Innovation Team Project of Zhejiang Province (No. 2022A-008−C). We thank the aforementioned projects for their financial support.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Castellani, M.; García-Macías, E.; Meoni, A.; Ubertini, F. Fast deformation assessment of riveted steel railway bridges during load testing using unmanned aerial vehicles (UAVs). Measurement 2025, 253, 117412. [Google Scholar] [CrossRef]
Ye, L.; Du, B.; Yu, L.; Xu, X.; Zhang, J. A comprehensive estimation method for vehicle motion states based on model constraints. Measurement 2025, 242, 116153. [Google Scholar] [CrossRef]
Messaoudi, K.; Oubbati, O.S.; Rachedi, A.; Lakas, A.; Bendouma, T.; Chaib, N. A survey of UAV-based data collection: Challenges, solutions and future perspectives. J. Netw. Comput. Appl. 2023, 216, 103670. [Google Scholar] [CrossRef]
Dinelli, C.; Racette, J.; Escarcega, M.; Lotero, S.; Gordon, J.; Montoya, J.; Dunaway, C.; Androulakis, V.; Khaniani, H.; Shao, S. Configurations and applications of multi-agent hybrid drone/unmanned ground vehicle for underground environments: A review. Drones 2023, 7, 136. [Google Scholar] [CrossRef]
Fei, Z.; Wang, X.; Wu, N.; Huang, J.; Zhang, J.A. Air-ground integrated sensing and communications: Opportunities and challenges. IEEE Commun. Mag. 2023, 61, 55–61. [Google Scholar] [CrossRef]
Ropero, F.; Munoz, P.; R-Moreno, M.D. TERRA: A path planning algorithm for cooperative UGV–UAV exploration. Eng. Appl. Artif. Intell. 2019, 78, 260–272. [Google Scholar] [CrossRef]
Güler, S.; Yıldırım, İ.E. A distributed relative localization approach for air-ground robot formations with onboard sensing. Control Eng. Pract. 2023, 135, 105492. [Google Scholar] [CrossRef]
Li, Z.; Jiang, C.; Gu, X.; Xu, Y.; Cui, J. Collaborative positioning for swarms: A brief survey of vision, LiDAR and wireless sensors based methods. Def. Technol. 2024, 33, 475–493. [Google Scholar] [CrossRef]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
Liang, X.; Zhao, S.; Chen, G.; Meng, G.; Wang, Y. Design and development of ground station for UAV/UGV heterogeneous collaborative system. Ain Shams Eng. J. 2021, 12, 3879–3889. [Google Scholar] [CrossRef]
Sucar, E.; Liu, S.; Ortiz, J.; Davison, A.J. imap: Implicit mapping and positioning in real-time. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 6229–6238. [Google Scholar]
LaValle, S. Rapidly-Exploring Random Trees: A New Tool for Path Planning; Research Report 98-11; Computer Science Department, Iowa State University: Ames, IA, USA, 1998. [Google Scholar]
Hart, P.E.; Nilsson, N.J.; Raphael, B. A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 1968, 4, 100–107. [Google Scholar] [CrossRef]
Lu, C.; Yu, M.; Li, H.; Cui, H. Landmark-aware autonomous odometry correction and map pruning for planetary rovers. Acta Astronaut. 2025, 226, 86–96. [Google Scholar] [CrossRef]
Zeng, Q.; Kan, Y.; Tao, X.; Hu, Y. LiDAR positioning algorithm based on ICP and artificial landmarks assistance. Sensors 2021, 21, 7141. [Google Scholar] [CrossRef]
Arbanas, B.; Ivanovic, A.; Car, M.; Orsag, M.; Petrovic, T.; Bogdan, S. Decentralized planning and control for UAV–UGV cooperative teams. Auton. Robot. 2018, 42, 1601–1618. [Google Scholar] [CrossRef]
Yen-Chen, L.; Florence, P.; Barron, J.T.; Rodriguez, A.; Isola, P.; Lin, T.-Y. inerf: Inverting neural radiance fields for pose estimation. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 1323–1330. [Google Scholar]
Adamkiewicz, M.; Chen, T.; Caccavale, A.; Gardner, R.; Culbertson, P.; Bohg, J.; Schwager, M. Vision-only robot navigation in a neural radiance world. IEEE Robot. Autom. Lett. 2022, 7, 4606–4613. [Google Scholar] [CrossRef]
a’Xing, X.; Jin, Z.; Haolong, F.; Tao, Z.; Dongjie, L. Vision-based map building and path planning method in unmanned air/ground vehicle cooperative systems. J. Eng. 2020, 2020, 520–525. [Google Scholar] [CrossRef]
Kim, P.; Price, L.C.; Park, J.; Cho, Y.K. UAV-UGV cooperative 3D environmental mapping. In Proceedings of the ASCE International Conference on Computing in Civil Engineering 2019, Atlanta, GA, USA, 17–19 June 2019; pp. 384–392. [Google Scholar]
Koubaa, A.; Ammar, A.; Abdelkader, M.; Alhabashi, Y.; Ghouti, L. AERO: AI-enabled remote sensing observation with onboard edge computing in UAVs. Remote Sens. 2023, 15, 1873. [Google Scholar] [CrossRef]
Bakirci, M. Internet of Things-enabled unmanned aerial vehicles for real-time traffic mobility analysis in smart cities. Comput. Electr. Eng. 2025, 123, 110313. [Google Scholar] [CrossRef]
Zhu, Z.; Peng, S.; Larsson, V.; Xu, W.; Bao, H.; Cui, Z.; Oswald, M.R.; Pollefeys, M. Nice-slam: Neural implicit scalable encoding for slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12786–12796. [Google Scholar]
Liu, L.; Gu, J.; Zaw Lin, K.; Chua, T.-S.; Theobalt, C. Neural sparse voxel fields. Adv. Neural Inf. Process. Syst. 2020, 33, 15651–15663. [Google Scholar]
Fridovich-Keil, S.; Yu, A.; Tancik, M.; Chen, Q.; Recht, B.; Kanazawa, A. Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5501–5510. [Google Scholar]
Müller, T.; Evans, A.; Schied, C.; Keller, A. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. 2022, 41, 1–15. [Google Scholar] [CrossRef]
Chen, A.; Xu, Z.; Geiger, A.; Yu, J.; Su, H. Tensorf: Tensorial radiance fields. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 333–350. [Google Scholar]
Kerbl, B.; Kopanas, G.; Leimkühler, T.; Drettakis, G. 3D Gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 2023, 42, 131–139. [Google Scholar] [CrossRef]
Lee, J.; Bretl, T. GSFeatLoc: Visual Localization Using Feature Correspondence on 3D Gaussian Splatting. arXiv 2025, arXiv:2504.20379. [Google Scholar] [CrossRef]
Klein, G.; Murray, D. Parallel tracking and mapping for small AR workspaces. In Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, Nara, Japan, 13–16 November 2007; pp. 225–234. [Google Scholar]
Cadena, C.; Carlone, L.; Carrillo, H.; Latif, Y.; Scaramuzza, D.; Neira, J.; Reid, I.; Leonard, J.J. Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Trans. Robot. 2017, 32, 1309–1332. [Google Scholar] [CrossRef]
Usman, A.B.; Gutierrez, J. DATM: A dynamic attribute trust model for efficient collaborative routing. Ann. Oper. Res. 2019, 277, 293–310. [Google Scholar] [CrossRef]
Franchi, M.; Bucci, A.; Zacchini, L.; Ridolfi, A.; Bresciani, M.; Peralta, G.; Costanzi, R. Maximum a posteriori estimation for AUV localization with USBL measurements. IFAC-PapersOnLine 2021, 54, 307–313. [Google Scholar] [CrossRef]
Deng, X.; Mousavian, A.; Xiang, Y.; Xia, F.; Bretl, T.; Fox, D. PoseRBPF: A Rao–Blackwellized particle filter for 6-D object pose tracking. IEEE Trans. Robot. 2021, 37, 1328–1342. [Google Scholar] [CrossRef]
Bacca, B.; Salvi, J.; Cufí, X. Appearance-based mapping and localization for mobile robots using a feature stability histogram. Robot. Auton. Syst. 2011, 59, 840–857. [Google Scholar] [CrossRef]
Maggio, D.; Abate, M.; Shi, J.; Mario, C.; Carlone, L. Loc-nerf: Monte carlo localization using neural radiance fields. arXiv 2022, arXiv:2209.09050. [Google Scholar]
Schonberger, J.L.; Frahm, J.-M. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4104–4113. [Google Scholar]
Otsu, N. A threshold selection method from gray-level histograms. Automatica 1975, 11, 23–27. [Google Scholar] [CrossRef]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
Fliess, M.; Lévine, J.; Martin, P.; Rouchon, P. Flatness and defect of non-linear systems: Introductory theory and examples. Int. J. Control 1995, 61, 1327–1361. [Google Scholar] [CrossRef]
Yuen, K.-V. Appendix A: Relationship between the Hessian and covariance matrix for Gaussian random variables. In Bayesian Methods for Structural Dynamics and Civil Engineering; John Wiley & Sons: Singapore, 2010; pp. 257–262. [Google Scholar]

Figure 1. Two threads in the AGC-NeRF framework. (Left) The Aerial Mapping thread is for pre-exploring the environment and providing map and landmark information for UGV navigation. (Right) The Ground Navigation thread is for planning obstacle avoidance paths and state estimation in a hazardous off-road environment.

Figure 2. The pipeline of autonomous navigation landmark selection method based on saliency detection algorithms. The UAV firstly captures sequential images. Then, the visual saliency of these images is detected via simultaneously examining the image saliency and local image feature richness. An object is selected as a landmark if its saliency score exceeds the pre-set threshold.

Figure 3. Upon arriving at the landmark’s vicinity, the UAV adjusts its re-observation radius based on the computed saliency index. It then executes a circumnavigation maneuver to capture multi-view images of the landmark.

Figure 4. The kinematic model of the UGV.

Figure 5. Illustration of the trajectory planning mechanism within the NeRF representation. (Left) The UGV kinematic model in the ground truth environment, where

u

denotes the control effort vector applied to the vehicle and

b

represents the discrete body points used for collision checking. (Right) The corresponding NeRF volume density field

σ

, where darker regions indicate higher density (obstacles). The planner optimizes the path by transitioning from the initial trajectory (red squares) to the optimized collision-free trajectory (blue diamonds). The symbol

s

denotes the traveled distance between consecutive waypoints, which is used to calculate the collision probability integral.

Figure 5. Illustration of the trajectory planning mechanism within the NeRF representation. (Left) The UGV kinematic model in the ground truth environment, where

u

denotes the control effort vector applied to the vehicle and

b

represents the discrete body points used for collision checking. (Right) The corresponding NeRF volume density field

σ

, where darker regions indicate higher density (obstacles). The planner optimizes the path by transitioning from the initial trajectory (red squares) to the optimized collision-free trajectory (blue diamonds). The symbol

s

denotes the traveled distance between consecutive waypoints, which is used to calculate the collision probability integral.

Figure 6. Comparison of NeRF reconstruction results for three camera layout schemes in Stonehenge scene.

Figure 7. Results of the landmark saliency evaluation for six different sets of images.

Figure 8. A comparison between our trajectory planner and Dijkstra and RRT baselines. (a) Examples of trajectories planned by the three planners in the Stonehenge environment. Our planner can generate a smooth and minimum control effort trajectory by moving the waypoints initialized by A*. While both Dijkstra and RRT plan collision-free trajectories, their irregular shape leads to a high control cost. (b) The start and end points of the 8 sets of trials are indicated by matching colors, with crosses indicating planning failures (their colors correspond to the colors in (a)).

Figure 9. A comparison between collision and control cost and failure rate of the three planners. The collision and control cost averaged over the 20 initialization iterations. Ours planner achieves the highest trajectory planning success rate using the smallest amount of control and collision costs.

Figure 10. Comparison of mapping results between the vision-only method (NeRF) and the LiDAR-based method (A-LOAM).

Figure 11. Evaluation of the online replanning module under environmental disturbances. (Left) 2D waypoint visualization in the X-Y plane. The blue dots represent the initial global waypoints, and the orange dots indicate the locally re-planned waypoints triggered to correct the deviation. The red star and green triangle denote the start and end positions, respectively. The arrow indicates the specific location where environmental noise was introduced. (Right) 3D visualization of the navigation outcome at the noise-introduced position. The trajectory without online replanning results in a collision with the obstacle, demonstrating the necessity of the proposed replanning mechanism for collision avoidance.

Figure 12. Real-world experimental platform.

Figure 13. Experimental environment and scenario setup. The same color indicates the corresponding start and end points.

Figure 14. (Left) NeRF-based 3D reconstruction results from aerial images. (Right) UGV autonomous navigation and obstacle-avoidance performance.

Table 1. Comparison of AGC-NeRF with state-of-the-art navigation frameworks.

Method	Map Representation	System Architecture	Landmark Strategy	Collaboration Type
iNeRF	Implicit (NeRF)	Single-Agent (Camera)	Passive (Global Matching)	None
NeRF-Nav	Implicit (NeRF)	Single-Agent (Camera)	Passive (Collision Check)	None
Map-free Coop.	Sparse/None	Multi-Agent Swarm	Relative Sensing (UWB/Vision)	State Exchange Only
AGC-NeRF (Ours)	Implicit + Explicit (Landmarks)	Heterogeneous (UAV + UGV)	Active Saliency Selection	Active Mapping & Guidance

Table 2. Quantitative analysis of NeRF reconstruction in Stonehenge scenes with three camera layout schemes.

Camera Layout Scheme	PSNR ↑	SSIM ↑	LPIPS ↓	Time ↓
Simple Trajectory	29.057	0.841	0.078	34.85 s
Complex Trajectory	40.556	0.984	0.017	34.41 s
360-degree Panorama	36.656	0.968	0.022	34.38 s

Table 3. Comparison of three different image detection strategy.

	Number of Landmarks	Time
Low-frequency	4	6.24 s
High-frequency	6	12.03 s
Sparse-to-frequent	6	7.19 s

Table 4. Comparison between the original NeRF model and the re-trained NeRF model.

	PSNR ↑	SSIM ↑	LPIPS ↓
Original NeRF	40.556	0.984	0.017
Re-trained NeRF	40.961	0.988	0.012

Table 5. Ablation studies in the Stonehenge environment. All tasks aimed to verify the performance of each module in our proposed landmark-aware UGV state estimator. The results are presented as mean ± standard deviation over 20 independent trials. Statistical significance was verified via paired t-tests, where

p < 0.05

indicates a significant improvement over the baselines. A_c represents the improved Charbonnier loss. A_l represents the landmark-aware localization method.

Table 5. Ablation studies in the Stonehenge environment. All tasks aimed to verify the performance of each module in our proposed landmark-aware UGV state estimator. The results are presented as mean ± standard deviation over 20 independent trials. Statistical significance was verified via paired t-tests, where

p < 0.05

indicates a significant improvement over the baselines. A_c represents the improved Charbonnier loss. A_l represents the landmark-aware localization method.

Method	Stonehenge Environment
Method	Rotation Error (Rad)	Translation Error (m)	$ω$ Error (Rad/s)	Velocity Error (m/s)
iNeRF w/dynamics	0.052 ± 0.006	0.065 ± 0.008	0.041 ± 0.005	0.15 ± 0.02
NeRF-Nav	0.038 ± 0.005	0.045 ± 0.006	0.040 ± 0.004	0.13 ± 0.02
Ours w/o A_c	0.035 ± 0.004	0.047 ± 0.006	0.045 ± 0.005	0.19 ± 0.03
Ours w/o A_l	0.031 ± 0.003	0.041 ± 0.005	0.039 ± 0.004	0.12 ± 0.02
Ours	0.027 ± 0.002	0.033 ± 0.002	0.035 ± 0.003	0.10 ± 0.01
A-LOAM	0.021 ± 0.004	0.024 ± 0.002	-	-
A-LOAM w/A_l	0.015 ± 0.002	0.018 ± 0.001	-	-

Table 6. Quantitative analysis of AGC-NeRF performance in the real-world environment (B203 Lab). The metrics are reported for 5 independent trials, with the final column presenting the mean ± standard deviation to demonstrate system stability.

Method		Metrics	B203 Lab Environment					Avg. ± Std.
Method		Metrics	01	02	03	04	05	Avg. ± Std.
AGC-NeRF	UAV Mapping	PSNR	28.76	28.91	27.80	29.31	26.50	28.26 ± 1.13
		SSIM	0.92	0.92	0.90	0.94	0.89	0.91 ± 0.05
		LPIPS	0.11	0.10	0.13	0.09	0.14	0.11 ± 0.02
	UGV Navigation	Rot. err	0.026	0.024	0.028	0.022	0.027	0.025 ± 0.002
		Trans. err	0.041	0.037	0.040	0.035	0.042	0.039 ± 0.003
		$ω$ Error	0.034	0.033	0.036	0.029	0.031	0.033 ± 0.003
		Vel. err	0.028	0.031	0.032	0.024	0.033	0.030 ± 0.004

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lu, C.; Yu, M.; Wang, Y.; Li, H. AGCNeRF: Air–Ground Collaborative Visual Mapping and Navigation via Landmark-Enhanced Neural Radiance Fields. Drones 2026, 10, 171. https://doi.org/10.3390/drones10030171

AMA Style

Lu C, Yu M, Wang Y, Li H. AGCNeRF: Air–Ground Collaborative Visual Mapping and Navigation via Landmark-Enhanced Neural Radiance Fields. Drones. 2026; 10(3):171. https://doi.org/10.3390/drones10030171

Chicago/Turabian Style

Lu, Chenxi, Meng Yu, Yin Wang, and Hua Li. 2026. "AGCNeRF: Air–Ground Collaborative Visual Mapping and Navigation via Landmark-Enhanced Neural Radiance Fields" Drones 10, no. 3: 171. https://doi.org/10.3390/drones10030171

APA Style

Lu, C., Yu, M., Wang, Y., & Li, H. (2026). AGCNeRF: Air–Ground Collaborative Visual Mapping and Navigation via Landmark-Enhanced Neural Radiance Fields. Drones, 10(3), 171. https://doi.org/10.3390/drones10030171

Article Menu

AGCNeRF: Air–Ground Collaborative Visual Mapping and Navigation via Landmark-Enhanced Neural Radiance Fields

Highlights

Abstract

1. Introduction

2. Related Work

2.1. UGV–UAV Collaborative System

2.2. 3D Scene Representations

2.3. Visual Navigation

3. Problem Formulation

4. Aerial Mapping

4.1. NeRF Representation with Aerial Images

4.2. Landmark Selection

4.3. Local Map Pruning

5. Ground Navigation

5.1. Differential Flatness

5.2. Trajectory Planner

5.3. Landmark-Aware State Estimator

5.4. Online Replanning

6. Simulations and Experimental Trial

6.1. NeRF Representation Using Aerial Images

6.2. Landmark Selection and Map Pruning

6.3. Trajectory Planner

6.4. Landmark-Aware State Estimator

6.5. Online Replanning

6.6. Real-World Experiments

6.7. Timing and Performance

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI