1. Introduction
With the increasing demand for automation, unmanned intelligence has placed Unmanned Aerial Vehicles (UAVs) [
1] and Unmanned Ground Vehicles (UGVs) [
2] at the forefront of high-risk mission execution, including search and rescue or surveillance in environments that pose potential threats to human life. This has significantly accelerated the application research for UAV and UGV autonomous systems [
3,
4]. However, the escalating complexity of exploration tasks has outstripped the capabilities of an individual agent, mostly due to the limitations of their inherent characteristics. For example, although UAVs are terrain-independent and capable of aero-exploration, their battery life and limited payloads pose great challenges to their long-term operation. On the other hand, UGVs have significantly greater payload and endurance capacity, yet their operations are heavily terrain-dependent and often hindered by obstacles or narrow terrain conditions. As a result, the complementary characteristic of UGVs’ and UAVs’ functions have fostered their cooperation on unmanned operation missions, leading to the emergence of air–ground collaboration research for multi-robot systems [
5,
6].
Utilizing onboard sensors for environment interpretation and trajectory planning is the backbone of air–ground collaborative navigation systems [
7]. A collaborative navigation system usually leverages the perceived information acquired by heterogeneous sensors mounted onboard the agents, e.g., inertial measurement unit (IMU), camera, or LiDAR. In this work, we focus on employing cameras to navigate both a UAV and a UGV in a mutually assisting manner. The main reason for utilizing only cameras is their light weight, which is well-suited for UAV operations. The visual representation of the environment can be polymorphous, such as RGB/gray image (monocular camera), voxel grid/depth image (binocular camera), or Signed Distance Field (SDF). Recently, research on deep learning-based 3D implicit geometric representation known as Neural Radiance Fields (NeRFs) [
8] has attracted considerable interest due to its proficiency in reconstructing and rendering photorealistic models. Currently, a substantial number of studies have demonstrated the advantages of NeRFs in the field of robotic vision. In contrast to conventional explicit geometric representation methods (e.g., voxels, meshes, and point clouds), NeRFs are trained directly on dense multi-view image streams, which omits the traditional feature extraction and matching procedures [
9,
10]. NeRFs represent the geometric form of objects in a scene via continuous volume density, which is highly suitable for robotic trajectory planning [
11]. This further motivates this study, in which we integrated the NeRF representation into an air–ground cooperative navigation system, harnessing its volume density information to generate dynamically viable and collision-free trajectories for unmanned vehicles.
The navigation of UGVs requires a priori global map information to carry out path planning, where algorithms like RRT [
12] or A* [
13] can be employed to generate several paths for the UGV to follow. However, as the exploration environment becomes increasingly intricate and complex, factors like topographical undulation, sensory biases, and the loss of GPS/GNSS signals might mislead the UGV into deviated paths. To this end, a state estimator can be employed to track the UGV pose with respect to pre-selected landmarks. The relative pose estimation can be further used to plan a local vehicle path according to the newly perceived scene information. Some studies [
14,
15] suggest to deploy several artificial landmarks for better UGV localization, yet in complicated environments, manually deploying landmarks can be labor-intensive and hazardous to humans. Arbanas et al. [
16] propose to utilize real-time communication between UGVs and UAVs to fuse the sensory data and achieve high-precision mutual localization. However, the visual data after fusion has not been employed to guide the UGV to correct its planned path, which, to a certain extent, poses a potential threat to the safety of unmanned systems.
Taking into account the aforementioned challenges, our aim is to address the problem of collaborative visual navigation for an air–ground dual-agent system in hazardous off-road environments. The proposed method starts with reconstructing the environmental map using sequential aerial vision under the NeRF architecture. During the UAV’s exploratory flight, a visual saliency evaluation method is introduced to autonomously select landmarks that provide reliable navigational cues tailored to the UGV’s navigation system. Notably, the UAV constructs the NeRF map and shares this dynamically updated NeRF representation, along with navigation-friendly landmarks, with the UGV. Equipped with this a priori NeRF map and landmarks acquired by the UAV, the UGV first performs global path planning and then estimates its current state by referencing landmarks in real time during traversal, thereby enabling online trajectory replanning. Such mechanism ensures the UGV’s capacity of flexible collision-free navigation with high accuracy. In summary, our main contributions are as follows:
We propose AGC-NeRF, a novel air–ground collaborative framework, which includes a method for the best viewpoint selection. This method enables the UAV to autonomously identify the best observation viewpoints and capture environmental images, providing a prior NeRF map of the unknown hazardous environment for the UGV’s ground navigation.
An innovative automatic navigation landmark selection strategy based on image saliency evaluation algorithms is designed to identify navigational landmarks for guiding the UGV’s planning module.
A comprehensive ground navigation framework consisting of a trajectory planner and a landmark-driven state estimator in a NeRF world is presented, realizing high-precision state estimation and collision-free navigation for the UGV.
To explicitly clarify the position of AGC-NeRF, we compare it with existing frameworks in
Table 1. Unlike single-agent methods (e.g., iNeRF [
17], NeRF-Navigation [
18]) that passively use static maps, AGC-NeRF utilizes a heterogeneous air–ground loop where the global prior is actively constructed and filtered via a saliency-based landmark selection mechanism. This active attention mechanism addresses the robustness issues of standard photometric optimization in texture-poor regions. Furthermore, compared to map-free cooperative localization which relies on sparse relative sensing, our method provides dense, photorealistic obstacle perception and global consistency. While this incurs higher computational costs, it offers the fine-grained environmental context indispensable for safe navigation in hazardous, unstructured environments.
The rest of this paper is organized as follows:
Section 2 briefly introduces related work.
Section 3 formulates the problem to be addressed.
Section 4 and
Section 5 present the detailed procedure of the proposed air–ground collaborative framework.
Section 6 presents the simulation and experiment results and some necessary discussion. Finally, conclusions and future works are summarized in
Section 7.
3. Problem Formulation
This paper proposes an air–ground collaborative framework for a UGV’s visual navigation in a neural radiance (NeRF) world. A NeRF () maps a 5D coordinate including a 3D location and view direction to an emitted color and volume density . The volume density is solely dependent on the position, we define as the volume density at location , and is also the output of the NeRF model estimated at . Similarly, we define as the color rendered by the NeRF model for pixel under the camera pose , where represents the Special Euclidean group.
We consider the problem of a UGV, equipped with a monocular camera, that seeks to navigate through an unknown hazardous environment with the assistant of a UAV that is also equipped with a monocular camera. The workflow of the UGV–UAV cooperative navigation architecture is illustrated in
Figure 1.
Before detailing the framework, we clarify the operational assumptions and the scope of this study. The primary objective of AGC-NeRF is to validate the methodological feasibility of utilizing NeRF for fine-grained 3D reconstruction and implicit obstacle perception in unknown environments. To this end, we assume the availability of a ground station equipped with high-performance computing units (e.g., discrete GPUs) to handle the computationally intensive NeRF training and state optimization tasks. Consequently, strictly limited onboard energy consumption and UAV flight endurance are considered constraints to be optimized in future engineering phases rather than immediate bottlenecks for this proof-of-concept exploration. This setup allows us to focus on demonstrating the superior mapping quality and navigation accuracy of the proposed air–ground collaborative paradigm. The overall workflow consists of two threads as shown in
Figure 1, which are as follows:
Aerial Mapping: A UAV first pre-explores the environment, collecting multi-view images for NeRF reconstruction of the entire scene. During the UAV’s flight time, the UAV evaluates the image saliency via its on-board camera to select landmarks suitable for the following UGV’s navigation and stores several images in the vicinity of landmarks for pruning their NeRF models.
Ground Navigation: The UGV seeks to plan a collision-free path from the initial state to the target state . During this process, the UGV evaluates collision probability and estimates ego-motion via referencing the NeRF models of landmarks established during aerial mapping.
Similar to [
18], we approximate the UGV’s body with a finite collection of points
at which the collision is checked. Typically, this involves a 3D grid of points that represent a bounding box of the UGV; however, given that a NeRF model provides an implicit form of representation, the volume density of these sampled points does not have a direct correlation with their spatial occupancy. Specifically, the volume density representation denotes the differential probability of obstructing a ray at a given spatial point [
18].
In a NeRF representation, the volume density
fundamentally represents the differential probability of a light ray terminating at location
. We acknowledge that this optical opacity is distinct from physical occupancy. Explicit geometric representations, such as Signed Distance Fields (SDFs) or Occupancy Grids, directly model surface geometry and free space. However, SDF-based methods often require depth supervision or strong geometric regularization, which poses challenges for monocular UAV mapping in unstructured scenes. In contrast, NeRF density is optimized for photometric consistency. While this can lead to artifacts (e.g., “floaters” or semi-transparent fog) in texture-rich or specular regions, our operational environment consists mainly of static, opaque obstacles such as rocks and stone pillars. In such scenarios, high optical density serves as a sufficient and conservative proxy for physical obstacles. To quantify the risk, we approximate the UGV as a set of body points
. A collision occurs if any point in the body set intersects with the obstacle space
. According to Boole’s inequality, the probability of the union of events is bounded by the sum of their individual probabilities. Therefore, the collision probability of the UGV at time
can be formulated by
where
is the distance traveled by a body-fixed point
at time
. Minimizing this upper bound effectively keeps the UGV away from high-density regions, providing a conservative and safe navigation strategy.
With an initial NeRF map, the UGV can autonomously generate several waypoints to traverse amid a high-density area (i.e., a scene full of obstacles) represented by NeRF modeling. Since the quality of path planning relies heavily on the accuracy of pose estimation, many robotic path planning methods introduce loop-closure means in their pose estimate process; a loop-closure process can be briefly summarized as correcting the vehicle’s pose via re-observing features that are already registered in the global map, by which means the continuously accumulated pose estimation errors can be effectively corrected. Inspired by loop-closure in SLAM, we introduce a re-localization method based on iNeRF [
17], which corrects the UGV pose by actively re-observing the landmarks selected by the UAV and updates the belief about the current UGV pose. Finally, the UGV fine-tunes the local trajectory by referencing the pose that has been corrected through the re-localization process.
4. Aerial Mapping
In this section, we first present a strategy to utilize aerial images to train a 3D NeRF map representation. Then, an image saliency evaluation algorithm for autonomous landmark selection is established. Finally, we propose a local map pruning method using newly added aerial images.
4.1. NeRF Representation with Aerial Images
Accurate map building is the foundation of vehicle path planning. It is also a vital factor that has major influence on the success of target arrival. We assume that the UAV collects images via its on-board monocular camera and sends these images to the ground station for further processing. We first establish a set of predetermined waypoints that facilitate the UAV’s initial exploration of the target environment. Following the results of this preliminary reconnaissance, we identify suitable flight paths for capturing aerial imagery, thereby ensuring a comprehensive coverage of the scene. Next, we use COLMAP [
37], a robust Structure-from-Motion (SfM) method, to estimate the camera poses for the input images
. This process enables the extraction of camera extrinsic parameters
associated with each aerial image, which are essential for the subsequent 3D reconstruction. Following the SfM process, the collected images and their corresponding 5D coordinates are fed into the NeRF network for training, with the objective of generating a 3D implicit model of the environment. For the sake of practical implementation, an off-the-shelf PyTorch v2.0.1 version of the Instant-NGP is employed for data training under the NeRF structure [
26].
4.2. Landmark Selection
In a typical UGV–UAV collaboration task, the flying capability of UAVs can be leveraged to construct a map with high consistency with the real world. However, the inconsistency between the prior map and the real world would heavily affects the following autonomous task. To maintain a high level of map consistency, we design an autonomous landmark selection method based on image saliency evaluation, aiming to provide landmarks that are beneficial for the UGV to carry out re-localization and path fine-tuning. The pipeline of autonomous navigation landmark selection is illustrated in
Figure 2.
Our saliency examination framework integrates two components: a visually salient region evaluation algorithm (OTSU [
38]) and a feature point detection algorithm (ORB [
39]). It is worthy of mentioning that the detected feature points will also be employed in the
State Estimator (
Section 5.3) during the re-localization process, thereby avoiding redundant computation. Note that in the saliency evaluation part, an object with plausible imagery saliency does not necessarily mean that it is suitable to serve as a navigational landmark. To this end, we propose a threshold-based criterion that takes into account both the proportion of the salient area and the total number of feature points within that area to determine if an object qualifies as a landmark. Specifically, we set the saliency area threshold to 40% and the feature point count to 300. These values were empirically determined based on preliminary experiments in the simulation environment. We performed a sensitivity analysis by varying these thresholds by
. The results indicated that the landmark detection performance remained stable within this range, confirming that the system is robust to minor parameter variations and does not require hyper-precise tuning.
To make the visual landmark detection adaptive to scene perception in a computationally efficient manner, we design a “sparse-to-frequent” landmark detection strategy. The landmark detection frequency for image saliency evaluation is set to 1 fps (i.e., one frame of saliency evaluation per second) from the beginning, and it will gradually increase to 10 fps if no landmark is detected. Once a landmark is identified, the detection frequency is reset to 1 fps, and the strategy is reiterated. Such a strategy encourages the UAV to find more effective landmarks under limited computational resources, thereby creating more favorable conditions for the subsequent autonomous tasks performed by the UGV.
4.3. Local Map Pruning
After an aerial image is evaluated as salient and containing landmarks, we generate a local map that surrounds the landmark by online NeRF re-training. The generated NeRF representation of the local map will then be implemented as a reference in UGV path planning. To collect sufficient data for NeRF re-training, we design an autonomous circumnavigation program for the UAV to fly around the landmark of interest, which is illustrated in
Figure 3.
The UAV’s circumnavigation rules are outlined in Algorithm 1. The UAV is initially guided to a randomly selected observation point, with the starting circling radius set to zero. At this position, the saliency index
of the landmark in the captured aerial image
is computed using the evaluation method detailed in
Section 4.2. By comparing the saliency index
with the threshold defined in
Section 4.2, the system assesses whether the current observation radius is excessively large or insufficient. The radius is then iteratively adjusted, and the UAV is redirected to the updated observation position accordingly. This iterative process continues until the saliency index reaches the desired threshold at a specific observation radius. Once the optimal saliency is achieved, this radius is deemed as the optimal observation distance, and the UAV proceeds to perform the circumnavigation at this optimized radius
.
| Algorithm 1 UAV circumnavigation rules |
Input:
Output:
|
- 1:
- 2:
- 3:
while do - 4:
[ Section 4.2] - 5:
if then - 6:
- 7:
- 8:
else - 9:
- 10:
- 11:
end if - 12:
end while
|
This strategy ensures that the images obtained provide a rich visual context of the landmarks from various angles. Subsequently, the image sequences , coupled with the determined camera extrinsic parameters , are incorporated into the NeRF training process. The NeRF network is then re-trained, enabling the generation of a local refined 3D map around the landmark. Optimization of 3D landmark models enhances the fidelity of their textural details, thereby significantly improving the accuracy of pixel color matching for regions of interest in UGV re-localization.
5. Ground Navigation
In this section, we first simplify the UGV system as a differentially flat system, which allows us to circumvent the complexities associated with dynamic constraints during the planning phase. Then, we introduce a UGV trajectory planner designed to formulate a path that minimizes the probability of collision and control effort. Moreover, we propose a landmark-aware state estimator for estimating the 6-DOF pose and velocity of the UGV. This estimator leverages the landmarks identified in
Section 4 for visual re-localization, realizing high-precision pose correction. Finally, we establish an online local trajectory fine-tuning module that synergizes the trajectory planner and the state estimator, resulting in a vision-only UGV autonomous navigation pipeline.
5.1. Differential Flatness
In this section, we introduce a particular property of “differential flatness” into our system to expedite the planning process. This allows the usage of a smaller set of “flat outputs” and their derivatives to represent the inputs and states of such systems. Notably, UGVs are known to be differentially flat, with their position and yaw angle as flat outputs [
40]. The kinematic model of a UGV is shown in
Figure 4, where
denotes the wheelbase of the UGV.
denotes the wheel radius of the UGV.
denotes the steering angle of the UGV. It is worth noting that although this study primarily focuses on addressing the navigation challenges specific to UGVs, the principles of differential flatness can also be applied to other agents if they exhibit certain flat outputs. If system outputs exist that allow all the state and input variables to be expressed through these outputs and their derivatives, the principles of differential flatness can apply to those agents. For instance, in UAV systems, the position coordinates and yaw angle are flat outputs, while the attitude and thrust can be computed from these outputs and their derivatives. Therefore, UAV systems can also be represented using differential flatness principles.
Typically, a trajectory planning pipeline for differential flat systems aims to find polynomial trajectories for the flat outputs to minimize the objective function (such as jerk or snap) subject to waypoint constraints. Such optimization progress can be cast as a quadratic programming problem that can be solved in a computationally efficient manner. Moreover, collision avoidance maneuvers can be seamlessly integrated into this framework. However, to maintain convexity in the optimization problem, navigation strategies through obstacle-laden regions often require pre-defined, hard-coded decisions.
5.2. Trajectory Planner
Following the differential flatness assumption, we propose a trajectory planning approach that takes into account both collision avoidance and control efficiency in this section. Our approach differs from traditional methods as we leverage the NeRF modeling to implicitly represent the environment rather than explicitly describing obstacles in closed form (such as polytopes). Rather than optimizing trajectories only between static waypoints, we employ a denser set of waypoints in NeRF environmental representation that allows for dynamic optimization of waypoint locations. The Adam optimizer is employed as the backbone optimization tool, as in many other trajectory planning studies.
The trajectory planner seeks a set of flat output waypoints
, by following which the combined objective cost can be minimized:
where
is the position component of a differentially flat state
,
is the rotation matrix from the UGV body frame to the world frame,
is a function that returns the distance traveled by the point in the set of the UGV’s point clouds, and
is the diagonal matrix used to penalize the weights of control effort. The first objective seeks to minimize the probability of collision, as defined in (1), and the second seeks to minimize control effort. Note that
,
, and
are functions of the decision variables
, as they are derived from the surrounding waypoints under the constraint of UGV kinematics.
We employ a heuristic search-based path planning algorithm based on A* to generate an initial set of waypoints, starting from the UGV’s current position to the target destination. These waypoints are then refined through an optimization process that utilizes gradient descent to balance multiple objectives, including obstacle avoidance and the minimization of control effort.
Figure 5 illustrates an optimized trajectory that deviates from an initial straight path to circumvent regions of high volume density within the NeRF representation.
5.3. Landmark-Aware State Estimator
When a UGV operates in a GPS-denied environment, its pose estimation error inevitably accumulates with its traversal. Such a phenomenon is also known as the cumulative navigation error problem in SLAM studies, and it might lead to notable deviation from the planned route or excessively cautious path planning decisions. To remedy this issue, one can introduce a loop-closure process to correct the pose estimate. For instance, the UGV can recognize several saliency image features and reference their global positions to carry out loop-closure. Nevertheless, it is not always guaranteed that each image would contain salient features qualified for pose estimation; in cases where an image feature lacks diversity in texture and is difficult to match with its counterpart in the global map, the loop-closure might lead to an even worse pose estimation result, thereby compromising the entire task. To make the loop-closure process more robust and reliable, we propose a NeRF-based re-localization method that capitalizes on navigational landmarks registered in the UAV circumnavigation mission. In addition, we address the challenge of updating the pose belief of the UGV based on given measurements and its most recent control action.
Our method builds upon the work presented in [
17], where an initial pose estimate is optimized by minimizing the photometric loss between the observed image and the NeRF-rendered image. Unlike the single-shot estimators that rely heavily on the initial conditions, our method employs an Extended Kalman Filter (EKF) for state estimation. Specifically, we define the UGV state vector
to include the position, orientation quaternion, linear velocity, and angular velocity. At each timestep, the UGV captures a new image
and executes a control action
. The EKF process begins with the prediction step, where the prior state estimate is propagated through the non-linear kinematic model:
where
represents the UGV’s dynamics. To propagate the system uncertainty, we linearize the non-linear motion model
via a First-Order Taylor Expansion. The Jacobian matrix
is computed with respect to the state vector
at the previous estimate
:
Subsequently, the predicted error covariance matrix
is computed by projecting the previous covariance
through the Jacobian and adding the process noise covariance
:
Following this prediction, the update step is performed. Unlike traditional filters that use explicit geometric measurements, we utilize the differentiable nature of NeRFs. We compute the gradient of the photometric residual between the observed image and the rendered image at the predicted state via backpropagation. This gradient serves as the observation Jacobian to update the state and covariance , effectively fusing the NeRF map prior with real-time visual data.
The original iNeRF algorithm utilizes an L2 loss function to measure the photometric loss, which has a tendency to uniformly smooth illumination gradients. While suitable for stable indoor lighting, this approach may obscure critical details under varying outdoor conditions. To solve this issue, we design a variant of Charbonnier loss that weakens its sensitivity to light by introducing a bias term:
where
is the observed image,
is the rendered image acquired via the equipped NeRF model,
is a time-dependent parameter,
is an empirical parameter and is set to
in this study,
is total traverse odometry after the UGV reaches the viewing site, and
is the observation radius of the landmarks. Equation (6) is the same as the standard L1 loss when the UGV just arrives at the landmark viewing site, and it grows to
as the UGV approaches the target landmark. This modified loss function enhances sensitivity to gradient changes, improving robustness against rendering inconsistencies caused by illumination variations. Additionally, the Charbonnier loss offers a smoother transition near zero compared to the L2 loss, which preserves fine details more effectively, particularly at image edges. These attributes render the Charbonnier loss highly suitable for an outdoor navigation context.
As in [
18], we select a series of pixels
for evaluation using existing image feature detectors (e.g., ORB) to identify areas of interest and bias sampling around regions with higher gradient information. The pose of the robot
can be constructed from the position and rotation elements of
. With this information, the cost function to be minimized is defined as
where
represents the covariance of measurement noise and the symbol
denotes the weighted
norm. Minimizing this equation gives the updated mean
. Outlier rejection is performed for each pixel loss to reduce variance.
Finally, we employ the known relationship between the Hessian of the Gaussian loss function and the covariance [
41] to generate the posterior covariance,
5.4. Online Replanning
We combine the trajectory planner from
Section 5.2 and the state estimator from
Section 5.3 to formulate an online replanning strategy. Such a strategy requires an initial state belief of the UGV pose, a predefined target destination, a set of pre-selected landmarks collected by the UAV, and a pre-trained NeRF map.
The online replanning module works in a call-on-demand manner in the path planning system. Concretely, the path planning system for a UGV is initiated by dense waypoints that are generated by the algorithm proposed in
Section 5.2. Once the UGV arrives in the vicinity of a landmark labeled by the UAV along its path, it calls the online replanning module to fine-tune its later path: it first updates its state belief via the landmark-aware State Estimator (
Section 5.3). The mean of the updated posterior is then utilized as the revised initial state within the trajectory planner. The trajectory is then re-optimized using updated UGV states, ensuring that the remaining path is collision-free and energy efficient. Meanwhile, the UAV continues to track the rest of the previous waypoints for registering more landmarks. This process is repeated in an iterative manner until the UGV successfully reaches the predetermined destination. This approach enables the UGV to dynamically update its trajectory plans by mitigating the impact of distractions, cumulative navigational error, and insufficient environmental perception capability, thereby enhancing the efficacy and flexibility of autonomous navigation.
6. Simulations and Experimental Trial
In this section, we evaluate the proposed air–ground collaborative framework in both a high-fidelity hazardous simulated environment (Stonehenge) and a real-world environment. We selected the Stonehenge environment for simulation because it represents a typical unstructured scenario featuring irregular geometric obstacles, severe occlusions, and rich high-frequency textures. Compared to simple flat terrains or structured urban environments, these characteristics pose significantly greater challenges for NeRF reconstruction and visual localization, making it a rigorous benchmark for evaluating our algorithm’s robustness. The experimental analysis is organized as follows. We first demonstrate that high-quality NeRF models can be reliably constructed from UAV-acquired aerial imagery. Next, we validate the proposed landmark selection strategy and analyze the effectiveness of local NeRF map pruning. We then conduct comparative studies to assess the performance of the trajectory planner. In addition, we evaluate the accuracy of the landmark-based state estimator. Finally, we extend the evaluation to real-world experiments to verify the framework’s robustness under natural lighting variations, sensor noise, and imperfect textures, followed by an assessment of its capability for online replanning.
6.1. NeRF Representation Using Aerial Images
The fidelity of the NeRF model in representing the environment is crucial to the success of UGV navigation, as its obstacle perception is entirely contingent upon the volume density derived from the NeRF output. Consequently, to ascertain whether the NeRF-based scene reconstruction suffices for UGV navigation demands or not, we set a comparative experimental setup against the conventional NeRF reconstruction using 360-degree environment images. To simulate the UAV aerial photography process, we manually create a flight trajectory within the Blender 5.0 software for environmental free-exploration. The UAV (denoted by a camera in Blender) follows the predefined trajectory, capturing environmental images.
We use an off-the-shelf PyTorch implementation of Instant-NGP as the 3D reconstruction method. We set up three different camera layout schemes for the Stonehenge environment: (a) 360-degree panoramic coverage, (b) a simple camera (UAV) motion trajectory, and (c) a complex camera (UAV) motion trajectory. 150 images were acquired separately using each scheme to produce the datasets. These three sets of multi-view image datasets are then fed into torch-ngp for training respectively. The qualitative analysis is shown in
Figure 6, with quantitative analysis detailed in
Table 2. The qualitative analysis indicates that a simple camera (UAV) motion trajectory often falls short in achieving comprehensive scene coverage. Within regions with rich texture details, numerous artifacts have emerged, leading to a scene reconstruction failure. The incorporation of such NeRF models into UGV navigation tasks is further complicated by the fact that these artifacts contribute erroneous volume density information. This spurious scene representation can significantly impact the UGV’s ability to perceive and avoid obstacles. In contrast, a circumnavigating UAV trajectory has been tailored by taking into account the flight kinematic and FoV constraints (the complex trajectory in
Figure 6). Comparative analysis with conventional 360-degree panoramic layouts reveals that the image dataset established along the circum-navigating trajectory outperform the 360-degree panorama in terms of Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS) metrics. Qualitative analysis demonstrates that images captured by the UAV following a tailored circumnavigating path can effectively mitigate the artifacts arising from ray sampling. In this study, we use the circumnavigating trajectory as a template that is scalable to the scene size, the purpose of which is to provide an accurate NeRF representation for the UGV to carry out re-localization and path planning.
6.2. Landmark Selection and Map Pruning
We set the threshold for determining whether an object in the environment can be classified as a landmark to be no less than 40% of the salient region and the number of feature points in the salient region is greater than 300.
Figure 7 shows the results of the landmark saliency evaluation for six different sets of images, where the objects in the first two images satisfy the requirement and are selected as landmarks.
Furthermore, we conduct an evaluation to ascertain the efficacy of our proposed ‘sparse-to-frequent’ image detection strategy. As is shown in
Table 3, a comparative analysis is performed between the number of landmarks identified and the computational time expended by three distinct strategies: low-frequency, high-frequency, and our ‘sparse-to-frequent’ strategy, all applied to the same aerial image sequence. The comparative results indicate that our proposed strategy is capable of detecting more landmarks and, in the meantime, consumes fewer computational resources.
We embed images captured by the UAV around landmarks, along with their respective 5D coordinates, into the existing NeRF model for re-training. As shown in
Table 4, the re-trained NeRF model exceeds the original model in all reconstruction accuracy metrics, realizing local map repair for UGV navigation scenes, and also improving the accuracy of visual localization of UGVs using landmarks, which is then verified in
Section 6.4.
6.3. Trajectory Planner
The study in [
18] has demonstrated that real-world collisions can be well proxied by utilizing the volume density of a NeRF and verified that a UAV can efficiently plan a smooth and collision-free trajectory. We extend this approach to verify the trajectory planning capabilities of a UGV within a hazardous off-road environment. Unlike a UAV, a UGV is subject to more stringent environmental constraints, presenting a heightened challenge for navigation within a NeRF representation.
We compare two widely used path planning techniques in Stonehenge’s NeRF environment representation: the Dijkstra and the Rapid Search Random Tree (RRT). Dijkstra is a classical method for determining the shortest path in a graph. It leverages the concept of breadth-first traversal to calculate the minimum path length between a starting node and all other nodes; RRT is a method based on state-space sampling. It initiates by randomly selecting a point within the search space to serve as a potential node. We selected these classical geometric planners as baselines specifically to isolate and evaluate the quality of the NeRF-based environmental representation. While modern kinodynamic or learning-based planners could offer superior motion efficiency, our primary objective here is to verify that the NeRF-derived density field provides sufficient and accurate geometric information to support valid path generation. Subsequently, it constructs a connection between the root node and this newly identified node, simultaneously assessing for any potential collisions with the surrounding obstacles. RRT requires a binary collision metric, so we first convert the NeRF into a mesh. In order to extract the control actions needed to track the generated trajectories, here we use a differential flatness controller for all the planning algorithms.
To evaluate the proposed trajectory planner’s performance, we run all three methods (Ours, Dijkstra, and RRT) on eight different trajectories with a set of obstacles in the Stonehenge environment.
Figure 8a shows the results of the qualitative analysis of the three different trajectory planning methods.
Figure 8b shows the start and end points of the 8 trajectories we set up and labels the failure cases.
Figure 9 shows the comparative analysis of the mean collision and control costs for the three trajectory planners, along with their respective failure rates in this experiment. In this context, a failed case indicates that the agent collides with obstacles during traversal.
6.4. Landmark-Aware State Estimator
To verify the performance of our proposed landmark-aware UGV state estimator, we conduct ablation studies under the Stonehenge environment. We utilize iNeRF (with a dynamic prior) and NeRF-Navigation as baselines. Then, we apply the improved Charbonnier loss and the landmark-aware localization method to the state estimator, respectively. Additionally, to demonstrate the effectiveness of the vision-based approach, we also compare it with the traditional LiDAR-based SLAM algorithm A-LOAM in terms of pose estimation accuracy. To ensure a fair comparison using the same sensor modality, the point cloud input for A-LOAM is generated by back-projecting the depth images from the RGB-D camera, rather than using a physical LiDAR. Specific parameter settings for A-LOAM were configured to accommodate the depth camera input (e.g., mapping resolution set to 0.2 m and minimum feature distance to 0.4 m). We evaluate these methods using an identical set of initial states and actions, performing 20 sets of trials for each method and calculating the average error over the entire process. The results of the ablation studies are shown in
Table 5, where A
c and A
l denote the improved Charbonnier loss and the landmark-aware localization method. Although the translation error, angular velocity error, and velocity error of our landmark-aware localization module are higher than that of NeRF-Navigation, the performance is remarkably improved when it is integrated with the modified Charbonnier loss. The proposed method achieves a competitive improvement in terms of pose estimation accuracy, where the rotation error is reduced by 28.9%, the translation error is reduced by 26.7%, rotation error is reduced by 12.5%, and velocity error is reduced by 23.1%, respectively. Furthermore, by integrating the landmark-aware localization method into A-LOAM, the rotational error is reduced by 28.6% and the translational error is reduced by 25.0%. This suggests that these two modules are highly complementary in the state estimation process. From the comparison of pose estimation accuracy with A-LOAM, it is evident that the vision-based approach still exhibits a certain gap compared to the LiDAR-based method. However, the vision-only method not only outperforms LiDAR-based methods in terms of mapping quality (as the reconstruction results shown in
Figure 10) but also offers advantages in terms of lightweight design and cost-effectiveness.
6.5. Online Replanning
We evaluate the performance of the entire UGV navigation framework in the Stonehenge scene. In our study, we simulate the influence of the ground environment on the UGV by incorporating identical noise characteristics into the finite difference equation of the UGV as those utilized in the state estimator experiments. We set up the same eight sets of starting and terminate location as in the trajectory planner experiment to verify the performance of online replanning. Although the online replanning module requires higher control costs to achieve collision-free trajectory planning, it can effectively prevent UGV collisions caused by open-loop execution along the initial path. Exemplary paths with and without loop-closure treatment under the same origin and destination are shown in
Figure 11. An illustrative case further shows that the proposed online replanning mechanism would form a timely updated sensing of the obstacles (right picture in
Figure 11), thereby avoiding a collision between the UGV and a pillar that occurred in regular cases, demonstrating its advantages in terms of flexibility and robustness.
6.6. Real-World Experiments
To evaluate the performance of our proposed method in real-world environments, we conduct extensive experiments in a laboratory scenario. In the experiments, a quadrotor and an Ackermann-steering mobile robot are used, as shown in
Figure 12. The NVIDIA Jetson Orin NX serves as the onboard computing platform for the UGV, running all the navigation algorithms. For environmental perception, the UAV is equipped with a self-stabilizing camera that captures 1920 × 1080 pixel images at 30 fps, while the UGV is fitted with a RealSense D435i RGB-D camera to provide high-quality RGB imagery for visual localization and scene understanding. The 3D reconstruction process leveraging NeRF is implemented on a ground station equipped with an Intel Core i9-13900KF CPU and an NVIDIA RTX 4090 GPU. High-precision motion tracking was achieved using a NOKOV optical motion capture system, providing millimeter-level ground truth data for quantitative evaluation of visual localization accuracy.
In our experiments, five different combinations of starting and ending points were configured as shown in
Figure 13. The UGV was required to perform safe obstacle-avoidance navigation based on the initial NeRF map generated from the UAV’s exploration and the extracted visually salient landmarks. The outcomes of UAV pre-exploration and UGV autonomous obstacle-avoidance navigation are illustrated in
Figure 14. The quantitative results summarized in
Table 6 provide strong evidence that the proposed AGC-NeRF framework maintains stable performance in the real-world environment. In terms of aerial NeRF reconstruction, the UAV-acquired image set yields a PSNR of 28.26, an SSIM of 0.91, and an LPIPS of 0.11, demonstrating that the reconstructed radiance fields preserve both the structural regularity and perceptual fidelity of the laboratory environment. For the UGV navigation component, the proposed landmark-aware state estimator achieves competitive estimation accuracy, with an average rotation error of 0.025 rad, translation error of 0.030 m, angular velocity error of 0.033 rad/s, and linear velocity error of 0.030 m/s. These results indicate that AGC-NeRF effectively suppresses cumulative drift, a well-known limitation of monocular visual odometry, by incorporating the proposed saliency-driven landmark selection and the modified Charbonnier photometric loss. Even when subjected to texture-deficient regions or brightness variations, the localization pipeline exhibits no performance collapse, validating the reliability of its re-localization strategy under real-world sensory ambiguities.
6.7. Timing and Performance
We run these experiments on a computer with an Intel Core i9-13900KF @ 3.0 GHz CPU and Nvidia RTX 4090 GPU. The NeRF reconstruction and rendering time depends on the number of images and the batch size. The computation time of both the trajectory planner and the state estimator depend on the number of iterations. In the initialization phase, the initial trajectory needs to utilize more than 20 s to perform more than 2000 iterations. The state estimator typically consumes 25 s to perform 300 steps of gradient descent iterations. It is crucial to clarify that our online replanning strategy operates in a “call-on-demand” manner rather than a high-frequency real-time control loop. The UGV relies on onboard visual odometry for continuous local control, while the computationally intensive NeRF-based re-localization is triggered only when the UGV reaches the vicinity of a landmark. This “stop-and-compute” latency is acceptable for exploration tasks in static, hazardous environments where safety supersedes speed.
7. Conclusions
In this paper, we propose a novel air–ground collaborative exploration framework, which enables a UGV to navigate autonomously in unknown environments with the assistance of a UAV through a NeRF-based scene representation. The framework consists of two core components: UAV-assisted mapping and ground navigation, forming a closed-loop collaborative mechanism tailored for unknown environment exploration. The UAV serves as a scene rehearsal director that provides both landmark and initial map for the UGV, and the UGV further benefits from referencing the rehearsed scenes for achieving flexible and accurate path planning under complicated unknown environment. Simulation and experiment results show that our proposed method can efficiently reconstruct high-fidelity NeRF models with detailed texture information even in unknown, cluttered scenarios. Furthermore, the integration of the proposed saliency-driven landmark selection method and improved photometric loss function significantly enhances the UGV’s state estimation accuracy, ensuring robust navigation performance in GPS-denied, texture-sparse, or obstacle-rich unknown environments. This research provides a practical solution for breaking through the technical bottlenecks of single-vehicle exploration in unknown environments, offering valuable technical support for high-risk missions such as post-disaster search and rescue, remote wilderness surveying, and underground facility detection.
Nevertheless, the proposed AGC-NeRF still has limitations: it heavily relies on high-performance hardware (e.g., discrete GPUs at a ground station) for NeRF reconstruction and optimization and lacks adaptability to large-scale unknown environments. In future research, we will focus on lightweighting the entire navigation pipeline to reduce hardware dependencies and expanding the experimental validation from controlled laboratory settings to large-scale outdoor field trials. This will allow us to rigorously test the system’s scalability and robustness under uncontrolled lighting and terrain conditions.