Cooperative Air–Ground Perception Framework for Drivable Area Detection Using Multi-Source Data Fusion

Zhang, Mingjia; Liang, Huawei; Zhou, Pengfei

doi:10.3390/drones10020087

Open AccessArticle

Cooperative Air–Ground Perception Framework for Drivable Area Detection Using Multi-Source Data Fusion

by

Mingjia Zhang

^1,2,3,

Huawei Liang

^1,4,5,* and

Pengfei Zhou

^1,4,5

¹

Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031, China

²

University of Science and Technology of China, Hefei 230052, China

³

Nanjing Polytechnic Institute, Nanjing 211548, China

⁴

Anhui Engineering Laboratory for Intelligent Driving Technology and Application, Hefei 230088, China

⁵

Jianghuai Advance Technology Center, Hefei 230088, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(2), 87; https://doi.org/10.3390/drones10020087

Submission received: 14 December 2025 / Revised: 24 January 2026 / Accepted: 26 January 2026 / Published: 27 January 2026

(This article belongs to the Special Issue Advances in Cooperative Perception Application for Unmanned System in Modern Transportation)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose a novel three-stage cooperative air–ground perception framework that synergistically integrates UAV topological reasoning, robust cross-view semantic localization, and adaptive multimodal fusion for robust drivable area detection.
The framework introduces three key innovations: a topology-aware segmentation network (DynCoANet), a semantic-enhanced particle filter for precise alignment, and a distance-adaptive fusion transformer (DAFT) for confidence-aware feature fusion.

What are the implications of the main findings?

This work establishes that the tight co-design of perception, localization, and fusion modules is essential for autonomous systems to achieve robustness against occlusions and sensor limitations in complex environments.
The presented framework provides a practical and effective solution for enhancing the safety and reliability of unmanned ground vehicles in challenging real-world transportation scenarios, such as unstructured roads and unregulated intersections.

Abstract

Drivable area (DA) detection in unstructured off-road environments remains challenging for unmanned ground vehicles (UGVs) due to limited field-of-view, persistent occlusions, and the inherent limitations of individual sensors. While existing fusion approaches combine aerial and ground perspectives, they often struggle with misaligned spatiotemporal viewpoints, dynamic environmental changes, and ineffective feature integration, particularly at intersections or under long-range occlusion. To address these issues, this paper proposes a cooperative air–ground perception framework based on multi-source data fusion. Our three-stage system first introduces DynCoANet, a semantic segmentation network incorporating directional strip convolution and connectivity attention to extract topologically consistent road structures from UAV imagery. Second, an enhanced particle filter with semantic road constraints and diversity-preserving resampling achieves robust cross-view localization between UAV maps and UGV LiDAR. Finally, a distance-adaptive fusion transformer (DAFT) dynamically fuses UAV semantic features with LiDAR BEV representations via confidence-guided cross-attention, balancing geometric precision and semantic richness according to spatial distance. Extensive evaluations demonstrate the effectiveness of our approach: on the DeepGlobe road extraction dataset, DynCoANet attains an IoU of 61.14%; cross-view localization on KITTI sequences reduces average position error by approximately 10%; and DA detection on OpenSatMap outperforms Grid-DATrNet by 8.42% in accuracy for large-scale regions (400 m × 400 m). Real-world experiments with a coordinated UAV-UGV platform confirm the framework’s robustness in occlusion-heavy and geometrically complex scenarios. This work provides a unified solution for reliable DA perception through tightly coupled cross-modal alignment and adaptive fusion.

Keywords:

UGV-UAV; cooperative perception; multi-source data fusion; cross-view localization; drivable area detection

1. Introduction

Unmanned ground vehicles (UGVs) operating in unstructured off-road environments require robust perception systems to identify drivable areas (DAs) under dynamically challenging conditions. DA detection is critical for safe navigation in terrains lacking clear boundaries, where vegetation, terrain irregularities, and occlusions create ambiguities. Traditional single-sensor approaches exhibit inherent limitations: LiDAR-based methods provide precise 3D terrain modeling through Bayesian generalized kernel inference and normal vector estimation, yet still suffer from sparse data and noise in low-reflectivity or occluded regions [1,2]. While camera-based semantic segmentation struggles with illumination sensitivity and environmental variability. Though multimodal fusion frameworks combining LiDAR and cameras have been explored, persistent challenges arise from misaligned spatiotemporal perspectives between ground and aerial platforms, dynamic environmental changes, and suboptimal feature fusion mechanisms—particularly at intersections or under long-range occlusion scenarios [3].

Existing ground-based fusion systems often fail to resolve road connectivity ambiguities caused by discontinuous observations from a single viewpoint. Recent aerial–ground collaborative systems, such as [4,5]’s UAV-UGV framework, leverage cross-view 2D/3D point cloud matching to construct dense global risk maps, but their probabilistic fusion methods struggle with real-time adaptability to dynamic obstacles like moving vegetation. Meanwhile, large-scale satellite datasets like [6] provide high-resolution (0.15 m/pixel) instance-level annotations aligned with autonomous driving benchmarks, yet their offline nature limits direct application to real-time UGV navigation. Current fusion strategies inadequately address dynamic region-of-interest (ROI) perception, as static fusion weights cannot adapt to rapidly changing environments. For instance, Zhong [7] proposed a LiDAR texture-based method leveraging bird’s-eye view (BEV) projection and multi-frame fusion to mitigate dust noise, yet its dependency on single-modality LiDAR limits adaptability to perspective variations between ground and aerial sensors. Furthermore, Xu [8] demonstrated elevation and range maps dual-projection strategies with Bayesian fusion for rough terrains, but their ground segmentation relies on handcrafted features vulnerable to dynamic obstacles. Conventional methods also lack mechanisms to handle directional dependencies at critical navigation zones like intersections, where path continuity must be inferred from fragmented sensor inputs, as highlighted in recent architectural analyses of drivable area estimation [9,10,11]. While existing collaborative systems have demonstrated potential, they often suffer from a fundamental limitation: the decoupled design of perception, localization, and fusion modules. This isolation leads to cascading errors: inaccurate aerial segmentation corrupts the semantic prior, misaligned cross-view localization introduces projection artifacts, and static fusion rules fail to reconcile conflicting information, ultimately undermining robustness in precisely the complex, occlusion-prone scenarios where collaboration is most needed.

To break this bottleneck, we propose a paradigm shift towards a tightly-coupled perception-localization co-design framework. Unlike sequential pipelines, our approach establishes a synergistic loop where each component informs and refines the others. This is realized through three interconnected innovations: (1) DynCoANet generates topologically-consistent road maps from UAV imagery through directional strip convolutions and connectivity attention, providing a geometrically meaningful prior rather than mere pixel-wise labels, which is crucial for subsequent geometric alignment under occlusion. (2) An enhanced particle filter leverages this rich semantic prior not just as a matching template, but as a probabilistic motion constraint, directly coupling semantic understanding with state estimation to achieve robust cross-view localization under vegetation and intersection occlusion. (3) The distance-adaptive fusion transformer (DAFT) dynamically resolves the perception–localization duality: it does not merely fuse features but explicitly models and compensates for remaining localization uncertainty and sensor-specific reliability degradation with distance, enabling a graceful performance transition from LiDAR-dominant near-field to UAV-semantics-dominant far-field.

Collectively, these contributions move beyond incremental improvements in individual tasks. They constitute a unified three-stage cooperative air–ground perception framework where aerial topology reasoning guides and is validated by ground-level geometric perception, effectively closing the loop between global semantics and local geometry. This work demonstrates that the tight co-design of perception and localization is not only beneficial but essential for achieving reliable autonomous navigation in unstructured, occlusion-heavy environments.

2. Related Work

2.1. Semantic Segmentation

The evolution of road extraction has transitioned from pixel-level classification to structural topology inference. Early approaches predominantly framed the task as semantic segmentation using encoder–decoder architectures. While convolutional networks achieved high per-pixel accuracy, their segmentation-centric paradigm inherently disregarded global connectivity, necessitating morphological post-processing that often introduced over-segmentation artifacts. Recent advances like DSMSA-Net [12] incorporated multi-scale attention units to enhance feature representation but failed to bridge the fundamental disconnect between local predictions and global topology. This limitation motivated graph-based methods that explicitly model road networks through iterative construction. Bastani et al. [13] pioneered CNN-guided graph expansion with RoadTracer, while Xu et al. [14] enhanced robustness in RNGDet via multi-scale backbone features and instance-aware segmentation heads. Although these graph-driven approaches demonstrate superior topological metrics compared to segmentation methods, their sequential inference incurs substantial computational costs and remains vulnerable to occlusions such as vegetation.

Attention mechanisms have been increasingly adopted to capture long-range road dependencies. Zhou et al. [15] combined ResNet encoders with dilated convolutions to balance receptive fields, though fixed dilation rates limited geometric adaptability. Dai et al. [16] employed deformable attention for curved road modeling at increased computational cost. Mei et al. [17] utilized directional strip convolutions but encountered difficulties with abrupt orientation changes. Collectively, while advancing directional awareness, these methods present a clear trade-off between computational efficiency and dynamic geometric adaptation.

Ultimately, preserving end-to-end topological continuity persists as the field’s central challenge. He et al. [18] proposed keypoint detection with greedy linking, yet still required complex post hoc refinement. This exemplifies how prevailing approaches predominantly rely on separate post-processing stages rather than end-to-end topological learning, severely limiting their applicability in occlusion-heavy scenarios. The reliance on auxiliary refinement stages underscores a persistent gap in achieving inherent topological learning. While recent methods have made strides in capturing directional dependencies or graph-based topology, they largely remain within a segmentation-centric paradigm that treats road extraction as an isolated task. In contrast, our DynCoANet is designed with cross-modal collaboration in mind: by generating topologically consistent road maps with explicit geometric meaning, it provides a structured prior that is inherently suitable for subsequent geometric alignment in UAV-UGV systems. This represents a departure from methods that output mere pixel-wise labels, as our segmentation explicitly serves the downstream need for robust cross-view localization under occlusion.

2.2. Cross-View Localization

Cross-view localization, centered on particle filter-based state estimation, has evolved to address multi-modal fusion in complex environments. While traditional Monte Carlo localization frameworks achieved early success indoors, they encountered significant scalability challenges in large-scale urban settings [19]. A paradigm shift occurred with the integration of semantic information into particle filters, pioneered by Miller et al. [20], who enabled cross-modal matching between LiDAR point clouds and aerial imagery via truncated distance fields. This innovation markedly improved viewpoint invariance but was fundamentally constrained by computational inefficiency and limited scale adaptability.

Recent progress has focused on hybrid geometric-semantic representations to refine cross-view alignment. Guo et al. [21] introduced a reward–penalty mechanism for satellite-to-ground semantic matching, achieving sub-meter accuracy through probabilistic correlation. However, its dependence on a fixed-scale assumption and dense GPS priors restricts its applicability in GPS-denied scenarios. Complementary research by Li et al. [22] on multi-robot map fusion demonstrated the utility of Gaussian processes for heterogeneous data integration, though their method lacked the real-time capability essential for mobile platforms.

Scale ambiguity remains a critical challenge in aerial–ground registration. Learning-based strategies, such as that proposed by Barsan et al. [23], utilize LiDAR intensity maps for scale estimation to achieve centimeter-level precision, albeit at the cost of relying on expensive sensor suites. This line of work highlights the ongoing effort to balance precision, cost, and generalizability in scale estimation. Maintaining particle diversity in high-dimensional state spaces is another persistent issue. While systematic resampling [24] effectively mitigates particle degeneracy, it inadvertently induces artificial variance reduction. This creates a need for more sophisticated diversity preservation mechanisms beyond fixed-threshold strategies, which are prone to premature convergence in complex environments like urban canyons. Concurrently, robust estimation theory has been integrated into particle filters to manage sensor outliers. The M-estimation framework builds upon seminal work in consistency maximization [25]. Extending these concepts to semantic feature space is crucial, as conventional robust kernels like the Huber loss can be suboptimal for handling outliers from transient semantic classes, pointing to a need for more semantically-aware robust kernels. Existing cross-view localization methods typically treat semantic information as a passive matching cue or a static constraint. Our work fundamentally rethinks this relationship: the enhanced particle filter actively utilizes the topological prior from DynCoANet not merely as a template, but as a probabilistic motion constraint that dynamically couples semantic understanding with state estimation. This tight integration is particularly crucial for UAV-UGV cooperation, where persistent occlusion and viewpoint variations require localization to be semantically aware and adaptive, rather than relying solely on geometric correspondences that may be sparse or ambiguous.

2.3. Drivable Area Detection

The adoption of BEV representation has fundamentally transformed multi-sensor fusion in autonomous driving. BEVFusion [26] pioneered this paradigm with a unified camera-LiDAR fusion framework via deformable BEV pooling, setting a benchmark for spatial feature integration. Subsequent work, such as SatForHDMap [27], extended this concept by employing hierarchical transformers to align vehicle perspectives with satellite maps, thereby enhancing environmental awareness. However, these foundational approaches exhibit key limitations: BEVFusion’s static fusion strategy fails to account for the distance-dependent sparsity of LiDAR point clouds, while SatForHDMap’s reliance on rigid coordinate transformations cannot adapt to dynamic localization errors. Consequently, their inability to perform context-aware, dynamic sensor weighting significantly hinders robustness in complex operational environments.

Transformer-based architectures have become the standard for modeling cross-modal interactions within the BEV space. BEV-guided multi-modality fusion [28] advanced this direction through a position-aware attention mechanism, improving coherence across LiDAR, camera, and radar data. Building on this, Grid-DATrNet [29] incorporated global attention to better handle occlusions. Despite these improvements, a common shortfall of such architectures is their sensor-agnostic formulation, which overlooks the need for modality-specific confidence estimation. While attempts like distance-sensitive masking in SatForHDMap have been made, their reliance on fixed thresholds remains inadequate for complex, dynamic occlusion patterns. Thus, current attention mechanisms continue to grapple with three intertwined challenges: dynamic confidence weighting, occlusion-resilient feature selection, and the effective preservation of semantic–geometric complementarity across modalities.

LiDAR-centric approaches offer superior geometric precision but suffer catastrophic degradation at range due to point cloud sparsity. Conversely, camera-based solutions like HDMapNet [30] provide rich semantic information but are acutely vulnerable to illumination changes and projective distortions. Hybrid paradigms, exemplified by PointPainting [31], attempt to bridge this gap via early fusion. Yet, their use of rigid fusion rules often introduces spatial misalignments and fails to fully exploit cross-scale modality complementarity. This landscape underscores a persistent and critical research gap: the lack of an adaptive fusion architecture capable of dynamically synthesizing the geometric fidelity of LiDAR with the semantic richness of camera data, while intelligently compensating for the inherent degradation patterns of each sensor. Current fusion strategies for drivable area detection often employ static rules or fixed attention mechanisms that cannot adapt to the complementary degradation patterns of heterogeneous sensors. Our distance-adaptive fusion transformer (DAFT) addresses this limitation by explicitly modeling the distance-dependent reliability of each modality and dynamically arbitrating between LiDAR precision and UAV semantics. This design choice is especially well-suited for UAV-UGV cooperation, as it formalizes the intuitive notion that ground sensors dominate near-field perception while aerial semantics become increasingly valuable at longer ranges, enabling graceful performance transition across the entire operational domain.

3. Materials and Methods

3.1. Framework

The proposed collaborative perception framework is designed to achieve robust drivable area detection in unstructured environments by establishing a synergistic loop between aerial semantic understanding and ground-level geometric perception. As illustrated in Figure 1, the system operates through three core, interconnected stages: (1) UAV-based road topology extraction, (2) enhanced cross-view localization, and (3) reliability-aware multimodal fusion. This architecture enables UGVs to overcome the limitations of single-platform perception by dynamically integrating complementary information from coordinated aerial and ground perspectives.

In the first stage, aerial imagery captured by a UAV is processed to generate a global semantic map of road structures. This stage focuses on extracting topologically consistent road networks from an overhead view, which provides a prior map that is inherently resilient to ground-level occlusions. The output serves as a semantic and structural prior for the entire system. The second stage establishes precise spatial correspondence between the aerial semantic map and the UGV’s local LiDAR point cloud. Utilizing an enhanced particle filtering approach, this component solves the cross-view localization problem by aligning the heterogeneous data sources. It addresses key challenges such as scale ambiguity and sensor noise, thereby enabling accurate projection of the aerial prior into the UGV’s ego-centric coordinate frame. The third stage performs the core fusion task. It dynamically integrates the projected aerial semantic features with the geometrically precise BEV features derived from ground LiDAR. This fusion is guided by a confidence-aware mechanism that evaluates and balances the reliability of each modality across different spatial contexts, ultimately producing a refined and robust drivable area map for navigation. Collectively, these stages form a closed-loop perception system. The aerial topology guides and refines local perception, while the ground-level observations continuously validate and enhance the global semantic understanding. This collaborative design ensures robust performance in complex, occlusion-heavy scenarios where single-source perception is insufficient.

3.2. Stage One: Semantic Segmentation for Topology Extraction

To generate a reliable prior map resilient to ground-level occlusions, we propose the Dynamic Connectivity Attention Network (DynCoANet). As illustrated in Figure 2, DynCoANet addresses three core challenges in aerial road extraction: geometric variability, directional diversity, and fragmented topology. The network comprises a deformable multi-scale encoder, a dynamic directional branch, and a connectivity-aware decoder.

Standard convolutional kernels, with their fixed sampling grids, struggle to model the curved and non-rigid shapes of roads. To enable geometric adaptation, we augment a ResNet-101 backbone with deformable convolutions at intermediate stages. For a given input feature map

F_{i n}

, the deformable convolution learns a set of 2D offsets

Δ p \in R^{H \times W \times 2 N}

, where N is the kernel size (e.g.,

N = 9

for a 3 × 3 kernel), shifting the sampling locations to align with road structures:

F_{d e f}^{(s)} = \sum_{k = 1}^{N} w_{k} \cdot F_{i n} (p + p_{k} + Δ p_{k}), s \in {2, 3}

(1)

where

w_{k}

and

p_{k}

are the pre-defined weight and offset of the k-th sampling point in the standard kernel, respectively. A regularization loss

L_{o f f s e t} = λ {∥ Δ p ∥}^{2}

is applied to prevent excessive distortion. To capture roads at multiple scales, we enhance the atrous spatial pyramid pooling (ASPP) module by applying deformable convolutions with different dilation rates

r \in {6, 12, 18}

, yielding a rich multi-scale contextual feature

F_{c t x}

.

Road connectivity critically depends on local orientation, especially at intersections. To explicitly model this, we introduce a direction predictor that generates a pixel-wise orientation weight map

W_{d}

. It takes the geometrically aligned features

F_{d e f}

as input and estimates the dominance of D discrete directions (e.g.,

D = 8

):

W_{d} = Softmax (C_{1 \times 1} (GeLU (C_{3 \times 3} (F_{d e f}^{(2, 3)})))), W_{d} \in R^{H \times W \times D}

(2)

These weights then guide a set of directional strip convolutions

{K_{d}}

, as shown in Figure 3. Each

K_{d}

is a depthwise separable kernel elongated along a specific direction. The final direction-aware feature

F_{d i r}

is a weighted combination:

F_{d i r} = \sum_{d = 1}^{D} W_{d} ⊙ (K_{d} * F_{d e f})

(3)

where ⊙ denotes element-wise multiplication and * denotes depthwise convolution. This design allows the network to adaptively emphasize features along the most salient local road direction.

To recover a topologically coherent road network from high-level features, we design a decoder built upon connectivity attention (CoA) units. Each CoA unit employs parallel dilated convolutions (rates 1,3,5) to gather multi-scale context, followed by a criss-cross attention (CCA) module to capture long-range spatial dependencies crucial for bridging gaps:

F_{c c a} = CCA (⨁_{r \in {1, 3, 5}} C_{3 \times 3}^{d i l = r} (F_{d e c}^{l - 1}))

(4)

A squeeze-and-excitation block then recalibrates channel-wise importance. To optimally combine the complementary semantic context (

F_{c t x}

) and local direction (

F_{d i r}

) features, we introduce a parametric fusion gate G:

G = σ (C_{1 \times 1} ([F_{c t x}; F_{d i r}])), G \in {[0, 1]}^{H \times W \times C}

(5)

F_{f u s e d} = G ⊙ F_{d i r} + (1 - G) ⊙ F_{c t x} + F_{s k i p}

(6)

The gate G learns to dynamically adjust the fusion ratio based on local cues—for instance, relying more on directional features near intersections and more on semantic context in open areas. Finally, a convolutional block attention module (CBAM) further refines the fused features before producing the pixel-wise road probability map. The network is trained with a composite loss function

L_{t o t a l} = L_{s e g} + α L_{o f f s e t} + β L_{c o n n e c t i v i t y}

, where

L_{s e g}

is the standard cross-entropy loss, and

L_{c o n n e c t i v i t y}

is a topology-aware loss that penalizes disconnected endpoints.

3.3. $SE (3)$ -View Semantic Localization

Building upon the aerial road topology graph

G_{a e r i a l}

from Stage One, this stage aims to estimate the precise 6-DoF transformation

T_{a l i g n} \in S E (3)

that aligns the UAV’s global semantic map with the UGV’s local LiDAR point cloud

P_{l i d a r}

. We formulate this as a semantic particle filtering problem, designed to be robust against scale ambiguity, sensor heterogeneity, and environmental dynamics. The detailed workflow of the proposed enhanced particle filter is illustrated in Figure 4.

The state vector for each particle is defined as

x = (p, s)

, where

p = (x, y, θ) \in S E (2)

denotes the UGV’s 2D planar pose (projected from the 3D motion), and

s \in R^{+}

is a scale factor converting pixel coordinates in the aerial map to meters. This explicit scale parameter is crucial for resolving the inherent scale ambiguity between image pixels and metric LiDAR points. The state transition from time

t - 1

to t incorporates odometry measurements

u_{t - 1}

and is regularized by the aerial road topology. The pose is updated as follows:

p_{t} = p_{t - 1} \oplus u_{t - 1} + n_{p}, n_{p} \sim N (0, Σ_{p})

(7)

where ⊕ denotes pose composition in

S E (2)

. Crucially, the process noise covariance

Σ_{p}

is adaptively inflated when a particle deviates from the road network:

Σ_{p} \leftarrow Σ_{p} \cdot (1 + λ \cdot {TDF}_{r o a d} (p_{t}))

(8)

Here,

{TDF}_{r o a d} (\cdot)

is the truncated distance field computed from

G_{a e r i a l}

, and

λ

is a penalty factor. This mechanism suppresses particles drifting into semantically implausible (e.g., off-road) areas, especially during temporary odometry failures.

The scale factor evolves under a log-normal random walk to ensure positiveness and gradual convergence:

s_{t} = s_{t - 1} \cdot exp (n_{s}), n_{s} \sim N (0, σ_{s}^{2})

(9)

The scale uncertainty

σ_{s}^{2}

is annealed based on the traveled distance to progressively lock in a stable estimate.

The likelihood of a LiDAR scan

z_{t}

given a hypothesized state

x^{[i]}

is computed by semantically aligning the transformed point cloud with the aerial map. First, each LiDAR point

q_{j}^{l}

(with semantic label

l_{j}

from the UGV’s onboard segmentation) is projected into the global frame:

q_{j}^{g} = s \cdot R (θ) q_{j}^{l} + {[x, y]}^{T}

(10)

where

R (θ)

is the 2D rotation matrix. The alignment score is then computed using the aerial TDF:

{\tilde{r}}_{j} = TDF (q_{j}^{g})

(11)

To mitigate the influence of dynamic objects (e.g., vehicles, pedestrians) and transient vegetation, we employ a robust, class-dependent kernel

ρ (\cdot)

:

r_{j} = \{\begin{matrix} {\tilde{r}}_{j}, & if l_{j} = road \\ ρ ({\tilde{r}}_{j}; τ), & otherwise \end{matrix}

(12)

where

ρ (r; τ)

is a robust kernel with threshold

τ

. The particle weight is then updated proportionally to the negative exponential of the total robust residuals:

ω_{t}^{[i]} \propto ω_{t - 1}^{[i]} \cdot exp (- γ \sum_{j} r_{j})

(13)

This design ensures that static road features dominate the likelihood, while non-road points contribute without disproportionately penalizing misalignments caused by dynamic objects.

To maintain efficiency and robustness, we dynamically manage the particle set. The particle count

N_{t}

is adjusted based on the estimated state uncertainty

Σ_{t}

:

N_{t} = clip (N_{base} \cdot \frac{det {(Σ_{t})}^{- \frac{1}{2}}}{det {(Σ_{0})}^{- \frac{1}{2}}}, N_{min}, N_{max})

(14)

This allocates more particles during high-uncertainty phases (e.g., initial convergence, after occlusion) and fewer during stable tracking.

Particle weights are annealed using a variable exponent

β_{t}

to control the selectivity of the likelihood function:

{\tilde{ω}}_{t}^{[i]} = {(ω_{t}^{[i]})}^{β_{t}}, β_{t} = \frac{1}{1 + exp (- κ \cdot {ESS}_{t} / N_{t})}

(15)

where

{ESS}_{t} = 1 / \sum_{i = 1}^{N_{t}} {(ω_{t}^{[i]})}^{2}

is the effective sample size. A low

β_{t}

(high uncertainty) flattens the weight distribution, preserving diversity; a high

β_{t}

(low uncertainty) sharpens it, accelerating convergence.

Systematic resampling is triggered when

{ESS}_{t}

falls below a threshold

{ESS}_{th}

. The complete workflow is summarized in Algorithm 1, which integrates motion prediction, semantic likelihood evaluation, and the adaptive management strategies described above, outputting the aligned transformation

T_{a l i g n}

for Stage Three.

Algorithm 1: Enhanced Particle Filter for Mobile Robot Localization.

3.4. Stage Three: Drivable Area Detection

Integrating UAV semantic maps with LiDAR-based BEV features for autonomous navigation faces three intertwined challenges. Firstly, cumulative projection errors arise from UAV-to-UGV coordinate transformation, combining localization inaccuracies, calibration residuals, and asynchronous motion artifacts. These errors scale with operational distance, violating centimeter-level alignment requirements. Secondly, LiDAR geometric fidelity degrades significantly at long ranges due to quadratic point cloud sparsity and dynamic occlusion, rendering traditional registration methods ineffective. Thirdly, sensor reliability exhibits opposing distance-dependent behaviors: LiDAR provides precise short-range geometry but suffers from semantic ambiguity, while UAV semantics remain consistent at long ranges but lack geometric precision. Traditional fixed-threshold fusion strategies are unable to adapt to this reliability trade-off, resulting in critical performance gaps in transition zones. To address these limitations, we propose a distance-adaptive fusion transformer (DAFT). By dynamically weighting UAV semantic and LiDAR geometric features based on spatial confidence, DAFT achieves robust cross-modal alignment without relying on explicit registration. The framework uses dual-branch position encoding to model uncertainty propagation, enabling context-aware fusion for diverse navigation scenarios, as depicted in Figure 5.

The UAV’s semantic segmentation mask,

F_{UAV} \in R^{256 \times 256 \times 5}

, undergoes dimensional alignment via a

1 \times 1

convolutional layer, expanding its channels to 64, yielding

F_{UAV}^{'} \in R^{256 \times 256 \times 64}

. Batch normalization follows this convolution to stabilize feature magnitudes across training batches, ensuring numerical consistency during backpropagation. This channel expansion ensures that the semantic probability distributions are preserved while enabling dimensional parity with the LiDAR-derived features. Concurrently, the UGV’s LiDAR point cloud is processed through a PointPillars encoder, which generates BEV features,

F_{LiDAR} \in R^{256 \times 256 \times 64}

, capturing geometric properties such as maximum height, point density, and intensity within

0.2 m \times 0.2 m

grid cells. The PointPillars encoder utilizes cascaded 2D convolutional blocks with kernel sizes [3, 3], [5, 5], and [5, 5], followed by LayerNorm for spatial activation normalization. This dual-branch processing ensures modality-specific feature extraction while maintaining spatial resolution alignment for the subsequent fusion operations.

For positional encoding, we use a multi-frequency sinusoidal projection for LiDAR features. The BEV grid coordinates

(x_{i}, y_{j})

are encoded as follows:

{PE}_{LiDAR}^{(2 k)} (x_{i}) = sin (\frac{x_{i}}{λ_{k}^{1 / 64}}), {PE}_{LiDAR}^{(2 k + 1)} (x_{i}) = cos (\frac{x_{i}}{λ_{k}^{1 / 64}})

(16)

where

λ_{k} \in {1, 10, 100}

meters represents different frequency components, capturing both local geometric details and global positional context. For UAV features, the pixel coordinates

(u_{p}, v_{q})

are transformed into UGV coordinates using the extrinsic parameters T and R, followed by sinusoidal encoding with 32 frequency components. The projection is performed using the camera intrinsics K in the transformation

{[x, y, 1]}^{T} = K^{- 1} R^{T} (T - {[u_{p}, v_{q}, 0]}^{T})

, establishing a cross-modal geometric correspondence.

The fusion of the UAV and LiDAR modalities is guided by a distance-adaptive mechanism, where the weight

α (d)

is determined by the Euclidean distance d from each BEV grid center to the UGV’s current position. The weighting function and fused positional encoding are defined as follows:

α (d) = σ (0.1 \times (d - d_{0}))

(17)

{PE}_{fused} = (1 - α) {PE}_{LiDAR} + α {PE}_{UAV}

(18)

The transition distance

d_{0} = 30

m in Equation (17) is chosen based on the characteristics of typical LiDARs. Statistical analysis of the KITTI dataset shows that within approximately 30 m, the LiDAR point density is sufficient for reliable geometric feature extraction. Beyond this range, point density decays quadratically, leading to noisier geometric cues. Concurrently, UAV semantic features from imagery remain confident at this distance. Thus,

d_{0} = 30

m enables a smooth transition from LiDAR-dominant near-field perception to UAV-semantics-augmented far-field perception, aligning the fusion strategy with sensor physics. This function ensures that, for distances less than 30 m, the encoding is dominated by the LiDAR features, while for distances greater than 30 m, the encoding is increasingly influenced by the UAV features. This approach smoothly transitions between the two modalities based on their respective reliabilities, allowing for dynamic adjustments during training.

The proposed architecture processes fused features through a multi-head self-attention mechanism to enhance spatial understanding of integrated representations. Each attention head computes contextual relationships using the formulation:

{Head}_{i} = Softmax (\frac{Q_{i} K_{i}^{T}}{\sqrt{D}} + B_{rel}) V_{i}

(19)

Z^{(l + 1)} = LayerNorm (Z^{(l)} + Attn (Z^{(l)} W_{Q}, Z^{(l)} W_{K}, Z^{(l)} W_{V}))

(20)

where

Q_{i}

,

K_{i}

, and

V_{i}

denote learnable projections for the i-th head,

B_{rel}

encodes relative positional biases between grid cells, and D represents the feature dimension. The queries are initialized through an MLP-based aggregation of local

3 \times 3

neighborhoods enhanced by depthwise separable convolutions, enabling efficient local feature extraction. These queries subsequently interact with global context features

Z^{(L)}

generated by a 6-layer transformer encoder that employs position-aware gating in its attention mechanism. The hierarchical processing allows shallow layers to capture local patterns while deeper layers

Z^{(L)}

establish long-range spatial dependencies.

To enhance feature fusion robustness, we propose a confidence-guided cross-attention mechanism that modulates attention weights using segmentation certainty maps

M \in {[0, 1]}^{256 \times 256}

. The modified attention computation is expressed as follows:

{Attn}_{cross} = Softmax (\frac{Q_{BEV} K^{T}}{\sqrt{D}} + log (M + ϵ)) V

(21)

where

ϵ = 10^{- 6}

ensures numerical stability. The logarithmic term

log (M + ϵ)

adaptively suppresses attention weights in low-confidence regions (confidence score

< 0.5

) while amplifying contributions from reliable areas, effectively pruning unreliable features without breaking gradient flow. This spatial attention gating enables the model to focus on high-certainty regions from UAV segmentation maps while maintaining end-to-end differentiability, particularly beneficial for handling occlusions and sensor noise in aerial imagery scenarios.

4. Experiments and Results

4.1. Semantic Segmentation

To validate the effectiveness of the proposed DynCoANet for aerial road extraction, we conduct comprehensive experiments on the DeepGlobe Road Extraction Challenge dataset. The dataset consists of 4696 training samples and 1530 test samples. We employ standard data augmentation techniques including random rotation, horizontal flipping, and Gaussian blurring on 512 × 512 pixel crops to enhance model robustness against geometric and photometric variations.

The performance is evaluated using standard pixel-wise metrics: Precision, Recall, F1-score, Intersection over Union (IoU), and Mean IoU (MIoU). The IoU metric quantifies the geometric overlap between predictions and ground truth, defined as follows:

IoU = \frac{TP}{TP + FP + FN}

(22)

where TP, FP, and FN denote true positives, false positives, and false negatives, respectively. For multi-class scenarios, we compute MIoU as follows:

MIoU = \frac{1}{K + 1} \sum_{i = 0}^{K} \frac{p_{i i}}{\sum_{j = 0}^{K} p_{i j} + \sum_{j = 0}^{K} p_{j i} - p_{i i}}

(23)

where

p_{i j}

represents the number of pixels of class i predicted as class j, and K is the number of classes.

All experiments were conducted on an NVIDIA RTX 4060 GPU. Our implementation uses a ResNet-101 backbone pre-trained on ImageNet, enhanced with deformable convolutions at stages 2–3 for geometric adaptation to curved roads. The decoder incorporates connectivity attention (CoA) units that combine criss-cross attention with multi-scale dilated convolutions (rates 1, 3, 5) to resolve occlusions. Feature maps are compressed to 128 channels after the ASPP module for computational efficiency. We trained with a batch size of 16 on 512 × 512 pixel crops for 100 epochs using the Adam optimizer with an initial learning rate of

1 \times 10^{- 4}

and a polynomial decay scheduler (power = 0.9). The loss weights were set to

α = 0.1

and

β = 0.05

. Quantitative results are presented in Table 1. DynCoANet achieves an IoU of 61.14% and an F1-score of 87.67%, outperforming D-LinkNet, RoadCNN, and CoANet by significant margins. Notably, DynCoANet improves MIoU by 9.07% compared to CoANet, demonstrating superior balance between road segmentation accuracy and background suppression. This performance gain can be attributed to the synergistic integration of deformable multi-scale encoding and dynamic directional convolutions, which collectively adapt to non-linear road geometries and scale variations. Unlike CoANet’s fixed strip convolutions, our directional predictor generates pixel-wise orientation weights that align convolutional kernels with local road curvatures and slopes. The efficiency metric shows that while DynCoANet’s inference time is higher than CoANet’s, this reflects a reasonable accuracy-efficiency trade-off given the substantial performance improvement.

Qualitative analysis further validates the advantages of DynCoANet under challenging conditions. As shown in Figure 6, conventional methods struggle with environmental obstructions such as dense vegetation coverage and tree canopy occlusion at intersections. These natural interferences create partial visibility conditions that lead to disconnected road segments in predictions from baseline methods. In contrast, DynCoANet successfully recovers fragmented road segments under heavy vegetation occlusion and intersection shadows. The cascaded attention mechanism establishes long-range dependencies across discontinuous segments through cross-attention, enabling logical completion of vegetation-obscured pathways. The synergistic architecture operates through complementary roles: CoA units establish initial connectivity hypotheses while CBAM modules iteratively prune spurious connections, jointly reinforcing geometrically plausible road continuations.

Figure 7 provides a detailed analysis of two representative scenarios that highlight DynCoANet’s capability to enhance road network continuity. In the first scenario, where three continuous roads converge at an intersection, baseline methods fragment one road segment under dense tree canopy occlusion. DynCoANet’s directional branching module successfully reconstructs this discontinuity through geometrically constrained strip convolutions that adaptively extrapolate path directions while synchronizing width variations with surrounding road features. The second scenario reveals a more severe segmentation failure where two critical segments are missing, disrupting both road continuity and intersection integrity. The hierarchical attention mechanism in our decoder prioritizes essential connections, leveraging the deformable encoder’s multi-scale contextual understanding to recover strategically vital links while suppressing spurious connections. This selective connectivity restoration stems from DynCoANet’s architectural synergy: the deformable encoder preserves geometric invariants across scales, the directional branching enhances orientation-aware feature representation, and the hierarchical attention dynamically weights spatial–channel relationships.

4.2. Cross-View Localization

To evaluate the proposed cross-view semantic localization framework, we conduct experiments on the SemanticKITTI and KITTI odometry datasets, focusing on sequences 00, 02, and 09. These sequences encompass diverse urban environments, including open streets, dense vegetation, and tunnel-like structures that present significant challenges for localization accuracy. Ground truth poses and LiDAR scans from KITTI are utilized for quantitative evaluation. To simulate realistic UAV deployment conditions, we generate synthetic aerial imagery via Google Earth Studio with varying poses and integrate them into the testing pipeline. As illustrated in Figure 8, satellite imagery corresponding to the trajectories is sourced from Google Earth, with resolutions adjusted to match ground-level data. System performance is evaluated using absolute position error (APE) in meters, computational efficiency, and convergence success rates within predefined error thresholds.

The enhanced particle filter parameters are configured as follows. The initial particle count is set to

N_{0}

= 10,000, with adaptive bounds

N_{min} = 5000

and

N_{max}

= 20,000 to balance robustness and computational efficiency. Systematic resampling is triggered when the effective sample size falls below

{ESS}_{th} = 0.5 N_{t}

. Process noise parameters are defined as

Σ_{p} = 0.15

and

Σ_{θ} = 0.004

, modeling typical odometry drift. The scale factor noise is

Σ_{s} = 0.01

, enabling gradual convergence. For the annealing mechanism, we set

β = 0.1

to control the smoothness of weight updates. The road constraint penalty factor is integrated into the motion model with

λ = 2.0

, while the measurement model employs a Cauchy kernel with scale parameter

τ = 3.0

meters for robust outlier rejection.

Quantitative results demonstrate the superior localization accuracy of our approach compared to baseline methods. As shown in Table 2, our system achieves an APE of 1.7 m, 8.2 m, and 6.4 m on KITTI sequences 00, 02, and 09, respectively, outperforming the original method [20]. Compared to a method that relies on sparse geometric features, our semantic-driven approach improves localization accuracy by an order of magnitude on KITTI 00. Notably, our framework operates online with LiDAR-semantic fusion, eliminating the need for offline preprocessing.

Qualitative analysis further validates the robustness of our approach in geometrically challenging regions. Our adaptive particle filter mitigates drift by dynamically adjusting particle counts based on localization uncertainty. Figure 9 illustrates reduced position errors post-convergence, particularly in orientation estimation, where our polar histogram acceleration enables efficient likelihood computation. The ESS-based resampling strategy ensures particle diversity, preventing premature convergence in feature-sparse environments. Dynamic weighting with annealing exponent

β_{t}

(Equation (10)) sharpens the probability distribution as convergence progresses, reducing jitter in final pose estimates.

By embedding semantic constraints into the pose estimation framework, our method effectively corrects cumulative drift while maintaining robustness in geometrically sparse areas. These results collectively demonstrate the efficacy of our semantic-enhanced particle filtering approach for achieving precise and reliable cross-view localization in complex urban environments.

4.3. Drivable Area Detection

To rigorously evaluate the proposed distance-adaptive fusion transformer, we conduct experiments on the OpenSatMap benchmark, which integrates data from nuScenes and Argoverse 2 with aligned GPS coordinates. For our multimodal task, we extend the dataset by projecting UAV-derived semantic maps into the UGV’s BEV frame using extrinsic calibration. We compare DAFT against the LiDAR-centric Grid-DATrNet across three ROI scales: 120 m × 100 m, 200 m × 200 m, and 400 m × 400 m, designed to test performance under varying degrees of LiDAR sparsity. Quantitative results, summarized in Table 3, reveal a clear trend: DAFT consistently outperforms Grid-DATrNet, with the advantage magnifying at larger scales. In the 400 m × 400 m ROI, DAFT achieves an accuracy of 83.84% and an F1-score of 0.7349, surpassing the baseline by 8.42% and 0.0761, respectively. The efficiency metrics further illustrate the practical trade-offs: while multimodal fusion introduces initial computational overhead for small ROIs, DAFT’s distance-adaptive mechanism enables superior scalability, eventually achieving both higher accuracy and lower latency for large-scale perception. At long ranges where LiDAR point clouds become extremely sparse and unreliable, this mechanism automatically shifts reliance towards the semantically consistent but geometrically coarse UAV features. In contrast, Grid-DATrNet, relying solely on LiDAR, suffers from inevitable performance degradation due to data sparsity, as evidenced by its sharp accuracy drop to 75.42% at the largest scale.

Qualitative results provide visual evidence of DAFT’s superior robustness in complex scenarios. As shown in Figure 10, our method produces continuous and topologically coherent road boundaries, whereas Grid-DATrNet outputs fragmented predictions, particularly evident at curved road sections. This enhancement in structural consistency stems from the explicit geometric constraints within our fusion decoder, which regularizes the BEV feature learning using structural priors from the aerial map, effectively bridging gaps caused by ground-level occlusions.

Figure 11 focuses on challenging urban intersections from the Argoverse dataset. Under conditions of severe occlusion caused by buildings and vehicles, DAFT maintains robust detection, outperforming Grid-DATrNet by 8.4% in accuracy. This resilience is primarily driven by two components: the spatiotemporal fusion strategy that reduces single-frame noise by enforcing geometric consistency across consecutive frames, and the confidence-guided cross-attention mechanism, which amplifies features from high-certainty UAV regions while suppressing unreliable ones. The visual comparison clearly demonstrates more complete and coherent road structure recovery in occluded areas by our method.

The comprehensive performance gains of DAFT are systematically attributable to its integrated innovations, each addressing a specific weakness of the LiDAR-only baseline: First, the localization-aware fusion network mitigates cumulative projection errors via differentiable pose optimization. This directly reduces misalignment between UAV and LiDAR data, underpinning the overall accuracy improvement across all scales. Second, the distance-adaptive weighting mechanism dynamically balances sensor contributions based on spatial confidence. This is the key to resolving the long-range sparsity problem, as evidenced by the dramatically widening performance gap at the 400 m × 400 m ROI. Third, the confidence-guided cross-attention and spatiotemporal fusion enhance feature reliability in dynamic, occluded environments. This directly translates to the superior scene completion and boundary precision observed in Figure 10 and Figure 11.

To provide deeper insights into the internal workings of the DAFT framework and demonstrate how UAV semantic priors contribute to drivable area estimation, we conduct interpretability analysis through two key visualizations.

Figure 12 illustrates the spatial distribution of the distance-adaptive fusion weight in the BEV space. The color gradient from red to blue represents a continuous transition in fusion strategy: red regions indicate where LiDAR-based perception is dominant, typically in the near-field; blue regions indicate where UAV semantic priors become primary, especially in the far-field and occluded areas. This smooth variation demonstrates the model’s ability to adaptively balance the two modalities based on distance, forming a complementary perception strategy.

Figure 13 illustrates the adaptive fusion mechanism of DAFT, where the drivable area estimation is effectively enhanced in both near and far fields. This is achieved through an internal process that first applies distance-adaptive fusion to balance LiDAR details in the near-field with UAV semantics in the far-field. Subsequently, confidence-guided attention dynamically prioritizes the most reliable features across all regions, as evidenced by the spatial variation in the attention heatmap. This visualization confirms that our method robustly integrates complementary information, utilizing precise LiDAR data nearby and expansive UAV coverage at distance to produce a coherent and improved cooperative detection result.

Collectively, these innovations enable DAFT to overcome the fundamental limitations of Grid-DATrNet, its dependence on a single degraded modality and lack of mechanisms to handle long-range sparsity or semantic–geometric conflicts. The results conclusively validate the efficacy of adaptive, confidence-aware multimodal fusion for robust drivable area perception in complex autonomous navigation.

4.4. Real-World Experimental Validation

To validate the practicality and robustness of the proposed collaborative perception framework in unstructured environments, we deployed a coordinated ground–air system. The platform, illustrated in Figure 14, integrates a modified UGV and a custom quadrotor UAV. The UGV is equipped with a 128-beam rotating LiDAR providing high-resolution 3D sensing, an industrial computer running the Robot Operating System for real-time processing, and a long-range wireless communication module. The UAV carries a high-precision gimbal camera and an embedded AI computer, enabling real-time aerial image capture and processing. Both platforms are synchronized through a ROS-based architecture with robust, error-corrected communication, forming the physical basis for seamless aerial–ground data exchange and cooperative perception.

The framework was evaluated in two representative and challenging real-world scenarios: unstructured suburban roads and complex urban intersections without clear lane markings. These scenarios were selected to stress-test the system’s capability to handle persistent occlusions from vegetation and terrain, as well as dynamic occlusions from traffic. To ensure a fair and objective comparison, we adopted the evaluation protocol proposed by Xu et al., measuring drivable area detection performance using Average Precision at Intersection-over-Union thresholds of 0.5 and 0.7. Crucially, unlike the baseline method which relies solely on ground-level LiDAR projections, our framework uniquely leverages and fuses UAV-derived semantic maps.

In suburban road scenarios characterized by dense roadside vegetation and complex road curvatures, standalone ground-based perception is fundamentally limited. As shown in Figure 15, the physical detection range and line-of-sight requirements of LiDAR result in incomplete and fragmented perception of distant road segments and obscured boundaries. Our collaborative system addresses this limitation. The UAV’s aerial perspective provides a global, unobstructed semantic prior of the road topology. As evidenced in Figure 16, this prior is effectively integrated through our fusion framework, enabling the accurate reconstruction of curved road boundaries and the recovery of connectivity in areas where LiDAR data is sparse or entirely missing due to occlusion. This leads to a more complete and reliable drivable area prediction compared to conventional LiDAR-only segmentation.

The urban intersection scenario, depicted in Figure 17, presents a different set of challenges, including dynamic traffic agents, temporary occlusions from large vehicles, and ambiguous navigable regions due to the lack of lane markings. Here, the value of cross-view perception is most apparent. When the UGV’s forward field of view is completely blocked, for instance, by a truck, traditional ground-based systems fail to perceive any risk or navigable space ahead. Our system, however, leverages the oblique perspective from the accompanying UAV. As illustrated in Figure 18, by integrating this aerial viewpoint, the framework can infer and reconstruct the drivable area geometry and semantics within the ground-level blind zone. Similarly, during turning maneuvers with persistent lateral occlusion from adjacent lanes, the UAV-derived geometric constraints enable continuous curb detection and a more confident prediction of the inner turning path, which is critical for safe navigation.

The qualitative results across both scenarios consistently demonstrate that the proposed collaborative framework generates more continuous, complete, and topologically correct drivable area maps compared to methods relying on a single perception modality. This performance enhancement is directly attributable to the core innovations of our framework: the robust cross-view semantic localization that accurately aligns aerial and ground data, and the adaptive multimodal fusion that dynamically synthesizes the geometric fidelity of LiDAR with the occlusion-resilient semantic richness of the aerial imagery. These real-world experiments confirm that the tight coupling of aerial and ground perception is not only feasible but essential for achieving robust environmental understanding in the complex, occlusion-heavy scenarios that challenge autonomous navigation systems.

5. Conclusions

This paper presents a novel collaborative air–ground perception framework to address the critical challenge of robust drivable area detection in unstructured, occlusion-heavy environments. The proposed three-stage solution systematically bridges the gaps between aerial semantic understanding and ground-level geometric perception.

In the first stage, we developed DynCoANet, a semantic segmentation network that incorporates directional strip convolution and connectivity-aware attention. This design explicitly models road topology, enabling the accurate extraction of continuous road structures from UAV imagery despite severe vegetation occlusion and complex curvatures, thereby providing a reliable global prior. The second stage introduced an enhanced particle filter for cross-view localization. By integrating semantic road constraints and diversity-preserving resampling, this method achieves robust and precise spatiotemporal alignment between UAV maps and UGV LiDAR, effectively overcoming scale ambiguity and sensor heterogeneity. The third stage proposed the distance-adaptive fusion transformer. Through confidence-guided cross-attention and distance-aware weighting, DAFT dynamically fuses the geometric precision of LiDAR with the semantic richness of UAV features, ensuring reliable perception both at close range and in distant, fully occluded regions.

The synergistic integration of these components forms the core contribution of this work: a tightly coupled perception–localization architecture. Topological reasoning from the aerial view resolves environmental ambiguities; semantic-enhanced alignment eliminates cross-modal drift; and adaptive, confidence-aware fusion overcomes the inherent limitations of individual sensors. Extensive experimental validation on public benchmarks and real-world field tests demonstrates that this unified framework significantly outperforms LiDAR-centric baselines, delivering more continuous, complete, and safe navigable area predictions in complex scenarios such as unregulated intersections and heavily vegetated roads.

While the framework demonstrates strong performance in structured and mildly unstructured environments, we acknowledge its limitations in extreme conditions that delineate important avenues for future work. The current system assumes safe UAV deployment with high-quality imaging conditions. In adverse weather such as heavy rain or fog, or during nighttime operations, UAV visual perception can be severely degraded or unavailable. Under such circumstances, the system can switch to a degraded mode where the fusion module relies on pre-stored offline high-precision semantic maps as priors, or falls back to single-UGV LiDAR-inertial odometry with geometry-based traversable region detection to maintain basic navigation functionality. Additionally, the framework assumes stable, low-latency communication between UAV and UGV for real-time data exchange. Future research should quantify communication delay impacts and explore edge computing and predictive communication strategies for link outages. The computational requirements, while moderate, also necessitate lightweight adaptations for resource-constrained onboard platforms.

This work establishes that the co-design of perception and localization is essential for autonomous systems operating under significant uncertainty. Looking forward, we plan to extend this collaborative paradigm by investigating dynamic multi-agent planning algorithms that leverage shared cross-modal understanding, and by enhancing the fusion framework to model more generalized forms of sensor and environmental uncertainty.

Author Contributions

Conceptualization, M.Z. and H.L.; methodology, M.Z.; formal analysis, P.Z.; investigation, M.Z. and H.L.; resources, M.Z.; data curation, P.Z.; writing—original draft preparation, M.Z.; writing—review and editing, M.Z., H.L. and P.Z.; visualization, M.Z. and P.Z.; supervision, M.Z.; project administration, H.L.; funding acquisition, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Program of China under Grant 2020AAA0108103, Anhui Province Science and Technology Innovation Plan—L4-level multi-mode perception end-to-end model breakthrough and industrialization project (202423d12050007), Basic Science Research Project for Colleges and Universities in Jiangsu Province of China (23KJA460010), and Scientific Research Foundation of Nanjing Polytechnic Institute (NJPI-2024-02).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xue, H.; Fu, H.; Ren, R.; Zhang, J.; Liu, B.; Fan, Y.; Dai, B. LiDAR-based drivable region detection for autonomous driving. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1110–1116. [Google Scholar]
Shan, Y.; Fu, Y.; Chen, X.; Lin, H.; Zhang, Z.; Lin, J.; Huang, K. LiDAR based traversable regions identification method for off-road UGV driving. IEEE Trans. Intell. Veh. 2023, 9, 3544–3557. [Google Scholar] [CrossRef]
Wang, L.; Huang, Y. LiDAR–camera fusion for road detection using a recurrent conditional random field model. Sci. Rep. 2022, 12, 11320. [Google Scholar] [CrossRef]
Wang, R.; Wang, K.; Song, W.; Fu, M. Aerial-ground collaborative continuous risk mapping for autonomous driving of unmanned ground vehicle in off-road environments. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 9026–9041. [Google Scholar] [CrossRef]
Yue, Y.; Zhao, C.; Wu, Z.; Yang, C.; Wang, Y.; Wang, D. Collaborative semantic understanding and mapping framework for autonomous systems. IEEE/ASME Trans. Mechatronics 2020, 26, 978–989. [Google Scholar] [CrossRef]
Zhao, H.; Fan, L.; Chen, Y.; Wang, H.; Yang, Y.; Jin, X.; Zhang, Y.; Meng, G.; Zhang, Z. Opensatmap: A fine-grained high-resolution satellite dataset for large-scale map construction. Adv. Neural Inf. Process. Syst. 2024, 37, 59216–59235. [Google Scholar]
Zhong, C.; Li, B.; Wu, T. Off-road drivable area detection: A learning-based approach exploiting lidar reflection texture information. Remote Sens. 2022, 15, 27. [Google Scholar] [CrossRef]
Xu, F.; Liang, H.; Wang, Z.; Lin, L. A framework for drivable area detection via point cloud double projection on rough roads. J. Intell. Robot. Syst. 2021, 102, 45. [Google Scholar] [CrossRef]
Hortelano, J.L.; Villagrá, J.; Godoy, J.; Jiménez, V. Recent developments on drivable area estimation: A survey and a functional analysis. Sensors 2023, 23, 7633. [Google Scholar] [CrossRef] [PubMed]
Miller, I.D.; Cladera, F.; Smith, T.; Taylor, C.J.; Kumar, V. Air-Ground Collaboration with SPOMP: Semantic Panoramic Online Mapping and Planning. IEEE Trans. Field Robot. 2024, 1, 93–112. [Google Scholar] [CrossRef]
Miller, I.D.; Cladera, F.; Smith, T.; Taylor, C.J.; Kumar, V. Stronger together: Air-ground robotic collaboration using semantics. IEEE Robot. Autom. Lett. 2022, 7, 9643–9650. [Google Scholar] [CrossRef]
Khan, S.D.; Alarabi, L.; Basalamah, S. DSMSA-Net: Deep spatial and multi-scale attention network for road extraction in high spatial resolution satellite images. Arab. J. Sci. Eng. 2023, 48, 1907–1920. [Google Scholar] [CrossRef]
Bastani, F.; He, S.; Abbar, S.; Alizadeh, M.; Balakrishnan, H.; Chawla, S.; Madden, S.; DeWitt, D. Roadtracer: Automatic extraction of road networks from aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4720–4728. [Google Scholar]
Xu, Z.; Liu, Y.; Gan, L.; Sun, Y.; Wu, X.; Liu, M.; Wang, L. Rngdet: Road network graph detection by transformer in aerial images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
Zhou, L.; Zhang, C.; Wu, M. D-LinkNet: LinkNet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 182–186. [Google Scholar]
Dai, L.; Zhang, G.; Zhang, R. RADANet: Road augmented deformable attention network for road extraction from complex high-resolution remote-sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
Mei, J.; Li, R.J.; Gao, W.; Cheng, M.M. CoANet: Connectivity attention network for road extraction from satellite imagery. IEEE Trans. Image Process. 2021, 30, 8540–8552. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Wang, L.; Cheng, S. HARNU-Net: Hierarchical attention residual nested U-Net for change detection in remote sensing images. Sensors 2022, 22, 4626. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Shen, Q.; Deng, Z.; Cao, X.; Wang, X. Absolute pose estimation of UAV based on large-scale satellite image. Chin. J. Aeronaut. 2024, 37, 219–231. [Google Scholar] [CrossRef]
Miller, I.D.; Cowley, A.; Konkimalla, R.; Shivakumar, S.S.; Nguyen, T.; Smith, T.; Taylor, C.J.; Kumar, V. Any way you look at it: Semantic crossview localization and mapping with lidar. IEEE Robot. Autom. Lett. 2021, 6, 2397–2404. [Google Scholar] [CrossRef]
Guo, X.; Peng, H.; Hu, J.; Bao, H.; Zhang, G. From Satellite to Ground: Satellite Assisted Visual Localization with Cross-view Semantic Matching. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 3977–3983. [Google Scholar]
Li, J.; Cheng, Y.; Zhou, J.; Chen, J.; Liu, Z.; Hu, S.; Leung, V.C. Energy-efficient ground traversability mapping based on UAV-UGV collaborative system. IEEE Trans. Green Commun. Netw. 2021, 6, 69–78. [Google Scholar] [CrossRef]
Zhang, S.; Shan, J.; Liu, Y. Approximate Inference Particle Filtering for Mobile Robot SLAM. IEEE Trans. Autom. Sci. Eng. 2024. [Google Scholar]
Shi, Y.; Li, H. Beyond cross-view image retrieval: Highly accurate vehicle localization using satellite image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17010–17020. [Google Scholar]
Charroud, A.; El Moutaouakil, K.; Yahyaouy, A.; Onyekpe, U.; Palade, V.; Huda, M.N. Rapid localization and mapping method based on adaptive particle filters. Sensors 2022, 22, 9439. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.L.; Han, S. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), Rotterdam, The Netherlands, 23–26 September 2024; IEEE: Piscataway, NJ, USA, 2023; pp. 2774–2781. [Google Scholar]
Gao, W.; Fu, J.; Shen, Y.; Jing, H.; Chen, S.; Zheng, N. Complementing onboard sensors with satellite maps: A new perspective for HD map construction. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 11103–11109. [Google Scholar]
Man, Y.; Gui, L.Y.; Wang, Y.X. Bev-guided multi-modality fusion for driving perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 21960–21969. [Google Scholar]
Goenawan, C.R.; Paek, D.H.; Kong, S.H. See the Unseen: Grid-Wise Drivable Area Detection Dataset and Network Using LiDAR. Remote Sens. 2024, 16, 3777. [Google Scholar] [CrossRef]
Guan, T.; Xian, R.; Wang, X.; Wu, X.; Elnoor, M.; Song, D.; Manocha, D. AGL-NET: Aerial-ground cross-modal global localization with varying scales. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; IEEE: Piscataway, NJ, USA, 2024; p. 8161. [Google Scholar]
Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; Tai, C.L. Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1090–1099. [Google Scholar]

Figure 1. Integrated framework architecture showing (Stage-1) UAV road extraction with deformable convolutions, (Stage-2) particle filter-based cross-view localization, and (Stage-3) reliability-aware multimodal fusion. Arrows indicate data flow between perception modules and the navigation planner.

Figure 2. Overview of the DynCoANet architecture, illustrating the encoder, dynamic strip convolution block, connectivity branch, and the decoder with feature fusion modules.

Figure 3. Overview of the dynamic direction attention mechanism used to generate direction feature maps. The strip convolutions are influenced by the direction attention weights that adapt to different road orientations.

Figure 4. Workflow of the enhanced particle filter for cross-view localization. The iterative process includes: (1) initialization of particles, (2) prediction and weight update with semantic constraints, (3) adaptive resampling for diversity, and (4) state estimation for pose and trajectory.

Figure 5. The diagram illustrates the process for DA prediction, combining UAV image features and UGV LiDAR features using a BEV feature fusion mechanism. The system employs a dynamically weighted positional encoding, a confidence-guided cross-attention mechanism, and differentiable pose-aware optimization for robust perception–localization learning.

Figure 6. Road extraction results on the DeepGlobe dataset. (a) Original satellite imagery. (b) Ground truth segmentation. (c) RoadCNN predictions. (d) CoANet predictions. (e) DynCoANet predictions. Our method demonstrates superior performance under vegetation occlusion and intersection scenarios.

Figure 7. Detailed road extraction analysis on challenging DeepGlobe samples. (a) Original satellite image. (b) Ground truth segmentation. (c) CoANet segmentation. (d) DynCoANet segmentation. (e) Extracted road topology from DynCoANet, demonstrating improved continuity and topological correctness in occluded intersection areas.

Figure 8. Aerial imagery matches KITTI datasets. (a) Aerial image and segmentation. (b–f) Ground LiDAR data and localization results in an intersection occlusion scenario.

Figure 9. (a) Trajectories overlain on aerial images. (b) Comparison trajectories in an intersection occlusion scenario. (c) Comparison localization results of KITTI 00. (d) Position error results of KITTI dataset.

Figure 10. Multi-scale drivable area detection on the OpenSatMap dataset: Row 1: (a) DA projection with distance coding, (b) raw point cloud, (c) point cloud segmentation, (d) ground truth DA. Rows 2–4: 120 × 100/200 × 200/400 × 400 ROI results showing: (a) UAV perspective, (b) point cloud overlaid UAV imagery, (c) Grid-DATrNet segmentation, (d) our topology-consistent DA reconstruction.

Figure 11. Multi-scale urban drivable area validation on Argoverse dataset: Row 1: (a) DA projection with distance coding, (b) raw point cloud, (c) point cloud segmentation, (d) ground truth DA. Rows 2–4: 120 × 100/200 × 200/400 × 400 ROI results in intersections: (a) UAV perspective, (b) dynamic point cloud overlaid UAV imagery, (c) Grid-DATrNet segmentation, (d) our spatiotemporally consistent DA prediction.

Figure 12. Spatial distribution of the continuous, distance-adaptive fusion weight

α (d)

in BEV. The color gradient (red to blue) indicates the transition from LiDAR-dominant to UAV-prior-dominant regions. (a) UAV perspective; (b) our spatiotemporally consistent DA prediction; (c) distance-adaptive fusion weight distribution.

Figure 12. Spatial distribution of the continuous, distance-adaptive fusion weight

α (d)

in BEV. The color gradient (red to blue) indicates the transition from LiDAR-dominant to UAV-prior-dominant regions. (a) UAV perspective; (b) our spatiotemporally consistent DA prediction; (c) distance-adaptive fusion weight distribution.

Figure 13. Visual components of the DAFT fusion framework. (a–d) Input data and semantic segmentation results from UAV and UGV platforms. (e–i) Intermediate features and fusion modules: confidence maps, distance-adaptive fusion, and confidence-guided cross-attention. (j) Final cooperative drivable area detection result.

Figure 14. The integrated ground–air experimental platform. (a) The unmanned ground vehicle equipped with a 128-beam LiDAR and computing system. (b) The quadrotor UAV equipped with a gimbal camera and onboard computer.

Figure 15. Experimental setup for off-road environment validation. Three typical scenarios (A, B: vegetation-occluded intersections; C: single-vehicle occlusion on straight road) are marked on the global map. For each scenario, three subfigures are provided: (a) UAV’s aerial perspective, (b) UGV’s ground-level perspective, and (c) UGV’s LiDAR-based drivable area perception without collaborative enhancement.

Figure 16. Drivable area detection in an unstructured off-road environment. For each scenario, six views are presented: (a) UGV’s forward-facing view, (b) UAV’s aerial perspective, (c) UGV LiDAR points overlaid on UAV imagery, (d) DA from collaborative perception in UAV perspective, (e) fragmented DA from UGV-only perception, and (f) collaborative DA focused on UGV’s forward region.

Figure 17. Experimental configuration for urban intersection validation. Three typical intersection scenarios (A, B, C) with varying vehicle occlusion patterns are marked on the global map. For each scenario, three subfigures are provided: (a) UAV’s aerial perspective, (b) UGV’s ground-level perspective, and (c) UGV’s independent LiDAR perception showing fragmented drivable area detection.

Figure 18. Drivable area detection in dynamic urban traffic. For each intersection scenario with varying vehicle occlusion, six views are presented: (a) UGV’s forward-facing view, (b) UAV’s aerial perspective, (c) UGV LiDAR points overlaid on UAV imagery, (d) DA from collaborative perception in UAV perspective, (e) fragmented DA from UGV-only perception, and (f) collaborative DA focused on UGV’s forward region.

Table 1. Performance and efficiency comparison of road segmentation models. Inference time is measured for processing one 512 × 512 image.

Model	IoUroad	MIoU (%)	F1 (%)	Precision (%)	Recall (%)	Inference (ms)
D-LinkNet [15]	57.62	63.00	77.11	76.69	77.53	71
RoadCNN [13]	59.12	65.49	79.08	78.14	80.04	28
CoANet [17]	59.11	69.42	81.22	78.96	83.61	22
Ours	61.14	78.49	87.67	86.45	88.93	45

Table 2. Localization accuracy on KITTI dataset.

Method	kitti0	kitti2	kitti9
Original [20]	2.0	9.1	7.2
Ours	1.7	8.2	6.4

Table 3. Performance and efficiency comparison on OpenSatMap dataset across different ROI scales.

ROI Size	Grid-DATrNet				DAFT (Ours)
	Accuracy	F1 Score	Inference	Computation	Accuracy	F1 Score	Inference	Computation
	(%)	(ms)		(GFLOPS)	(%)		(ms)	(GFLOPS)
120 m × 100 m	93.28	0.8328	231	142	94.45	0.8426	262	157
200 m × 200 m	88.35	0.7732	324	182	90.72	0.8014	306	178
400 m × 400 m	75.42	0.6588	502	268	83.84	0.7349	455	236

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, M.; Liang, H.; Zhou, P. Cooperative Air–Ground Perception Framework for Drivable Area Detection Using Multi-Source Data Fusion. Drones 2026, 10, 87. https://doi.org/10.3390/drones10020087

AMA Style

Zhang M, Liang H, Zhou P. Cooperative Air–Ground Perception Framework for Drivable Area Detection Using Multi-Source Data Fusion. Drones. 2026; 10(2):87. https://doi.org/10.3390/drones10020087

Chicago/Turabian Style

Zhang, Mingjia, Huawei Liang, and Pengfei Zhou. 2026. "Cooperative Air–Ground Perception Framework for Drivable Area Detection Using Multi-Source Data Fusion" Drones 10, no. 2: 87. https://doi.org/10.3390/drones10020087

APA Style

Zhang, M., Liang, H., & Zhou, P. (2026). Cooperative Air–Ground Perception Framework for Drivable Area Detection Using Multi-Source Data Fusion. Drones, 10(2), 87. https://doi.org/10.3390/drones10020087

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Cooperative Air–Ground Perception Framework for Drivable Area Detection Using Multi-Source Data Fusion

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Semantic Segmentation

2.2. Cross-View Localization

2.3. Drivable Area Detection

3. Materials and Methods

3.1. Framework

3.2. Stage One: Semantic Segmentation for Topology Extraction

3.3. $SE (3)$ -View Semantic Localization

3.4. Stage Three: Drivable Area Detection

4. Experiments and Results

4.1. Semantic Segmentation

4.2. Cross-View Localization

4.3. Drivable Area Detection

4.4. Real-World Experimental Validation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Cooperative Air–Ground Perception Framework for Drivable Area Detection Using Multi-Source Data Fusion

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Semantic Segmentation

2.2. Cross-View Localization

2.3. Drivable Area Detection

3. Materials and Methods

3.1. Framework

3.2. Stage One: Semantic Segmentation for Topology Extraction

3.3. SE ( 3 ) -View Semantic Localization

3.4. Stage Three: Drivable Area Detection

4. Experiments and Results

4.1. Semantic Segmentation

4.2. Cross-View Localization

4.3. Drivable Area Detection

4.4. Real-World Experimental Validation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.3. $SE (3)$ -View Semantic Localization