Vision-Based Unmanned Aerial Vehicle Swarm Cooperation and Online Point-Cloud Registration for Global Localization in Global Navigation Satellite System-Intermittent Environments

Garcia, Gonzalo; Eskandarian, Azim

doi:10.3390/drones10010065

Open AccessArticle

Vision-Based Unmanned Aerial Vehicle Swarm Cooperation and Online Point-Cloud Registration for Global Localization in Global Navigation Satellite System-Intermittent Environments

by

Gonzalo Garcia

^*

and

Azim Eskandarian

College of Engineering, Virginia Commonwealth University, 601 W Main St Richmond, Richmond, VA 23220, USA

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(1), 65; https://doi.org/10.3390/drones10010065

Submission received: 4 December 2025 / Revised: 16 January 2026 / Accepted: 17 January 2026 / Published: 19 January 2026

Download

Browse Figures

Review Reports Versions Notes

Highlights

What are the main findings?

A unified, lightweight framework is introduced that fuses bio-inspired passive vision swarm coordination with real-time monocular point-cloud registration, enabling UAV teams to maintain formation and achieve global localization without GNSS, active ranging, or communication.
Experiments with drone–drone and drone–ground robot pairs show that online cooperative map alignment improves spatial consistency compared to baseline monocular SLAM, even under heterogeneous viewpoints, sparse maps, and global positioning outage.

What is the implication of the main findings?

The framework provides a scalable and sensor-efficient strategy for resilient multi-drone autonomy in GNSS-intermittent environments such as tunnels, industrial interiors, underground facilities, and disaster zones.
By maintaining both coordination coherence (stable, collision-free group motion) and spatial coherence (shared global map alignment), the method enables real-time cooperative navigation for mixed aerial–ground teams operating under strict payload and communication constraints.

Abstract

Reliable autonomy for drones operating in GNSS-intermittent or denied environments requires both stable inter-vehicle coordination and a shared global understanding of the environment. This paper presents a unified vision-based framework in which UAVs use biologically inspired swarm behaviors together with online monocular point-cloud registration to achieve real-time global localization. First, we apply a passive-perception strategy, bird-inspired drone swarm-keeping, enabling each UAV to estimate the relative motion and proximity of its neighbors using only monocular visual cues. This decentralized mechanism provides cohesive and collision-free group motion without GNSS, active ranging, or explicit communication. Second, we integrate this capability with a cooperative mapping pipeline in which one or more drones acting as global anchors generate a globally referenced monocular SLAM map. Vehicles lacking global positioning progressively align their locally generated point clouds to this shared global reference using an iterative registration strategy, allowing them to infer consistent global poses online. Other autonomous vehicles optionally contribute complementary viewpoints, but UAVs remain the core autonomous agents driving both mapping and coordination due to their privileged visual perspective. Experimental validation in simulation and indoor testbeds with drones demonstrates that the integrated system maintains swarm cohesion, improves spatial alignment by more than a factor of four over baseline monocular SLAM, and preserves reliable global localization throughout extended GNSS outages. The results highlight a scalable, lightweight, and vision-based approach to resilient UAV autonomy in tunnels, industrial environments, and other GNSS-challenged settings.

Keywords:

multi-drone systems; cooperative localization; point cloud registration; visual SLAM; GNSS intermittency; aerial–ground collaboration; resilient navigation; online mapping

1. Introduction

Autonomous multi-drone systems are becoming increasingly essential for missions such as infrastructure inspection, subterranean exploration, search-and-rescue, and environmental monitoring [1,2]. Many of these missions take place in GNSS (Global Navigation Satellite System)-intermittent or fully GNSS-denied environments—including underground facilities, industrial interiors, collapsed structures, and dense urban corridors—where traditional satellite-based localization cannot be relied upon. In such conditions, UAVs (unmanned aerial vehicles) must depend on onboard sensing, vision-based perception, and cooperative behaviors to sustain navigation performance, situational awareness, and team cohesion. Previous work has explored alternatives to GNSS, including image-to-map registration [3], terrain-constrained visual–DEM (Digital Elevation Model) matching [4], and visual–geographical optimization [5], all of which underscore the growing need for robust localization strategies when absolute positioning is not available. To enable effective multi-UAV operations—potentially supported by ground robots but led by aerial agents—two complementary capabilities must be achieved simultaneously: spatial coherence, meaning a consistent and shared geometric understanding of the environment across UAVs, and coordination coherence, meaning stable, decentralized group behavior without continual access to global positioning or centralized control.

The achievement of spatial coherence is a major requirement for cooperative exploration and map fusion. Lightweight autonomous robots such as UAVs are constrained by payload, power, and cost, often relying primarily on monocular cameras. Although monocular visual SLAM (vSLAM) offers a feasible approach to local mapping, the resulting point clouds are inherently sparse and noisy. These inconsistencies complicate both map fusion and cooperative localization, particularly when comparing maps collected by aerial robots (with different viewpoints and altitudes) and other autonomous robots operating near the floor. To overcome these obstacles, robust map alignment demands specialized registration techniques capable of handling discrepancies in scale, density, viewpoint, and noise. Previous work has shown that coarse global feature matching—using learned matchers such as (ref. [6] or LoFTR [7]—combined with fine-grained refinement methods such as point-to-plane ICP [8] significantly improves robustness. Global LiDAR (Light Detection and Ranging)-derived descriptors such as Scan Context [9,10] offer further insights into rotationally robust place recognition, although they are not directly applicable to sparse monocular maps. In our earlier work [11], we evaluate such pipelines for aligning monocular vSLAM point clouds from heterogeneous autonomous robots, demonstrating improved robustness to scale inconsistencies and viewpoint changes. This paradigm enables multiple drones to align their independently generated monocular maps into a shared global reference, allowing an autonomous robot with global pose information to serve as a proxy reference source for others.

At the same time, multi-drone autonomy also depends critically on coordination coherence—the ability of robots to maintain formation, avoid collisions, and modulate their spacing based solely on local sensing. This is especially challenging without explicit distance measurements, inter-robot communication, or GNSS. Biological swarms—such as flocks of starlings—provide a compelling inspiration. Decades of research show that natural swarms rely on topological, not metric, interaction rules: individuals react primarily to a small, fixed number of nearest neighbors, guided largely by passive visual cues [12,13]. Early computational models such as the Boids framework [14] support these findings. These principles inspired our previous work on the swarm control of UAVs [15], which demonstrated that monocular visual cues alone can provide sufficient passive information for UAVs to estimate relative proximity and adjust their trajectories using a decentralized Nonlinear Model Predictive Control (NMPC) strategy. That work shows that stable, cohesive group motion can be achieved without metric ranging, GNSS, or explicit inter-agent communication—mirroring the passive visual coordination found in natural swarms.

The synergy between these two research directions—vision-based cooperative localization (spatial coherence) and passive-vision swarm coordination (coordination coherence)—addresses a fundamental challenge in lightweight autonomous systems: how to achieve stable, team-level behavior when sensing is sparse, noisy, and ambiguous. Operating in GNSS-denied environments requires multi-drone teams not only to remain cohesive but also to share a consistent representation of the environment to coordinate exploration and maintain situational awareness. Cooperative SLAM research has explored related ideas through decentralized frameworks [16], map merging [17], collaborative occupancy mapping [18], distributed data fusion [19], and multi-agent visual–inertial odometry [20]. However, these methods generally rely on richer sensing or higher communication bandwidth than what is available on UAVs. This work unifies our previously independent lines of research—cooperative visual point cloud registration [11] and passive-vision swarm-keeping [15]—into a single integrated framework for drone-led multi-robot autonomy using only lightweight sensors and decentralized logic. By merging these concepts, we develop a real-time algorithm that enables different autonomous vehicles to maintain formation cohesion while simultaneously aligning their locally generated maps, even when GNSS is unavailable.

The key contributions of this journal article are therefore:

A unified cooperative perception and coordination architecture integrating passive-vision, biologically inspired swarm-keeping with cross-platform monocular point cloud fusion;
A real-time registration pipeline robust to sparse, drone-led heterogeneous monocular maps, enabling fast alignment between different autonomous vehicles’ vSLAM maps during motion;
Integrated experimental validation in an indoor GNSS-denied testbed, demonstrating stable formation keeping from passive vision, global map alignment improving spatial coherence, and real-time cooperative localization under extended GNSS outages.

By merging biologically inspired visual coordination with robust cooperative perception and online registration, this work provides a scalable, lightweight, and resilient strategy for drone-led heterogeneous teams or autonomous vehicles operating in GNSS-intermittent or denied environments, supporting practical applications such as underground inspection, industrial automation, and disaster response.

2. Materials and Methods

This section describes the system architecture, sensing modalities, swarm coordination logic, visual SLAM pipeline, and the online point cloud registration framework used to achieve cooperative global localization under GNSS-intermittent or denied conditions. The methodology integrates biologically inspired passive-vision-based coordination with lightweight cooperative perception among autonomous vehicles.

2.1. Swarm Coordination Using Passive Vision

The aerial swarm uses a bio-inspired topological coordination rule, derived from flocking behavior in birds, where each agent adjusts its trajectory based on the apparent motion and angular size of neighboring agents. Study [21] was the first to capture the internal 3D structure of flocking birds, revealing a formation pattern among nearest neighbors based on angular distribution. Later, ref. [12] reconstructed a 3D swarm of thousands of birds and demonstrated that local interactions depend on topological distance rather than metric distance. Birds consistently interact with a fixed number of nearest neighbors, maintaining cohesive formations even as global density varies.

The proposed swarming model draws inspiration from biological behavior, emphasizing the absence of explicit distance measurement between agents, referred to as metric distance in [12]. Its core principle is that agents—especially those within the swarm’s interior—instinctively adjust their positions to maintain approximately equal spacing from their closest neighbors. This is particularly relevant in large swarms, where interior agents face sensory limitations, such as obstructed vision, preventing them from assuming alternative roles like temporary leadership. Since biological agents cannot measure exact distances, it is reasonable to assume they rely on an innate sense of proximity, using relative cues to position themselves centrally within the group. In its simplest form, this mechanism assumes agents share similar attributes, such as size, enabling them to estimate closeness through perceived size differences. However, size alone is insufficient for cohesion. Therefore, the model hypothesizes that a biologically determined optimal spacing, combined with a common speed, forms universal principles governing swarm behavior [12]. Ultimately, the model rests on the notion that agents, particularly those in central positions, instinctively regulate their spacing to maintain uniformity—a behavior that becomes critical in large swarms where sensory constraints limit alternative strategies.

According to [15,22], the surrounding space of each agent is partitioned into adjacent zones based on its position, direction of motion, and the horizontal plane, as illustrated in Figure 1. In total, twelve lateral sectors are defined—six above and six below the horizontal plane—along with two forward-facing sectors and one blind sector oriented backward. All sectors extend up to a maximum radius

d_{p}

. The upper lateral space is divided into six angular ranges:

{ξ_{1}, ξ_{2}}

,

{ξ_{2}, ξ_{3}}

,

{ξ_{3}, ξ_{4}}

,

{ξ_{4} - π, ξ_{3} - π}

,

{ξ_{3} - π, ξ_{2} - π}

, and

{ξ_{2} - π, ξ_{1} - π}

determined by the vertical angle

ν

. The lower lateral sectors mirror these divisions using

- ν

. Two forward sectors are defined as

{ξ_{4} - π, ξ_{1}}

with a larger extension to compensate for the larger relative velocity, while the blind sector, based on avian physiology, spans

{ξ_{1} - π, ξ_{4}}

. The number, thickness, and maximum radius of these sectors should be selected according to swarm dynamics and individual agent behavior, and can be fine-tuned through extensive simulations before flight testing. As detailed later, the relative proximity of neighboring agents is assessed using opposing sectors. This approach evaluates perceived distances and angles by comparing relative closeness, similar to how animals sense neighbors in large aggregations. When applied to unmanned aerial vehicles, behaviors such as cohesion, separation, and alignment emerge through an auto-centering mechanism rather than explicit distance calculations. Unlike models that rely on precise measurements, this method reproduces bird-like aggregation patterns using relative perception of proximity rather than metric distances.

The proposed swarming model, inspired by biological systems, operates without explicit measurement of inter-agent distances (metric distances). Instead, it assumes that agents instinctively strive to maintain approximately equal spacing from their nearest neighbors. This behavior is particularly critical for agents located within the interior of large swarms, where vision and sensing capabilities are often limited. Since biological organisms cannot measure exact distances, they likely rely on relative proximity cues to maintain their position within the group. Assuming agents share similar characteristics—such as size—they can estimate closeness by comparing perceived size differences. However, this mechanism alone is insufficient for maintaining cohesion. The model therefore incorporates the concept of a biologically preferred spacing (preferred distance

d_{p}

) and a common speed

| V |

, which are assumed to be universal among swarm members. Birds, for instance, exhibit limited depth perception due to restricted binocular vision, where each eye captures slightly different perspectives without enabling precise distance estimation [23]. These limitations suggest that flocking behavior is unlikely to depend on exact distance measurements. The method for estimating relative distances from passive visual cues is detailed in the following subsection. Each agent updates its neighbor selection and speed adjustment (speed correction

Δ v

) at every time step using data obtained from onboard cameras monitoring its surroundings.

Neighbor Selection:

1.: Identify the closest neighbors within each lateral sector.
2.: Compare relative proximity between opposing lateral sectors. For example, for sectors upper ${ξ_{1}, ξ_{2}}$ (neighbor a at distance $d_{a}$ ) versus lower ${ξ_{2} - π, ξ_{1} - π}$ (neighbor b at distance $d_{b}$ )

$d_{a} ≶ d_{b}$

(1)
3.: Select the closest neighbors in the forward sectors.
4.: Compare the relative proximity of forward neighbors to a predefined forward preferred distance $d_{p}^{f}$ (based on the expected relative speed in the forward direction) using an approach similar to that in 1.
5.: If any opposing lateral sector pair contains only one neighbor, apply a similar logic using a lateral preferred distance $d_{p}$ .

Speed Correction:

For opposing lateral sectors containing neighbors in both, generate a speed correction $Δ v_{i}$ toward the center of the sector with the more distant neighbor. The correction magnitude is fixed and determined through simulation. If relative sizes (proxy for distance) are similar, no correction is applied.
For opposing lateral sectors with only one neighbor, assume a neighbor exists at the preferred lateral distance and apply the same correction logic.
Apply the same approach for forward sectors using the preferred forward distance $d_{p}^{f}$ .
If no neighbors exist in either lateral or forward sectors, no correction is generated.
Sum all corrections $Δ v = \sum_{i} Δ v_{i}$ and adjust the commanded speed $V$ accordingly.

$V \leftarrow V + Δ v$

(2)

Passive Distance or Depth Estimation

Each agent uses its onboard monocular camera to detect neighboring agents. Crucially, the system does not require explicit distance sensors or unique neighbor identification. The relative distance between agents is estimated solely from the apparent pixel size of the agent in each other’s image plane, leveraging the known physical size of the UAVs.

Our approach guides individual corrective actions within a collective system of vehicles by applying a logic centered on proximity rather than precise distance measurements. Instead of relying on exact metric calculations, as discussed in [12] and related studies, the method uses perceived closeness to adjust each agent’s speed and position. By estimating relative distance, agents reposition themselves toward the center of their local neighborhood, gradually reducing disparities between opposing agents. This process promotes uniform swarm speed and cohesive movement patterns. The resulting spatial reasoning fosters synchronized motion and structured group cohesion, enabling autonomous and harmonious collective behavior without centralized control.

Distance estimation from visual data can be achieved through various techniques. Depending on computational resources, these range from advanced artificial intelligence methods—such as the deep learning approach proposed in [24], which employs neural networks for distance prediction—to simpler image-processing strategies like those in [25]. These techniques may use multiple cameras or a single one. Our work adopts the latter, bio-inspired approach, reflecting how birds in flocks typically rely on monocular vision and infer distance from relative size cues. Given the limited onboard processing capabilities of drones, we employ a lightweight algorithm that extracts numerical data from pixel color variations. Bird flocks exemplify how simple local rules can yield effective collective behavior despite restricted sensing abilities. Coordination emerges without a central leader, enabling optimized group structure, obstacle avoidance, and complex maneuvers under natural constraints.

A key assumption in the proposed framework is that agents within a cooperating group share comparable physical dimensions, analogous to members of the same biological species. This assumption enables each agent to compare the apparent size of neighboring agents across adjacent visual sectors, supporting consistent relative proximity estimation from monocular imagery. Importantly, this assumption applies at the group level and does not preclude the coexistence of heterogeneous platforms operating as distinct classes or “species”.

Inter-agent distance estimation from 2D imagery is performed using a pixel-based detection approach that relies on a prior metric calibration of the observed agent’s characteristic dimensions. Specifically, a known or learned mapping between an agent’s real-world size and its pixel footprint is used to infer relative distance from a single monocular camera. This mechanism is biologically inspired, reflecting observations that animals—such as birds in flocks—can infer neighbor proximity through passive visual cues, including perceived size, shape, or motion, without explicit range sensing.

In heterogeneous robotic teams, this framework can naturally accommodate species-specific calibrations, whereby each agent class (e.g., micro-UAVs, larger UAVs, ground robots) maintains an independent size-to-distance mapping. Agents may classify observed neighbors by type using visual appearance, learned descriptors, or prior mission knowledge, and apply the appropriate calibration accordingly. While the present implementation focuses on homogeneous aerial agents for clarity and experimental validation, the underlying formulation readily generalizes to mixed-size swarms by associating multiple calibrated models with different agent categories. Extensions incorporating additional sensory modalities (e.g., active ranging, acoustics, or communication-based identification) may further refine proximity estimation, but are beyond the scope of this work, which deliberately emphasizes passive visual perception.

To isolate the drone’s visual markers, each pixel value represented by a byte in the range of

(0, 255)

is compared against an empirically determined intensity

0 < t r < 255

, producing a binary image with values in

{0, 1}

. This step separates potential markers from the background. Threshold selection involves stochastic considerations, including camera noise and variations in lighting.

The final stage is the logical processing of the binary data to extract a reliable distance estimate. In this implementation, the two lights on the drone are treated as fixed reference points of known relative geometry. The processing pipeline includes:

Detecting all connected components ${m_{1}, m_{2}, \dots}$ in the binary image and computing their centroids ${m_{1}^{c e n t r o i d}, m_{2}^{c e n t r o i d}, \dots}$ and pixel areas ${m_{1}^{a r e a}, m_{2}^{a r e a}, \dots}$ ; discarding components with areas outside a predefined valid range;
Computing pairwise distances between remaining centroids and retaining the pair ${m_{r}, m_{s}}$ with the smallest separation and with an orientation angle within an acceptable range;
Associating the selected centroids with those from the previous frame using nearest-neighbor matching to maintain unique marker identities, and selecting of their inter-distance ${d i s t}_{m e a s}$ ;
Repeating the process for each incoming frame, with no need for external initialization.

This algorithm is designed to be self-recovering: if detection fails temporarily due to occlusion or noise, the system automatically re-acquires the markers in subsequent frames. The output is the distance in pixels between the two valid markers

d i s t_{m e a s}

.

In the experimental implementation, two LED markers are employed to facilitate robust visual detection and tracking under laboratory conditions. These markers serve as a practical aid for perception benchmarking, rather than as a fundamental requirement of the proposed coordination framework. The underlying algorithms operate on abstracted visual detections and relative pose cues and are agnostic to the specific visual features used for agent identification.

The use of LED markers simplifies the perception pipeline by providing high-contrast, unambiguous features, thereby isolating and validating the cooperative localization and coordination logic under controlled conditions. In operational settings, these markers can be replaced by alternative passive visual cues, such as learned visual descriptors, shape-based detection, motion patterns, or semantic classification, without altering the core coordination mechanism.

Regarding robustness, the system does not rely on continuous global reference availability. When visual coordination links are temporarily degraded—due to occlusion, illumination changes, or limited field of view—agents retain local state estimates through onboard inertial sensing and short-term dead reckoning. Cooperative localization is re-established opportunistically once visual contact is restored, allowing the swarm to recover global consistency without centralized coordination. This behavior mirrors biological collectives, where transient loss of neighbor visibility does not result in systemic collapse but is instead handled through local autonomy and intermittent re-synchronization.

2.2. VSLAM Theory

This section reviews the theoretical foundation for cooperative visual localization using monocular Visual Simultaneous Localization and Mapping. Our approach involves two independently generated point clouds: a referenced global map obtained from a first drone that simultaneously estimates its position using global positioning information and a second locally referenced map captured by another vehicle from the same scene. The core objective is to register the local point cloud to the global one using alignment techniques [26]. This process effectively globally references the second map, enabling the second robot to obtain global localization without requiring explicit (from a global positioning system) knowledge of its own position.

The role of the global anchor agent is not that of a permanently authoritative reference, but a transient coordination element used to facilitate initial map alignment. While monocular vSLAM is subject to scale ambiguity and both stochastic and systematic errors, the proposed cooperative framework avoids unidirectional error propagation by integrating relative pose information bidirectionally through repeated inter-agent observations, proximity-based constraints, and opportunistic re-alignment when visual contact is re-established. Stochastic noise tends to average out through cooperative updates, whereas systematic local misalignments may temporarily degrade global accuracy; however, such errors are not blindly propagated, as no single agent’s estimate is unconditionally trusted. Instead, global consistency emerges incrementally from mutual corrections across the network, yielding bounded drift behavior rather than cumulative divergence, even in the presence of intermittent sensing or degraded local estimates.

Each agent operates a lightweight monocular vSLAM pipeline designed to estimate its motion and map the environment using a single camera. This pipeline outputs a local camera trajectory, a sparse monocular point cloud, and a set of keyframes enriched with detected features. The process begins with feature extraction and matching, followed by motion estimation for robust pose computation. To refine accuracy, local bundle adjustment optimizes both camera poses and 3D points, while keyframe insertion and local map generation maintain a consistent representation of the scene.

2.2.1. Monocular vSLAM

Monocular vSLAM recursively performs two tightly coupled tasks: (1) Map Construction, which generates a 3D representation (point cloud) of the environment from sequential image features, and (2) Localization, which estimates the camera’s 6-DOF pose by localizing it within the evolving map. This recursive process leverages principles derived from stereo vision, specifically requiring the estimation of the camera’s relative pose (translation and rotation) between consecutive image frames.

2.2.2. Stereo Calibration and Image Rectification

The vSLAM process simulates a stereo vision setup by treating two consecutive, initially uncalibrated frames from a single moving camera as a pair of spatially separated cameras. In a standard calibrated stereo scenario, the two cameras are perfectly aligned with translation only along a horizontal baseline b. This configuration significantly simplifies the Correspondence Problem (finding the 3D location of an object seen in both images) by allowing the direct estimation of the 3D position

(x, y, z)

from matched pixel coordinates:

x = \frac{b (u_{l} - o_{x})}{u_{l} - u_{r}}, y = \frac{b f_{x} (v_{l} - o_{y})}{f_{y} (u_{l} - u_{r})}, z = \frac{b f_{x}}{u_{l} - u_{r}}

(3)

where

(f_{x}, f_{y})

and

(o_{x}, o_{y})

are the camera intrinsic parameters, and

(u_{l}, v_{l})

and

(u_{r}, v_{r})

are the pixel coordinates of the object in the left and right images, respectively.

When the camera undergoes arbitrary motion, the simple stereo condition is lost, and the images must be rectified [27]. Image rectification transforms the views such that corresponding points lie on the same horizontal line (sharing the same vertical coordinate v). This transformation is based on the principles of Epipolar Geometry [28] and reduces the correspondence search from a two-dimensional space to a one-dimensional search along a common line, thereby recovering the simple stereo configuration. The initial steps for generating candidate 3D points from rectified image pairs are as follows:

Image Distortion Correction: Radial distortion is corrected using the camera’s intrinsic calibration matrix $K_{c a m e r a}$ , obtained through camera intrinsic calibration [25]:

$K_{c a m e r a} = [\begin{matrix} f_{x} & 0 & o_{x} \\ 0 & f_{y} & o_{y} \\ 0 & 0 & 1 \end{matrix}]$

(4)
Feature Detection and Description: Salient image features are identified (e.g., using ORB [29]) and then characterized by feature extraction algorithms (e.g., SIFT or SURF [30,31]) to produce robust numerical or binary descriptors.
Feature Matching: The detected features are matched between the two images by comparing their descriptors (e.g., via Hamming distance).

2.2.3. Pose Calculation

The Essential matrix

E

or Fundamental matrix

F

encodes the relative pose of the camera between the two frames, yielding the translation vector

T = {[t_{x}, t_{y}, t_{z}]}^{T}

and the rotation matrix

R

, which form the basis of the localization trajectory. Given a set of matching points between the two key frames (a subset of images with significant changes), the matrices are estimated by solving the Epipolar equation, represented as:

[\begin{matrix} u_{l} & v_{l} & 1 \end{matrix}] \underset{F}{\underset{⏟}{K_{l}^{- T} E K_{r}^{- 1}}} {[\begin{matrix} u_{r} & v_{r} & 1 \end{matrix}]}^{T} = 0

(5)

or using normalized coordinates, using Equations (3) and (4):

[\begin{matrix} x_{l} & y_{l} & z_{l} \end{matrix}] E {[\begin{matrix} x_{r} & y_{r} & z_{r} \end{matrix}]}^{T} = 0

(6)

Once the Essential matrix

E

is determined, the relative camera pose

(T, R)

is uniquely defined by the relationship:

E = [\begin{matrix} 0 & - t_{z} & t_{y} \\ t_{z} & 0 & - t_{x} \\ - t_{y} & t_{x} & 0 \end{matrix}] \underset{R}{\underset{⏟}{[\begin{matrix} r_{11} & r_{12} & r_{13} \\ r_{11} & r_{12} & r_{13} \\ r_{11} & r_{12} & r_{13} \end{matrix}]}}

(7)

This constraint enables the recovery of the relative translation

T

and rotation

R

through singular value decomposition of Equation (7). By chaining these estimated pose transformations across frames, the vSLAM system constructs the full 3D camera trajectory, completing the localization component.

2.3. Online Cooperative Point Cloud Registration

The process of fusing the two independently generated 3D maps begins once both are constructed. The first map, generated by the first vehicle, serves as the target point cloud and is reliably referenced through an external global navigation system. Conversely, the second map, captured by the second autonomous vehicle, constitutes the source point cloud and is inherently lacking global positional information. The central objective of this pipeline is to globally reference the second map by estimating the optimal rigid transformation necessary to align the source cloud with the first vehicle’s globally referenced target map.

The core alignment technique utilized is the Iterative Closest Point (ICP) algorithm, a standard method for determining the rigid transformation (rotation and translation) that best aligns a source point cloud to a target point cloud. This alignment is achieved by iteratively finding and minimizing the distance between corresponding points. The standard ICP pipeline involves the following steps:

Initial Alignment: Begin with an initial transformation guess, often the identity matrix or a prior estimate from a coarse alignment method.
Closest Point Matching: For each point in the source cloud, identify its nearest neighbor (the closest point) in the target cloud.
Transformation Estimation: Compute the rigid transformation that minimizes the mean squared error (MSE) between the established matched point pairs.
Apply Transformation: Update the source point cloud’s position and orientation using the estimated transformation.
Iteration: Repeat Steps 2 through 4 until convergence, defined by a minimal reduction in the MSE or reaching a maximum iteration limit.

To enhance the robustness and improve the convergence rate—especially when dealing with sparse, noisy maps typical of monocular vSLAM—specialized ICP variants are employed [32]: (i) Point-to-plane ICP minimizes the distance from a source point to the tangent plane defined by its corresponding closest point and the estimated surface normal in the target cloud. This approach effectively leverages local surface geometry for improved alignment compared to the standard point-to-point metric, particularly in environments with distinct planar features. (ii) Plane-to-plane ICP further generalizes this concept by incorporating planar approximations from both the source and target clouds. This method minimizes the misalignment between corresponding local surface patches.

Given a fixed target point cloud

p_{i}

, and a pair-wise matched moving source point cloud

q_{i}

, for

i = 1 \dots N

, point-to-point ICP estimates the 3D rigid transformation

A

by minimizing the sum of squared distances between these corresponding matched points:

A = [\begin{matrix} R & T \\ 0 & 1 \end{matrix}] = \underset{A}{arg min} \sum_{i = 1}^{N} {∥ p_{i} - (R q_{i} + T) ∥}^{2}

(8)

The monocular vSLAM subsystem used in this work produces sparse point clouds composed of salient visual landmarks rather than dense geometric reconstructions, and the proposed framework is explicitly designed to operate under these conditions. Prior to point cloud registration, low-confidence landmarks and spatial outliers are filtered using vSLAM tracking quality and consistency metrics. Furthermore, ICP is not applied in an unconstrained manner: relative pose estimates derived from cooperative perception and inter-agent proximity provide an initial alignment prior that significantly restricts the search space and mitigates convergence to local minima. The objective of the registration process is not high-fidelity reconstruction, but the establishment of a consistent shared reference frame across agents. Through incremental updates and opportunistic re-observation, global consistency is refined over time despite individual measurements being sparse and noisy, reflecting the biologically inspired principle that approximate spatial coherence is sufficient to support coordinated multi-agent behavior.

3. Results

This section presents the simulated as well as experimental validation of the proposed framework, which combines passive-vision swarm coordination and online cooperative point cloud registration under global localization intermittency. Experiments were conducted in a controlled indoor environment using a drone-led team of autonomous vehicles, each equipped with onboard cameras. Flight experiments were limited to two agents to ensure safe and repeatable operation, while simulations with three or more drones were used to analyze multi-agent interaction effects in a controlled environment. All simulations or experiments were designed to (1) demonstrate stable biological-inspired swarm cohesion using only monocular imagery and (2) evaluate online map alignment between platforms under heterogeneous viewpoints and sensor constraints.

The proof of concept for this cooperative perception framework was validated indoors using a heterogeneous team of three autonomous vehicles: two Crazyflie 2.1+ unmanned aerial vehicles [33] and a Wifibot Lab V4 ground vehicle [34], all equipped with monocular RGB cameras. These drones were selected for their lightweight design and agility, making them well suited for constrained indoor environments. The Wifibot Lab V4, a four-wheel-drive mobile platform, provided ground-based sensing and mobility.

The cooperative localization framework presented in this work is intended to improve global consistency under conditions where individual monocular SLAM pipelines satisfy basic operational requirements, including sufficient feature richness, stable keyframe tracking, and bounded local drift over short time horizons. Inter-robot registration is applied opportunistically and incrementally, such that relative alignment is only performed when overlap and visual confidence are adequate, preventing degradation due to poor local estimates. While the conceptual framework generalizes to larger collectives, the experimental validation focuses on two-agent interactions (drone–drone and drone–ground robot pairs), which represent the fundamental building blocks of scalable multi-agent systems. Localization accuracy was evaluated against a reference positioning system available in the test environment, which provides higher accuracy than the proposed vision-based method and serves solely as a ground-truth baseline for quantitative comparison.

3.1. Swarm-Keeping Performance

This section tests the swarm keeping logic using visual information during hover.

3.1.1. Simulation Test

This subsection tests the bio-inspired swarm keeping logic for three simulated drones, with sectors defined by

{ξ_{1} = 30^{\circ}, ξ_{2} = 70^{\circ}, ξ_{3} = 110^{\circ}, ξ_{4} = 150^{\circ}}

, preferred distance

d_{p} = 3 [m]

, preferred forward distance

d_{p}^{f} = 1.3 \cdot d_{p} [m]

, vertical angle

ν = 80^{\circ}

, speed correction

| Δ v_{i} | = 0.1 [m / s]

, and general velocity

| V | = 0

. Each drone starts at a random 3D position, and using the bio-inspired logic, each one adjusts its velocity using Equations (1) and (2) to a new stable hover position. Figure 2 shows the initial (*), final drone positions, and their locations in their neighbors’ sectors, as well as their interdistances (magenta: Drones 1-2, black: Drones 1-3, cyan: Drones 2-3). It also shows the speed changes, heavily influenced by the preferred distances. For this run, Drones 2 and 3 maneuvered, while Drone 1 remained in its original position.

To evaluate the stability and cohesion of the proposed coordination strategy under steady-state conditions, 100 independent simulation runs were conducted using three drones initialized with different relative positions and orientations. Figure 3 reports the evolution of pairwise inter-agent distances as the system converges toward a stable configuration. The upper plot shows the three pairwise distances for each run (300 curves in total), illustrating how inter-agent separations settle to their steady-state values. The lower plot presents the corresponding sample mean and variance over time, confirming smooth convergence and consistent formation behavior across all trials.

3.1.2. Flight Test

Flight tests were conducted to test the passive topological coordination. All bio-inspired parameters used are similar to previous simulation. One maintained hover while the other maneuvered to keep a specified distance from the first, remain centered in its image, corresponding to the forward sectors defined by

{ξ_{4} - π, ξ_{1}}

, and stay at the same fixed height. Figure 4 shows a single-color scale image selected from the video (left) and a 3D sketch of each pixel’s intensity (right). Multiple connected components

{m_{1}, m_{2}, \dots}

are present, within which the two lights are clearly distinguishable. These two lights are retained as the selected pair

{m_{r}, m_{s}}

. After iterative prior testing across multiple videos, a threshold of approximately

t r = 230

was selected.

The control strategy allows the second quadrotor to dynamically adjust its position relative to the hovering one using visual cues. The drone firmware utilizes a cascaded PID control structure across multiple levels—position, velocity, attitude, and attitude rate. The specific controllers engaged depend on the setpoint provided to the system, in this case horizontal commanded velocities

(v_{x_{c m d}}, v_{y_{c m d}})

. In every control mode, the angle-rate controller translates the desired angular rates into PWM signals for the motors. A detailed illustration of this inner control architecture is available in [33].

Guidance was achieved using two proportional controllers based on visual feedback for longitudinal and lateral components, based on the following tracking errors:

longitudinal: e_{d i s t} = d i s t_{c m d} - d i s t_{m e a s}, and lateral: e_{l a t} = u_{c m d} - u_{m e a s}

(9)

The measured quantity

u_{m e a s}

corresponds to the horizontal component in pixels of the geometric center of both markers;

{d i s t}_{m e a s}

is calculated based on the steps in Section Passive Distance or Depth Estimation. These quantities are converted into commanded longitudinal and lateral velocities as follows:

longitudinal: v_{x_{c m d}} = k_{e}^{v_{x}} \cdot e_{d i s t}, and lateral: v_{y_{c m d}} = k_{e}^{v_{y}} \cdot e_{l a t}

(10)

with

k_{e}^{v_{x}}

and

k_{e}^{v_{y}}

being proportional gains to be tuned experimentally. For this test,

d i s t_{c m d} = 10 [p x]

, and

u_{c m d} = 324 / 2 = 162 [p x]

, corresponding to the component of the center of the camera. These velocities were fed into the inner controllers. Both raw measurements

d i s t_{m e a s}^{r a w}

and

u_{m e a s}^{r a w}

were smoothed by a low-pass filter, given

d i s t_{m e a s}

and

u_{m e a s}

. Figure 5 shows commanded, measured, and filter quantities.

3.2. Online Point Cloud Registration

The proof of concept for the cooperative perception framework was evaluated indoors with two different configurations: (1) using two aerial drones and (2) using one ground vehicle and one drone.

3.2.1. First Configuration: Two Drones

The first drone follows a predefined trajectory while capturing video. This path is designed to cover a broader area, generating a point cloud that is larger and encompasses the point cloud to be produced by the second drone. In contrast, the second drone hovers in place while recording its frames, with its pose slightly offset to the flight path of the first drone. The objective is to assess the registration of two point clouds of different sizes, captured from distinct perspectives, where one is geometrically a subset of the other.

Each drone video consists of 400 images at a resolution of

324 \times 244

pixels, with camera intrinsics (see Equation (4)) previously calibrated as follows:

K_{d r o n e} = [\begin{matrix} f_{x} & 0 & o_{x} \\ 0 & f_{y} & o_{y} \\ 0 & 0 & 1 \end{matrix}] = [\begin{matrix} 180.9 & 0 & 159.23 \\ 0 & 180.94 & 155.72 \\ 0 & 0 & 1 \end{matrix}]

(11)

Feature extraction and matching constitute essential components of the monocular vSLAM pipeline. Feature extraction identifies the salient image points and generates descriptors representing the local neighborhood of pixels. Matching then pairs descriptors across consecutive key frames by comparing these feature vectors using similarity metrics such as Hamming or Euclidean distance. These correspondences support pose estimation, image alignment, and dense 3D reconstruction. For example, Figure 6 illustrates the features detected in two consecutive drone key images and the established matches.

The alignment of the second point cloud (smaller in size) with respect to the first one (larger) is performed using the ICP algorithm, Equation (8). The obtained rigid transformation matrix is given by:

A = [\begin{matrix} R & T \\ 0 & 1 \end{matrix}] = [\begin{matrix} 0.995 & - 0.053 & - 0.077 & - 0.123 \\ 0.054 & 0.998 & 0.013 & 0.202 \\ 0.076 & - 0.017 & 0.996 & 0.065 \\ 0 & 0 & 0 & 1 \end{matrix}]

(12)

from where the rotation matrix can be seen to closely resemble the identity matrix (given a small orientation misalignment) and the translation to be mainly along the y-axis. Figure 7 shows the trajectories and point clouds of both drones. It can be observed that the point cloud of the second drone is a subset of the first point cloud, exhibiting some differences, primarily in height and orientation. Two different perspectives are presented.

Figure 8 shows the scene in front of the drones (up) and the point cloud reconstructed by overlapping the one from the first drone and the one realigned from the second drone. It can be seen that they complement each other.

The registration can also be performed in real time as the second drone’s scene is being processed. In this case, the first drone generates a broader point cloud, which is then available for the second drone to register against its own point cloud as it is being created. Figure 9 shows the orientation and translation of the second point cloud during its processing and registration with the first one. It is interesting to note how the pose of the second point cloud gradually stabilizes to fixed values as more key frames are used

[32, 53, 135, 156, 179, 210, 231, 252, 273, 294, 328, 355]

.

3.2.2. Second Configuration: One Ground Vehicle and One Drone

Both robots navigated autonomously through the laboratory while capturing visual data of the shared environment. Each platform carried an onboard monocular camera for local vSLAM processing and also had access to an external motion-capture-based positioning system for ground-truth evaluation. This external system consisted of a ceiling-mounted camera array connected to a laboratory computer that tracked each robot’s position and heading in real time.

The experimental procedure was structured as follows. The ground vehicle first executed a predefined trajectory while recording video of its surroundings. This video was processed to generate a monocular point cloud and an associated trajectory, both referenced to the camera’s initial orientation—the natural reference frame used by monocular vSLAM. Using the external positioning system, this point cloud and trajectory were then transformed into the laboratory’s global coordinate frame, producing a referenced baseline map.

The drone followed a similar procedure, capturing its own video and generating a local point cloud. Unlike the ground vehicle, the drone did not use external positioning for navigation. Instead, it aligned its local point cloud with the globally referenced ground-vehicle point cloud, thereby correcting its pose and obtaining a globally referenced trajectory through cooperative point cloud registration. The externally measured drone position was used solely to evaluate the accuracy of this approach. The core objective of the experiment was to obtain a globally consistent trajectory for a robot operating without global positioning by registering its map to a previously globally referenced map of the environment. Importantly, the roles of the two autonomous vehicles are interchangeable and may be reversed in other scenarios.

Because the ground vehicle serves as the reference robot, its point cloud is used to globally reference the drone’s point cloud later in the experiment. For consistency, we assume that the laboratory scene remains unchanged between the times at which the two videos (car and drone) are recorded. The camera, rigidly mounted to the vehicle, was oriented 90 degrees to the right relative to the vehicle’s forward motion, ensuring that both robots observe the same region of the lab.

The ground vehicle’s video and synchronized position measurements were processed to generate a vSLAM-based point cloud referenced to the laboratory coordinate system, with an acquired dataset consisting of 388 grayscale images at a resolution of

640 \times 360

pixels. The camera intrinsics of the car (Equation (4)) are given by:

K_{c a r} = [\begin{matrix} f_{x} & 0 & o_{x} \\ 0 & f_{y} & o_{y} \\ 0 & 0 & 1 \end{matrix}] = [\begin{matrix} 479.75 & 0 & 323.18 \\ 0 & 479.92 & 177.88 \\ 0 & 0 & 1 \end{matrix}]

(13)

The drone mostly traversed the same region as the ground vehicle, capturing a video of the environment using its onboard monocular camera. Its flight path consisted of a vertical takeoff to approximately 0.5 m, a horizontal translation following a straight-line motion similar to the ground vehicle’s trajectory, and a final vertical descent for landing. The drone’s map and trajectory were aligned with the ground vehicle’s globally referenced map, giving the 3D rigid transformation matrix:

A = [\begin{matrix} R & T \\ 0 & 1 \end{matrix}] = [\begin{matrix} 0.995 & - 0.027 & - 0.097 & - 0.808 \\ 0.071 & 0.875 & 0.479 & - 0.261 \\ 0.072 & - 0.483 & 0.872 & 0.388 \\ 0 & 0 & 0 & 1 \end{matrix}]

(14)

Figure 10 presents the point clouds of the drone before and after realignment, as well as the ground vehicle’s point cloud. It also includes the drone trajectories (initial, after registration, and true). The corrected trajectory is closer to the true one (obtained by using the Lab positioning system), showing an error reduction of around 4:1. The shared scene is also shown in the inset.

4. Conclusions

This work presents an integrated framework that combines bio-inspired passive-vision coordination with cooperative monocular point-cloud registration to enable resilient multi-robot localization in GNSS-intermittent environments. By relying exclusively on lightweight onboard cameras and decentralized interactions, the proposed approach allows heterogeneous agents to maintain coordinated motion and establish shared spatial references without explicit inter-agent ranging, centralized control, or continuous global positioning. Drawing inspiration from biological swarms, the passive-vision coordination mechanism supports stable inter-agent spacing based on local visual cues, while the cooperative perception module incrementally aligns sparse landmark-based maps generated by monocular vSLAM, enabling non-GNSS agents to recover globally referenced trajectories.

Experimental results with drone–drone and drone–ground robot teams demonstrate that the proposed cooperative registration improves spatial consistency and maintains coordination coherence under realistic constraints, including limited onboard computation, viewpoint variability, partial occlusions, and intermittent sensing. Rather than relying on dense geometric reconstructions, the framework exploits sparse visual landmarks and repeated inter-agent interactions to achieve bounded drift and incremental global consistency. These results highlight the feasibility of integrating behavioral coordination and geometric alignment within a unified, system-of-systems perspective, supporting autonomous operation in tunnels, industrial interiors, metro environments, and other challenging settings where GNSS and active sensing may be unreliable or unavailable.

Several important directions remain for future work. First, reproducibility and comparability will be strengthened through standardized benchmarks and open-source implementations. Second, although real-time operation was achieved under the tested laboratory conditions, a detailed characterization of computational latency across different hardware platforms and flight regimes is required to formally define operational envelopes, scalability limits, and speed-dependent update requirements. Finally, a systematic analysis of localization error characteristics—including the impact of stochastic versus systematic errors on cooperative registration stability—will be pursued, alongside confidence-aware fusion and outlier mitigation strategies. Addressing these directions will further enhance robustness, scalability, and deployability of cooperative visual localization systems in dynamic, heterogeneous multi-robot teams.

Author Contributions

Conceptualization, G.G. and A.E.; Methodology, G.G.; Software, G.G.; Validation, G.G. and A.E.; Formal analysis, G.G. and A.E.; Investigation, G.G.; Resources, G.G. and A.E.; Writing—original draft, G.G.; Writin—reviewand editing, A.E.; Visualization, A.E.; Supervision, A.E.; Project administration, A.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tong, P.; Yang, X.; Yang, Y.; Liu, W.; Wu, P. Multi-UAV Collaborative Absolute Vision Positioning and Navigation: A Survey and Discussion. Drones 2023, 7, 261. [Google Scholar] [CrossRef]
Liu, H.; Fu, Y.; Ma, Y.; Zhang, W. Multimodal Fusion and Dynamic Resource Optimization for Robust Cooperative Localization of Low-Cost UAVs. Drones 2025, 9, 820. [Google Scholar] [CrossRef]
Lee, I.; Sung, C.; Lee, H.; Nam, S.; Oh, J.; Lee, K.; Park, C. Georeferenced UAV Localization in Mountainous Terrain Under GNSS-Denied Conditions. Drones 2025, 9, 709. [Google Scholar] [CrossRef]
Yao, F.; Lan, C.; Wang, L.; Wan, H.; Gao, T.; Wei, Z. GNSS-denied geolocalization of UAVs using terrain-weighted constraint optimization. Int. J. Appl. Earth Obs. Geoinf. 2024, 135, 104277. [Google Scholar] [CrossRef]
Xu, W.; Yang, D.; Liu, J.; Li, Y.; Zhou, M. A Visual Navigation Algorithm for UAV Based on Visual-Geography Optimization. Drones 2024, 8, 313. [Google Scholar] [CrossRef]
Sarlin, P.E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperGlue: Learning Feature Matching with Graph Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4938–4947. [Google Scholar] [CrossRef]
Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-Free Local Feature Matching with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8922–8931. [Google Scholar] [CrossRef]
Besl, P.J.; McKay, N.D. A Method for Registration of 3-D Shapes. IEEE Trans. Pattern Anal. Mach. Intell. 1992, 14, 239–256. [Google Scholar] [CrossRef]
Kim, G.; Kim, A. Scan Context: Egocentric Spatial Descriptor for Place Recognition Within 3D Point Cloud Map. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 4802–4809. [Google Scholar] [CrossRef]
Kim, G.; Choi, S.; Kim, A. Scan Context++: Structural Place Recognition Robust to Rotation and Lateral Variations in Urban Environments. IEEE Trans. Robot. 2021, 38, 1856–1874. [Google Scholar] [CrossRef]
Garcia, G.; Eskandarian, A. Point Cloud Registration for Visual Geo-referenced Localization between Aerial and Ground Robots. In Proceedings of the 22nd International Conference on Informatics in Control, Automation and Robotics, Marbella, Spain, 20–22 October 2025; Volume 2, pp. 211–218. [Google Scholar]
Ballerini, M.; Cabibbo, N.; Candelier, R.; Cavagna, A.; Cisbani, E.; Giardina, I.; Lecomte, V.; Orlandi, A.; Parisi, G.; Procaccini, A.; et al. Interaction ruling animal collective behavior depends on topological rather than metric distance: Evidence from a field study. Proc. Natl. Acad. Sci. USA 2008, 105, 1232–1237. [Google Scholar] [CrossRef]
Cavagna, A.; Cimarelli, A.; Giardina, I.; Parisi, G.; Santagati, R.; Stefanini, F.; Viale, M. Scale-free correlations in starling flocks. Proc. Natl. Acad. Sci. USA 2010, 107, 11865–11870. [Google Scholar] [CrossRef]
Reynolds, C.W. Flocks, herds, and schools: A distributed behavioral model. Comput. Graph. 1987, 21, 25–34. [Google Scholar] [CrossRef]
Garcia, G.; Eskandarian, A. Bio-Inspired UAS Swarm-Keeping based on Computer Vision. In Proceedings of the 2024 International Conference on Unmanned Aircraft Systems (ICUAS), Chania, Crete, Greece, 4–7 June 2024. [Google Scholar]
Cieslewski, T.; Choudhary, S.; Scaramuzza, D. Data-Efficient Decentralized Visual SLAM. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 2466–2473. [Google Scholar] [CrossRef]
Carpin, S. Fast and accurate map merging for multi-robot systems. Auton. Robot. 2008, 25, 305–316. [Google Scholar] [CrossRef]
Sunil, S.; Mozaffari, S.; Singh, R.; Shahrrava, B.; Alirezaee, S. Feature-Based Occupancy Map-Merging for Collaborative SLAM. Sensors 2023, 23, 3114. [Google Scholar] [CrossRef]
Chen, W.; Wang, X.; Wang, Z.; Lin, X.; Chen, M.; Hu, K. Overview of Multi-Robot Collaborative SLAM from the Perspective of Data Fusion. Machines 2023, 11, 653. [Google Scholar] [CrossRef]
Vodisch, N.; Cattaneo, D.; Burgard, W.; Valada, A. CoVIO: Online Continual Learning for Visual-Inertial Odometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 17–24 June 2023; pp. 2464–2473. [Google Scholar] [CrossRef]
Major, P.F.; Dill, L.M. The three-dimensional structure of airborne bird flocks. Behav. Ecol. Sociobiol. 1978, 4, 111–122. [Google Scholar] [CrossRef]
Bajec, I.L.; Zimic, N.; Mraz, M. Flocks on the wing: The fuzzy approach. J. Theor. Biol. 2005, 223, 199–220. [Google Scholar] [CrossRef]
Martin, G.R. What is binocular vision for? A birds’ eye view. J. Vis. 2009, 9, 1–19. [Google Scholar] [CrossRef]
Yang, L.; Kang, B.; Huang, Z.; Zhao, Z.; Xu, X.; Feng, J.; Zhao, H. Depth Anything V2. arXiv 2024, arXiv:2406.09414. [Google Scholar] [CrossRef]
Zhang, Z. A Flexible New Technique for Camera Calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334. [Google Scholar] [CrossRef]
Jian, B.; Vemuri, B.C. Robust Point Set Registration Using Gaussian Mixture Models. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 1633–1645. [Google Scholar] [CrossRef]
Szeliski, R. Computer Vision: Algorithms and Applications; Springer: London, UK, 2010. [Google Scholar]
Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision, 2nd ed.; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Bay, H.; Tuytelaars, T.; Van Gool, L. SURF: Speeded Up Robust Features. In Computer Vision—ECCV 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 404–417. [Google Scholar] [CrossRef]
Rusinkiewicz, S.; Levoy, M. Efficient variants of the ICP algorithm. In Proceedings of the Proceedings Third International Conference on 3-D Digital Imaging and Modeling, Quebec City, QC, Canada, 28 May–1 June 2001; pp. 145–152. [Google Scholar] [CrossRef]
Crazyflie 2.1 Plus. Available online: https://www.bitcraze.io/products/crazyflie-2-1-plus/ (accessed on 19 November 2025).
Wifibot Company. Wifibot Lab V4: 4-Wheel Drive Autonomous Platform. 2025. Available online: https://www.wifibot.com (accessed on 21 November 2025).

Figure 1. Left: Sectors’ top view. Right: Opposed lateral sectors’ view. In total, twelve lateral sectors are defined, six above and six below the horizontal plane, along with two forward-facing sectors and one blind sector oriented backward.

Figure 2. Up: Trajectory, starting point (*), and final location of each drone (Drone 2: red, Drone 3: yellow) with respect to neighbors’ sectors (Inset shows the interdistances with Drones 1-2: magenta, Drones 1-3: black, Drones 2-3: cyan). Down: The speed corrections to achieve the new hover condition (Drone 1: blue).

Figure 3. Up: Inter-agent distances over time for all simulation runs, compared against the steady-state separation achieved by the proposed passive-vision coordination strategy. Down: Sample mean and variance of the inter-agent distances as a function of time. The results correspond to 100 independent simulation runs involving three drones, initialized with different relative positions and orientations and operating without additional obstacles or agents. The reported separations reflect the evolution of pairwise distances between drones as the system converges toward a stable configuration.

Figure 4. Left: Grey-color image is showing the tracking of the two retained markers

{m_{r}, m_{s}}

(inset zooms in showing their inter-distance

d i s t_{m e a s}

). Right: Three-dimensional representation of the pixel’s intensity (inset shows the binary image after threshold comparison).

Figure 4. Left: Grey-color image is showing the tracking of the two retained markers

{m_{r}, m_{s}}

(inset zooms in showing their inter-distance

d i s t_{m e a s}

). Right: Three-dimensional representation of the pixel’s intensity (inset shows the binary image after threshold comparison).

Figure 5. Bio-inspired visual distance and lateral error signals obtained from onboard monocular vision. The figure shows the commanded pixel-based references, raw visual measurements, and filtered signals used by the proportional guidance controllers to compute longitudinal and lateral velocity commands during passive-vision coordination.

Figure 6. Detected and matched features in two consecutive frames from the drone camera.

Figure 7. Two different views of the point clouds and trajectories of both drones. First drone: red, second drone (unregistered): blue, second drone (registered): green.

Figure 8. Up: Scene observed by the drones during flight. Down: Sparse landmark-based point cloud generated by monocular vSLAM, showing the reference map from the first drone (red) and the realigned map from the second drone (green). The point cloud represents visual features used for cooperative registration and localization rather than a dense geometric reconstruction of the scene.

Figure 9. Incremental pose estimation by registering the second point cloud with the first as it is being created.

Figure 10. Point clouds: car (red), drone uncorrected (blue), and realigned (green). Drone trajectories: uncorrected (blue), realigned (green), and true (purple). The inset shows the scene in front of the vehicles.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Garcia, G.; Eskandarian, A. Vision-Based Unmanned Aerial Vehicle Swarm Cooperation and Online Point-Cloud Registration for Global Localization in Global Navigation Satellite System-Intermittent Environments. Drones 2026, 10, 65. https://doi.org/10.3390/drones10010065

AMA Style

Garcia G, Eskandarian A. Vision-Based Unmanned Aerial Vehicle Swarm Cooperation and Online Point-Cloud Registration for Global Localization in Global Navigation Satellite System-Intermittent Environments. Drones. 2026; 10(1):65. https://doi.org/10.3390/drones10010065

Chicago/Turabian Style

Garcia, Gonzalo, and Azim Eskandarian. 2026. "Vision-Based Unmanned Aerial Vehicle Swarm Cooperation and Online Point-Cloud Registration for Global Localization in Global Navigation Satellite System-Intermittent Environments" Drones 10, no. 1: 65. https://doi.org/10.3390/drones10010065

APA Style

Garcia, G., & Eskandarian, A. (2026). Vision-Based Unmanned Aerial Vehicle Swarm Cooperation and Online Point-Cloud Registration for Global Localization in Global Navigation Satellite System-Intermittent Environments. Drones, 10(1), 65. https://doi.org/10.3390/drones10010065

Article Menu

Vision-Based Unmanned Aerial Vehicle Swarm Cooperation and Online Point-Cloud Registration for Global Localization in Global Navigation Satellite System-Intermittent Environments

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Swarm Coordination Using Passive Vision

Passive Distance or Depth Estimation

2.2. VSLAM Theory

2.2.1. Monocular vSLAM

2.2.2. Stereo Calibration and Image Rectification

2.2.3. Pose Calculation

2.3. Online Cooperative Point Cloud Registration

3. Results

3.1. Swarm-Keeping Performance

3.1.1. Simulation Test

3.1.2. Flight Test

3.2. Online Point Cloud Registration

3.2.1. First Configuration: Two Drones

3.2.2. Second Configuration: One Ground Vehicle and One Drone

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI