An Aerial and Ground Multi-Agent Cooperative Location Framework in GNSS-Challenged Environments

: In order to realize the cooperative localization of multi-unmanned platforms in the GNSS-denied environment, this paper proposes a collaborative SLAM (simultaneous localization and mapping, SLAM) framework based on image feature point matching. Without GNSS, a single unmanned platform UGV and UAV (unmanned ground vehicle, UGV; unmanned aerial vehicle, UAV) equipped with vision and IMU (inertial measurement unit, IMU) sensors can exchange information through data communication to jointly build a three-dimensional visual point map, and determine the relative position of each other through visual-based position re-identification and PnP (Perspec-tive-n-Points, PnP) methods. When any agent can receive reliable GNSS signals, GNSS positioning information will greatly improve the positioning accuracy without changing the positioning algo-rithm framework. In order to achieve this function, we designed a set of two-stage position estimation algorithms. In the first stage, we used the modified ORB-SLAM3 algorithm for position estimation by fusing visual and IMU information. In the second stage, we integrated GNSS positioning and cooperative positioning information using the factor graph optimization (FGO) algorithm. Our framework consists of an UGV as the central server node and three UAVs carried by the UGV, that will collaborate on space exploration missions. Finally, we simulated the influence of different visibility and lighting conditions on the framework function on the virtual simulation experiment platform built based on ROS (robot operating system, ROS) and Unity3D. The accuracy of the cooperative localization algorithm and the single platform localization algorithm was evaluated. In the two cases of GNSS-denied and GNSS-challenged, the error of co-location reduced by 15.5% and 19.7%, respectively, compared with single-platform independent positioning.


Introduction
The aerial and ground collaborative unmanned systems are a heterogeneous crossdomain collaborative unmanned system composed of unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs), with complex functions such as perception, positioning, control and navigation.It can not only perform tasks independently, but also interacting with multiple aircraft across domains.A heterogeneous team of UAVs and UGVs can compensate for the lack of mobility, payload, and observation conditions between different platforms.UAVs can quickly cover large areas and have a good point of view for situational assessment.Ground vehicles have longer battery life, can carry large payloads, and actively interact with the environment.
In recent years, single-platform SLAM technology has been developed significantly.Early SLAM framework of sensor fusion mostly adopted the extended Kalman filter (EKF).For example, MSCKF [1] proposed a multi-sensor location information fusion method loosely coupled with visual-inertial odometry (VIO).ORB-SLAM3 [2] proposed by Campos e.g., revealed its potential in the aspects of high precision and high robustness.In the aspect of map fusion, the ORBSLAM-Atlas [3] module used camera pose error covariance to estimate the observability of the camera pose to determine whether to retain the camera pose or create a new map.In the field of multi-sensor fusion SLAM, VINS-FUSION [4] proposed a method that fused loosely-coupled global positioning information and VO/VIO positioning result.Some excellent collaborative SLAM frameworks also emerged from this foundation.When lacking external measurements, the relative position measurement between unmanned platforms mainly relies on visual position re-identification.A common way to obtain loop closures is to use visual place recognition methods, based on image or keypoint descriptors and a bag of words model, such as [4,5].Some recent works have also studied closed-loop detection between distributed robots [6,7].This method finds closed-loops through local communication between robots [8], collects observation data in a central server, and obtains a motion trajectory estimation of each robot through pose graph optimization (PGO).Different from the above, Yun Chang et al. proposed a collaborative SLAM method based on deep learning semantic description features [9].
There are two communication modes of multi-robot collaborative SLAM: distributed and centralized.[9][10][11] are representative works of a distributed framework.In related research of a centralized framework, Zou et al. introduced CoSLAM [12] in the early years, which demonstrated considerable potential of the centralized collaborative SLAM, CCM-SLAM [13] that deploys resource-consuming computations on servers, while still ensuring each agent's autonomy at low computational resource requirements by running a visual odometry system onboard.CVISLAM [14] was the first collaborative SLAM framework for bidirectional communication and extended visual-inertial odometry to the collaborative SLAM domain.It achieved higher accuracy and metric scale estimation.However, this study did not integrate GPS positioning information and thus lacks flexibility.Jialing Liu et al. proposed a collaborative monocular inertial SLAM system for smart phones.This was the first multi-agent collaborative SLAM system to run on a mobile phone [15], supporting cross-device collaboration.Similar work has reported CoVins [16] which can perform collaborative SLAM tasks on a larger scale.This study showed advantages in removing redundant information and reducing the coordination overhead.
All the above research only provides some thought to solving the ground-air collaborative navigation problems.They did not evaluate the specific application of various methods in aerial and ground collaborative navigation problems under GNSS-challenged environments.Even so, many challenges of this application still exist.For example, how to overcome the place recognition of crossing platforms under the aerial-ground difference of visual angles, or how to correct drift errors of GNSS positioning information for different platforms.The previous research mostly considered that a single platform did not need to run a complete SLAM optimization process during collaborative SLAM and only needed to execute visual odometry or visual-inertia odometry.However, with the rapid development of the terminal equipment computing power, we considered that a deploy communication interface, loosely-coupled with two-stage optimization and a complete singleplatform SLAM process on a terminal device at the same time will not only improve the positioning accuracy of the single-platform but also improve the robustness of the single-platform positioning algorithm of the whole system in the case of communication disorder.The key to collaborative positioning in GNSS-challenging environments is to ensure system initialization and positioning without GNSS signals, and to improve overall positioning accuracy with GNSS positioning when GNSS positioning information is available.
In order to solve ground-air collaborative positioning problems, Moafipoor et al. proposed a method that used UGV and UAV collaboration to navigate [17].When GPS is not available, the constraints of the external measurements provided by the extended Kalman filter and tracking filter are used to ensure the normal operation of the navigation function under adverse GPS conditions.In this paper, graph optimization was adopted to solve similar problems, assuming that GPS signals of each agent may be interfered with at any time.Peter Fankhauser et al. proposed a completely integrated method of relative observation between robots, independent of external positioning, and without initial guesstimates about the robot's posture [18].This method was applied to the mutual positioning between a hexacopter and a quadrupedal robot.J Park et al. studied the map point registration between UAV and UGV through feature points and realized the work of spatial data collection by multiple agents in a decentralized environment [19].Hailong Qin proposed a two layer shared novel optimized exploration path planning and navigation framework, which provided optimal exploration paths and integrated the collaborative exploration and mapping efforts through an OctoMap-based volumetric motion planning interface [20].They only considered GNSS-denied environments, not GNSS-challenged environments.In practical applications, on the one hand, we hope to use GNSS positioning information when receiving GNSS signals, and on the other hand, we hope to maintain a certain navigation function in the absence of GPS.
In this paper, we proposed an algorithm framework based on feature point matching and graph optimization for ground-air collaborative positioning in GNSS-challenged environments and verify its function in the virtual simulation system.Compared with previous related work, the main contributions of this paper are summarized as follows: 1. GNSS information is used to eliminate drift errors of the SLAM process in a looselycoupled way.This method can work normally even if the GNSS signal is disturbed or missing; 2. The cooperative location process is divided into several stages, and we try our best to balance the server computing power, communication bandwidth and algorithm performance.This allows the system to perform a complete global map initialization in the environment without GNSS signals; 3. The function of the proposed method is verified by using a virtual simulation system, and the positioning accuracy of the algorithm is analyzed.

System Overview
The architecture of the framework is depicted in Figure 1.At the early stages of system startup, all the UAV take off from a platform on the top of a UGV.After the unmanned platform completes IMU initialization and performs the first global bundle adjustment (BA), map fusion and relative pose solutions are completed between unmanned platforms through the local map.A flowchart of the co-location program executed on the server is shown below: Overall algorithm framework."MPs/KFs" is short for map points and keyframes.In the figure, the oval block represents all kinds of data, and the square block represents the processing of data.Input for each platform includes images, IMU, and GNSS location information.Each data handler thread is responsible for handling a corresponding agent.Pose estimation and global optimization on a single platform and server are realized by factor graph optimization.Finally, the final state after global optimization is output to the subsequent program for path planning, task allocation or map construction.
For convenience of explanation, the first stage of the position estimation coordinate system for a single platform is called the VIO frame.The coordinate system where each agent's local map is located after a single platform has completed the second stage of position estimation is called the local frame.The coordinate system where the global map is located after map fusion is completed on the server is called the global frame.The colocation program on the server is shown in Figure 2. When there is no GNSS positioning or collaborative positioning information, the VIO frame will overlap with the local frame.The global frame overlaps with the local frame of the UGV before initialization, and the process of global map initialization is to obtain the relative positioning relation between each UAV and UGV through PnP solutions and convert the local map of each agent to a global map based on UGV's local map.After initialization, the local frame of each agent will overlap with the global frame.As the unmanned platform continues to move, each agent generates a new keyframe during the local SLAM process and sends these new keyframes and map points to the server through wireless data communication.The server stack will cache map information from each agent.These keyframes and map points will be added to the global map through the initialized relative position changes between platforms.After the program discovers place recognition among platforms through detecting feature points, loop-closure and map fusion of the global map will be executed, as well as optimization.The optimized position and pose will be used to estimate that in the second stage, together with the GNSS positioning from each platform.Finally, the new pose of the new keyframes after collaborative positioning in the closed-loop position will be sent back to the corresponding agent.And the agent that accepts the collaborative positioning information will adjust their pose during the second stage of local optimization.loop is added to the global map that has been initialized.Finally, the GNSS positioning information is integrated to perform global optimization and output the result of co-location.

Keyframe-Based SLAM Algorithm Fusing GNSS Positioning Information and Co-Location Information
We deployed a SLAM framework modified and based on ORB-SLAM3 on each agent to run independently, to ensure each agent can complete independent navigation tasks in the situation where communication links are lost.The algorithm's flow is shown in Figure 3. ORB-SLAM3 was modified as follows:

•
A communication module to the server was added to exchange keyframe and map point information with the server instantly.The details about this part will be introduced in Section 2.3.1; The position estimation of the second stage was added, and both GNSS positioning information and collaborative positioning information were fused.The tracking thread processes sensor information in real time and initially estimates the pose of the current frame.In the process of tracking the local map, the track thread determines whether the current frame is used as a keyframe.Different from ORBSLAM3′s keyframe addition strategy, when valid GNSS positioning information completes timestamp alignment, frames aligned with GNSS information are also inserted as keyframes.
Each node represents a keyframe position and pose in the world frame.In Figure 4, the line between two nodes is called an "edge", which represents the constraint of the amount to be optimized in the optimization of the factor graph.The edge between two consecutive nodes is a local constraint, which comes from ORB-SLAM3′s pose estimation.The other edge is the constraint from the co-location results of the server and GNSS satellite positioning information.We used the VIO factor as the local constraints, and GNSS location information and co-location information as global constraints.An illustration of the second-stage global pose graph structure is shown in Figure 4.
where  is the measured value, provided by the results of the one-stage position estimation;  is the state prediction; and s is the observation equation. is calculated from the pose transformation between the two moments.The specific method used in this paper uses the relative pose of the current moment and the previous moment obtained from the position estimation in the first stage to add to the position coordinate j t x at the previous moment to obtain the position coordinate i t x at the current moment.is the minus operation on the error state of quaternion.We take the pose covariance matrix generated during SLAM as the covariance of local measurements.The essence of the local factor is the relative change in the pose in two keyframes.
Consider the following two situations where GNSS does not work: the GNSS signal is poor or has interference, resulting in a large error in the GNSS positioning information, resulting in a large drift in an agent's navigation trajectory; the GNSS signal is completely disabled, and there is no GNSS positioning information.Upon receipt of a valid GNSS location, the longitudinal dimension heights of the original measurements will be transformed into x, y, z coordinates in the local Cartesian coordinates system (ENU) during GNSS data preprocessing.

[ , , ]
When each agent receives a new pose about a keyframe's collaborative positioning from the server, we introduce the result of collaborative positioning as a new measured value into the process of secondary optimization.We assume that the result obtained by Lastly, in order to prevent VIO from drifting too much, the transformation of the VIO coordinate system to the global coordinate system should be updated after each optimization.The operation must make the position after fusion overlap with the results of GNSS positioning and cooperative positioning as much as possible, and the difference between the two frames should be as equal as possible with VIO.The data from VIO does not impose constraints on the absolute position after fusion, and only requires that the incremental error of the position after fusion and the incremental error of VIO should be as small as possible.

Communication
The communication modules on the agents and server was based on The ROS communication infrastructure [21].It was used for message passing over a wireless network.The communication interface supports two-way communication between the agents and the server, we applied it to update the keyframe pose at a given moment for single-platform SLAM.The server uses it to receive map information from each agent.With this module, the objects waiting to be sent were first serialized/deserialized.Since ROS is not a real-time communication system, time stamps needed to be attached to packets.

Agent-to-Server Communication
In order to realize map data sharing from agent to server and save data bandwidth as much as possible, we added several state variables into the ORB-SLAM3 program to monitor the running status of SLAM on a single platform, including: IMU initialization process, whether visual-inertial bundle adjustment (VIBA) is completed, and tracking the running status of threads.After the IMU completes initialization and VIBA, all the previous keyframes and map point information is serialized and sent.The keyframe information includes the unique ID of this keyframe on the agent, its pose and time stamp, the information of feature points it contains, and the numbers of all map points observed by this keyframe.The map point information contains the ID of this map point on the agent, its three-dimensional coordinates, its descriptors, and the ID of all keyframes that have observed this map point so far.Unique IDs ensure that these messages are not sent repeatedly to reduce the required network bandwidth.The communication module distinguishes between dynamic information (such as the pose of a keyframe that may be adjusted according to the running state of SLAM) and static information (descriptors of feature points).In the process of communicating with the server, each agent will send the smallest packet at the highest frequency (including at least one newly added keyframe information).

Server-to-Agent Communication
Each agent handler thread is responsible for processing keyframes and map points from the corresponding agent on the server.Each keyframe is not processed until all of its observed map points have been deserialized by an agent handler and inserted into the map stack.The server map stack contains separate maps that contain data from each agent.When the server maps complete location identification and merge, the two maps are removed from the map stack and replaced with their merged map until there is only one global map in the entire server map stack.When the server detects the new closedloop and completes the global pose optimization, it sends the ID and pose of the keyframe back to the corresponding agent to be updated.

Global Map Initialization and Intra-Agent Measurements
In order to enable the system to complete the positioning of each agent without GNSS at the initial stage of operation, we let the UAVs and UGVs start from the same position with the same angle of view.In this paper, UAVs are set to take off from the top platform of the UGV.After the IMU initialization is completed (which commonly needs 15-35 keyframes and takes 5-10 s), the agents perform a visual-inertial bundle adjustment.The aim of this action is to improve the precision of the initial map.Then, each agent sends its  .After the above pairwise pairing of the unmanned platforms, the local map of each UAV is finally converted into the coordinate system of the local map of the unmanned vehicle.New incremental keyframes and map points received by the server during subsequent operations will be transferred to the global map with the corresponding poses until a new loop closure or GNSS location provides a new location constraint.

Loop Closure and Global Optimization
The place recognition of loop closure is similar to the global map initialization, which depends on the feature point matching provided by DBoW2, and the initial matching through RANSAC and the PnP algorithm.Different from the global map initialization, considering the long-term drift of single-platform SLAM, a finer relative position transition is required.As above, we suppose that is carried out by using all the map point matching relationships found (the initial value of s is 1), and the goal function is the bidirectional reprojection error

Pose Graph Optimization
After obtaining the optimized transformation matrix cm T  , the redundant map points need to be eliminated in the above matching process for each pair of matching map points describing the same feature point.Then propagation of the corrections must be disseminated to the rest of the map through pose graph optimization.To eliminate duplicate points, for each pair of matches the points from As it has sufficient computing power on the server and is not sensitive to the time cost, global BA optimization can be directly carried out on the adjusted visibility and essential graphs.

Global the Second-Stage Pose Estimation
The factor graph structure of the global two-stage optimization is similar to that of the single platform, the difference is that in the global optimization process only the GNSS positioning information (if any) is needed to be taken into consideration.This is due to the GNSS positioning information from different agents will also generate new position constraints on the global map making the track change significantly and intersect, even if there is no visual re-location at that position.In this case, we would lower the threshold in line with the RANSAC algorithm.If it can pass the detection after reducing the number of inner points, a new closed-loop will be established here.The subsequent processing is consistent with the closed-loop process described above.

Virtual Simulation Experiment Platform
The overall architecture of the multi-unmanned platform simulation system based on Unity3D and ROS architecture is shown in Figure 5.The UAV flight control and visual simulation is based on Flightmare [22].The ROS Gazebo [23] simulation environment was run on Computer A to constrain the movements of UAVs and UGVs through dynamic models, and to generate the true values of the position and motion velocity.Position and velocity errors were superimposed to generate virtual IMU and GNSS data.Among them, GPS positioning information in GPGGA format was chosen as GNSS satellite positioning.The location and timestamp of an agent sent to the visual simulation module was passed through ROS-Unity3D interface.The visual simulation module moves the agent's model to the corresponding coordinates, renders the photo, and finally sends the most up-todate image and timestamp to the algorithm verification program on Computer B. The configuration of the two computers in the figure is shown in the Table 1.The rigid body appearance, physical properties, joint types and other aspects of the unmanned platform are described through URDF files.The motion of the unmanned platform model in the ROS environment can be viewed through Gazebo.Its position in Gazebo is taken as the true value of the pose of the unmanned platform.hybrid A* algorithm to UGV is applied to realize the path planning, and TEB algorithm to realize trajectory tracking.The UAV adopts a four-rotor model, using the quadrotor dynamics to design the control algorithm, and the Euler method and fourth-order Runge-Kutta method to integrate the UAV dynamics equation.

UGV Handler
Computer A

Visual Simulation Based on Unity3D
The visual simulation module includes scene resources and unmanned platform models.The scene resource module is responsible for the construction and optimization of 3D scenes in the visual simulation software, and makes the scenes restore the real environment as much as possible to run smoothly through the 3D model import and level of detail (LOD) optimization.Multiple unmanned platforms can be loaded in the simulation scene according to the demand, and the objects can be driven to move according to the immediate feedback of the data in ROS.Particle effect components are used to simulate weather, such as rain and fog.The light was set in the Unity scene to simulate a sunny day, dusk and night.Various meteorological and light environments were combined to simulate different visibilities in the real world.Lastly, the virtual RGB images were obtained through the Unity camera module.

Virtual Sensor Data Generation
Based on the modeling of the unmanned platform in Section 3.1.2,two sensor components were added: an inertial measurement unit module and a GPS module.IMU and GPS data were obtained by adding noise to the true value of the sensor model in the Gazebo coordinate system.The IMU-related parameters are shown in Table 2 The parameters related to GPS are shown in Table 3.

Experimental Results
This section introduces the simulation and experimental results of the autonomous positioning in an unknown environment by using the heterogeneous UAV and UGV system proposed.In the following experiment, we generated specified waypoints for the three UAVs and a UGV through the virtual simulation and saved the resulting sensor data and time stamps in the ROS bag for subsequent repeating tests.We examined the robustness of the method we proposed in an actual engineering scenario by modifying the ambient light, weather in the scene, and the down view angle of the UAV camera.Then we analyzed and compared the influence of the collaborative positioning framework on the positioning accuracy of each platform with single-platform positioning.The detailed settings for each dataset are shown in the Table 4.Our algorithm depends on visual graph feature points to complete place recognition, which is the foundation of subsequent map fusion and loop closure.Considering that visual angle difference and observation conditions are the main factors affecting feature point matching, we set three meteorological environments in the virtual simulation system: no occlusion, mist and fog, and two lighting environments: normal light and high light ratio under sunset.There was no cover, mist and fog representing the visibility of infinity, 150 m and 50 m, respectively.A schematic of the scenario is shown in Figure 6.The first thing worth paying attention to is the network bandwidth occupied by each agent and server for exchanging data in the virtual communication network environment.In Table 5, we list the average bandwidth and instantaneous traffic peaks of upload and downlink data between each agent and server.The peak of uploaded data occurs during map initialization, and the bandwidth usage is relatively smooth thereafter.There is a positive correlation between bandwidth usage and keyframe generation speed.Here, we compared the number of closed-loops generated with that of the feature points matched on the closed-loop frame during the operation process of the collaborative positioning algorithm on different datasets.From the results in Table 6, we found that the feature point matching had similar performance under two similar viewing angles when the viewing angle of UAV was the same.As many feature point pairs as possible are beneficial not only to achieve robust place recognition, but also to improve the positioning accuracy of SLAM.However, the number of feature point matching between UAV and UGV will decrease with an increase in the angle of view difference, which is more obvious when a UAV flies at low altitude.We believe that setting the camera angle of the UAV to 45° in the downward view can balance the need for visual repositioning among UAVs and between UAVs and UGVs better.As can be seen from the data in Table 7, the change of light affected the detection of feature points, but the number of feature points in the picture was still enough to produce the correct closed-loop.In the mist mode, 59% of feature point pairs were lost, and the number of correct closed-loops produced also dropped dramatically.Moreover, dense fog interfered the feature point detection and place recognition thoroughly.In the dense fog mode, except when the map initialization was completed when all the UAVs and UGVs started from the same position at the first stage of the operation, all the position re-identifications failed at the position where the closed-loop should be generated due to insufficient matching feature points in subsequent operations.To estimate the positioning accuracy of the system, in this section the influence of external observation conditions on the experiment results were ignored and the camera view that generated the largest average number of matching point pairs (45°) was adopted.The experiment for this section was run on Factory 01, 03, 05, 06 on the datasets described in Table 8.We estimated the absolute trajectory error (ATE) of each agent.We tested the positioning performance in GPS-denied and GPS-challenged modes separately.In the GPSchallenged 1 mode, all agents obtained valid GPS location information every 20 s.In the GPS-challenged 2 mode, the UAV could only randomly obtain four valid GPS positions during the operation, and the UGV masked all GPS location information.In the GPS-denied mode, all agents would not use the GPS location data in the dataset.The effect trajectory of the three UAVs and the unmanned vehicle in the collaborative positioning is shown in Figure 7a.

Discussion
From the experiments described in Section 3 we draw the following three conclusions.
1. Experiments in the virtual simulation platform demonstrated that the co-location framework we designed can maintain usable positioning accuracy over hundreds of meters of trajectory when the GNSS signal is rejected.At the same time, the co-location framework outperforms the single platform in terms of accuracy.This is essentially attributed to the new position constraints arising from place recognition between the different platforms.2. The SLAM front-ends we currently employ were not sufficiently stable under poor observation conditions.For example, poor visibility environments or poor near-orthogonal viewing angles affected the proper function and positioning accuracy of the air-ground co-location system.The main reason is that the ORB feature-based SLAM front-end we adopted often cannot provide a sufficient number of feature points for matching to achieve robust location re-identification.Other common front-end methods, such as optical flow method, have more stringent requirements on observation conditions and viewing angles.In the SLAM front-end process, if the threshold of hamming distance for ORB detection is lowered in order to increase the number of matched pairs, it will lead to excessive false matches affecting the accuracy of VIO.Therefore, place recognition that relies on ORB features can only be adapted to the images generated under more optimum observation conditions.The visual front-end used in this manuscript fulfills the basic requirements for cooperative air-ground localization but is not yet sufficiently robust for weak observation conditions.3. The method described in this manuscript allows the location information of each agent in the entire unmanned cluster to be propagated to the other agents in the cluster.This process improves the positioning accuracy while also allowing the server and the individual unmanned platforms to acquire relative positions to each other.We designed a comparison of four cases.They were GPS-challenged co-location, GPS-denied co-location, single-platform independent operation of ORBSLAM3, and single-platform independent operation of ORBSLAM3 fusion GPS.In GPS-denied mode, the simulation was performed on a solar body with no GNSS fix except of Earth, or on Earth but the GNSS signal had interference.In this case, the positioning of each agent completely depended on the camera and IMU.It can be seen from Table 8 that the accuracy of co-location was significantly improved compared with the positioning accuracy of ORBSLAM3 running on a single platform.Assuming that some agents in the system obtained satellite positioning information, the positioning accuracy of all agents was improved.It is worth noting that even if we assume in the experiment that the UGV cannot obtain satellite positioning signals at all, its positioning accuracy still benefits from the co-location algorithm.This is due to the fact that the new constraints brought by the satellite positioning information are propagated to the local maps of all agents through the global map.Through the comparison of the two GPS-challenged modes, we found that only a few GPS points were enough to greatly improve the positioning accuracy.The GPS-challenged 1 mode used several times more GPS positioning points than the GPS-challenged 2 mode, but the positioning accuracy was not significantly improved compared with the latter.
In the experiments designed in this manuscript, we set the UAV to use high-precision GNSS positioning information, while the UGV could not obtain GNSS signals.There were two main reasons for this assumption.The first point was that, in practice, UAVs located in the sky can often obtain GNSS positions with good accuracy through RTK or relative positioning measurements.Whereas vehicles located on the ground may temporarily lose their GNSS position due to obstruction by reinforced concrete buildings, or by entering tunnels and interiors.The second point was that ORBSLAM3 itself is an excellent singleplatform SLAM algorithm, which can have a trajectory drift error of less than 1% with a closed-loop bottleneck.This makes it difficult for the meter-level errors inherent in GNSS to contribute to the improvement in positioning accuracy if differential-free GNSS positioning is used in a virtual scene of limited size.If the scale of the motion trajectory reaches several kilometers, however, even differential-free GNSS positioning information can greatly optimize the positioning accuracy.
In our paper, we focused on collaborative localization.However, in engineering applications, the movement of each agent in the system was not infinite due to the limitation of communication bandwidth and communication distance.Due to the lack of prior information of the global map, the pose information of each agent and the local map information received by the server had time delays, and the global positioning pose updated through the visual closed-loop had no gradient information.In this case, if a motion strategy for convex online learning to train unmanned clusters is required, delayed mirror descent (DMD) [24] would be a good choice.We will use this approach in the subsequent work.In the aspect of multi-agent motion control, our method only used the most basic D* Lite algorithm.Only the trajectory of a single agent was considered to fit the route generated in advance.However, in practical applications, there may be obstacles in the waypoints planned in advance, and the route itself is unreachable.Cooperation among multiple agents can be difficult to achieve.Collaborative algorithms such as [25] realize the decentralized and real-time cooperative pursuit of a single evader in the planar domain.By improving the method proposed in this paper, the control of the prior map obtained by the UAV would be improved.
Finally, it is worth mentioning that in the statistics of network bandwidth usage, we found that the peak bandwidth usage occurs in the map initialization phase, which is about three times the average value.In the case of long distance or interference, this will put higher requirements on the performance of wireless communication devices.When there is a prior map, it will greatly relieve the bandwidth pressure during system initialization.During system initialization, at least four high-quality GNSS positioning points are needed to convert the VIO coordinate system to the northeast sky coordinate system.Meanwhile, ORBSLAM3′s IMU initialization requires initial zero-bias estimation and gravity direction estimation to obtain scale information, which makes the initialization of the global map take too long.How to complete global map initialization more efficiently and quickly or to make better use of prior map information is the next urgent problem to be solved.

Conclusions
This manuscript uses ORB-based feature point re-identification to improve the positioning accuracy of all unmanned platforms in a cluster by fusing the local maps of each unmanned platform with new position constraints and global GNSS positioning information.Through a centralized collaborative positioning service, this can provide low latency positioning information for subsequent collaborative path planning and task allocation algorithms.The co-location algorithm proposed in this manuscript has better accuracy in both GNSS-challenged and GNSS-denied modes than the ORB-SLAM3 algorithm running on a single platform.The two-stage position estimation method used can also be combined with other positioning sensors such as UWBs and barometers in addition to the GNSS global positioning information applied in the manuscript.Current techniques for the localization of air-ground unmanned clusters in complex environments present new demands in the direction of visual front-ends, optimization methods and multi-sensor fusion.The approach proposed in this manuscript can be extended to other unmanned clusters in areas such as UAVs, logistics, agriculture and military.In the future, based on the current results, our subsequent work will further investigate the impact of different vision front-end techniques on location re-identification, a key aspect that critically affects map fusion and closed-loop detection.We will consider the use of feature detection and matching techniques based on deep learning or point and line features to improve the robustness of the algorithmic framework and to improve the localization accuracy based on this.Finally, our proposed approach in this manuscript has high computational resource requirements for map fusion and global optimization, and we will aim to mitigate the computational power and communication bandwidth required for global map maintenance on a server.

Figure 1 .
Figure 1.Overall algorithm framework."MPs/KFs" is short for map points and keyframes.In the figure, the oval block represents all kinds of data, and the square block represents the processing of data.Input for each platform includes images, IMU, and GNSS location information.Each data

Figure 2 .
Figure 2. The running process of the co-location module in Figure 1 is explained in detail.The function of this module is to restore the received MPs/KFs data from each agent to a local map.The local map is further stitched into a global map through visual position re-identification, or a new closed-

Figure 3 .
Figure 3.The tracking thread processes sensor information in real time and initially estimates the pose of the current frame.In the process of tracking the local map, the track thread determines whether the current frame is used as a keyframe.Different from ORBSLAM3′s keyframe addition strategy, when valid GNSS positioning information completes timestamp alignment, frames aligned with GNSS information are also inserted as keyframes.

Figure 4 .
Figure 4. Schematic diagram of the pose graph structure to be optimized in the second stage.The circle is the state quantity, and the yellow square is the constraint of local observation, that is, the relative pose transformation of VIO from ORBSLAM3.The other colored squares are constrained by global observations.

G
t KF in the first-stage pose estimation at that moment in the VIO coordinate system is   , vv qp .The pose obtained by t KF after the co-location is transformed into   , cc qp on the local framework of the agent.The co-location factor is derived as: initial optimized local map to the server.For a certain terminal c Agent , the server will start the alignment from the keyframe c KF at the most recent moment sent in the c Map .First, a subset of maps m Map from the server map stack and brute-force descriptors are chosen that match all keyframes contained within that map with c t KF through DBoW2, and obtained the most likely matched candidate  

.m
i to four times that of the number of agents).For each keyframe, among all the feature points contained in it and its five best common-view keyframes, the 2D-2D feature point matching provided by DBoW2 is used to obtain the 3D-3D matching between their corresponding map points.The map points of candidacy If the reprojection error is less than the threshold, a vote is cast for the corresponding n m t KF , and the frame with the most votes is selected as the matching frame.For further improvement of the closed-loop's robustness, the matching frame and its common-view frame together need to achieve three successful matches to be considered as the complete place recognition.If, at this moment, c Agent is a UAV, m Agent is a UGV, then the UAV's local map will be fused into the UGV's through _ Agent are UAVs, the relative pose transformation will be saved until one of them is fused with the UGV's map, and another will complete fusion through

.
All converted map points contained within the common-view frame of m KF , and those found matching to the key points in c KF .After the above intra-agent measurements have been completed, we need to adjust the map information according to the new closed-loop and fuse all the positioning information from each agent to perform global optimization.The details are as follows.2.5.1.Refinement of Transformation Matrixcm T is used to convert all map points contained within the common-view keyframe of m KF , and matching map points are located in the feature points of c KF .To obtain as many matches as possible, the map points are found that matches c KF in m KF and all of its keyframes at the same time.The nonlinear optimization of

c
Agent are removed, and the points from m Agent inherit all the observations of the removed points.By adding edges to the co-visibility and essential graphs, the observability is propagating between the common-

Figure 5 .
Figure 5.The virtual data generation program is deployed on Computer A, and the agent handler and cooperative location program is deployed on Computer B. They are connected using a gigabit network.

Figure 6 .
Figure 6.Schematic diagram of the virtual simulation scene (a) bright scene with high visibility; (b) a dimly lit but fog-free scene; (c) foggy and dim scene.Different observation conditions will affect the extraction and matching of feature points.

Figure 7 .
Figure 7. (a) The rendering of the co-location of three UAVs and a UGV; (b-e) trajectories and ground truth for a single-platform operation of each agent.

Table 1 .
The hardware configuration of the computers running the virtual simulation platform and the positioning algorithm.

Table 2 .
: IMU data production module parameter configuration table.

Table 3 .
GPS data production module parameter configuration table.

Table 4 .
Parameter Settings for each dataset.

Table 5 .
This table shows the network bandwidth usage after all simulation datasets were run.The results are obtained from seven experiments.

Table 6 .
The number of feature points matching and successful location re-identification under different angles of view and illuminations.

Table 7 .
The number of feature points matching and successful location reidentification under different angles of view and illuminations.

Table 8 .
The accuracy of each agent running in different modes.