1. Introduction
For multirobot systems, many applications frequently encounter the challenge of cooperative localization, including formation control [
1], cooperative transportation [
2], and search-and-rescue missions [
3]. The issue of cooperative localization in multirobot systems refers to the detection and localization of mobile robots (often with limited sensor data, such as the relative distance or bearing) relative to other robots or landmarks. This is crucial in multirobot systems as it requires cooperation among the entire team and group. Currently, state estimation for multiple robots in outdoor environments heavily relies on a Global Navigation Satellite System (GNSS) [
4]. However, GNSS-based methods are restricted to outdoor environments due to limitations to satellite signals. Furthermore, these methods only provide meter-level accuracy, which falls short for tasks requiring high-precision localization of robots. Other methods involve the use of motion capture systems or traditional Ultra-WideBand (UWB) localization systems for indoor scenarios without satellite signals [
5,
6,
7]. They rely on physical anchors or base station deployment in applications to achieve cooperative localization. Therefore, these solutions are unsuitable for unknown environments, such as large areas or indoors, where deploying or modifying infrastructure is challenging. This limitation restricts the overall performance and application potential of multirobot systems and makes their usage more difficult and expensive.
With the advancement of autonomous perception in robots, visible light sensors, LiDAR, ultrasonic sensors, and numerous other novel sensors are being applied to cooperative localization [
8,
9,
10]. The data provided by these sensors can improve localization accuracy and enable motion tracking, providing efficient and robust state estimation for robots in both indoor and outdoor and known and unknown environments [
11]. This is due to their ability to detect and understand the surrounding environment, even if their primary installation purposes might not be for localization. For instance, high-precision maps constructed using LiDAR create a digital surface model represented by discrete points, which includes spatial information required for target localization [
12]. However, LiDAR, ultrasonic sensors, and optical detectors are usually costly and less easily deployable compared to camera-based visual sensors. Therefore, there is growing interest in cooperative localization methods employing cameras. Vision-based simultaneous localization and mapping (SLAM) techniques can provide relative positions of robots in the surrounding environment [
13]. Additionally, other robots can also be directly detected based on the images captured by the onboard camera, and distance and bearing measurements can be extracted from visual devices. However, they suffer from the disadvantages of a small field of view, short range, and susceptibility to nearby object occlusion. Due to the limitations to the observation angle of visual sensors, relative state estimation outside the field of view of onboard cameras has become a challenging issue. Therefore, a sensor with omni-directional perception capabilities and that is easy to carry, such as UWB modules, is needed to assist the camera with relative measurements. This paper introduces a cooperative localization system based on visual and UWB technologies that can be applied to multirobot frameworks.
There are typically two modes for multirobot cooperative localization: centralized and decentralized. Compared to the centralized framework, decentralized methods reduce the complexity of computation and communication during the cooperative process. In the decentralized framework, each member runs an estimator locally: propagating its state by fusing measurements from its local onboard sensors and relative observations received from team members [
14]. In multirobot systems, solving the relative distance and bearing problems in a decentralized manner is preferable and scalable. Huang et al. [
15] introduced an adaptive recursive decentralized cooperative localization method for multirobot systems with varying measurement accuracy. Wang et al. [
16] proposed a fully decentralized cooperative localization algorithm based on covariance union to address inconsistent state estimation caused by spurious sensor data. Ref. [
17] presented a resilient decentralized localization algorithm for tracking inter-robot correlation.
The existing decentralized cooperative localization methods for multirobot systems usually utilize traditional filtering algorithms to fuse measurements from different sensors and their prior information, overcoming the limitations of individual sensor technologies to obtain more accurate estimations. For example, commonly used filtering methods include the extended Kalman filter, unscented Kalman filter, and particle filter [
18]. However, with the development of multirobot systems, the nonlinearities in navigation and localization systems have become increasingly serious. Filtering methods often require appropriate modeling of the problem and are susceptible to outliers, which leads to the fact that traditional filtering techniques are inadequate for the high-precision requirements of cooperative localization systems. Factor-graph-based multi-source fusion methods have opened up new avenues for cooperative localization in multirobot systems. Unlike traditional filter-based fusion methods, they optimize the complete set of states by considering both historical measurements and system updates simultaneously. These systems can seamlessly integrate new sensors and have the feature of plug-and-play, which effectively solve the nonlinearities in cooperative localization and asynchronous issues in data transmission. Xu et al. [
19] presented a decentralized visual–inertial–UWB swarm state estimation system that flexibly integrates or discards sensor measurements and provides estimations through backend graph optimization. In factor-graph-based cooperative localization systems, if a sensor becomes unavailable due to signal loss or sensor failure, the system simply avoids adding relevant factors without requiring special procedures or coordination. In this case, historical information is used to encode all measurement values and states into a factor graph, and sensor fusion is solved through iterative optimization methods. In [
20], a factor-graph-assisted distributed cooperative localization algorithm was developed to address ranging localization errors in cooperative nodes. This algorithm integrated a Fisher information matrix between adjacent agent nodes to evaluate ranging performance of the agent node to adjacent nodes. In this paper, we formulate multirobot cooperative localization as a graph optimization problem to efficiently achieve accurate state estimation through a factor-graph-based decentralized cooperative localization method.
Vision-based multirobot cooperative localization often requires accurate recognition of robots to track associated metrics for estimation of the association states [
21]. However, real-world environments are far from perfect and often contain various sources of interference that make the task of correct data association challenging. Xunt et al. [
22] proposed a system that focuses on relative pose estimation for multirobot systems but established data association through special patterns formed by infrared LED boards. Nguyen et al. [
23] proposed a cooperative localization method using a coupled probabilistic data association filter to handle nonlinear measurements. However, this method is implemented based on a filtering structure. While considerable work has been done on multirobot cooperative localization, there is limited research on factor-graph-based methods using visual information for cooperative localization. To solve the problem of association uncertainty in cooperative localization, some researches have explored the use of data association algorithms in factor graphs. Gulati et al. [
24] utilized data association filters to calculate the correspondence probabilities between multiple topologies of two robot states, which reduced interference in topological factors. Xu et al. [
25] introduced a cooperative localization method based on omni-directional cameras, UWB sensors, and maps from multiple unmanned aerial vehicles that associates visual data through the global nearest neighbor algorithm. Although these methods can effectively track a single target, they face challenges in achieving satisfactory results when dealing with data association for multiple targets. Compared to [
25], we use monocular camera and do not have the computational burden of dealing with a large number of distorted images. Meanwhile, in this paper, the Joint Probabilistic Data Association (JPDA) algorithm is used to solve the data association problem in multirobot cooperative localization. It can track multiple targets simultaneously with high tracking performance, and a target can be correlated with multiple measurements in the association gate. By fusing the measurements from the front-end with the back-end through graph optimization based on a factor graph, accurate localization is achieved while improving the resilience of the system in the face of abnormal situations.
We propose a factor-graph-based resilient cooperative localization algorithm (FG-RCL) for multirobot system. The main contributions of this paper are as follows:
We propose a resilient and accurate cooperative localization system that fuses visual information and UWB measurements without relying on external onboard sensors or computing power.
The system can handle data association problems in the visual measurement process of multirobot cooperative localization.
The accuracy and robustness of the proposed algorithm are demonstrated through simulation results and experiments on a public dataset.
The remaining content of this paper is organized as follows.
Section 2 first provides some basic concepts and system frameworks. Then, construction of the cost factors and the use of the JPDA algorithm to solve the data association problem of visual detection measurements are introduced.
Section 3 presents simulation and experimental results. Finally,
Section 4 and
Section 5 discuss and summarize our work.
2. Materials and Methods
2.1. Notation
In this paper, we use
,
, and
to represent the sets of natural numbers, non-negative integers, and positive integers, respectively.
represents the
m-dimensional real vector space of
. For a vector or matrix, we use
to represent its transpose. For vector
,
represents its Euclidean norm. For matrix
,
represents the
matrix block in the bottom right corner of matrix
, e.g,
. We use
to represent an antisymmetric matrix, e.g, for
, there is
We add a left superscript to to indicate that the vector is relative to the reference frame , e.g, . and , along with the the left superscript and subscript, represent the rotation matrix and transformation matrix, respectively, between two frames. For example, and are the rotation matrix and transformation matrix, respectively, from frame to .
Quaternions can also be used to represent rotation. A quaternion
is a four-dimensional complex number composed of one real part and three imaginary parts:
The multiplication of rotation matrices represents the rotation between reference frames, and the corresponding quaternion operations are as follows:
We denote as the body reference frame of the robot, as the world reference frame, and as the camera reference frame. Without specifying it, the left subscript body reference frame is implicit: for example, represents the rotation quaternion from the camera frame reference to the body frame at time k, i.e., .
2.2. System Overview
We assume a multirobot system containing N robots, where the i-th robot is denoted as Robot i, . For simplicity, unless otherwise specified, Robot i will be discussed in the following content. If the self-motion estimation of Robot j works normally and can communicate reliably with Robot i, then Robot j is within the available set of Robot i.
For the team of robots described in this paper, the sensors of each robot include cameras and UWBs. The cooperative localization system is formulated as a nonlinear optimization problem.
Figure 1 shows the proposed factor graph structure. Hollow circles serve as vertices, which represent the poses to be optimized. There are three types of edges connecting poses: namely, (1) the local state estimation factor, which refers to incremental estimation of the own motion state of the robot through VO, thereby forming pose constraints for the robot at two moments before and after; (2) the UWB factor, which represents the relative distance constraint between two robots at the same time; (3) the visual detection factor, which represents the relative visual measurement constraint of one robot being detected by another robot. For each edge, the measurement, information matrix, and the Jacobian matrix of the error function relative to the state variables should be implemented.
Usually,
is used to describe a set of rigid body motions in 3D space. The pose of a robot on
is represented as
, where
and
represent the orientation quaternion and the position of the body frame relative to the world frame, respectively.
represents the transformation of quaternions into rotation matrices, which can be seen in [
26]. Obviously, the main goal of our work is to estimate the transform
.
The proposed framework is divided into a front-end and a back-end that run independently on each robot. At the front-end, the raw data are processed separately through local state estimation (LE), visual detection (VD), and UWB modules. On the back-end, state estimation is performed using a graph optimization method. We use sliding windows to construct cost functions based on the measurements of LE, VD, and UWB at each new time step. The state of Robot
i to be estimated includes the robot’s orientation quaternion
and position
at time step
k. We define the state estimate of Robot
i at each time step
k as
, so the optimized full state vector of Robot
i in the sliding window is defined as:
where
N is the total number of robots, and
m is the number of keyframes in the sliding window.
LE, VD, and UWB provide observations coupling two continuous states in a sliding window. Using all the observations obtained from LE, VD, and UWB, the cost function can be constructed at time step
k as
where
L is the set of all local state estimation factors;
V is the set of all visual detection factors; U is the set of all UWB factors;
denotes the residuals of the local state estimation factors, which ensures local consistency of the state of Robot
i;
denotes the residuals of the visual detection factors;
denotes the residuals of the UWB factors; and
,
, and
are the covariance matrices of the measurement errors.
2.3. Construction of the Cost Factors
Our proposed FG-RCL algorithm divides the measurements into two stages: (1) a local measurement stage, where local state estimation results are obtained through visual odometry (VO) or proprioceptive sensors; (2) a relative measurement stage, where the relative measurement results of available robots are obtained by cameras and UWBs and are then used with local state estimation results for fusion based on graph optimization.
2.3.1. Local State Estimation Factor
In local state estimation, the VO algorithm is used. The transformation matrix
can be obtained based on the state vector
of Robot
i in the world reference frame at time step k, so measurement of the transformation matrix is obtained as
where
is the transformation matrix of Robot
i from the world reference frame to the camera reference frame.
Then, we can use the Gaussian model to represent the local measurements as
where
is the Gaussian white noise conforming to
.
The measurement noise is modeled as an independent zero-mean Gaussian distribution with the following covariance:
In this paper, the residual of the local measurement denotes the residual of the VO result, as shown in Equation (
12). The multiplication of
and
denotes the transformation matrix of Robot
i from time step
to
k in the world reference frame. The result is then multiplied with the local measurement
to denote the residual of the local measurement.
We divide the measurement residual into two parts according to the relative rotation and displacement and find the Jacobi matrix for the rotational component and the translational component , respectively.
The variable
takes the partial derivatives of
,
,
, and
to obtain
The variable
takes the partial derivatives of
,
,
, and
to obtain
After completing local state estimation in the first stage, it is also necessary to detect and track the relative measurements between robots in the second stage.
2.3.2. Visual Detection Factor
Relative measurements can be obtained by detecting and tracking targets using vision sensors. In this paper, a monocular camera is employed for target detection. When Robot
j appears in the field of view of Robot
i, Robot
i utilizes an object detection algorithm (such as YOLO) to detect Robot
j. It is assumed that the center coordinate of the detection bounding box in the pixel coordinate frame of the camera is
. The position information of the detected Robot
j is broadcast by the communication module and then received by Robot
i. The visual detection measurement can be calculated as the direction relative to the detected robot.
where
denotes the function that obtains the direction relative to the detected robot from the coordinates of the bounding box center, and its specific form can be referenced in [
27].
is the rotation matrix of Robot
i from the world reference frame to the camera reference frame. The relative vector is frame invariant, and
is the Gaussian white noise that conforms to
.
The measurement noise is modeled as an independent zero-mean Gaussian distribution with the following covariance:
The residual of the visual detection measurements is
The Jacobian matrices of the measurement residual based on visual detection over
and
are
2.3.3. UWB Factor
From the UWB module, the relative distance between each pair of nodes can be obtained. The relative distance measurements between Robot
i and Robot
j can also be modeled using Gaussian noise as
where
is the Gaussian white noise conforming to
.
The measurement noise is modeled as an independent zero-mean Gaussian distribution with the following covariance:
The residual of the relative distance measurements between Robot
i and Robot
j is
The Jacobian matrices of the residual of relative distance measurements over
and
are
2.4. Data Association for Visual Detection Measurements
The JPDA is a well-known Bayesian target tracking filter. Instead of using a single measurement to update the state, it takes into account all measurements within the associated region of the predicted measurements of the target. JPDA considers each measurement within the cross-correlation region as potentially coming from clutter or multiple targets at the intersection and then assigns weights to the validated measurements [
28]. Here, we only provide a brief overview of calculating the corresponding probabilities of multiple visual detection measurements at a time step.
Assume that the set of validated measurements is represented as
where
is the
j-th validated measurement, and
is the number of measurements in the validation region at time step
k.
The cumulative set of measurements before time step
k is
All possible sources for each measurement can be indicated by creating a validation matrix. Define the validation matrix as
where
is a binary element 0 or 1. If the
j-th visual measurement is associated with the target Robot
t, then the corresponding validation matrix element
is 1; otherwise, it is 0. Hence, there is
The main purpose of the JPDA is to calculate the probability of associating each validated measurement with a target that could have generated it. There are the following assumptions: (1) each measurement can only come from one target source; (2) each target can only generate at most one true measurement.
According to the above rules, we can get the feasible matrix
, where
is the number of feasible matrices. Calculating the probability of a joint event requires scanning each feasible matrix row by row. If
, then the Gaussian distribution function of the validated measurements is
where
denotes the joint event based on the feasible matrix
,
c is a normalization constant,
is the number of false measurements,
is the probability mass function of the number of false measurements,
is the probability density distribution function of the true measurement,
is the association indicator of the measurement, and
is the detection indicator of the target.
denotes the probability of the detection of the target and is set manually according to the actual situation.
The association probability between measurement
j and target
t is obtained from the joint association probability, i.e.:
The final state estimate
at time step
k in terms of
can be written as
where
is the updated state of target
t based on the events associated with the
j-th validated measurement.
Therefore, using the JPDA filter, we can calculate the sum of the probabilities of all feasible events at time step k to obtain the association probability of the validated measurement relative to the target t and then assign weights to the measurements based on the association probabilities.
To solve the problem of data association in visual detection measurement, we implement JPDA next to the factor graph. Thus, the probabilities obtained from the JPDA filter are used as weights for the factors.
We assume that
is the probability of the
s-th validated measurement
of Robot
i with respect to Robot
j; normalizing it yields
For
measurements of Robot
i relative to Robot
j, each measurement has a certain probability. We can rewrite the visual detection measurement as
Therefore, the corresponding visual detection factor is
Data association based on visual detection can be understood from
Figure 2. When Robot
i detects both Robot
j and Robot
l, there may be multiple object detection bounding boxes or clutter measurements relative to a single target. Therefore, Robot
i has multiple relative measurements relative to Robot
j or Robot
l. In this case, it is necessary to perform data association on these measurements. We use JPDA to enable Robot
i to track measurements relative to multiple target robots. Firstly, a validation region is given for these measurements to exclude some invalid measurements. Robot
i has four valid measurements relative to Robot
j and three valid measurements relative to Robot
l. The variables
and
are the predictions of visual detection measurements of Robot
i relative to Robot
l and Robot
j, respectively. The variables
and
are the validation measurements in the validation region of Robot
i relative to Robot
l and Robot
j, respectively. In particular,
is in the intersection portion of the validation region, indicating that it may come from Robot
i relative to Robot
l or Robot
j. Then, JPDA is used to calculate the probability for each validated measurement. In
Figure 2, the size of the rectangle in the validation region represents the probability of measurement. The probabilities of these calculations are used in the above equation. Therefore, only validated relative measurements can contribute to a final solution based on weights.
4. Discussion
This paper proposes a resilient cooperative localization algorithm based on a factor graph. In this method, a factor graph model of the resilient cooperative localization system is established, and the mathematical model for constructing the cost function is derived in detail. The measurement, information matrix, error function, and Jacobian matrix of the error function relative to relevant state variables are provided. This method solves the problem of visual detection measurement data association in factor graphs by using the JPDA algorithm. This paper divides the front-end into a local measurement stage and a relative measurement stage, and it implements optimized fusion based on a factor graph in the back-end. Each robot operates independently to achieve decentralized cooperative localization. Therefore, this solution can realize plug-and-play of sensors, scalability, and resilience in complex environments.
The FG-RCL algorithm is applicable to multirobot cooperative localization in GNSS-denied environments. Specifically, when visual detection measurements have interference, erroneous visual information can affect the optimization of the global function during the factor graph fusion process, leading to an unsatisfactory final estimation state. Although this paper uses the JPDA algorithm to solve the problem of multi-target data association during visual detection measurement, it is still based on the correct recognition of targets. When the cooperative localization system fails to acquire visual information for a long time or is subjected to strong environmental interference, it can cause deviation in the localization results. For example, when the system is in a low-light environment, insufficient lighting seriously weakens the recognition efficiency of the visual object detection module; meanwhile, small targets cannot be recognized effectively, making the system miss part of the relative visual measurement results.
In the future, the FG-RCL algorithm can be combined with sensor selection algorithms, such as consistency-checking-based sensor optimization methods [
31], to address the problems of sensor deviation and noise model updates in the factor graph framework during the fusion process.
In addition, for cooperative localization systems, a robust communication network is essential. In a perfect communication scenario, wireless communication networks can be established through point-to-point methods to achieve localization at lower cost and complexity. However, if the system suffers from occasional message loss due to external events, this situation can lead to communication failure within the network. It is best to use corresponding resilient measures to make the cooperative localization system more robust in this situation.