1. Introduction
Nowadays, unmanned aerial vehicles (UAVs) have received great attention from the robotics research community. In this case, one of the main objectives has been the improvement of the autonomy of these systems. In particular, the multi-rotor aerial systems allow great versatility of movements, making this kind of aerial platform very useful for a great variety of applications. Altogether with the recent advances in computational processing, computer vision has become an important tool in order to improve the autonomy of robotics systems. Cameras are well adapted for embedded systems because they are inexpensive, lightweight and power-saving. For instance, applications of surveillance [
1], tracking and rescue [
2], among others, seem to be feasible for aerial robots equipped with onboard cameras.
A fundamental requirement in order to improve the autonomy of an aerial robot has to do with the capacity of self-location and perception of the operational environment. In this case, for most applications, GPS (Global Positioning System) still represents the main alternative for addressing the localization problem. Nevertheless, the use of GPS presents some drawbacks, for instance, the precision error can be substantial, and it provides poor operability due to multipath propagation. However, several mission profiles require the UAVs to fly in GPS-challenging or GPS-denied environments, as in natural and urban canyons [
3]. The use of range sensors like laser, sonar or radar (see [
4,
5,
6]) allows obtaining knowledge about the environment of the robot. However, this kind of sensor can be expensive and sometimes heavy, and its use in outdoor environments can be somewhat limited. For instance, sonar systems have a limited range of operation. Active laser systems (e.g., LiDAR) represent a very interesting sensing technology; they can operate under any visibility condition (i.e., both day and night, unlike cameras) and can directly provide 3D measurements about the surrounding environment. On the other hand, LiDAR is generally expensive; it can overload the system for certain applications; and it has moving parts, which can generate error.
1.1. Related Work
Visual SLAM is a technique that makes use of visual features as landmarks. Visual SLAM is intended to address the navigation problem of a robot moving in a previously unknown environment, while it provides information about the environment, using mainly angular measurements obtained from cameras. Currently, there are two main approaches for implementing vision-based SLAM systems: (i) filtering-based methods [
7,
8,
9] and (ii) the optimization-based methods [
10,
11]. While the latter approach is shown to give accurate results when the availability of computational power is enough, filtering-based SLAM methods might be still beneficial if limited processing power is available [
12].
Some examples of visual SLAM approaches applied to unmanned aerial vehicles are [
13,
14]. In [
15], a visual SLAM proposal that adds inertial measurements given by an Inertial Measurement Unit (IMU) is presented. The potential problem with this kind of approach is related to the fact that the acceleration obtained from the IMU has a dynamic bias, which is difficult to estimate. In [
16], an EKF-based (Extended Kalman Filter) method is proposed in order to perform visual odometry with an unmanned aircraft. This method uses inertial sensors, a monocular downward facing camera and a range sensor (sonar altimeter). Unlike vision-based SLAM, in visual odometry approaches, there is not a mapping process. Furthermore, in those approaches, the operating altitude of the UAV is limited by the operating range of the sonar. More recently, new approaches appeared addressing the problem of visual-based navigation in GPS-denied environments, such as [
17,
18,
19].
Multi-robot systems have also received great attention from the robotics research community. This attention is motivated by the inherent versatility that this kind of system has for performing tasks that could be difficult for a single robot. The use of several robots shows advantages like cost reductions, more robustness, better performance and efficiency [
20,
21]. In the case of the SLAM problem, in [
22,
23], a centralized architecture is used where all vehicles send their sensor data to a unique Kalman filter. In [
16,
24,
25], the idea of combining monocular SLAM with cooperative, multi-UAV information to improve navigation capabilities in GPS-challenging environments is presented.
In works like [
26,
27,
28,
29], it has been shown that 6DOF-SLAM (six degrees of freedom), based only on angular measurements (i.e., monocular SLAM), is a partially observable system that can be applied to both the single-robot case and the multi-robot case. In [
30], cooperative localization with visual information is addressed. According to the analysis presented in that work, the proposed system is completely observable. However, in this case, only distances and the relative orientations between robots are estimated. This fact can represent a clear drawback for applications where global measurements of the system are required (e.g., absolute position).
1.2. Objectives and Contributions
In this work, nonlinear observability properties of an aerial multi-robot system are analyzed. Based on this analysis, it is shown that the observability properties of this kind of system are improved by the inclusion of measurements of the relative distance between the aerial robots. Furthermore, based on the observability analysis, it is shown that the cooperative approach has theoretical advantages with respect to other configurations like the single-robot monocular SLAM approach. In addition, it is demonstrated that in a system composed of several UAVs, the observation of common landmarks is a sufficient condition in order to propagate through the whole system the information provided by the measurement of the relative distance between two robots. This property allows flexibility in the system as opposed to the absolute need for multiple contacts between robots.
In order to take advantage of all the above theoretical results, in this work, a novel cooperative monocular-based SLAM approach for multi-UAV systems is proposed. The system model is composed of the dynamics of each aerial robot and the Euclidean position of each landmark. The measurements of the system are the projections of the landmarks in the images, provided by the monocular cameras carried individually in every aerial robot. Additionally, as was mentioned before, the availability of some measurements about the relative distance between two robots is assumed.
In order to accomplish the requirement of having measurements of the relative distance between two robots, a technique based on a homography is also presented in this research. The main idea is to exploit the physical structure of the aerial robots in order to obtain measurements of relative distances by means of visual information. In this case, the method is developed assuming a team of quadrotors. It is important to remark that this proposed approach could be also applied to many other aerial platforms. The only requirement for the presented approach is that at least one robot has to be maintained inside the field of view of another aerial robot, while sharing the observation of one common visual landmark (see
Figure 1).
The geometric structure of a typical quadrotor is cross-shaped, and therefore, each rotor is mounted at the different ends of the cross. This kind of physical geometry can allow a standard computer vision algorithm to extract and track the centroids of the rotors. In this case, those centroids can be assumed to be coplanar. In order to compute the relative distance from one quadcopter in the field of view of another one, a homography is applied from the camera coordinate reference system of the observing robot to the plane formed by the four rotors of the robot being observed. The information obtained by the homography is fused with the orientation of the observing robot, provided by an IMU, which finally allows one to obtain measurements of relative distances. It is important to note that, based on the theoretical results presented in this work, it should be straightforward to replace the homographic technique used for estimating the relative distance between UAVs by another technique that would provide a similar measurement.
In addition to the benefit of improving the observability of the system, the relative distance obtained between any pair of robots provides metric information of the system, which is an important issue to be addressed in monocular-based systems. For example, in other configurations, the metric information is obtained purely from inertial systems (i.e., monocular/Inertial Navigation Systems (INS) solutions), but inertial sensors present some drawbacks due to the large drift bias, which is inherent to this kind of sensor [
31,
32].
In the proposed system, in order to take advantage of the multi-UAV cooperative configuration, the initialization process of new map features is carried out through a pseudo-stereo system composed of two monocular cameras mounted on two UAVs respectively (one camera per UAV) that observe common landmarks. This approach allows initializing landmarks with less uncertainty than a pure monocular system since 3D information of the position of landmarks is gathered from the beginning of the observation. It is well known that, in visual SLAM, the initialization process can play an important role in the convergence of the filter. Having a flexible baseline in the pseudo-stereo system allows one to initialize landmarks at distances far away with less uncertainty, unlike stereo systems with a rigid baseline [
32] or delayed monocular initialization methods. The above fact allows the proposed cooperative system to have better performance in environments where landmarks are far from the measurement system, contrary to SLAM approaches based on depth cameras, stereo systems, monocular cameras or sonars.
1.3. Paper Outline
The document is organized in the following manner:
Section 2 presents the specifications of the system;
Section 3 presents the nonlinear observability analysis that represents the theoretical basis of the proposed method;
Section 4 presents the proposed cooperative approach for monocular-based SLAM; in
Section 5, the results obtained from numerical simulations are presented in order to validate the proposal, and finally, in
Section 6, some final remarks are given.
3. Observability Analysis
In this section, the nonlinear observability properties of an aerial multi-robot system are studied. Observability is an inherent property of a dynamic system and has an important role in the accuracy of its estimation process; moreover, this fact has important consequences in the context of SLAM.
A system is defined as observable if the initial state
, at any initial time
, can be determined given the state transition and observation models of the system and observations
, from time
to a finite time
t. In [
37], it is demonstrated that a non-linear system is locally weakly observable if the observability rank condition
is verified, where
is the observability matrix.
As previously mentioned, 6DOF-monocular SLAM represents a kind of partially-observable system with a high number of unobservable modes and states that can be applied to both the single-robot case and the multi-robot case. The following references are examples of works where the problem of the observability of 6DOF-monocular SLAM systems has also been studied, such as [
26,
27,
28,
29].
For the analysis developed in this work, and for the sake of simplicity, the system (
1) is redefined as:
Let be the Euler angles of the j-th camera with respect to the coordinate system W.
The observability matrix
can be computed as:
Let
be the
s-th-order Lie derivative [
38] of the scalar field
with respect to the vector field
. For example, in (
23), the zero-order and first-order Lie derivatives are used for each measurement.
For the measurement given by a monocular camera, according to (
4) and (
22), the following zero-order Lie derivative can be defined:
where:
and:
Note that
denotes the antisymmetric matrix of the vector
. The first-order Lie derivative can also be defined in the following:
where:
with:
and:
Considering the case where relative measurements of the distance between robots are available, the following statement can be defined from (
21) and (
22):
For the zero-order Lie derivative, if
(the index of the observing robot is lesser than the index of the observed robot):
On the other hand, if
, then:
and:
where
is the identity matrix.
For the first-order Lie derivative, if
:
On the other hand, if
(the index of the observing robot is higher than the index of the observed robot), then:
with
With the above considerations, the observability matrix for the proposed system (
22) can be defined as follows:
The maximum rank of the observability matrix (
42) is
, where
is the number of landmarks being measured and
is the number of robots.
is multiplied by three, since this is the number of states per landmark given by the Euclidean parametrization.
is multiplied by 12, since this is the number of states per robot given by its global position, orientation (Euler angles) and its derivatives. Therefore,
will be rank deficient (
).
The unobservable modes are spanned by the right nullspace basis of the observability matrix
; therefore:
It is straightforward to verify that the right nullspace basis of spans for (i.e., ).
From (
43), it can be seen that the system is partially observable and that the unobservable modes cross with the states that correspond to the global position of the robots and the landmarks; these states are unobservable. An important conclusion is that all the vectors of the right null space basis are orthogonal with the rest of the states, and therefore, these states are completely observable.
The results of the observability analysis are summarized in
Table 1.
Some important remarks on the analysis can be extracted:
From the above results, it can be concluded that the proposed cooperative system, although still partially observable, considerably reduces the unobservable modes and states with respect to the 6DOF-monocular SLAM system. This contribution represents an advantage to improve the accuracy and consistency in the estimation process.
5. Computer Simulations Results
In this section, computer simulation results are presented. The computer simulations were performed in order to validate the performance of the proposed method. A MATLAB implementation was used for this purpose.
With the intention of making an exhaustive analysis of the performance of the proposed system, a comparison is carried out with respect to the other three typical single-robot SLAM configurations. The comparison allows one to note the advantages and drawbacks of multi-UAV systems compared with single robot systems.
For the computer simulations setup, two quadcopters equipped with an onboard monocular camera are simulated, while moving maintaining a stable flight formation. In this case, a Quadcopter (Quad 2) navigates over the other (Quad 1) at an arbitrary relative distance. In the computer simulations, it is considered that Quad 1 remains all the time inside the visual field of Quad 2. It is also assumed that there exist some landmarks observed in common by the cameras of both quadcopters.
The characteristics of the three SLAM configurations used for the comparison are described below:
The first configuration to be compared is monocular SLAM. In this case, the estimates are obtained from the monocular camera carried by Quad 1. The Monocular SLAM approach used to implement this configuration is based on the method proposed in [
40]. In this method, the map features are parametrized with the inverse depth parametrization. Both the initialization and update process are performed by means of the monocular measurements. The metric scale of the estimates cannot be retrieved when only monocular vision is used. For this reason, for this configuration, it is assumed that the position of the landmarks seen in the first frame (at the beginning of the flight trajectory) is perfectly known.
The second configuration to be compared is stereo SLAM. In this case, the estimates are obtained from a stereo system, with a baseline of 15 cm, carried by Quad 1. In this method, the map features are parametrized with the Euclidean parametrization. The feature initialization process is carried out directly by means of the 3D information provided by the stereo system. The state update is also performed using the stereo measurements.
The third configuration to be compared is a hybrid system stereo-monocular SLAM. In this case, the estimates are obtained from a stereo system, with a baseline of 15 cm, carried by Quad 1. In this method, the map features are parametrized with the Euclidean parametrization. The features initialization process is carried out directly by means of the 3D information provided by the stereo system. Unlike the second configuration, in this case, the state update is performed through monocular measurements obtained from one of the cameras of the stereo system.
In computer simulations, it is assumed that the initial condition of the quadcopter states is known with certainty. In order to emulate uncertainty, Gaussian noise with pixels is added to the measurements given by the cameras. The measurements from the cameras are taken with a frequency of 10 Hz. The intrinsic parameters used for the cameras are and . The environment is composed of 3D points, randomly distributed over the ground. Furthermore, it is assumed that the camera can detect and track visual features without error, avoiding the data association problem. Furthermore, the problem of the influence of the estimates on the control system was not considered. In other words, an almost perfect control over the vehicle is assumed. The trajectory followed by the vehicles begins near the ground, then it moves away from the initial position taking a higher altitude as the trajectory progresses.
The average NEES (Normalized Estimation Error Squared [
41]) over
Monte Carlo runs was used in order to evaluate the consistency of each method, as proposed in [
42]. The NEES is estimated as follows:
The average NEES is computed from:
Figure 6 shows the real and estimated trajectory obtained from the cooperative system.
Figure 7 shows the real and estimated trajectory obtained with all the configurations. Note that in this case, only the trajectory of Quad 1 is presented. In this simulation, it can be seen that as the trajectory evolves, the error considerably increases for the single-robot configurations. On the other hand, for the proposed (cooperative) method, the error is better bounded.
Figure 8 shows the evolution over time of the real and estimated states (position and orientation) for Quad 1. In this case, the initial results are confirmed. The results of the estimated state of Quad 2 are not shown, but they are closely similar to those presented for Quad 1.
Table 2 summarizes the Mean Squared Error (MSE) for the position in the three axes of Quad 1.
Figure 9 shows the average NEES over 50 Monte Carlo runs obtained for each method. The average NEES is calculated taking into account the twelve variables that define the complete state of the vehicle (position, orientation, linear velocity and angular velocity). It is very interesting to note how the consistency of the filter considerably degenerates in the three cases of the single-robot configurations. On the other hand, for the cooperative case, the consistency of the filter remains practically stable.
Figure 10 shows the relative distances (from Quad 1 to Quad 2) estimated with the method proposed in
Section 2. It can be seen that these measurements are good enough to be used to update the filter (see
Section 4). It is important to remark that the observability results presented in
Section 3 depend on these measurements. The lower-right plot of
Figure 10 shows an image frame captured from the monocular camera carried by Quad 2. In this case, the projection of the landmarks can be appreciated, as well as the projections of the four rotors of Quad 1 needed to compute the homography.
In order to compare the quality of the measurements obtained with the fixed stereo system and those obtained with the cooperative pseudo-stereo system, some computer simulations were performed. In this case, the error was computed for the estimated landmarks’ positions, assuming that the position of Quad 1 was perfectly known along the flight trajectory. For the fixed stereo system, the camera-camera calibration is perfectly known. For the cooperative pseudo-stereo system, the camera-camera calibration is obtained from the homography, and therefore, it presents a certain level of error.
Figure 11 shows the absolute value of the mean error obtained for both methods. In this experiment, the same measurements were performed for both systems. In the lower-right plot, the number of measurements per frame is shown. In the case of the fixed stereo system, the accuracy of its measurements is affected by the small baseline between cameras. This is especially notorious when the vehicle moves far away from the landmarks (the altitude is increased). In the case of the cooperative pseudo-stereo system, the error in estimation is much better bounded, although the calibration of the system is not perfectly known. A suitable explanation has to do with the possibility of having an arbitrarily greater baseline between the cameras.
Figure 12 illustrates the above fact. In this case, the statistical results obtained from simulating the measurement of a single landmark with (i) the cooperative pseudo-stereo system and (ii) a monocular method are presented. In the simulation, the UAV-Camera 1 system is located at
at instant
k. The UAV-Camera 2 system is located at
at instant
k. Thus, the baseline in the cooperative system is equal to
meters. A landmark is located at
. In order to model the inaccuracies associated with the cooperative pseudo-stereo approach, the estimated location of the UAV-Camera 2 system was modeled by adding a Gaussian noise with
cm to its actual location. In order to emulate the monocular measurements, it is assumed that the UAV-Camera 1 system was moved (at some instant
) to
to generate a parallax with respect to the landmark. Thus, the baseline in the monocular system is equal to
meters. The drift associated with the estimated displacement of the UAV-Camera 1 system is modeled by adding Gaussian noise with standard deviation
cm to the actual location at instant
. In all cases, the angular measurements provided by the cameras are corrupted by Gaussian noise with
degrees. Using the above conditions, a Monte Carlo simulation with 1000 executions has been used to estimate the landmark position with linear triangulation. In
Figure 12, ellipsoids are used to illustrate the uncertainties in the estimated positions. According to the simulation results, it is better to have a larger baseline between two cameras with greater position uncertainty (like the cooperative system) than a small baseline with small uncertainty (like monocular measurements with low parallax).
In practical applications, there are several related factors that can severely also affect the performance of a system. For instance, in visual SLAM, the data association problem is critical for these approaches to be reliable. Although currently, there are several methods available for rejecting outliers, it is difficult to completely eliminate this problem. In addition, in cooperative visual systems, the data association problem can be extended from the single-image case to the multiple-image case. Furthermore, a problem that can arise in multi-robot systems, contrary to the mono-robot systems, is related to the communication issues between robots. This problem can cause loss of information or even make the interchange of information impossible during certain periods.
In order to take into account the above practical considerations, a set of additional computer simulations is presented. In this case, based on the same simulation setup used previously, the following aspects are now added: (i) outliers for the visual data association in each camera; (ii) outliers for the cooperative visual data association; (iii) outages of communication between robots; (iv) failures in the homography-based technique used for estimating the relative distance between robots.
In order to emulate the failures of the visual data association process, of the total number of visual correspondences are forced to be outliers in a random manner. In this case, each outlier is modeled by means of a big measurement error of pixels. With the objective of having a good insight into the performance of the proposed method, under the above conditions, a comparison with a reliable general method is carried out. In this case, the method chosen is a monocular SLAM system aided by measurements of the position given by a GPS and attitude measurements obtained from an IMU (monocular SLAM + GPS + IMU).
Table 3 shows the number of failures introduced into the simulation: (i) the number of outliers introduced in the visual tracking process of Quad 1; (ii) the number of outliers introduced in the visual tracking process of Quad 2; (iii) the number of outliers introduced in the visual data association process used for cooperatively measuring the landmarks by means of Quad 1 and Quad 2; (iv) the number of outages in communication between robots, which result in filter update not being carried out with the information given by Quad 2; and (v) the number of failures in the homography-based technique, which result in the filter update not being carried out with the information given by the relative distance between the Quads.
Figure 13 shows the real and estimated trajectory obtained with the two configurations: (i) cooperative SLAM; and (ii) monocular SLAM + GPS + IMU.
Figure 14 shows the evolution over time of the real and estimated states (position and orientation) of Quad 1 obtained with both configurations. Note that in this case, only the trajectory of Quad 1 is presented for illustration purposes, but estimates of Quad 2 are closely similar to those presented for Quad 1.
Table 4 summarizes the mean squared error for the position in the three axes of Quad 1 obtained with both configurations. In this simulation, both configurations have a good performance, in the case of monocular SLAM + GPS + IMU, this result was expected, since this system has enough sources of information to determine all the states. The cooperative system shows a good performance despite all the failures introduced into the system. The above study provides a good insight about the robustness of the proposed (cooperative) system.
Table 5 provides an insight into the performance of the proposed method for estimating the features map. In this case, the total (sum of all) of the mean squared errors for the estimated position of the landmarks is presented for both configurations. Furthermore, the total of the mean squared errors for the initial estimated position of the landmarks is presented. Note that the results are presented for each coordinate of the reference frame
W. The results show that the proposed cooperative system has a better performance than the monocular SLAM + GPS + IMU system, regarding the error obtained in the estimation of the position of the landmarks, although the latter has more sources of information provided by its sensors.
6. Conclusions
In this work, a vision-based cooperative SLAM system with application to unmanned aerial vehicles has been presented. The general idea is to take advantage of a cooperative UAV scheme in order to improve the accuracy and consistency of the state estimation process of the whole system. To achieve this purpose, it was proposed to add some relative distances between the robots as system measurements for updating the EKF. These measures provide metric information to the system, unlike other configurations where the scale of the system is a problem. Through a non-linear observability analysis, it is verified that the observability of the cooperative system improves the observability obtained for a single-robot configuration. In this case, the observability of the system is improved by adding the measures of relative distances. Sufficient conditions required for obtaining the observability results were established. In order to infer the 3D knowledge of the position of the landmarks for initializing the map features with less uncertainty, in the proposed method, pseudo-stereo systems are formed from pairs of aerial robots.
An extensive set of computer simulations was performed in order to validate the proposed method. In the computer simulations, the proposed system was compared against four single-robot configurations of visual SLAM. Based on the results of the simulations, it can be observed how the proposed method (cooperative) improves the estimation of the state with respect to the other configurations. The difference in the performance of the systems is especially notorious when the distance from the cameras to the landmarks increases. Furthermore, it was shown that the consistency of the filter is improved with the proposed method. Computer simulations also show that the accuracy of the measurements obtained from the pseudo-stereo system is better than the measurements obtained from a stereo system with a fixed small baseline.
In computer simulations, an effort has been made in order to emulate several aspects regarding applicability in real scenarios of the proposed approach. For instance, the data association problem has been considered by emulating outliers (mismatches) during the tracking of visual features on each monocular camera, as well as on the pseudo-stereo matching. Furthermore, issues for the multi-robot communication were considered, as well as failures on the homography technique used to provide measurements of the relative distance between robots. However, although computer simulations are useful for evaluating the full statistical consistency of the methods, they can still neglect important practical issues that appear when the methods are used in real scenarios. In this sense, it is important to note that future work should be focused on developing experiments with real data in order to validate the applicability of the proposed approach fully. Therefore, it should be interesting to investigate more practical aspects, like the homography-based technique or the pseudo-stereo matching process.