Monocular Visual SLAM Based on a Cooperative UAV–Target System

To obtain autonomy in applications that involve Unmanned Aerial Vehicles (UAVs), the capacity of self-location and perception of the operational environment is a fundamental requirement. To this effect, GPS represents the typical solution for determining the position of a UAV operating in outdoor and open environments. On the other hand, GPS cannot be a reliable solution for a different kind of environments like cluttered and indoor ones. In this scenario, a good alternative is represented by the monocular SLAM (Simultaneous Localization and Mapping) methods. A monocular SLAM system allows a UAV to operate in a priori unknown environment using an onboard camera to simultaneously build a map of its surroundings while at the same time locates itself respect to this map. So, given the problem of an aerial robot that must follow a free-moving cooperative target in a GPS denied environment, this work presents a monocular-based SLAM approach for cooperative UAV–Target systems that addresses the state estimation problem of (i) the UAV position and velocity, (ii) the target position and velocity, (iii) the landmarks positions (map). The proposed monocular SLAM system incorporates altitude measurements obtained from an altimeter. In this case, an observability analysis is carried out to show that the observability properties of the system are improved by incorporating altitude measurements. Furthermore, a novel technique to estimate the approximate depth of the new visual landmarks is proposed, which takes advantage of the cooperative target. Additionally, a control system is proposed for maintaining a stable flight formation of the UAV with respect to the target. In this case, the stability of control laws is proved using the Lyapunov theory. The experimental results obtained from real data as well as the results obtained from computer simulations show that the proposed scheme can provide good performance.


Introduction
Nowadays, unmanned aerial vehicles (UAVs), computer vision techniques, and flight control systems have received great attention from the research community in robotics. This interest has resulted in the development of systems with a high degree of autonomy. UAVs are very versatile platforms and very useful for several tasks and applications [1,2]. In this context, a fundamental problem to solve is the estimation of the positions of UAVs. For most applications, GPS (Global Positioning System) still represents the main alternative for addressing the localization problem of UAVs. However, GPS comes with some well-known drawbacks associated with its use. For instance, in scenarios where GPS signals are jammed intentionally [3] or when the precision error can be substantial and they provide poor operability due to multipath propagation (e.g., natural and urban canyons [4,5]). Furthermore, there are scenarios where the GPS is inaccessible (e.g., indoor). Hence, to improve accuracy and robustness, additional sensory information, like visual data, can be integrated into the system. Cameras are lightweight, inexpensive, power-saving, and provide lots of information, moreover, they are well adapted to be integrated into embedded systems. In this context, visual SLAM methods are important options that allow a UAV to operate in an a priori unknown environment using only on-board sensors to simultaneously build a map of its surroundings while, at the same time, locating itself in respect to this map. On the other hand, perhaps the most important challenge associated with the application of monocular SLAM techniques has to do with the metric scale [6]. In monocular SLAM systems, the metric scale of the scene is difficult to retrieve, and even if the metric scale is known as an initial condition, the metric scale tends to degenerate (drift) if the system does not incorporate continuous metric information.
Many works can be found in the literature where visual-based SLAM methods are used for UAV navigation tasks (e.g., [7,8]). For SLAM based on monocular vision, different approaches have been followed for addressing the problem of the metric scale. In [9], the position of the first map features is determined by knowing the metrics of an initial pattern. In [10], a method with several assumptions about the structure of the environment is proposed; one of these assumptions is the flatness of the floor. This restricts the use of the method to very specific environments. Other methods like [11] or [12] fuse inertial measurements obtained from an inertial measurement unit (IMU) to recover the metric scale. A drawback associated with this approach has to do with the dynamic bias of the accelerometers which is very difficult to estimate. In [13], the information given by an altimeter is added to the monocular SLAM system to recover the metric scale.
The idea of applying cooperative approaches of SLAM to UAVs has also been explored. For example, [14,15] present a Kalman-filter-based centralized architecture. In [16][17][18], monocular SLAM methods for cooperative multi-UAV systems are presented to improve navigation capabilities in GPS-challenging environments. In [19], the idea of combining monocular SLAM with cooperative human-robot information to improve localization capabilities is presented. Furthermore, a single-robot SLAM approach is presented in [20], where the system state is augmented with the state of the dynamic target. In that work, robot position, map, and target are estimated using a Constrained Local Submap Filter (CLSF) based on an Extended Kalman filter (EKF) configuration. In [21], the problem of cooperative localization and target tracking with a team of moving robots is addressed. In this case, a least-squares minimization approach is followed and solved using sparse optimization. However, the main drawback of this method is related to the fact that the positions of landmarks are assumed a priori. In [22], a range-based cooperative localization method is proposed for multiple platforms with different structures. In this case, the dead reckoning system is implemented by means of an adaptive ant colony optimization particle filter algorithm. Furthermore, a method that incorporates the ultra-wideband technology into SLAM is presented in [23].
In a previous work by the authors [24], the problem of cooperative visual-SLAM based tracking of a lead agent was addressed. With big differences from the present work, where the (single robot) monocular-SLAM problem is addressed, in [24] a team of aerial robots in flight formation had to follow the dynamic lead agent. When two or more camera-robots are considered in the system, the problem of landmark initialization, as well as the problem of recovering the metric scale of the world, can be solved using a visual pseudo-stereo approach. On the other hand, the former problems can constitute a bigger challenge, if only a camera-robot is available in the system. This work deals with this last scenario.

Objectives and Contributions
Recently, in [25], a visual SLAM method using an RGB-D camera was presented. In that work, the information given by the RGB-D camera is used to directly obtain depth information of its surroundings. However, the depth range of that kind of camera is quite limited. In [26], a method for the initialization of characteristics in visual SLAM, employing the algorithm based on planar homography constraints, is presented. In that case, it is assumed that the camera only moves in a planar scene. In [27], a visual SLAM system that integrates a monocular camera and a 1D-laser range finder is presented; it seeks to provide scale recovering and drift correction. On the other hand, LiDAR-like sensors are generally expensive and can over weigh the system for certain applications presenting moving parts which can induce some errors.
Trying to present an alternative to related approaches, in this work, the use of a visual-based SLAM scheme is studied for addressing the problem of estimating the position of an aerial robot and a cooperative target in GPS-denied environments. The general idea is to use a set of a priori unknown static natural landmarks and the cooperation between a UAV and a target for locating both the aerial robot and the target moving freely in the 3D space. This objective is achieved using (i) monocular measurements of the target and the landmarks, (ii) measurements of altitude of the UAV, and (iii) range measurements between UAV and target.
The well-known EKF-SLAM methodology is used as the main estimation technique for the proposed cooperative monocular-based SLAM scheme. In this work, since the observability plays a key role in the convergence and robustness of the EKF ( [28,29]), the observability properties of the system are analyzed using a nonlinear observability test. In particular, it is shown that by the sole addition of altitude measurement, the observability properties of the SLAM system are improved. In this case, the inclusion of the altimeter in monocular SLAM has been proposed previously in other works, but no such observability analyses have been done before.
In monocular-based SLAM systems, the process of initializing the new landmarks into the system state plays an important role in the performance of the system as well [30]. When only monocular measurements of landmarks are available, it is not easy to obtain 3D information from them. In this case, it becomes a difficult task to properly initialize the new map features into the system state due to the missing information. Therefore, a novel technique to estimate the approximate depth of the new visual landmarks is proposed in this work. The main idea is to take advantage of the UAV-Target cooperative scheme to infer the depth of landmarks near the target. In this case, it is shown that by the addition of altitude measurements and by the use of the proposed initialization technique, the problem of recovering the metric scale is overcome.
This work also presents a formation control scheme that allows carrying out the formation of the UAV with respect to the target. Moreover, the stability of the control system is assured utilizing the Lyapunov theory. In simulations, the state estimated by the SLAM system is used as a feedback to the proposed control scheme to test the closed-loop performance of both the estimator and the control. Finally, experiments with real data are presented to validate the applicability and performance of the proposed method.

Paper Outline
This work presents the following structure: mathematical models and system specifications are presented in Section 2. The nonlinear observability analysis is presented in Section 3. The proposed SLAM approach is described in Section 4. The control system is described in Section 5. Section 6 shows the results obtained from numerical simulations and with real data experiments. Finally, conclusions and final remarks of this work are given in Section 7.

System Specification
In this section, the mathematical models that will be used in this work are introduced. First, the model used for representing the dynamics of a UAV-camera system, and the model used for representing the dynamics of the target are described. Then, the model for representing the landmarks as map features is described. Furthermore, measurement models are introduced: (i) the camera projection model, (ii) the altimeter measurement model, and (iii) the range measurement model.
In applications like aerial vehicles, the attitude and heading (roll, pitch, and yaw) estimation is properly handled with available AHRS systems (e.g., [31,32]), so in this work, the estimated attitude of the vehicle is assumed to be provided by an Attitude and Heading Reference Systems (AHRS) as well as the orientation of the camera pointing always toward the ground. In practice, the foregoing assumption can be easily addressed, for instance, with the use of a servo-controlled camera gimbal or digital image stabilization (e.g., [33]). To this effect, it is important to note that the use of reliable commercial-degree AHRS and gimbal devices are assumed.
Taking into account the previous considerations, the system state can be simplified by removing the variables related to attitude and heading (which are provided by the AHRS). Therefore, the problem will be focused on the position estimation.

Dynamics of the System
Let consider the following continuous-time model describing the dynamics of the proposed system (see Figure 1 where the state vector x is defined as: with i = 1, ..., n 1 , where n 1 is the number of landmarks included into the map. In this work, the term landmarks will be used to refer to natural features of the environment that are detected and tracked from the images acquired by a camera. Additionally, let x t = [ x t y t z t ] T represent the position (in meters) of the target, with respect to the reference system W. Let x c = [ x c y c z c ] T represent the position (in meters) of the reference system C of the camera, with respect to the reference system W. Let v t = [ẋ tẏtżt ] T represent the linear velocity (in m s ) of the target. Let v c = [ẋ cẏcżc ] T represent the linear velocity (in m s ) of the camera. Finally, let x a i = [ x i a y i a z i a ] T be the position of the i-th landmark (in meters) with respect to the reference system W. In Equation (1), the UAV-camera system, as well as the target, is assumed to move freely in the three-dimensional space. Let note that a non-acceleration model is assumed for the UAV-camera system and the target. Non-acceleration models are commonly used in monocular SLAM systems. In this case, it will be seen in Section 4 that unmodeled dynamics are represented by means of zero-mean Gaussian noise. In any case, augmenting the target model to consider higher-order dynamics could be straightforward. Furthermore, note that landmarks are assumed to remain static.

Camera Measurement Model for the Projection of Landmarks
Let consider the projection of a single landmark over the image plane of a camera. Using the pinhole model [34] the following expression can be defined: and W R c ∈ SO3 is the rotation matrix, that transforms from the world coordinate reference system W to the coordinate reference system C of the camera. Recall that the rotation matrix W R c is known and constant, by the assumption of using the servo-controlled camera gimbal.

Camera Measurement Model for the Projection of the Target
Let consider the projection of the target over the image plane of a camera. In this case, it is assumed that some visual feature points can be extracted from the target by means of some available computer vision algorithms like [35][36][37][38] or [39].
Using the pinhole model the following expression can be defined: represent the position (in meters) of the target with respect to the coordinate reference system C of the camera, and:

Altimeter Measurement Model
Let consider an altimeter carried by the UAV. Based on altimeter readings, measurements of UAV altitude are obtained, therefore this model is simply defined by: It is important to note that the only strict requirement for the proposed method is the availability of altitude measurements respect to the reference system W. In this case, the typical barometer-based altimeters which are equipped in most UAVs can be configured to provide such kind of measurement [40].

Range Measurement Model
Let consider the availability of a range sensor. Its measurements of the relative distance of the UAV with respect to the target are obtained. In this case, the measurement model is defined by: For practical implementation, several techniques like [41] or [42] can be used to obtain these kinds of measurements. On the other hand, a practical limitation for using these techniques is the requirement of a target equipped with such a device. Thus, the application of the proposed method with non-cooperative targets becomes more challenging.

Observability Analysis
In this section, the nonlinear observability properties of the proposed system are studied. Observability is an inherent property of a dynamic system and has an important role in the accuracy and stability of its estimation process. Moreover, this fact has important consequences in the convergence of the EKF-based SLAM.
In particular, it will be shown that the inclusion of the altimeter measurements improves the observability properties of the SLAM system.
A system is defined as observable if the initial state x 0 , at any initial time t 0 , can be determined given the state transition modelẋ = f(x), the observation model y = h(x), and observations z[t 0 , t] from time t 0 to a finite time t. Given the observability matrix O O O, a non-linear system is locally weakly

Observability Matrix
An observability matrix O O O can be constructed in the following manner: where L s f h represent the s-th-order Lie derivative [44], of the scalar field h respect to the vector field f. In this work, the rank calculation of Equation (9) was carried out numerically using MATLAB. The degree of Lie derivatives, used for computing O O O, was determined by gradually augmenting the matrix O O O with higher-order derivatives until its rank remained constant. Based on this approach, only Lie derivatives of zero and first order were needed to construct the observability matrix for all the cases.
The description of the zero and first order Lie derivatives used for constructing Equation (9) are presented in Appendix A. Using these derivatives the observability matrix n Equation (9) can be expanded as follows: In Equations (9) and (10), Lie derivatives that belong to each kind of measurement are distributed as: first two rows (or first two elements in Equation (9)) are for monocular measurements of the landmarks; second two rows (or second two elements) are for monocular measurements of the target; third two rows (or third two elements) are for altitude measurements; and last two rows (or last two elements) are for range (UAV-target) measurements.

Theoretical Results
Two different cases of system configurations were analyzed. The idea is to study how the observability of the system is affected due to the availability (or unavailability) of the altimeter measurements.

without Altimeter Measurements
In this case, considering only the respective derivatives on the observability matrix in Equation (10) (11) it can be seen that the unobservable modes cross through all states, and thus all states are unobservable. It should be noted that adding Lie derivatives of higher-order to the observability matrix the previous result does not improve.

with Altimeter Measurements
When altimeter measurements are taking into account, the observability matrix in Equation (10) In such a case, the following right nullspace basis N 2 spans the unobservable modes of O O O: It can be verified that the right nullspace basis of O O O spans for N 2 , (i.e., O O ON 2 = 0). From Equation (12) it can be observed that the unobservable modes are related to the global position in x and y of the UAV-camera system, the landmarks, and the target. In this case, the rest of the states are observable. It should be noted that adding Lie derivatives of higher-order to the observability matrix the previous result does not improve.

Remarks on the Theoretical Results
To interpret the former theoretical results, it is important to recall that any world-centric SLAM system is partially observable in the absence of global measurements (e.g., GPS measurements).
In this case, the SLAM system computes the position and velocity within its map, and not respect to the global reference system. Fortunately, this is not a problem for some applications that require local or relative position estimates, for instance the problem addressed in this work that implies to following a moving target.
On the other hand, it is worth noting how the simple inclusion of an altimeter improves the observability properties of the system when GPS measurements are not considered (see Table 1). It is very interesting to observe that, besides the states along the z-axis [z t , z c , z i a ] (as one could expect), the velocity of the camera-robot (which is global-referenced) becomes observable when altitude measurements are included. In this case, since the target is estimated respect to the camera, the global velocity of the target becomes observable.
Accordingly, it is also important to note that, since the range and monocular measurements to the target are used only for estimating the position of the target with respect to the camera-robot, these measurements affect neither the observability of the camera-robot state nor the observability of the landmarks states.
In other words, the target measurements create only a "link" to the camera-robot state that allows estimating the relative position of the target but does not provide any information about the state of the camera-robot, and for this reason, they are not included in the observability analysis.
Later, it will be seen how the target position is used for improving the initialization of nearby landmarks, which in turn improves the robustness and accuracy of the system.
A final but very important remark is to consider that the order of Lie derivatives and the rank calculation of Equation (9) were determined numerically, but not analytically. Therefore, there is still a chance, in rigorous terms, that a subset of the unobservable states determined by the analysis is in reality observable (see [43]).

Ekf-Based Slam
In this work, the standard EKF-based SLAM scheme [45,46] is used to estimate the system state in Equation (2). The architecture of the proposed system is shown in Figure 2.
From Equation (1), the following discrete system can be defined: From Equations (3), (5), (7) and (8), the system measurements are defined as follows: In Equation (14), a t and a c are zero-mean Gaussian noise representing unknown linear accelerations dynamics. Moreover, n k ∼ N (0, Q k ), r k ∼ N (0, R k ) are uncorrelated noise vectors affecting respectively the system dynamics and the system measurements. Let k be the sample step, and ∆t is the time differential. It is important to note that the proposed scheme does not depend on a specific aircraft dynamical model.
The EKF prediction equations are: The EKF update equations are:x with and Let K be the Kalman gain, and let P be the system covariance matrix.

Map Features Initialization
The system state x is augmented with new map features when a landmark is observed for the first time. The landmark can be initialized in one of two different parametrizations: (i) Euclidean parametrization and (ii) Inverse depth parametrization, depending on how close this landmark is from the target. Since the target is assumed to move over the ground, the general idea is to use the range information provided by the target to infer the initial depth of the landmarks near to the target. In this case, it will be assumed that the landmarks near the target lie at a similar altitude, situation encountered in most of the applications. It is important to recall that the initialization of landmarks plays a fundamental role in the robustness and convergence of the EKF-based SLAM.

Initialization of Landmarks near to the Target
A landmark is initialized with a Euclidean parameterization if it is supposed arbitrarily near the target. This assumption is made if the landmark is within a selected area of the image (see Section 4.1.4). In this case, the landmark can be initialized with the information given by the range measurement between the UAV and the target, which is assumed to be approximately equal to the depth that the landmark has respect to the camera. Therefore, the following equation is defined:  arctan 2 p g a n y , p g a n x arctan 2 p g a n z , ( p g a n x ) 2 + p g a n y with p g a n = [ p g a n x p g a n y p g a Where, [ p u n c , p v n c ] define the coordinates (in pixels) of the projection of the n-th landmark over the image of the camera. p ρ n a is initialized as it is shown in [47]. In case of a landmark with inverse depth parametrization, the projection over the image plane of a camera is defined by:

State Augmentation
To initialize a new landmark, the system state x must be augmented by j p x a n x a new ] T , being x a new the new landmark which is initialized by either the Euclidean or the inverse depth parametrization. Thus, a new covariance matrix P new is computed by: where R i is the measurement noise covariance matrix, ∆J is the Jacobian ∂h(x) ∂x , and h(x) is the landmark initialization function.

Landmarks Selection Method
To determine whether a landmark is initialized with Euclidean or inverse depth parametrization, it should be determined arbitrarily if the landmark is considered near or far from the target. To achieve this objective the following heuristic is used (see Figure 3): (1) firstly, a spherical area centered on the target of radius r w is defined; (2) then, the radius r c of the projected spherical area in the camera is estimated; and (3) the landmarks whose projection in the camera are within the projected spherical area ( i d t ≤ r c ) are considered near to the target and thus, they are initialized with Euclidean parameterization (see Section 4.1.1). Otherwise ( i d t > r c ), landmarks are considered far from the target, and they are initialized with inverse depth parametrization (see Section 4. Here, r c is estimated as follows: with and η η η = r w 0 0 T .

Target Tracking Control
To allow a UAV to follow a target a high-level control scheme is presented. Firstly, the kinematic model of the UAV is defined:ẋ Let x q = x q y q z q T represent the UAV position respect to the reference system W (in m).
Let (v x , v y ) represent the UAV linear velocity along the x and y axis (in m/s) respect to the reference system Q. Let v z represent the UAV linear velocity along the z (vertical) axis (in m/s) respect to the reference system W. Let ψ represent the UAV yaw angle respect to W (in radians); and let ω (in radians/s) is the first derivative of ψ (angular velocity).
The proposed high-level control scheme is intended to maintain a stable flight formation of the UAV with respect to the target, by generating velocity commands to the UAV. In this case, it is assumed that a low-level (i.e., actuator-level) velocity control scheme exists, like [48] or [49], that drives the velocities [v x , v y , v z , ω] commanded by a high-level control.

Visual Servoing and Altitude Control
By deriving Equation (5) the following expression can be obtained, neglecting the dynamics of the tangential and radial distortion components, taking into account that x q = x c − q d c , where q d c is the translation vector (in meters) from the coordinate reference system Q to the coordinate reference system C, and assuming q d c to be known and constant: with Furthermore, an altitude differential λ z to be maintained from the UAV to the target is defined: Now, differentiating Equation (37):λ Taking Equations (34), (35) and (38), the following dynamics is defined: where with It will be assumed that disturbances, as well as unmodeled uncertainty, enters the system through the input. In this case, Equation (39) is redefined as: where the term ∆ g (representing the unknown disturbances and uncertainties) satisfies ∆ g ≤ , where is a positive constant, so it is assumed to be bounded. Based on the dynamics in Equation (39), a robust controller is designed using the sliding mode control technique [50]. For the controller, the state-feedback is obtained from the SLAM estimator presented in Section 4. In this case, it is assumed that the UAV yaw angle is obtained directly from the AHRS device. The architecture of closed-loop system is show in Figure 4.

EKF SLAM
Visual Servoing and altitude Control First, the transformationx q =x c − q d c j is defined, to obtain the UAV estimated position in terms of the reference system Q.
To design the controller, the following expression is defined: where K 1 is a positive definite diagonal matrix, and e λ =λ − λ d , and λ d is the reference signal vector. By deriving Equation (43) and substituting in Equations (39), the following expression is obtained: For the former dynamics, the following control law is proposed: where K 2 is a positive definite diagonal matrix. Appendix B shows the proof of the existence ofB −1 . A Lyapunov candidate function is defined to prove the stability of the closed-loop dynamics: (46) with its corresponding derivative: So, by substituting Equation (45) in Equation (47), the following expression can be obtained: where α = λ min (K 2 ). Therefore, if α is chosen such that α > , thenV λ will be negative definite. In this case, the dynamics defining the flight formation will reach the surface s λ = 0 and will remain there in a finite time.

Experimental Results
To validate the performance of the proposed method, simulations and experiments with real data have been carried out.

Simulation Setup
The proposed cooperative UAV-Target visual-SLAM method is validated through computer simulations. For this purpose, a Matlab R implementation of the proposed scheme was used. The simulated UAV-Target environment is composed of 3D landmarks, which are randomly distributed over the ground. In this case, a UAV equipped with the required sensors is simulated. To include uncertainty into the simulations, the following Gaussian noise is added to measurements: for camera measurements σ c = 4 pixels; for altimeter measurements σ a = 25 cm; and for range sensor measurements σ r = 25 cm. All measurements are emulated to be acquired with a frequency of 10 Hz. The magnitude of the camera noise is bigger than the typical noise of real monocular measurement. In this way, it is intended to consider, in addition to the imperfection of the sensor, the errors in camera orientation due to the gimbal assumption. In simulations, the target was moved along a predefined trajectory.

Convergence and Stability Tests
The objective of the test presented in this subsection is to show how the robustness of the SLAM system takes advantage of the inclusion of the altimeter measurements. In other words, both observability conditions described in Section 3 (with and without altimeter measurements) are tested. For this test, a control system is assumed to exist able to maintain the target tracking by the UAV.
It is well known that the initial conditions of the landmarks play an important role in the convergence and stability of SLAM systems. Therefore, a good way to evaluate the robustness of the SLAM system is to force bad initial conditions for the landmarks states. This means that (only for this test) the proposed initialization technique, described in Section 4.1, will no be used. Instead, the initial states of the landmarks x a new , will be randomly determined using different standard deviations for the error position. Note that in this case, the initial conditions ofx t ,x c ,v t ,v c are assumed to be exactly known.
Three different conditions of initial error are considered: σ a = {1, 2, 3} meters, with a continuous uniform distribution. Figure 5 shows the actual trajectories followed by the target and the UAV.

MSEX t (m) MSEY t (m) MSEZ t (m) MSEX c (m) MSEY c (m) MSEZ c (m)
With Altimeter Taking a closer look at Figure 6 and Table 2, it can be observed that both, the observability property and the initial conditions, play a preponderant role in the convergence and stability of the EKF-SLAM. For several applications, the initial position of the UAVs can be assumed to be known. However, in SLAM, the position of the map features must be estimated online. That confirms the great importance of using good features initialization techniques in visual-SLAM; and, as it can be expected, the better observability properties the better performance of the EKF-SLAM system, indeed.

Comparative Study
In this subsection a comparative study between the proposed monocular-based SLAM method and the following methods is presented, There are some remarks about the methods used in the comparison. The method (1) is the approach described in [47]. This method is included only as a reference to highlight that purely monocular methods cannot retrieve the metric scale of the scene. The method (2) is similar to the previous method. But in this case, to set the metric scale of the estimates, the position of a subset of the landmarks seen in the first frame is assumed to be perfectly known (anchors). The method (3) is the approach described in [12]. In this case, inertial measurements obtained from an inertial measurement unit (IMU) are fused into the system. For this IMU-based method, it is assumed that the alignment of the camera and the inertial measurement unit is perfectly known; the dynamic error bias of the accelerometers is neglected as well. The method (4) is the approach proposed in [13]. In this case, altitude measurements given by an altimeter are fused into the monocular SLAM system. The method (5) is a variation of the proposed method. In this case, the landmark initialization technique proposed in Section 4.1 is not used, and instead only the regular initialization method is used. Therefore, this variation of the proposed method is included in the comparative study to highlight the advantage of the proposed cooperative-based initialization technique.
It is worth pointing out that all the methods (included the proposed method) use the same hypothetical initial depth for the landmarks without a priori inference of their position. Also for the comparative study, a control system is assumed to exist able to maintain the target tracking by the UAV.

First Comparison Test
Using the simulation setup illustrated in Figure 5, the performance of all the methods were tested for estimating the position of the camera-robot and the map of landmarks. Figure 7 shows the results obtained from each method when estimating the position of the UAV. Figure 8 shows the results obtained from each method when estimating the velocity of the UAV. For the sake of clarity, the results of Figures 7 and 8 are shown in two columns of plots. Each row of plots represents a reference axis. Table 3 summarizes the Mean Squared Error (MSE) for the estimated relative position of the UAV expressed in each one of the three axes. Table 4   In this comparison test, the performance of all the methods was tested for recovering the metric scale of the estimates. For this test, the target and the UAV follow a circular trajectory for 30 s.
During the flight, the altitude of the UAV was changed (see Figure 9). In this case, it is assumed that all the landmarks on the map are seen from the first frame and that they are kept in the camera field of view throughout the simulation.
The scale factor s is given by [51]: where d real is the real distance, and d est is the estimated distance. For the monocular SLAM problem, there exist different kind of distances and lots of data for real (and estimated) distances: distances between camera and landmarks, distances between landmarks, distances defined by the positions of the camera in time periods (camera trajectory), among other distances. Therefore, in practice, there is not such a standard convention for determining the metric scale. But in general, for determining the scale, the averages of multiple real and estimated distances are considered. In this work, authors propose to use the following approximation, which averages all the distances among the map features.
Let d ij represent the actual distance of the i-th landmark respect to the j-th landmark. Letd ij represent the estimated distance of the i-th landmark respect to the camera j-th landmark, and let n be the total number of landmarks. From (49), if the metric scale is perfectly recovered then s = 1.
For this test, an additional method has been added for comparison purposes. The Monocular SLAM with altimeter (Loosely-coupled) explicitly computes the metric scale by using the ratio between the altitude obtained from an altimeter, and the unscaled altitude obtained from a purely monocular SLAM system. The computed metric scale is used then for scaling the monocular SLAM estimates.
Case 1: The UAV follows a circular flight trajectory while varying its altitude (see Figure 9, upper plot). In this case, the UAV gets back to fly over its initial position, and thus, the initial landmarks are seen again (loop-closure). Figure 9 (lower plot) shows the evolution of the metric scale obtained for each method. In this case, for each method, the metric scale converged to a value, and remains almost constant. Even the monocular SLAM method (yellow) which does not incorporate any metric information, and the monocular SLAM with anchors (green) that only includes metric information at the beginning of the trajectory, exhibit the same behavior. It is important to note that this is the expected behavior since the camera-robot is following a circular trajectory with loop closure where the initial low-uncertainty landmarks are revisited.
Case 2: The UAV follows the same flight trajectory illustrated in Figure 5. In this case, the UAV drifts apart from its initial position, and thus, the initial landmarks are never seen again. Figure 10 (upper plot) shows the evolution of the metric scale obtained for each method. In this case, the monocular SLAM method (yellow) was manually tuned to have a good initial metric scale. The initial conditions of the other methods are alike as those of the Case 1, but the vertical limits of the plot have been adjusted for better visualization. Figure 10 (middle and lower plots respectively) shows the Euclidean mean error in position for the camera-robot e c and the Euclidean mean error in position for the landmarks e a for each method, where Figure 10, as could be expected, for the methods that continuously incorporate metric information into the system through additional sensors, the metric scale converges to a value, and remains approximately constant (after time > 100 s). On the other hand, the methods that do not continuously incorporate metric information (monocular SLAM and monocular SLAM with anchors), exhibit a drift in the metric scale. As one could also expect in SLAM without loop-closure, all the methods present some degree of drifting in position, both for the robot-camera trajectory and the landmarks. The above reasoning is independent of the drift in metric scale (the methods with low drift in scale also present drift in position). Evidently, it is convenient to maintain a low error/drift in scale, because it affects the error/drift in position.
It is interesting to note that the loosely-coupled method (purple) appears to be extremely sensitive to measurements noise. In this case, the increasing "noise" in error position is because the scale correction-ratio increases as the error in the scale of the purely monocular SLAM (yellow) also increases. In other terms, the signal-to-noise ratio (S/N) increases. Surely some adaptations can be done, as filtering the computed metric scale, but a trade-off would be introduced between the time of convergence and the reduction of noise effects.

Estimation and Control Test
A set of simulations were also carried out to test in a closed-loop manner the proposed control scheme. In this case, the estimates obtained from the proposed visual-based SLAM estimation method are used as feedback to the control scheme described in Section 5. The value of the vector λ d , that defines the desired values of the servo visual and altitude control, is: λ d = [ 0, 0, 7 + sin(t · 0.05), atan2(ŷ q ,x q ) ] T . Those values for the desired control mean that the UAV has to remain flying exactly over the target at a varying relative altitude. Figure 11 shows the evolution of the error respect to the desired values λ d . In all the cases, note that the errors are bounded after an initial transient period.  Figure 11. Errors with respct to λ d . Figure 12 shows the real and estimated position of the target and the UAV.   Table 5 summarizes the Mean Squared Error (MSE), expressed in each of the three axes, for the estimated position of: (i) the target, (ii) the UAV, and (iii) the landmarks.   Table 6 summarizes the Mean Squared Error (MSE) for the initial hypotheses of landmarks depth MSEd. Furthermore, Table 6 shows the Mean Squared Error (MSE) for the estimated position of landmarks, expressed in each of the three axes. In this case, since the landmarks near to the target are initialized with a small error, its final position is better estimated. Once again, this result shows the importance of the initialization process of landmarks in SLAM. Table 6. Mean Squared Error for the the initial depth (MSEd) and position estimation of the landmarks.

MSEd (m) MSEX (m) MSEY (m) MSEZ (m)
Far from the target 13 According to the above results, it can be seen that the proposed estimation method has a good performance to estimate the position of the UAV and the target. It can also be seen that the control system was able to maintain a stable flight formation along with all the trajectory respect to the target, using the proposed visual-based SLAM estimation system as a feedback.

Experiments with Real Data
To test the proposed cooperative UAV-Target visual-SLAM method, an experiment with real data was carried out. In this case, a Parrot Bebop 2 R quadcopter [33] (see Figure 13) was used for capturing real data with its sensory system. The set of sensors of the Bebop 2 that were used in experiments consists of (i) a camera with a wide-angle lens and (ii) a barometer-based altimeter. The drone camera has a digital gimbal that allows to fulfill the assumption that the camera is always pointing to the ground. The vehicle was controlled through commands sent to it via Wi-Fi by a Matlab R application running in a ground-based PC. The same ground-based application was used for capturing, via Wi-Fi, the sensor data from the drone. In this case, camera frames with a resolution of 856 × 480 pixels were captured at 24 fps. The altimeter signal was captured at 40 Hz. The range measurement between the UAV and the target was obtained by using the images and geometric information of the target. In experiments, the target was represented by a person walking with an orange ball over his head (See Figure 14). For evaluating the results obtained with the proposed method, the on-board GPS device mounted on the quadcopter was used to obtain a flight trajectory reference. It is important to note that, due to the absence of an accurate ground truth, the relevance of the experiment is two-fold: (i) to show that the proposed method can be practically implemented with commercial hardware; and (ii) to demonstrate that using only the main camera and the altimeter of Bebop 2, the proposed method can provide similar navigation capabilities than the original Bebop's navigation system (which additionally integrate GPS, ultrasonic sensor, and optical flow sensor), in scenarios where a cooperative target is available. Figure 14 shows a frame taken by the UAV on-board camera. The detection of the target is highlighted with a yellow bounding box. The search area of landmarks near the target is highlighted with a blue circle centered on the target. For the experiment, a radius of 1 m was chosen for the sphere centered on the target that is used for discriminating the landmarks. In this frame, some visual characteristics are detected in the image. The red cercles indicate those visual features that are not within the search area near the target, that is, inside the blue circle. Instead, the green circles indicate those detected features within the search area. The visual features that are found within the patch that corresponds to the target (yellow box) are neglected, this behaviour is to avoid considering any visual feature that belongs to the target as a static landmark of the environment.  Figure 15 shows both the UAV and the target estimated trajectories. This figure also shows the trajectory of the UAV given by the GPS and the altitude measurements supplied by the altimeter. Although the trajectory given by the GPS cannot be considered as a perfect ground-truth (especially for the altitude), it is still useful as a reference for evaluating the performance of the proposed visual-based SLAM method, and most especially if the proposed method is intended to be used in scenarios where the GPS is not available or reliable enough. According to the experiments with real data, it can be appreciated that the UAV trajectory has been estimated fairly well.

Conclusions
This work presented a cooperative visual-based SLAM system that allows an aerial robot following a cooperative target to estimate the states of the robot as well as the target in GPS-denied environments. This objective has been achieved using monocular measurements of the target and the landmarks, measurements of altitude of the UAV, and range measurements between UAV and target.
The observability property of the system was investigated by carrying out a nonlinear observability analysis. In this case, a contribution has been to show that the inclusion of altitude measurements improves the observability properties of the system. Furthermore, a novel technique to estimate the approximate depth of the new visual landmarks was proposed.
In addition to the proposed estimation system, a control scheme was proposed, allowing to control the flight formation of the UAV with respect to the cooperative target. The stability of control laws has been proven using the Lyapunov theory.
An extensive set of computer simulations and experiments with real data were performed to validate the theoretical findings. According to the simulations and experiments with real data results, the proposed system has shown a good performance to estimate the position of the UAV and the target. Moreover, with the proposed control laws, the proposed SLAM system shows a good closed-loop performance. Since f c , d u , d v ,ẑ t d > 0, then, B = 0, thereforeB −1 exists.