Towards Multi-Robot Visual Graph-SLAM for Autonomous Marine Vehicles

: State of the art approaches to Multi-robot localization and mapping still present multiple issues to be improved, offering a wide range of possibilities for researchers and technology. This paper presents a new algorithm for visual Multi-robot simultaneous localization and mapping, used to join, in a common reference system, several trajectories of different robots that participate simultaneously in a common mission. One of the main problems in centralized conﬁgurations, where the leader can receive multiple data from the rest of robots, is the limited communications bandwidth that delays the data transmission and can be overloaded quickly, restricting the reactive actions. This paper presents a new approach to Multi-robot visual graph Simultaneous Localization and Mapping (SLAM) that aims to perform a joined topological map, which evolves in different directions according to the different trajectories of the different robots. The main contributions of this new strategy are centered on: (a) reducing to hashes of small dimensions the visual data to be exchanged among all agents, diminishing, in consequence, the data delivery time, (b) running two different phases of SLAM, intra- and inter-session, with their respective loop-closing tasks, with a trajectory joining action in between, with high ﬂexibility in their combination, (c) simplifying the complete SLAM process, in concept and implementation, and addressing it to correct the trajectory of several robots, initially and continuously estimated by means of a visual odometer, and (d) executing the process online, in order to assure a successful accomplishment of the mission, with the planned trajectories and at the planned points. Primary results included in this paper show a promising performance of the algorithm in visual datasets obtained in different points on the coast of the Balearic Islands, either by divers or by an Autonomous Underwater Vehicle (AUV) equipped with cameras.


Introduction and Related Work
Simultaneous Localization and Mapping (SLAM) [1] is an essential task for Autonomous Underwater Vehicles (AUV) to achieve successfully and precisely their programmed missions. SLAM consists of building a map of the environment and, at the same time, estimating its own pose within this map. SLAM is presently a de facto localization standard for any kind of autonomous vehicle. Laser range finders or sonar were the sensor modality of choice at first [2][3][4]. However, research turned to computer vision as soon as price and capabilities of cameras made it possible [5], since cameras provide higher temporal and spatial data resolutions and richer representations of the world.
However, large-scale or long-term operations with a single robot equipped with cameras generate huge amounts of visual data that can collapse the vehicle computer, if they are not treated properly. A common strategy to overcome this problem is to explore the areas of interests in different, separated missions, so-called sessions, run with a single robot in different time periods (a Multi-session configuration [6]) or with several robots running simultaneously (Multi-robot configurations [7]). Therefore, any low capability of a robot to operate robustly during long periods of time can be alleviated by running different transits with different agents, at the same time through common areas, and joining all individual estimated trajectories in a single coordinate frame. Multi-robot systems also increase robustness in case of failure of any of the robots; however, they need complex coordination and multiple localization systems. Typical applications using teams of robots include aerial surveillance [8], underwater exploration [9], maintenance of industrial infrastructures or intervention in archaeological sites [10], among others.
The first approaches to Multi-robot SLAM were based on particle filters [11], and introduced the concept of encounters as the relative pose between two robots that can mutually recognize each other and determine their relative poses. These encounters are introduced as additional pose constrains in the particle filter. Some Multi-robot approaches are based on the Anchor-nodes [12,13] proposal, which defined two concepts unconsidered for multiple trajectories until that moment: (a) the Anchor, defined as the offset of a complete trajectory with respect to a global system of coordinates, and (b), an encounter, re-defined as a transformation between two different poses of two different robots that observe the same part of the environment, but without being necessarly that both robots recognize themselves. In visual-based systems this can be achieved, for instance, detecting overlapping scenes. In Multi-robot systems, encounters represent additional constraints between different graphs corresponding to different sessions.
Schuster et al. conceived a very precise SLAM approach to localize a team of planetary rovers equipped with an Inertial Measurement Unit (IMU) and a stereo camera [14]. IMU data, visual odometry and wheel odometry are integrated in a local localization Extended Kalman Filter (EKF) and the 3D point clouds of all robots computed from their respective stereo views are, firstly stored in each agent, and then matched to be joined in global 3D maps. The use of stereo vision and advanced techniques for 3D feature matching and alignment complicate considerably the whole system and generate huge amounts of data to work with and to be exchanged. This solution turns out to be very difficult for underwater missions, given the limited options for fast communication in this media.
Saeedi et al. offered an extensive survey of Multi-robot systems and strategies, pointing also towards the upcoming trends and challenges [15], such as extending the systems to dynamic and/or large-scale environments or increasing the number of agents in the working teams.
Another issue to consider in SLAM is the detection of loop closings and their use to correct the robot trajectory estimated by means of dead-reckoning sensors, such as, inertial units, acoustic beacons, laser-based or visual odometers. Loop closing is the problem of recalling revisited scenes, and approaches to visual loop closure detection try to recognize the same scene in different images, taken at different and relatively distant time instants, regardless evident differences on scale or view point [16]. In single session SLAM, since the robot pose is continuously estimated, the search for images candidate to close a loop with a query (from now on called intra-session loop closings) is constrained to a region around the robot pose associated with that query [17]. In contrast, multi-robot loop detection, i.e., the detection of loop closings among different sessions of different robots (from now on called inter-session loop closings), cannot rely on the AUV poses to constrain the search since, at first, the relative poses between sessions is unknown. Consequently, it seems that every query of one session would need to be compared with all the images obtained until that moment in the other sessions, increasing considerably the time dedicated for this task, and the amount of visual data to be exchanged.
Exchanging image hashes instead of entire images or sets of image salient points is a way to reduce data transfer requirements in Multi-robot configurations. Hash functions are usually used to authenticate messages sent between a source and a receiver, so that the later can verify the authenticity of the source. Conventional hashing algorithms are extremely sensitive to the hashed messages. A change in 1 bit of the input message causes dramatic changes on the output. Applications of hashes, understood as exposed before, include image retrieval in large databases, authentication and watermarking, among many others [18]. However in applications of scene recognition, localization or visual loop closing detection, it is accepted that similar or overlapping images are expected to produce similar or close hashes while distinct images produce clearly distinctive hashes, being this concept known are perceptual image hashing [19][20][21][22][23]. In particular, McDonald et al. [6] proposed to detect loop closings using a solution based on Bag of Words BoW [24] combined with iSAM [12] for batch map optimization, and Negre et al. [22] showed how their new global image descriptor HALOC outperformed other techniques, such as BoW and VLAD [20], in the task of loop closing detection with image hashes. From now on, this text uses equally hash or global image descriptor to refer the same concept.
All these aforementioned references apply hashes to detect loops in SLAM applications for single robots. However, now, our interest is focused on the Multi-robot systems, and the application of global image signatures to find loops between images captured by different robots that operate in a same mission on a common area of interest. A few authors have already explored this idea. For instance, Decentralized Visual Simultaneous Localization and Mapping (DSLAM) [25] is a powerful tool for pose-graph Multi-robot decentralized applications in environments where absolute positioning is not available. DSLAM reduces every image to its NetVLAD (a Neural Network Architecture for Place Recognition) global descriptor [26]. To find loop closings, DSLAM seeks, for every query of one robot, the image of another robot whose NetVLAD descriptor presents the shortest distance to the descriptor of the query and this distance is below a certain threshold. This process is done for every frame of every robot that is inside a predefined cluster. DSLAM uses ORB-SLAM [27] for continuous localization, which includes a global image charectrization based on BoW for initial odometric estimates, and ORB [28] feature matching and RANSAC to calculate the 3D transform between confirmed visual loop closings.
The idea of Cloud Computing is applied in some cases to alleviate the computational charge needed for a set of robots to localize themselves and map the environment running a software architecture based on a multi-layer cloud platform [29]. In this later reference, robots use ORB-SLAM for self-localization and the multi agent SLAM is tested with the KITTI [30] public dataset and using a quadrotor drone in an outdoor environment. A few solutions integrate inertial with image data to perform Multi-robot graph SLAM. In [31], ORB visual features are tracked along consecutive frames and integrated together with the motion given by an IMU in a graph optimization context. BoW is also used to detect candidates to close inter-session loops. The BoW-based global image descriptor of a query image is compared to the global descriptor of all other images of the other agents, selecting a set of candidates to close inter-session loops with the query. Afterwards, a brute-force feature matching with RANSAC is applied to confirm the candidates or to reject them. Experiments in [31] are performed with aerial robots in industrial environments.
Previous references have been tested only in terrestrial indoor and outdoor environments. The literature is extremely scarce in Multi-robot SLAM addressed, implemented and tested in underwater scenarios with AUVs [9,32]. Underwater computer vision is affected by several challenging problems, such as flickering, reduced range, lack of illumination, haze, light absorption, refraction, and reflection. These limitations increase the need for more robust visual SLAM approaches which start with accurate camera calibrations. Accuracy in the processes of camera calibration is critical to reduce drift in the visual odometry and increase precision in the pose transform obtained from images that close loops [33,34]. Furthermore, none of the papers cited previously consider the potential impact of limited communications among robots, because, either they are applied in ground or aerial robots or they simply assume full, high-bandwidth connectivity. This supposal is clearly unrealistic in underwater environments, where blue-light laser communications need to be highly directive and acoustic USBL modems work, on average, at 13 Kbps for long range devices and up to 65 Kbps in mid-range devices. Additionally, the later speeds are not suitable to transmit medium-high resolution visual data between two robots without a previous compression. For instance, Pfingsthorn et al. [35] proposed a visual pose-graph SLAM approach in which compressed JPEG-format images are send via acoustic links only between robots that can mutually recognize their positions and are viewing overlapping areas. Paull et al. [36] refuse the use of images for SLAM and trust all the localization process to an acoustic modem for data transmission and instruments that give relatively small amounts of data, if compared with images: compass, Doppler Velocity Log (DVL) and a Side Scan Sonar. Besides, they also apply a new strategy to marginalize unnecessary local information to reduce the dimensions of the transferable packages.
In the context of the ongoing national project TWINBOT (TWIN roBOTs for Cooperative Underwater Intervention Missions) [37], diverse missions of exploration and cooperative intervention have to be run using one or several AUVs in underwater areas with multiple appearances and different benthic habitats. In this project, accurate, fast and reliable robot localization, loop closing and navigation algorithms are crucial for the success of their missions. This paper presents a new approach to Multi-robot visual graph-SLAM, especially designed for 2'5D configurations, where vehicles move at a constant altitude with a camera pointing downwards, with the lens axis constantly perpendicular to the ground or to the vehicle longitudinal axis. This condition simplifies the visual system to 3 Degrees of Freedom (DoF): two for an in-plane translation (x, y) and another for rotation in yaw (θ). This simplification fits with aerial and underwater vehicle configurations, if the navigation altitude is large enough compared with the heigh of the terrain relief [38,39]. However, now tests have been made only with underwater datasets since the research developed by our team, in general [40], and the TWINBOT project in particular, is applied entirely and solely underwater, and this approach emerged as a solution to be applied on the robots that participate in our project missions.
The approach presented now includes several contributions that represent clear advantages with respect to the existing solutions, namely: (a) As in [25], images are reduced to global descriptors decreasing drastically the amount of visual data to be exchanged among robots; however, the global descriptor used now is HALOC [22] instead of NetVLAD. The construction of HALOC is simpler and faster than NetVLAD, consisting in projecting all image features on a base of orthogonal vectors, without any need for tedious and long training tasks. This is a clear advantage over the previous work, since HALOC already showed to outdo VLAD [20] and BoW, in speed and performance for loop closing detection, in both terrestrial public benchmaks and underwater environments. HALOC also showed a performance better than ORB-SLAM, only underwater. Additionally, extensive experiments with HALOC performed in marine areas partially colonized with seagrass [41,42] also revealed an excellent efficiency, capacity and utility for loop closing detection in this type of environments.
(b) A second important contribution is the simplification of the whole system with respect previous approaches. Ours does not require neither the computation of relative poses among robots, nor a specific strategy to limit their communication and interaction. At every SLAM iteration, the quantity of bytes to be exchanged between robots is so small that this will not necessarily limit the communication between all agents that participate in the mission, if needed.
(c) The global procedure includes local and global SLAM tasks, with a map joining process in between. The advantage of this point lies on the flexibility to choose the moment at which the map joining is performed, giving priority to local routes as accurate as possible, or delaying the major corrections once all maps have been joined.
(d) The present approach goes one step beyond its predecessors, since the joined graph incorporates and reflects, online, the successive poses of all robots that move simultaneously.
(e) The localization and motion problem is simplified to 2D. Furthermore, it avoids complex multi-layer software architectures or Cloud computing strategies present in previous solutions.
(f) One of the principal objectives has been the reduction of the computational requirements of the algorithm, since, in general, they are limited in lightweight underwater vehicles. Running the algorithm online onboard the vehicles is a must, since it is especially addressed to multi-robot configurations, and these configurations imply controlling, mapping and guiding several robots moving simultaneously, where usually, one centralizes the processing of the localization data of the whole group.
Furthermore, although they are not directly novel contributions, it is worth mentioning two additional advantages in the implementation: (i) similarly to [32] or [35], once maps of different robots are joined, standard graph-based topology representations are used, where images form nodes and transforms between two images (being from consecutive frames or between two images that close a loop) form edges or links, and (ii) the graph is optimized by means of standard g2o [43]; this standardization facilitates the exchange of the different modules on a variety of software platforms and their reuse among different implementations.
Although this is out of the scope of this paper, this vision-based algorithm can complement the navigation facilities of underwater vehicles equipped with multiple types of sensors. In fact, this algorithm can integrate additional sensorial data in the first estimation of the vehicle motion, combining visual odometry with other means of laser or sonar-based dead reckoning [44,45].
The source code of the whole approach has been made publicly available for the scientific community in several GitHub repositories, together with a simple underwater dataset to test the whole procedure. Links to sources are provided in Section 3.
Section 2 contextualizes and details all algorithms proposed to: (a) estimate the visual odometry, (b) detect intra-and inter-session loop closings, (c) perform the local trajectory-based SLAM and (d) join maps and optimize the global graph. Section 3 presents some qualitative and quantitative preliminary results. Finally, Section 4 concludes the paper and gives some indications of ongoing and upcoming tasks to continue and improve this work.

Overview
The proposed localization module is based only in vision, with no intervention of either dead-reckoning navigation instruments, such as IMU or DVL, or global positioning systems, such as GPS for surface vehicles or Ultra-short Baselines (USBL) for underwater vehicles.
The structure of the proposed system is as follows: (1) Let us simplify the problem assuming that there are, for instance, two vehicles moving simultaneously over the same area of interests in such a way that there is no possibility of collision, and that part of the area will be explored by both robots.
(2) The approach starts by estimating the trajectory of each robot motion, separately, applying the trajectory-based visual-SLAM strategy included in the multi-session scheme of Burguera and Bonin-Font [46]. Let us refer to this step as the intra-session SLAM. The indicated trajectory-based scheme implies that the trajectory of each robot is estimated by means of compounding [47] successive displacements calculated from one point to the next. These successive displacements form the state of an Extended Kalman filter (EKF) which is updated using the transforms given by the confirmed loop closings. In our particular case, this displacement corresponds to the visual odometry calculated between consecutive images, and the images candidate to close a loop with the current image are found comparing the corresponding image hashes and confirmed by a RANSAC-based algorithm applied on a brute-force visual feature-matching process. The main difference with respect to [46] is that while in a multi-session localization procedure the trajectory of the currently running robot is joined to another trajectory already completed and available in its totality, now, in a Multi-robot scheme, both robots are moving at the same time to complete a mission in which both participate simultaneously. When joined, both trajectories are incomplete, and continue running until all robot missions are finished.
(3) Simultaneously to both intra-session SLAM tasks, the system also searches for inter-session loop closings using HALOC. The global signature of each new image captured by each robot is compared with the global signatures of all images of the other robot. Once a certain number of inter-session loop closings are confirmed, both routes are joined in a single graph. Let us refer to this step as the Map Joining.
(4) Once joined, the global map (graph) must be completed with the successive poses of both robots until the end of both sessions. Furthermore, the trajectory-based localization approach applied to both agents separately is not longer valid. Each new displacement of both robots is included in the form of new nodes on the graph. Each new node of the graph will follow the direction of motion of each vehicle, which means that the graph will grow in two different directions, according to the two different trajectories, but forming a single entity. Let us denote this step as the Multi-robot SLAM.
(5) The global graph completion and optimization is done using a pose-based scheme, i.e., all new nodes corresponding to each robot will contain their successive global poses with respect to the origin of a unique world coordinate system, while the links between nodes will contain the displacement between them. After the map joining, only the inter-session loop closings are used to optimize the global map. If one loop closing between different sessions is confirmed, its resulting transform is used as an additional constraint between two nodes to optimize the whole graph. In our case, the optimization solver used is the g2o implementation [43] of the Levenberg Marquardt algorithm [48,49]. Figure 1 shows the global idea behind this first step of the approach. The visual odometry gives the estimated 2D motion between consecutive images by means of a SIFT feature detection and matching procedure. Algorithm 1 shows the RANSAC-based method used to register two images, i.e., get the transform (if it exists) between them in translation and rotation. apply_altitude() is a function that converts image feature coordinates from pixels to meters by considering the altitude at which the AUV navigates, as well as the camera parameters.  The idea behind this algorithm is that correct correspondences lead to the same roto-translation while wrong feature matchings lead to different and wrong roto-translations. The algorithm selects a random subset R of correspondences from the total number of correspondences C between two images, and then computes the roto-translation X and the subsequent error using only this subset. Afterwards, if the error introduced by the non-selected matchings of C is below a threshold corr , then, these matchings are included in R. If at any moment, the number of elements in R surpasses a threshold N min , the roto-translation and the error are computed again using this expanded R. If the error is below the smallest error obtained until this moment, the roto-translation is kept as a good model. This process is iterated a certain number of times. If partial roto-translations are inconsistent and R never reaches the minimum number of items required, the algorithm will not return any transform, but a boolean called f ail set to true. The obtained transform can be assumed to be the odometric displacement between consecutive images and the trajectory of the robot between steps i and j (X i j ) (assuming step j being subsequent to i) can be estimated using the compounding ⊕ operator of the successive odometric displacements (X i i+1 , X i+1 i+2 , . . . , X j−1 j ), as described in [50]:

Local Loop Detection and Trajectory Optimization
Local loops are those found within a single SLAM session. The Loop Candidates set (LC t ) is the set of images I i that may close a loop with the last gathered image I t obtained in each running trajectory. This set is built by searching in a region within a predefined radius δ [38] around the current robot pose as estimated by the odometry: where X i t is computed by Equation (1). Every image contained in the set of loop closing candidates (I i ∈ LC t ) is registered with I t using Algorithm 1, in order to build the set of local loops LL t , being indicates that Algorithm 1 did not fail to get a roto-translation Z i t between I i and I t . The trajectory estimation obtained by means of compounding the successive odometric displacements between two points A and B will most likely not coincide with the direct transform between images obtained in A and B provided by the image registration process of Algorithm 1, if A and B close a loop: due to the drift introduced by the visual odometry and the error inherent to the transform directly obtained from the image registration procedure. Figure 2 illustrates these concepts. Afterwards, a process of global optimization is run to get a trajectory that best combines the pose constraints imposed by the set of local loops (LL t ) and the odometry. As mentioned before, the trajectory of the robot is the state vector of a Iterative Extended Kalman Filter (IEKF). Each new odometric displacement (X t−1 t ) computed between the last image and the previous one is used to augment the state vector at time t (X − t ): In the prediction stage of the IEKF the state vector does not change at all.
If LL t is not empty, the trajectory is optimized performing the Update stage of the IEKF using the set of loop closings as measurements. The observation function h i t associated with each measurement Z i t (the transform of each real loop closing) can be defined as being the innovation of the IEKF for each measurement: . With all this elements, one can iterate the classical equations of an EKF to get the optimized trajectory, until ||X − t,j − X − t,j−1 || < γ, where j represents the last iteration and γ is a predefined threshold. The classical format of the IEKF involves iterating until the changes between consecutively estimated states are below a certain threshold. However, in the experiments we verified that after a certain and almost constant number of iterations, the filter already converged with a difference between consecutive results below the threshold ( Because of that, it was decided to repeat all the experiments with a defined number of iterations, in order to limit the number of executions to be done and save computational resources. In any case, both options can be used in other circumstances, depending on the needs and environmental conditions of each different system and field case.

Inter-Session Loop Closings
The main problem for joining two trajectories of two robots operating simultaneously is the lack of geometric relation between their corresponding sessions. Every robot geo-localizes itself with respect to the origin of its own trajectory, but it has no knowledge about the origin of the other trajectories. By means of finding inter-session loop closings, i.e., images that show, partially or totally, the same area, but taken by two different robots in two different sessions running simultaneously, the maps of the two robots can be joined in a single one [46]. Due to this lack of geometrical relation between the two trajectories, the search of the loop closing candidates of one robot to close a loop with the last image captured by the other robot cannot be restricted to a certain area. In this case, one should compare the last image gathered by one of the robots with all images taken by the other robot, from the start of the mission until the current moment. Applying a brute force feature matching algorithm between all these involved images is unfeasible for online applications, due to the great amount of computing resources and time needed. One way to alleviate this problem is reducing all images to global descriptors. As in [46], all images of both sessions are reduced to their HALOC global descriptor. The size of HALOC is fixed in 384 floats since the size of the used projective orthogonal vectorial space is 3 [22]. This length is independent of the number of visual features found in each image. The global descriptor of every new image of one of the sessions (called the query image) is compared with the global descriptor of all and each of the images taken during the other session. According to [22], those 5 images that give the lowest L1-norm of the difference between their hash and the query hash, and this norm is lower than a certain threshold δ are taken as the inter-session loop closing candidates (GC t ): being V p the set of images taken by one of the robots from the start of its session until the current moment, H i the hash of each of these images and H t the hash of the query image. The value of δ will be selected experimentally. Once the set of candidates is established, the true positives are confirmed by means of the RANSAC-based Algorithm 1, forming the definitive set of images (GL t ) of the first session that, in principle, close a loop with the last image of the second session, as: GL t = {Z i t : i ∈ GC t ∩ ¬ f ail(i, t)}, being Z i t the transformation found by Algorithm 1. Inter-session loops are accumulated at every iteration of both single SLAM sessions. These transforms Z i t , are, in fact, a set of geometrical relations between the two different sessions. Assessing the performance of HALOC in loop closing detection, in terms of accuracy, recall and fall-out, is out of the scope of this paper, since it has already been presented in [22,42] with considerable good results underwater.

Map Joining
As mentioned in the previous section, the loop closings between different sessions can be used to infer the geometrical relation between the two trajectories of the two robots that perform both missions simultaneously. The objective now is to align, at a certain moment, both surveys in a single global graph, and, maintain this single graph from the moment of the joining to the end of both missions.
Let X 1 denote the trajectory of one of the sessions. Let the first and last images of this trajectory be denoted as I 1s and I 1e , respectively. Let X 2 denote the trajectory of the second session and let us denote its first and last images as I 2s and I 2e , respectively. Let us denote the number of accumulated inter-session loop closings at a certain moment t as K. Let us also denote this set of loop closings as Z G , each one relating one image of the first session with another image of the second session.
where each Z 1i 2i represent a transform from image I 1i of the session 1 to image I 2i of session 2, or what is the same, the transforms of the loop closings. Each Z 1i 2i belongs to a certain GL t . Let us define X 1e 2s as the transform, or the relative motion, from I 1e to I 2s . For every loop closing, ideally, Z 1i 2i = X 1i 1e ⊕ X 1e 2s ⊕ X 2s 2i , where X 1i 1e is the displacement from I 1i to I 1e , and X 2s 2i is the displacement from I 2s to I 2i , being: The proposal consists of an IEKF that will give the value of X 1e 2s that better matches all the loop closures found until the moment t. The state vector of the IEKF is just the transform X 1e 2s , the observation function for each loop closing will be g i G = X 1i 1e ⊕ X 1e 2s ⊕ X 2s 2i , and Z 1i 2i is the corresponding measurement. With this, one can form the innovation, and apply the classical EKF equations iteratively, as explained in Section 2.2.2. X 1e 2s is the transformation that can be used to join the two sessions, in such a way that the state vectors of both trajectories, formed by displacements, are joined by this recently computed transformation as: X J = (X 1 ) T (X 1e 2s ) T (X 2 ) T T , where X J represents the joined trajectory, and X 1 //X 2 the state vector of the first and second trajectories, respectively, from their starting points until the instant t. The idea is illustrated in Figure 3. Once both sessions have been joined, the trajectory-based schema is no longer valid, and the resulting map is transformed into a pose-based graph. All the robot displacements included in X J are transformed into global poses that constitute each node of the global graph as: where X i is the pose associated with node i. The case for i <= 1e refers to the global pose of a node corresponding to an element of the first trajectory, and the case for i > 1e refers the global pose of a node of the second trajectory. All the displacements included in X J become the links between successive nodes and each node of the graph is associated with its corresponding image.

Multi-Robot Graph SLAM
Let us assume that, (a) after both trajectories have been joined in a single graph, both robots are still running their own missions, and, (b) the successive poses of both robots must be included in the global graph as new nodes, each one following the corresponding trajectory.
The Multi-robot Graph-SLAM procedure detailed below follows the indications of [51], in terms of structure, node generation, inclusion of loop closings as additional pose constraints, and graph optimization, but in our case particularized for a Multi-robot configuration. The algorithm includes the next points: (1) The local SLAM algorithm explained in Section 2.2 is continuously executed for both trajectories until them are joined when a certain number of inter-session loop closings have been accumulated. It is better to optimize the local trajectories every N frames, although there is only a couple of loop closings, in order to, when both sessions are joined the drift has already been reduced locally, as much as possible. Otherwise, trajectories could be joined before local optimizations have been applied, transferring local drifts to the global map.
Let us assume that the Multi-robot localization is centralized in the first robot, which will receive, from the second robot: (1) The set of visual features and the global descriptor of the last gathered image, (b) only if the map joining has to be done, the state vector and the last odometric displacement. The state vector is needed to be attached to the one of the first robot, if it is due. The last frame global descriptor is needed to find possible loop closures with frames of trajectory 1, the set of features is needed to confirm or reject the possible candidates, and the last odometric displacement of trajectory 2 will be used as a reference after the map joining.
(2) Once both trajectories have been joined in a single global graph, it is time to feed the map with the successive displacements of both robots. It is important to note that the last node of the graph corresponds to the last displacement of the second trajectory, since the first set of elements correspond to the trajectory of robot 1, then it comes the link between trajectories, and finally the elements of trajectory 2. Let us denote the identifier ID of the last node of the global graph as N n2 , where n2 represents the number of nodes in the graph, and is equal to the length of the joined state vector X J (|X J |). Accordingly, the ID of the graph node corresponding to the end of the trajectory 1 will be N n1 , where n1 = |X J | − 1 − |X 2 |.
Storing N n1 and N n2 is necessary, since they will be the points of the global graph from which the successive nodes corresponding to trajectories 1 and 2, respectively, will be placed according to the ongoing motion of both vehicles.
The set of iterated actions performed for the Multi-session SLAM are: 1.
Let us denote the last (or next) computed odometric displacements of trajectories 1 and 2 as X n1 and X n2 , respectively. These displacements together with the last images of both session are stored in the system.

2.
If trajectory 1 has not finished, add a new node (N n1+1 ) to the graph, linked to N n1 with the transform X n1 . The global pose contained in this node will be: X n1+1 = X n1 ⊕ X n1 . N n1+1 will be the last node of trajectory 1.

3.
If the trajectory 2 has not finished, add a new node (N n2+1 ) to the graph, linked to N n2 with the transform X n2 . The global pose of this node will be: X n2+1 = X n2 ⊕ X n2 . N n2+1 will be the last node of trajectory 2. The link between nodes X n1+1 and X n1 , and the link between nodes X n2+1 and X n2 will contain the values of X n1 and X n2 , respectively. Each new node added on the graph is associated in the code to its corresponding image, regardless the trajectory it belongs to. In this way, with the node ID one can find it associated image, and with an image identifier, one can find its associated node ID.

4.
Search for inter-session loop closings between the last image of session 2 and all images of session 1 using the algorithm explained in Section 2.2.3. Those candidates of session 1 retrieved by HALOC that present a transform after the RANSAC discrimination process with several inliers lower than a pre-fixed parameter ( Add one additional link in the graph between nodes IDai and IDbi, which content is Z IDai IDbi , ∀i, 1 i N ALC . 6.
Run the graph bundle adjustment using the Levenberg Marquardt algorithm. Even if after a certain number of iterations no inter-session loop closings are found, the graph will be optimized as well, just to re-adjust the odometric trajectory estimates. 7.

8.
Return to the first step, and iterate the process until both trajectories are finished. If one of the two trajectories finishes before the other one, the system keeps adding the corresponding nodes of the session that is still on course. Obviously, no additional inter-session loop closings will be found in this case, so every graph optimization will include only the pose estimates given by the visual odometry of the ongoing mission.
The idea is illustrated by Algorithm 2.
1 Inputs 2 X n1 , X n2 : Last odometric displacements of Trajectories 1 and 2 before the map joining. 3 N n1 , N n2 : Identifiers (ID) of the last graph nodes corresponding to trajectories 1 and 2, after the map joining. 4 X n1 ,X n2 : Global poses corresponding to nodes N n1 and N n2 , after the map joining. X n1 = X n1−1 ⊕ X n1 5 I n1 , I n2 : last images taken by robots 1 and 2 at instants n1 and n2, just before the map joining.

Experimental Setup
A set of preliminary experiments were performed simulating the Multi-robot configuration with real underwater data. Three different datasets were used. Every dataset consists of two different video sequences, partially overlapping. Sequences were recorded either by a diver or by an AUV, both carrying a bottom looking camera with its lens axis perpendicular to the longitudinal axis of the vehicle or of the diver. Divers or autonomous underwater vehicles moved at an approximate constant altitude with respect to the sea bottom. The lack of any other sensorial data which could be supplied by an AUV, makes the localization system a pure vision-based approach. The three datasets have been recorded in several coastal sites of the north and south of Mallorca, at depths between 5 and 13 m. All the environments where the datasets were grabbed present a great variety of bottom textures, including seagrass, stones, sand, algae, moss and pebbles. The first dataset was recorded in the north coast of the island, by a diver with an attached Gopro camera, pointing downwards, moving on the surface at an approximate constant altitude of 4 m. Let us refer to both video sequences of this first dataset as S11 and S12. The camera altitude was obtained at the beginning of the video sequence by means of a visual marker of known size, placed at the sea bottom, in the starting point of each trajectory.
A second dataset formed by two partially overlapping trajectories named S21 and S22 were recorded also in the north coast of Mallorca, also by a diver supplied with a Gopro, looking to the bottom, far from S11 and S12, swimming on the water surface, at an approximate constant altitude of 4 m. In this case, the initial altitude was computed thanks to the know dimensions of a structure formed by markers and PVC tubes placed at the sea floor in the origin of both trajectories. The video resolution was 1920 × 1080 pixels, grabbed at 30 frames per second (fps), and prior to their use, all images were scaled down to 320 × 180 pixels.
A third dataset, with two video sequences named S31 and S32, was recorded by a SPARUS II AUV [52] property of the University of the Balearic Islands, at 7.5 fps, moving at a constant altitude of 3 m, in an area of the south of the island with an almost constant depth of 16 m. The navigation altitude is obtained from the vehicle navigation filter which integrates a Doppler Velocity Log (DVL), an Inertial Measurement Unit (IMU), a pressure sensor, an Ultra Short Baseline (USBL) acoustic modem [53], and a stereo 3D odometer. This dataset permitted to test the approach in larger environments with complex imagery due to the presence of sea grass on the sea bottom. In particular, S31 was recorded during a trajectory of 93 m long, and S32 during a a trajectory of 114 m, covering both an approximate area of 300 M 2 each one. Figure 4 shows some samples of images included in the three datasets. All these images show how all regions are colonized with Posidonia oceanica, a seagrass that forms dense and large meadows. Images of dataset 3 show a lack of illumination which increases at larger depths. With these conditions, the feature matching process decreases its performance and affects directly the accuracy of the visual odometry and the loop closing detection using HALOC. In this type of marine environments and with our robot and its equipment, moving at approximately 1 knot at altitudes between 3 m and 5 m, in areas with a depth between 16 m and 20 m, gave a good tradeoff between image overlap and illumination conditions. Due to the particular texture of the Posidonia and the slight motion of its leafs caused by the currents, tracking stable visual features in consecutive overlapping frames is complicated and requires an accurate selection of the type of features and the feature detection/tracking parameters. Errors in this process will compromise the accuracy of the visual odometry and the image registration task for the loop closing confirmation. Previous pieces of work [22,41,42] already showed the high efficiency of SIFT features for underwater SLAM in areas colonized with Posidonia, in all the tasks involved in the process: visual odometry, image hashing with HALOC, loop closing detection, and pose refinement. Although SIFT feature detector is slower than other descriptors and delays the RANSAC-based matching process of Algorithm 1, given the robustness and traceability of SIFT, and according to our experience, this additional processing time is preferable to obtain more reliable trajectories than using other simpler features that take less time than SIFT to be computed and tracked, but can cause larger inaccuracies in the camera trajectory estimation or in the image registration process.

Experiments and Results
All experiments performed to get these first preliminary results were run offline, simulating the Multi-robot configuration with the visual data obtained in the sea. Key images of each video sequence that forms the different datasets mentioned in the previous section were extracted, indexed, stored separately in a hard disk, and processed consecutively according to the algorithms exposed in this paper.
Successive field experiments showed that, an overlap between consecutive images of 35% to 50% was necessary to obtain a robust visual odometry. On the other hand, reducing the number of images stored was also required in order to save as much memory space as possible. Consequently, a good trade off between both requirements was obtained selecting the keyframes of datasets 1 and 2 down-sampling the initial video frame rates at 1.1 fps, on average, and the dataset 3 at 3 fps. 226 key images were extracted from S11 and 199 from S12. 152 key frames were extracted from the video sequence S21 and 57 from S22. Finally, a total of 400 key images were extracted from dataset 3, 200 belonging to S31 and 200 to S32.
Local SLAMs are continuously executed for both trajectories in sequential steps of a predefined number of frames N; each local trajectory is accumulated and optimized from N to N frames, alternating both sessions every N frames. N is set differently for each dataset. Figure 5 shows several samples of inter-session loop closings found by HALOC and confirmed by Algorithm 1. Once images and inter-session loops are available, the whole localization and mapping process starts. In other words, the sequence of actions is as follows: 1.
For each dataset, extract the key images of both video sequences and store them in separated folders 2.
For each dataset, compute the HALOC global descriptor of each image extracted from both video sequences.

3.
For each dataset, compute and store in a file, the odometry, frame to frame, for both stored image sets, corresponding to both sessions.

4.
At this point, for each dataset and for each of their sessions, the key frames and the odometry have been stored and related through successive identifiers. Thereafter, for each dataset run the local SLAM procedure, which: (a) Starts algorithm of Section 2.2, building the state vector of each session, by steps of N consecutive frames, using the displacements included in each odometry file. Search among all images of session 1 the best 5 candidates to close an inter-session loop with each new query of session 2, and filter out all those that do not present enough inliers after running Algorithm 1. HALOC is, obviously, the method used to find these candidates to close loops inter-sessions. Since each image will be associated with a node of the global graph, computing the transform between two candidates to close a loop and adding this transform between both nodes will be straightforward. This simulation does not permit assessing anything related to execution times because, although the whole process works with real data, the Multi-robot configuration has been simulated, split into three different software packages of different natures. The purpose of these preliminary results is not giving an accurate set of quantitative and numerical results to assess the process in terms of execution time or trajectory accuracy. The aim of this section is presenting the implementation of a new approach and a set of preliminary results that provide: (a) a proof of viability and feasibility of the solution, (b) a proof of its utility and suitability to manage, in a single map, two sessions of two different robots that operate simultaneously, in a simple way, (c) a qualitative proof of concept and, (d) the source code and a dataset to be tested, open to further improvements.
Obtaining a ground truth trajectory underwater is a challenging task, unless one can install an infrastructure of acoustic beacons or Long Baseline (LBL) systems, which is costly and complicated to run and manage, and imposes spatial restrictions on the motion of the robots. In our experiments, there is no ground truth and no possibility to get it. The planned trajectories, in the case of those performed with the AUV, cannot be used as a ground truth either, since they differ substantially from the ones that the AUV ends up performing (which is usual in underwater robotics). In consequence, we cannot compare the trajectories estimated by the system with another one that serves as reference. In our case, robustness has been qualitatively validated by two means, (a) comparing the resulted global maps with the mosaics obtained with BIMOS [54] and, (b) comparing the direct transforms between images that close loops, obtained with Algorithm 1 with the transforms between the graph-nodes related to the same loop closing images. These two points have been already used and validated in previous pieces of work [46]. Figure 6a,b,d,e show, respectively, the trajectory of S11, S12, S21 and S22 estimated by the local SLAM procedure described in Section 2.2. Figure 6c,f show, respectively, the global graph obtained applying the Multi-robot SLAM procedure described in Section 2.3. Figures 7 and 8 show four photo-mosaics corresponding to sequences S11, S12, S21 and S22. These photomosaics have been obtained using BIMOS [54], a mosaicing algorithm based on bags of binary words that already demonstrated a great performance in underwater environments [55,56].  The resulting mosaics have associated an implicit trajectory which imposes the position of each image with respect to the origin of the mosaic system of coordinates. Due to the lack of any trajectory ground truth and the impossibility to get it, the mosaic is, to a certain extent, a qualitative reference to assess the quality of the resulting joined trajectories of Figure 6c,f, since BIMOS has already demonstrated its good performance in land and underwater. Notice how the mosaic of Figure 7b shows a montage very close to the SLAM trajectory of S11, and Figure 7a a mosaic fitting the SLAM trajectory of S12. It can be assumed that the quality of the single and joined graphs obtained with BIMOS are similar to the quality of the mosaics. If both mosaics are aligned by their left laterals, where the marker is located, something that fits very well with the global graph is obtained. The same applies to the mosaics of Figure 8. The structure containing the two markers is the point joining both trajectories. Then, the alignment of both figures by the markers gives a result that also fits perfectly with the global graph of Figure 6. Figure 9a,b show the local SLAM trajectory of S31 and S32, and Figure 9c shows the global graph estimated after joining both trajectories and applying the Multi-robot graph SLAM approach.
An illustrative video of the whole process involving the three datasets can be seen in [57]. The video shows, at the beginning, some sequences grabbed underwater and used to test our approach. Afterwards, it shows the whole process for the three datasets exposed: (1) The local SLAM for both separated trajectories, (2) the moment when both sessions are joined and converted into a single global graph, and (3) how the graph continues growing in different directions, each one corresponding to each trajectory involved in the Multi-robot mission. As mentioned in previous sections, the joined graph is optimized every time a set of inter-session loop closings are confirmed.
As this is a pure visual-based SLAM approach and no other sensorial input is included in the multi-localization process, this is firstly computed in pixel units. To find the relation between pixels and metric units, the known real dimensions of the markers used to establish the starting and end points of each trajectory (see pictures (a) and (e) of Figure 4) were related with the pixel dimensions of the markers in the images. These relations resulted in coefficients ranging between 0.0015 and 0.0019, depending on each video sequence.
Another way to verify the consistency of the resulting optimized global map is comparing the transform between two images (called I1 and I2) that close an inter-session loop, when it is obtained by two different means: (1) using Algorithm 1, and (2) running the next operation: P1 ⊕ P2, where P1 is the global pose associated with the graph node corresponding to I1, and P2 is the global pose associated with the graph node corresponding to I2. The idea is illustrated in Figure 10: P1 and P2 are the global poses associated with two loop closing nodes, T I1 I2 is the direct transform between P1 and P2 obtained with the RANSAC-based algorithm applied directly on I1 and I2. In principle, if the graph is consistent, T I1 I2 has to be equivalent to the inverse of P1 composed with P2, which is the transform between I1 and I2, but obtained composing the global poses between the respective nodes of the loop closing images.   Tables 1 and 2 show some samples of quantitative results of intersession loop closings with the corresponding transforms calculated by both aforementioned ways. The first to forth columns show, respectively, the number of both images that close an inter-session loop and the graph nodes which each image is associated with. Column I1 contains images of the sessions S11 and S21, and column I2 contains images of sessions S12 and S22. Fifth and sixth columns indicate the 2D transform, in translation and rotation (x, y, θ), computed indirectly through the graph and directly using RANSAC (Algorithm 1). The units of these transforms are expressed in pixels and radians. The seventh column indicates the difference between both transforms, in module (meters) and orientation (radians). These samples indicate that: (a) for S11 and S12, the difference of transforms ranges between 1.8 cm and 0.15 mm in module and between 0.05 rad. (2.86 • ) and 0.0089 rad. (0.5 • ) in orientation, and (b) the difference for S21 and S22 ranges between 1.02 cm and 14.8 cm in module and between 0.0085 rad. (0.5 • ) and 0.21 rad. (12.03 • ) in orientation. These differences are totally acceptable, taking into account that there are errors inherent to the RANSAC transform estimation process due to the possible (and usual) presence of any inconsistent inlier, and differences (or errors) due to the successive graph optimizations, which also cause subsequent readjustments of all node poses.

Some Considerations of the Data Reduction
The length of the HALOC global descriptors used in the aforementioned tests is 384 floats. That is, a total of 1536 bytes per image, considering that in C + + a float needs 4 bytes for memory storage. All this, regardless the image resolution and the amount of SIFT features per image. That means that no matter how big is the image and how many visual features are being detected per image, that the length of the hash maintains invariable. Conversely, the size of a color image with a very reduced resolution of 320 × 240 pixels would be 320 × 240 × 3 = 230,400 bytes. The save on memory space for data storage is clearly reduced when using the HALOC hash instead of the original images. The set of image features must be stored for every image, in any case, since they are needed in the later processes of loop closing confirmation. However, the computational cost of comparing two hashes to retrieve the best candidates for loop closing just calculating the L1-norm of two vectors is much lower than finding the best candidates with a brute-force recursive feature double-matching with RANSAC. From this point on, one can think of applying additional strategies to limit the communication between robots, complementing, for instance the solution proposed in [35], where the images are sent only among robots that view, simultaneously, a common point. In this case, we introduce an additional layer to compress the information to be exchanged, since instead of sending JPEG images, robots would send their respective hashes.

Sources Availability
The source code for the odometry computation has been developed in Matlab and it is available at [58]. The source code of the HALOC library is available at [59], for its C + + version, and in [60], for its version in Python.
The sources for the local SLAM, the Map Joining and the later Multi-Robot Graph SLAM have been developed also in Matlab, and they are available for the community at [61]. The pose-based graph management has been programmed using the Matlab library for localization and pose estimation especially addressed to mobile autonomous vehicles [62].

Discussion, Conclusions and Future Work
This paper presents a new approach to visual SLAM for Multi-robot configurations, based on joining, in a single pose-based graph, several trajectories of different robots which operate simultaneously in a common area of interests. The system finds loop closings between images of different robot trajectories by means of a hash-based methodology (HALOC), and uses them to add additional constraints to the global graph. As exposed in the text, the use of HALOC clearly guarantees an important reduction in storage space, amounts of data to be transferred and time dedicated for loop closing detection, especially in centralized multi-robot configurations. Using HALOC also assures a proper graph optimization since it has already been proved underwater, showing excellent results in terms of success ratios in loop closing detection.
The strategy for map joining comes from [46], but adapted to a Multi-robot configuration, which differs from a multi-session case in some aspects. This strategy is simple, easy to replicate, effective, and, more importantly, as flexible as possible to modulate the moment for map joining depending on the mission conditions and convenience, which implies a trade off between the accuracy of the local maps before they are joined, or the need for joining as soon as possible all the trajectories to centralize the global multiple localization of all the robotic team in a single agent.
Preliminary experiments permitted to show how the application of the new approach for joining, online, multiple ongoing sessions was perfectly feasible, suggesting a certain consistency and reliability in the results, from a qualitative point of view.
Although, until now, we focused our efforts exclusively in the estimation of the camera pose and trajectory, this algorithm has been designed to be applied on board a vehicle. Therefore, one priority ongoing task is testing this algorithm in a team of real vehicles operating in the sea. To this end, a ROS [63] wrapper in C++ is currently being developed and tested.
We have now focused our efforts exclusively in the estimation of the camera pose and trajectory assuming that navigation and control are solved issues. In fact, most of the existing research on SLAM makes the same assumption. However, the continuous re-estimation of the vehicle poses thanks to the SLAM algorithm surely affects the control of the vehicles, because the control modules are fed with the poses and velocities. In addition, changes in control affect, in turn, the vehicle navigation. At the moment the SLAM algorithm is completely decoupled from the control module, but once it is installed on a vehicle, the vehicles velocity and pose provided by our SLAM modules, together with the mission goal points, will be input in the navigation and control subsystems. Another possible line of research that is also under consideration is to make the goal points also depend on SLAM in order to add exploration to the AUV capabilities.
Other future work includes: (1) Extending the tests to additional environments with longer trajectories. (2) Extending the assessment of the approach by means of evaluating the performance of the SLAM pose corrections in the presence of additive Gaussian noise in the visual odometry, and all evaluation techniques employed in [46]. (3) Comparing with other Multi-robot software packages still not tested underwater, such as DSLAM.
Matlab sources are available in a public repository giving the chance to the scientific community of testing, replicating, and also improving them.