Experimental Validation of Entropy-Driven Swarm Exploration under Sparsity Constraints with Sparse Bayesian Learning

Increasing the autonomy of multi-agent systems or swarms for exploration missions requires tools for efficient information gathering. This work studies this problem from theoretical and experimental perspectives and evaluates an exploration system for multiple ground robots that cooperatively explore a stationary spatial process. For the distributed model, two conceptually different distribution paradigms are considered. The exploration is based on fusing distributively gathered information using Sparse Bayesian Learning (SBL), which permits representing the spatial process in a compressed manner and thus reduces the model complexity and communication load required for the exploration. An entropy-based exploration criterion is formulated to guide the agents. This criterion uses an estimation of a covariance matrix of the model parameters, which is then quantitatively characterized using a D-optimality criterion. The new sampling locations for the agents are then selected to minimize this criterion. To this end, a distributed optimization of the D-optimality criterion is derived. The proposed entropy-driven exploration is then presented from a system perspective and validated in laboratory experiments with two ground robots. The experiments show that SBL together with the distributed entropy-driven exploration is real-time capable and leads to a better performance with respect to time and accuracy compared with similar state-of-the-art algorithms.


Introduction
For exploration tasks that rely on multi-agent systems, with complex, unstructured terrains, autonomy plays a key role to lower potential threats or tedious work for human operators, be it space exploration, disaster relief, or routine industrial facility inspections. The main objective here is to give a human operator more detailed information about the explored area, e.g., in terms of a map, and to support further decision making. While multiple agents do provide an increased sensing aperture and can potentially collect information more efficiently than a single-agent system, they have to rely more heavily on autonomy to compensate, e.g., possible large (or unreliable) communication delays [1] or the complexity of teleoperating multiple agents.
One of the approaches to increase the autonomy of multi-agent systems consists of using in situ analysis of the collected data with the agents' own computing resources to decide on future actions. In the context of mapping, such an approach is also known as active information gathering [2,3] or exploration. Note that mapping is generally not restricted to sensing with imaging sensors, such as cameras. The exploration of gas sources [4] or of the magnetic field [5] also falls in this category.
An approach for active information gathering lies in the focus of the presented work. In the following, we provide an overview of work related to the approach discussed in this paper, the arising challenges, and a proposed solution.

Paper Contribution
To address this, the exploration problem with sparsity constraints has been cast into a probabilistic framework, where the parameter covariance can be computed exactly. In [15], we formulated a Bayesian approach toward cooperative sparse parameter estimation for SOF, and in [16] for SOE data splitting. However, the distributed computation of the covariance matrix and information-driven exploration has not been considered so far. With this contribution, we close this gap and study an information-driven exploration strategy that is based on a Bayesian approach toward distributed sparse regression. Specifically, • We consider a distributed computation of the corresponding parameter covariance matrices for information-seeking exploration using a Bayesian formulation of the model, and • Validate the algorithm's performance both in simulations as well as in an experiment with two robots exploring the magnetic field variations on a laboratory floor.
The rest of the paper is structured as follows. We begin with a model formulation and model learning in Section 2. In Section 3, we discuss a distributed computation of the exploration criterion for the considered regression problem. Afterwards, we outline the experimental setting, the collection of ground truth data, and the sensor calibration in Section 4, as well as the overall system design in Section 5. The experimental results are summarized in Section 6, and Section 7 concludes this work.

Model Definition
We make use of a classical basis function regression [17] to express an unknown scalar physical process p(x) ∈ R, with x ∈ R d and d ∈ N. Typically, the process is d-dimensional, with d ∈ {2, 3}. To represent the process p(x), a set of N ∈ N basis functions φ n (x, π n ) ∈ R, n = 1, . . . , N are used, where π n ∈ R s is dependent on the used basis function and s is a number of parameters per basis function.
Each basis function is parameterized with π n , n = 1, . . . , N, which can represent centers of corresponding basis functions, their width, etc. More formally, we assume that where w n ∈ R are generally unknown weights in the representation.
To estimate w n , n = 1, . . . , N, we make M observations of the process p(x) at locations X = [x 1 , . . . , x M ] T ∈ R M×d . The corresponding m-th measurement is then represented as where η(x m ) ∝ N (0, λ −1 ) is an additive sample of white Gaussian noise with a known precision λ ∈ R + . By collecting M measurements in a vector y(X) = [y(x 1 ), . . . , y(x M )] T ∈ R M , we can reformulate (2) in a vector-matrix notation. To this end, we define and w [w 1 , . . . , which allows us to formulate the measurement model in a vectorized form with Based on (7), we define the likelihood of the parameters w as follows Often, the representation (1) is selected such that N M, i.e., it is underdetermined. This implies that there is an infinite number of possible solutions for w. A popular approach to restrict a set of solutions consists of introducing sparsity constraints on parameters. Within the Bayesian framework, this can be achieved by defining a prior over the parameter weights w. This leads to a class of probabilistic approaches referred to as Sparse Bayesian Learning (SBL).
The basic idea of SBL is to assign an appropriate prior to the N-dimensional vector w such that the resulting maximum a posteriori (MAP) estimate w is sparse, i.e., many of its entries are zero. Typically, SBL specifies a hierarchical factorable prior p(w|γ)p(γ) = ∏ N n=1 p(w n |γ n )p(γ n ), where p(w n |γ n ) = N (w n |0, γ n ), n ∈ {1, . . . , N} [18][19][20]. For each n ∈ {1, . . . , N}, the hyperparameter γ n , also called sparsity parameter, regulates the width of p(w n |γ n ); the product p(w n |γ n )p(γ n ) defines a Gaussian scale mixture (the authors in work [21] extend this framework by generalizing p(w n |γ n ) to be the probability density function (PDF) of a power exponential distribution, which makes the hierarchical prior a power exponential scale mixture distribution). Bayesian inference on a linear model with such a hierarchical prior is commonly realized via two types of techniques: MAP estimation of w (Type I estimation; note that many traditional "non-Bayesian" methods for learning sparse representations such as basis pursuit de-noising or re-weighted pnorm regressions [22][23][24] can be interpreted as Type I estimation within the above Bayesian framework [21]) or MAP estimation of γ (Type II estimation, also called maximum evidence estimation, or empirical Bayes method). Type II estimation has proven (both theoretically and empirically) to perform consistently better than Type I estimation in the present application context. One reason is that the objective function of a Type II estimator typically exhibits significantly fewer local minima than that of the corresponding Type I estimator and promotes greater sparsity [25]. The hyperprior p(γ n ), n ∈ {1, . . . , N}, is usually selected to be non-informative, i.e., p(γ n ) ∝ γ −1 n [26][27][28]. The motivation for this choice is twofold. First, the resulting inference schemes typically demonstrate superior (or similar) performance as compared to schemes derived based on other hyperprior selections [21]. Second, very efficient inference algorithms can be constructed and studied [26][27][28][29][30].
In the following, we consider only SBL Type II optimization as it leads to usually sparser parameter vectors w [21], and we drop explicit dependencies on measurements X and basis function parameters Π to simplify notation. The marginalized likelihood for SBL Type II optimization is therefore where Σ = λ −1 I + ΦΓΦ T , Γ = diag{γ}, and I being the identity. Taking the negative logarithm of (9), we obtain the objective function for SBL Type II optimization in the following form An estimate of hyperparameters γ is then found as Once the estimate γ is obtained, the posterior probability density function (PDF) of the the parameter weights w can be easily computed: it is known to be Gaussian p(w|y, γ) = N ( w, Σ w ) with the moments given as where Γ = diag{ γ} (see also [18]).

Sparse Bayesian Learning with the Automatic Relevance Determination
The key to a sparse estimate of w is a solution to (11). There are a number of efficient schemes [26][27][28] to solve this problem. The method that we use in this paper is based on [26]. In the following, we shortly outline this algorithm.
In [26], the authors introduced the reformulated automatic relevance determination (R-ARD) by using an auxiliary function that upper bounds the objective function L(γ) in (10). Specifically, using the concavity of the log-determinant in (10) with respect to γ, the former can be represented using a Fenchel conjugate as where z ∈ R N is a dual variable and h * (z) is the dual (or conjugate) function (see also [31] (Chapter 5) or [32]). Using (13), we can now upper-bound (10) as follows Note that for any γ, the bound becomes tight when minimized over z. This fact is utilized for the numerical estimation of γ, which is the essence of the R-ARD algorithm.
R-ARD alternates between estimating z, which can be found in closed form as [26,31] and estimating γ as a solution to a convex optimization problem In order to solve (16), the authors in [26] proposed to use yet another upper bound on L(γ, z). Specifically, by noting that the cost function in (16) can be bounded with The right-hand side of (18) is convex both in w and γ. As such, for any fixed w, the optimal solution for γ can be easily found as γ l = z − 1 2 l |w l |, l = 1, . . . , N. By inserting the latter in (18), we find the solution for w that minimizes the upper-bound L(w, γ, z) as which can be recognized as a weighted least absolute shrinkage and selection operator (LASSO) cost function. Expression (19) builds a basis for a distributed estimation learning of SBL parameters, since there exist techniques to optimize a LASSO function over a network, which are presented in the following section.

The Distributed Automated Relevance Determination Algorithm for SOF Data Splitting
The derivation of the distributed R-ARD (D-R-ARD) for SOF is shown in [14]. Here, we would like to show the main aspects of the distribution paradigm and the resulting algorithm. The main aspect of heterogeneous data splitting is that each agent has its own model. Therefore, the parameter weights w are distributed among K ∈ N agents as w = [w T 1 , . . . , w T K ] T and each agent has its part w k ∈ R N k , where N = ∑ K k=1 N k . Likewise, the matrix Φ is partitioned among K agents as Φ = [Φ 1 , . . . , Φ K ] where Φ k ∈ R M×N k . The SOF model is then formulated as Similarly, the hyper-parameters γ are also partitioned as γ = [γ T 1 , . . . , The solution to cooperative SOF inference then amounts to computing z from (15) and optimizing the upper bound (18) over a network of K agents.
Unfortunately, in the case of the SOF model, the dual variable z = [z T 1 , . . . , z T K ] T in (15) cannot be computed exactly. Instead it is upper-bounded [14] as z k ≤ z k , where z k is computed for each agent: with This approximation preserves the upper bound in (18). Consequently, (19) can be reformulated to fit for SOF as which can be solved distributively via the alternating direction method of multipliers (ADMM) algorithm [33] (Section 8.3). The D-R-ARD algorithm for SOF is summarized in Algorithm 1. When using ADMM to solve for w k , the only communication between the agents takes place inside of the ADMM algorithm. The communication load of the ADMM algorithm for SOF is discussed in [33] (Chapter 8).

The Distributed Automated Relevance Determination Algorithm for SOE Data Splitting
For SOE, we will assume that measurements y at locations X are partitioned into K disjoint subsets {y k (X k ), X k } K k=1 , each associated with the corresponding agent in the network. Hence, each agent k makes M k observations y k ( This allows us to rewrite (7) in an equivalent form as where we assumed that perturbations η k , k = 1, . . . , K, are independent between agents, i.e., To cast R-ARD in a distributed setting, we need to be able to solve (19) and compute z in (15) over a network of agents. To this end, let us define where Λ = diag [λ 1 I 1 , . . . , λ K I K ], and I k is an identity matrix of size M k × M k , k = 1, . . . , K.
We point out that D, or rather the last factor in (24), can be computed over a network of agents using an averaged consensus algorithm [34,35]. Next, we apply the Woodbury identity to Σ −1 to obtain Inserting (25) and (24) into (15), we get where Σ w = (D + Γ −1 ) −1 . Thus, once D becomes available, z can be found distributively using expression (26).
To solve (19) distributively, we first note that for the model (23) the likelihood (8) can be equivalently rewritten as It is then straightforward to show that the upper bound (18) will take the form Similarly to (18), for any w l , l = 1, . . . , M, the bound is minimized with respect to γ l at γ l = |w l |/ z l , l = 1, . . . , M. Inserting the latter in (28), we obtain an objective function for estimating w l w = arg min Expression (29) can be readily solved distributively using an ADMM algorithm (see e.g., [33] (Chapter 8) and [36]). Once w is found, optimal parameter values γ are found as In Algorithm 2, we now summarize the key steps of the resulting D-R-ARD algorithm for SOE. As we can see from Algorithm 2, D-R-ARD includes two optimizing loops. The inner optimization loop is an ADMM algorithm, which is guaranteed to converge to a solution [33]. The convergence of the outer loop is basically the convergence of the R-ARD algorithm presented in [26].
Algorithm 2 D-R-ARD for SOE 1: z n ← 1, ∀n = 1, . . . , N 2: Compute D using averaged consensus over Φ T k ΛΦ k as in (24) See (29); is solved distributively using ADMM [33,36] 5: In the D-R-ARD algorithm, two communication steps are required. The first communication step involves the computation of the matrix D, where we leverage an average consensus algorithm. There, each of the A ∈ N consensus steps requires the transmission of N(N + 1)/2 floats due to the symmetry of D. Note that the number A of averaged consensus iterations can vary depending on the connectivity of the network.
The second communication step involves the iterative estimation of the model parameters. Assuming that the update loop of D-R-ARD requires I ∈ N iterations, the distributed estimation of parameters w with R ∈ N ADMM iterations then scales up as O(I × ARN). Thus, the total communication load of D-R-ARD algorithm behaves as AN(N + 1)/2 + O(I × ARN). Please note also that for this estimation of the communication load, the network structure remains unchanged.

Distributed Entropy-Driven Exploration for Sparse Bayesian Learning
The learning algorithm described in the previous section estimates the parameters of the model w and γ given the measurements y and X. In the following, we focus on the question of how a new measurement is acquired in an optimal fashion. As we will show, the main criterion for this purpose is the information or, more specifically, the entropy change as a function of a possible sampling location.

D-Optimality
One possible strategy to optimally select a new measurement location x is provided by the theory of optimal experiment design. Optimal experiment design aims at optimizing the variance of an estimator through a number of optimality criteria. One of these criteria is a so-called D-optimality: it measures the "size" of an estimator covariance matrix by computing the volume of the corresponding uncertainty ellipsoid. More specifically, a determinant (or rather the logarithm of a determinant) of the covariance matrix is computed. The latter can then be optimized with respect to the experiment parameter. In our case, the covariance matrix Σ w of the model parameters w is readily given in (12) as a second central moment of p(w|y). Thus, the D-optimality criterion can be formulated as where the dependency of Σ w on measurement locations X has been made explicit. Note that due to the normality of the posterior pdf p(w|y), the term log|Σ w (X, Π)| is proportional to the entropy of w; thus, minimization of the criterion (30) would imply a reduction of the entropy of the parameter estimates. Note that in contrast to [14], the covariance matrix is not approximated here, but it is computed exactly based on the resulting probabilistic inference model. Our intention is now to evaluate and optimize (30) as a function of the new possible sampling location x. Let us consider a modification of the model (7) as a function of the location x. The incorporation of x into (7) would imply that the design matrix Φ would be extended as where π is a new parameterization of a function φ based on the new location x-a new regression feature. Let us stress that in general, the potential measurement at x does not have to lead to a new column in (31)-columns, i.e., basis functions in Φ can be fixed from the initial design of the problem. In the latter case, Φ would be extended only by a row vector φ( x, π N )]. However, a basis function with a currently zero parameter weight estimate might be useful for explaining the new measurement value at x and, thus, might be activated. Our next step is to consider how 1.
The D-optimality criterion can be evaluated efficiently for the "grown" design matrix Φ in (31), 2.
And how the criterion can be evaluated in a distributed fashion.

Measurement Only-Update of the D-Optimality Criterion
We will begin with considering the update of the D-optimality criterion with respect to a new measurement location x assuming that only the number of rows in Φ grows, while the number of features stays constant. In this case, (31) can be represented as Based on (32), the new covariance matrix Σ w that accounts for the new measurement location x can be computed as where Λ = diag{Λ, λ} ∈ R M+1×M+1 and λ is the assumed noise precision at the potential measurement location. It is worth noting that we assume every measurement to be independent white Gaussian noise. By combining terms that depend on x, we can represent (33) as As we see from (34), an addition of a new measurement row causes a rank-1 perturbation of the information matrix Σ −1 w . Using matrix determinant lemma [37], we can thus compute Note that Σ w is independent of x, and thus, only the second term on the right-hand side of (36) is relevant for the estimation.
Finally, the D-optimality criterion with respect to a location x can be formulated as arg min where we have exchanged minimization with a maximization by changing the sign of the cost function.

Computation of the D-Optimality Criterion with Addition of a New Feature
The computation of the D-optimality criterion becomes more involved when a measurement at a location x is associated with a new feature π. This can happen if, e.g., π is a center or location of a new basis function.
Then, based on (31), the new covariance matrix Σ w that accounts for x and π is formulated as where γ is a sparsity parameter associated with a new column [φ T (X, π), φ( x, π)] T . By combining terms that depend on x, we can represent (38) as To simplify the notation, let us define which can be inserted into (39), leading to The first term in (41) describes how much the new feature column contributes to the covariance matrix, while the second term represents the contribution of a measurement at location x. Let us now insert (41) into the D-optimality criterion in (30). By applying the matrix determinant lemma [37] to the resulting expression, we compute Now, consider separately the contribution of the two terms in the right-hand side of (42) to the D-optimality criterion. For the first term, we can use the Schur complement [38] q( π) = b( π) − c T ( π)Σ w c( π) such that the first logarithmic term can be reformulated as Note that Σ w is independent of x and of π, which is a fact that will become useful later.
To simplify the second term in the right-hand side of (42), we first apply inversion rules for structured matrices [39], which allows us to write and thus Finally, after inserting (43) and (45) into (42), the D-optimality criterion with respect to a location x can be formulated as arg min arg max where we have exchanged minimization with a maximization by changing the sign of the cost function, and we dropped log |Σ w | as it is independent of x and π.

Distributed Computation of the D-Optimality Criterion for SOE
Let us begin first with evaluating the D-optimality criterion for the SOE case. Evaluating (37) for this data splitting is easier as compared with SOF.
Since Π is known to each agent, the vector φ( x, Π) can be evaluated without any cooperation between the agents. The covariance Σ w can then be evaluated distributively using averaged consensus as Σ w = (D + Γ −1 ) −1 , where D is computed using networkwide averaging. To compute (46), a few more steps are needed. Specifically, in addition to Σ w , we also need to compute the quantities c( π) and b( π) in (40) to evaluate the criterion. These can already be computed using averaged consensus as b( π) = φ(X, π) T Λφ(X, π) Then, using (47) and (48) as well as Σ w computed distributively, the criterion (46) can be easily evaluated by each agent.
It is worth noting that the choice of γ −1 in (48) is the only parameter that can be set manually in this exploration criterion. Basically, it controls how much we know about the potential measurement location. If γ −1 is large, the criterion would yield that the potential measurement location is not informative. On the other side, if γ −1 → 0, the criterion yields that the considered measurement location is potentially informative. We set γ −1 = 0 for all considered measurement locations, such that the current information in the model determines how informative a measurement location could be.

Distributed Computation of the D-Optimality Criterion for SOF
For SOF, (37) is unsuited for a distributed computation such that some changes have to be made. First, we define the following terms to facilitate the distributed formulation where Π k = [π 1 , . . . , π N k ] T ∈ R N k ×s and Γ k = [ γ 1 , . . . , γ N k ] T . All terms in (49)-(51) can then be computed by means of an averaged consensus [40,41]. Next, we reformulate Σ w with the help of the matrix-inversion-lemma as Now, (37) can be reformulated in a distributed setting for SOF as For the case when the criterion (46) is used for evaluaton of the D-optimality, the variable q( π) in (46) and the second additive term there have to be reformulated in a form suitable for SOF data splitting. For the former, we utilize the definitions in (49)-(51), together with (52) such that q( π) = γ −1 + φ T (X, π)Λφ(X, π) − φ T (X, π)ΛΦΣ w Φ T Λφ(X, π) 1 φ(X, π). (54) The other term in (46) is then reformulated similarly using the results (49)-(52) as As a result, the exploration criterion can be re-formulated for SOF in the following form arg min with q( π) defined in (54) and f ( x, λ) given in (53).

Experimental Setup
This section describes definition of the experimental setup, calibration of the sensors, and collection of ground-truth data for performance evaluation.

Map Construction
The following describes our experimental setup. We conducted the experiments indoor in our laboratory with two paper boxes as obstacles displayed in Figure 1a. Red lines in the figure represent the borders of the experimental area. We use two Commonplace Robotics (https://cpr-robots.com, accessed on 19 March 2022) ground-based robots with mecanum wheels; further in the text, we will refer to the robots as sliders due to their ability to move holonomically. To position the slider within the environment, the laboratory is equipped with 16 VICON (https://www.vicon.com/, accessed on 19 March 2022) Bonita cameras. For the experiment itself, we assume that the map is a priori known to the system. Thus, we need to record the map before the experiment. So, a single slider is equipped with a light detection and ranging (LIDAR) sensor. We use a Velodyne (https://velodynelidar.com/, accessed on 19 March 2022) VLP-16 LIDAR and the corresponding robot operating system (ROS) package, which can be downloaded from the ROS repository. We construct the map while sending waypoints to the slider manually. The steering of the slider is done with the help of ROS' navigation stack [42] together with the Teb Local Planner [43]. The sensor output of the LIDAR and the slider position estimated by the VICON system are then used to generate a map with the Octomap [44] ROS package. Because we use the VICON position of the slider, which is accurate, this mapping procedure is simpler compared to simultaneous localization and mapping (SLAM) algorithms [45,46]. Figure 1b shows the constructed map, which is afterwards used in the experiment.

Sensor Calibration
Each slider is equipped with a XSens MTw inertial measurement unit (IMU). The sensor comprises a three-axis magneto-resistive magnetometer, an accelerometer, gyroscopes, and a barometer. For the following experiment, we only use the magnetometer. The sensor is attached to a wooden stick to reduce the influence of the metal wheels on the measurement. Although the sliders are equipped with sensors from the same product line of the same manufacturer, their absolute perception differs. Additionally, the sensors can still perceive the metal in the wheels of the robots. Therefore, we need to calibrate the sensors relatively to each other to perceive the environment equally using the approach proposed in [47].
The authors in [47] assume that the sensor readings of one sensor can be expressed as another sensor's reading through an affine transformation. To estimate the rotation and translation, multiple sensor readings of all sensors have to be acquired. These readings are then exploited to estimate the rotation and translation relative to one specific sensor by means of a least squares method. In this experiment, each magnetic field sensor reads at a position x m one measurement of the magnetic field per Euclidean axis. During the estimation, absolute values of these measurements are used. Figure 2a shows the absolute values of the sensor readings for multiple measurement locations of two sensors. The error of the sensor readings before and after calibration are presented in Figure 2b. The correction thus reduces the bias and the standard deviation of the error between both sensors. However, this calibration is only useful if the orientation of both sensors is constant during the experiment. As the sensors always measure in the same orientation, this assumption is fulfilled for our experiments. For further information on intrinsic calibration of inertial and magnetic sensors, the reader is referred to [48].

Collecting Ground Truth Data
In order to evaluate the performance of the distributed exploration, we also need to know the actual magnetic field in the laboratory-a ground truth data. For collecting the ground truth data, one slider measures the area of the Holodeck in a systematic fashion, where the distance between each measurement was set to be 5 cm such that in total, 8699 measurement points were collected. On each measurement position, multiple sensor readings are taken and averaged. The resulting ground truth is displayed in Figure 3.

Experimental System Design
Our setup relies on ROS (https://www.ros.org/, accessed on 19 March 2022), which manages the communication between all software modules called nodes. On each slider, several ROS nodes are running such as the motor controller, which translates the measurement locations into velocity commands for each wheel, the path-planner, and the sensor.
As a path-planner, we use the popular A* [49,50]. We implemented the A* algorithm as a global and as a local planner, which is utilized for collision avoidance. Therefore, each slider does not only consider the global map but also a local map around its current position.
After receiving a new waypoint, the global path planner estimates a path in the global map from the current position to the goal avoiding the obstacles. If there is no other robotic system in its path, the goal is reached. However, if another slider enters the local frame while the robot is on its way toward the goal, the robot stops, and the path within the local frame is re-planned to avoid collisions. If the planner is not able to find a solution in the local frame within a given time, the global path planning is re-initiated, taking the current slider as an obstacle into account.
The whole system design for this experiment is shown in Figure 4. The distributed exploration criterion uses the computed map excluding the locations of the obstacles. In addition, the map information is used by the path-planner to find an obstacle-free path to the estimated measurement location x. Figure 4 also describes the process-flow of the whole system.
For comparison, we will use non-Bayesian SOF and SOE formulations as discussed in [14]. As in these formulations, the ADMM algorithm [33] was used for estimation, we will refer to them as ADMM for SOF and ADMM for SOE, respectively. For the Bayesian learning and algorithms discussed in this paper, we will refer to them as D-R-ARD for SOF and the D-R-ARD for SOE (see also Table 1). In the experiments, we will set the number of basis functions to N = 560, which also determines the size of the vector w. The basis functions are distributed in a regular grid. We consider Gaussian basis functions with a width set to σ n = 0.25 such that where π n ∈ R s and s = d.
After initialization of the system, every agent takes a first measurement and incorporates it in its local measurement model to calculate the first estimate of the regression. Then, each algorithm requires that the intermediate estimated parameter weights are distributed to the neighbors (following Figure 4) to do an average consensus [40,41]. Consequently, each agent can proceed to estimate with the regression using the averaged intermediate parameter weights. When the distributed regression converged, the agents use the estimated covariance matrix in the distributed exploration step. In this step, the agents propose candidate positions to their neighbors and receive information to compute the D-optimality criterion locally. When the best next measurement locations are chosen, they are passed to the coordination part [51] to verify that all agents go to different positions. If the measurement location is considered as valid, an agent locally plans its path on the global frame to reach the goal. While approaching the goal, the agent checks if other agents entered into the local frame to avoid collisions. When all agents reached their goal, the agents take measurements and the process flow continues.
Compute the D-Optimality as in (46).
Distribute local data to neighbors by (47) and (48) Make measurement at  . System design with additional path planner and map constraints. Each gray box represents interaction between other agents. In some boxes, the lower right indicates where this process belongs. This software setup is representative for the SOE distribution paradigm.
As evaluation metric, we chose the normalized mean square error (NMSE), which can be defined as where y true (X T ) ∈ R T is the ground truth measured at T ∈ N positions X T ∈ R T×d . Here, we set T = 560, and these locations are equal to the center positions of the Gaussian basis functions. Figure 5 shows the NMSE of all conducted experiments with respect to time (top plot) and to the number of measurements (bottom plot). Each experimental run has a different duration, and the ROS system uses asynchronous interprocess communication resulting in asynchronous time-steps. Thus, all runs of one particular algorithm are visualized as a scatter plot. The number of measurements varies because the computation time for each measurement could be different. As a consequence, an averaging along multiple experimental runs along the time axis is not reasonable. For both ADMM algorithms, we conducted four experiments, whereas for each D-R-ARD algorithm, we conducted two experiments. The corresponding results are summarized in Figure 5. When looking at the top plot in Figure 5, the D-R-ARD for SOE has the best performance because the NMSE is reduced faster compared to the other methods.

Experimental Validation
Regarding the ADMM algorithms, the SOE paradigm has a brief benefit until the 1200 s until SOF paradigm outperforms the SOE paradigm. The weak performance of D-R-ARD for SOF might result from the distributed structure of the algorithm, which requires the algorithm to compute a matrix inversion in each iteration together with the computational complex estimation of parameter weights and variances. In contrast to that, the corresponding algorithm with the SOE distribution paradigm is able to cache the matrix inversion, which drastically increases the performance. Yet, the D-R-ARD algorithms have generally a higher computational complexity compared to the ADMM algorithms. This is due to the fact that the Bayesian methods require the covariance to be computed in each iteration. The ADMM algorithm, in contrast, does not require this.
The plot at the bottom of Figure 5 displays the NMSE with respect to the number of obtained measurements. There, the D-R-ARD for SOF and ADMM for SOF have almost the same performance. However, the ADMM for SOF is able to achieve substantially more measurements because it is computationally less complex. Consequently, the ADMM for SOF achieves not only more measurements but is on a par with the D-R-ARD for SOF when it comes to efficiency per measurement.
For the SOE distribution paradigm, on the contrary, it is beneficial to use the Bayesian methodology. In the experiments we present here, the D-R-ARD for SOE achieves a lower NMSE with fewer measurements compared to ADMM for SOE algorithm. This could be due to the fact that D-R-ARD for SOE computes the entropy of the parameter weights and does not approximate it. The computed entropy seems then to be better for the D-optimality criterion than the approximated version for the ADMM for SOE.
To support the claim that the Bayesian framework estimates a better covariance of the parameter weights when the SOE paradigm is applied, Figure 6a,b present the estimated magnetic field and the estimated covariance at different timesteps. In both figures, the left most plots display the beginning of the experiment and the most right plots show the end result of the experiment. At the beginning of the experiments, both algorithms-ADMM and D-R-ARD for SOE-estimate a sparse covariance with not much difference. As the measurements increase, the approximated covariance becomes smoother, and the covariance estimated in the Bayesian framework stays sparse. This effect might result from the approximation of a covariance as introdcued in [14], where a penalty parameter needs to be chosen as a compromise between sparsity and a reasonably well approximated covariance. As a second remark, the ADMM algorithms involve a thresholding operator, which sets all not used basis functions to zero such that these basis functions can not be considered by the exploration step. This is controlled by a manually set penalty parameter and might be sub-optimal. The D-R-ARD for SOE, on the other side, estimates a hyper-parameter for each basis function based on the current data. Therefore, the influence of each basis function is addressed more individually and, hence, leads to a better covariance estimate.
The way basis functions and parameter weights are introduced in the SOF paradigm makes this effect eventually not observable between the Bayesian and the Frequentist framework.

Conclusions
The presented paper proposes and validates a method for spatial regression using Sparse Bayesian Learning (SBL) and exploration, which are both implemented over a network of interconnected mobile agents. The spatial process of interest is described as a linear combination of parameterized basis functions; by constraining the weights of these functions in the final representation using a sparsifying prior, we find a model with only a few, relevant functions contributing to the model. The learning is implemented in a distributed fashion, such that no centralized processing unit is necessary. We also considered two conceptually different distribution paradigms splitting-over-features (SOF) and splitting-over-examples (SOE). To this end, a numerical algorithm based on alternating direction method of multipliers is used.
The learned representation is used to devise an information-driven optimal data collection approach. Specifically, the perturbation of the parameter covariance matrix with respect to a new measurement location is derived. This perturbation allows us to choose new measurement locations for agents such that the size of the resulting joint parameter uncertainty, as measured by the log-determinant of the covariance, is minimized. We show also how this criterion can be evaluated in a distributed fashion for both distribution paradigms in an SBL framework.
The resulting scheme thus includes two key steps: (i) cooperative sparse models that fit data collected by agents, and (ii) the cooperative identification of new measurement locations that optimizes the D-optimality criterion. To validate the performance of the scheme, we set up an experiment involving two mobile robots that navigated in an environment with obstacles. The robots were tasked with reconstructing the magnetic field which was measured on the floor of the laboratory by a magnetometer sensor. We tested the proposed scheme against a non-Bayesian sparse regression method and a similar exploration criterion.
The experimental results show that the Bayesian methods explore more efficiently than the benchmark algorithms. Efficiency is measured as the reduction of error over the number of measurements or the reduction of error over time. The reason is that the used Bayesian method directly computes the covariance matrix from the data and has fewer parameters that have to be manually adjusted. The exploration with these methods is therefore simpler to set up as compared with non-Bayesian inference approaches studied before. Yet, for the SOF distribution paradigm, the Bayesian method is computationally too demanding such that significantly fewer measurements can be collected in the same amount of time as compared with the non-Bayesian learning method.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

ADMM
alternating direction method of multipliers ROS robot operating system SBL Sparse Bayesian Learning PDF probability density function SOF splitting-over-features SOE splitting-over-examples NMSE normalized mean square error D-R-ARD distributed R-ARD R-ARD reformulated automatic relevance determination IMU inertial measurement unit LIDAR light detection and ranging SLAM simultaneous localization and mapping UAV unmanned aerial vehicle LASSO least absolute shrinkage and selection operator MAP maximum a posteriori