Learning Environmental Field Exploration with Computationally Constrained Underwater Robots: Gaussian Processes Meet Stochastic Optimal Control

Autonomous exploration of environmental fields is one of the most promising tasks to be performed by fleets of mobile underwater robots. The goal is to maximize the information gain during the exploration process by integrating an information-metric into the path-planning and control step. Therefore, the system maintains an internal belief representation of the environmental field which incorporates previously collected measurements from the real field. In contrast to surface robots, mobile underwater systems are forced to run all computations on-board due to the limited communication bandwidth in underwater domains. Thus, reducing the computational cost of field exploration algorithms constitutes a key challenge for in-field implementations on micro underwater robot teams. In this work, we present a computationally efficient exploration algorithm which utilizes field belief models based on Gaussian Processes, such as Gaussian Markov random fields or Kalman regression, to enable field estimation with constant computational cost over time. We extend the belief models by the use of weighted shape functions to directly incorporate spatially continuous field observations. The developed belief models function as information-theoretic value functions to enable path planning through stochastic optimal control with path integrals. We demonstrate the efficiency of our exploration algorithm in a series of simulations including the case of a stationary spatio-temporal field.


Introduction
Autonomous underwater field exploration has been a fast-growing research area in the last decade. With continuous advances in small-scale computing technology, smart micro-robots are expected to play a prominent role in increasingly autonomous and interconnected exploration and monitoring systems [1]; examples of micro underwater robots include the Avexis [2] and the HippoCampus [3] micro autonomous underwater vehicle, see Figure 1. Autonomous multi-agent swarms can be deployed for maritime exploration tasks such as the monitoring of algae growth, oil spill, or underwater currents.
Hereby, the development of maritime exploration and monitoring systems profits from many synergies with research on similar onshore tasks such as the monitoring of urban environmental fields. As a result, micro-robots can monitor environmental processes such as particulate matter, acoustic pollution, or electromagnetic fields. Particularly in hazardous environments, autonomous micro-robots inherit great potential to drastically increase safety and reduce costs compared to concepts involving the deployment of stationary sensors or humans taking measurements. Thereby, depending on the dynamics of the specific robotic platform sophisticated control algorithms such as [4,5] are required in order to allow a safe and robust deployment in the underwater domain. However, missions with mobile underwater robots inherit the challenge of limited communication bandwidth and range. This naturally enforces a high level of autonomy as all key components of the exploration algorithm have to run onboard. Systems for maritime exploration tasks consist of multiple building blocks as depicted in Figure 2. The state of the environment is represented in a so-called belief model which is updated through measurements of the environment. The observations are collected using mobile robots. Thus, a path-planning algorithm is required in order to maximize the expected information gain through future observations along the robots' paths. Since the robots cannot simply take measurements at arbitrary locations, a trade-off arises between the potential information gained and the effort to drive the robots to regions where information-rich observations can be collected.

Related Work
Methods for modeling a robot's environmental belief can be distinguished in physics-based models and non-physics based models. Physics-based models allow the extrapolation of the model to the vicinity of the known belief. Nonetheless, they require solving a partial differential equation (e.g., Navier-Stokes equation), which is computationally demanding. Moreover, such systems require information regarding the boundary conditions which is often not directly available to the system [6]. In recent years, probabilistic belief models have been developed as a promising alternative for efficient field estimation. These data-driven models may be more suitable for computationally constrained robot swarms since they avoid computational-costly solving the partial differential equation and additionally allow to model the uncertainty of the estimated physical process directly. Moreover, data-driven models inherit the potential to infer the process characteristics during operation.
A prominent inference method for learning environmental field beliefs is Gaussian process (GP) regression. Gaussian process regression, or in geo-statistics terminology 'Kriging', originated as a method for statistical inference of ore concentration fields [7]. Modeling environmental fields with GPs is attractive as their mean and covariance functions allow a spatial continuous field representation while additionally providing statistical uncertainty measures. Example applications include the estimation of a temperature field in lakes [8] and the reconstruction of spatio-temporal fields with mobile sensor nodes [9]. Moreover, Xu et al. [10] present a broad variety of different GP models for spatial field estimation. They evaluate their method in a series of experiments to analyze the performance. Recently, Cui et al. proposed the combination of mutual information and rapidly-exploring random trees for underwater path planning in a scalar field sampling scenario [19]. In this context, many publications consider the scenario of exploring large-scale environments such as oceans. However, when reducing the scale to confined environments such as industrial tanks, the robot dynamics cannot be neglected anymore. This problem can be tackled using receding horizon schemes that optimize the full planning horizon while not providing guarantees beyond the horizon. However, such a planning algorithm require a continuous representation of the field belief. Kreuzer and Solowjow use linear functions to interpolate the field belief representation between the grid-points while Xu et al. [20] propose sinusoidal weighting functions which are an attractive nonlinear alternative for smooth field interpolation. The computational costs of these belief algorithms increase with the number of collected observations, rendering their application on real robots impractical. With a continuous belief representation, an exploratory path can be computed using the policy improvement with path integrals (PI 2 ) algorithm proposed by [21]. The PI 2 algorithm origins from the work on solving a nonlinear stochastic optimal control via the use of path integrals [22]. In reinforcement learning terminology, our resulting exploration method could be seen as performing policy iteration. In which, first, the exploration policy is evaluated at each new location through an update of the field belief model. Subsequently, the belief is used as an information-theoretic value function for policy improvement as proposed by [23]. This approach enables rapid evaluation of the belief space for finding an optimal informative path.

Contributions
The contribution of this work is two-fold. First, we combine and extend approaches from previous works [13,23] on field exploration with GMRFs to meet the requirements of an application on a micro-robot platform and on the PI 2 stochastic path planning algorithm. These requirements are • a constant computational complexity over time, • a continuous spatial belief representation which allows efficient path planning.
Therefore, we first extend the PI-GMRF approach proposed in [23] with the concept of sequential Bayesian spatial prediction in [13] to guarantee a constant computational load while maintaining the ability to incorporate off-grid measurements. Second, we extend the recently presented spacetime Kalman filter (STKF) [15] by incorporating the concept of weighted shape functions to render the belief model compatible with PI 2 . The combination of these two different belief models with PI 2 results in two novel algorithms for stochastic field exploration-namely, an improved PI-GMRF and the new PI-STKF.
We conduct two numerical experiments to compare these models regarding their ability to efficiently explore environmental fields and computational complexity. We assume that sufficient prior knowledge on the physical process is available and thus no hyperparameter estimation of the belief model is required. For the sake of brevity, we present the field estimation schemes for a single robot. Albeit, we derive the algorithms in a form that allows a direct extension to multiple agents sharing a centralized belief model.

Paper Structure
The paper is structured as follows. In Section 2, we briefly outline the problem of autonomous field exploration. In Section 3, the belief models, namely the fully Bayesian GMRF and STKF, are introduced and extended to incorporate off-grid measurements. In Section 4, we elaborate on the usage of PI 2 for path planning using information-theoretic value functions. In Section 5, the two belief models are compared with respect to their computational time and estimation results. In Section 6, we analyze the exploration performance of the final algorithm on an unknown spatio-temporal field. Finally, Section 7 concludes this work and highlights potential future research directions.

Problem Statement
We consider an autonomous underwater vehicle which explores an environmental field through point-wise observations in a confined environment. The goal is to minimize the error between the estimated and the true field within a minimal amount of time. Regarding the exploration algorithm, this task can be restated as maximizing the ratio of collected information per time step. To make this task tractable for mobile robot teams, the computational cost of the exploration algorithm has to be bounded over time.

Robot Model and Problem Formulation
A dynamical robot model allows sampling feasible exploratory trajectories which can be evaluated regarding their potential information gain on the environmental field belief. The motion of each robot is described byẋ where x t ∈ R n is the state of the robot. The passive dynamics f robot define the state transition, which for path planning algorithms is commonly described through a simple particle model. The scaled control input u t ∈ R p is the computed state correction through the robot's actuators and G(x t ) ∈ R n×p the control matrix. At each discrete time step, the robot collects measurements y(x t ) to gather information about the environmental field f (x t ) of interest. The sensor model of a single robot is described by where z real depicts the real environmental field values, h(·) is the observation function, and Σ y is the diagonal covariance matrix representing the measurement noise.

Field Belief Representation
In order to efficiently learn an environmental field estimate the robots have to gather information by taking measurements and infer knowledge from the collected data. The collected data is mapped to a belief of the underlying environmental process that enables the agents to plan actions. In the information-theoretic exploration algorithms, the second order moments of the field belief representation are used to evaluate the quality of the planned exploration paths. In this context, the concept of probabilistic kernel models is a natural choice for spatially correlated field values. The correlation of the field values is assumed to be known a priori. Thus, no hyperparameter estimation has to be performed during exploration.
The limited computational capacities of embedded systems such as mobile robots require the belief model to have a comparably small and constant computational cost. Therefore, we extend algorithms based on GMRFs and Kalman Filtering to limit their computational complexity and to enable their combination with a stochastic path planning controller for exploration tasks.
A GMRF defines a (finite-dimensional) random field that follows a multivariate Gaussian distribution while satisfying the Markov property. Due to the Markov assumption, the inverse of the covariance matrix Λ can be defined on a predefined lattice, while also being (desirably) sparsely populated. The GMRF is fully defined by a mean vector µ and a precision matrix Λ −1 as N (µ, Λ −1 ). GMRFs are well suited for the approximation of conditional auto-regressive processes, but require the initialization of a fixed lattice of random variables (RVs). The initialization hinders the application of GMRFs to model temporal process correlations. Therefore, a Kalman regression algorithm, also referred to as spacetime Kalman Filtering (STKF), is utilized to express a belief of the spatio-temporal environmental field. However, both stochastic field belief models provide a measure of belief uncertainty in the form of the conditional variance.

Stochastic Optimal Control Problem
The stochastic optimal controller allows computing an optimal path with respect to the given value function. In this work, we use a field uncertainty based value function, whereby the field uncertainty is computed from the previously developed belief model. We use a stochastic optimal controller to plan a maximal informative path with respect to the belief model's predicted variance. As thoroughly discussed in [24], the predicted variance results in the conditional entropy, which is an indirect informative criterion. Hereby, the term indirect informative means that only the uncertainty of the next potential state is considered, while the information of surrounding field values is not considered. A critical insight for information theoretic path planning is that the posterior variance does not depend on the process values of the obtained measurements if the kernel function is known.
In order to sample potentially feasible paths, the following robot dynamics are considereḋ where x t ∈ R n×1 denotes the system state, f robot (x t ) ∈ R n×1 the passive dynamics and ε t ∈ R p×1 the additive Gaussian noise with variance Σ ε . Moreover, let the index t denote any arbitrary time step, while we use the index t i to emphasize a particular time. The final goal of a stochastic optimal controller is to compute the optimal controls u t with respect to the performance functional The expectation E τ i:H is taken over all trajectories starting at x t i . Also t 0 = 0 s denotes the time at the current agent position, and t H the last time step of the control horizon. We define the finite horizon cost function R τ i:H for a trajectory piece τ i:H with start at time t i and end at t H as The term φ t H denotes a terminal reward at time t H . The immediate cost r t at time t is chosen as with q t = q(x t , t) being a state-dependent cost function, and R being a positive semi-definite weight matrix. Note that the control action u t is linear in (3) and quadratic in (6).

Probabilistic Belief Modeling for Field Exploration
In this section, we present extensions to two existing field belief concepts, namely the GMRF and the STKF approach, to allow information-based exploration control with constant computational cost over time. Both belief algorithm can be used to estimate a stochastic process on a predefined lattice of Gaussian random variables.
In order to enable the incorporation of the belief models into a stochastic optimal control exploration framework, we extend the belief algorithms analog to [23]. Therefore, we incorporate off-grid observations through an affine transformation of field measurements onto the belief grid using spatial shape functions. However, the original concept presented in [23] does not fulfill the requirement of constant computational cost over time.
The GMRF-based belief algorithm was originally proposed in [13] and enables efficient estimation of stationary spatial processes on a discrete grid of Gaussian random variables. The second belief algorithm, proposed in [15], combines GP regression of spatial process components with Kalman filtering of conditional-auto-regressive temporal process components.
Defining the field representation on a lattice raises the question on how to choose the grid discretization based on the fundamental trade-off between the accuracy of the field representation versus the available computational power. The latter aspect is of particular importance in the field of underwater robotics, as off-board computation of the field representation is not feasible due to the very limited communication bandwidth. Thus, the discretization step size has to be selected depending on the actual application scenario. Thereby, prior knowledge on the spatial scale of the physical process is a helpful and valid assumption. For instance, if the user aims to explore small scale processes in a local environment, e.g., an industrial tank, a dense grid is likely to be a better choice than a coarse grid which captures global physical phenomena with acceptable computational burden. Moreover, shape function approximation can be used to interpolate the field belief between the grid points. This allows to efficiently monitor large scale fields where the main interest lies in the exploration of global phenomena rather than local small scale processes.

Shape Functions
The introduced belief algorithms estimate the field on a discrete lattice {V, E } with vertices V = {1, ..., n} and edges E.
The set of continuous field locations is discretized into a finite subset of n spatial input locations S = {s 1 , ..., s n }, such that the vector The lattice S consists of a finite number of sub-domains S e,i , where each is enclosed by four verticess i , with i = 1, ..., 4. For the ease of illustration, S is chosen as regular lattice with edges each of length a and b respectively, as depicted in Figure 3. As proposed by [23], the field value at position q can be approximated through a sum of weighted shape functions φ e i (q) and field values on the vertices f (s i ), such that Figure 3 illustrates shape function φ e i=4 on an element domain F e . Each shape function is zero outside its corresponding element and equal to one at the associated vertex  Figure 3. Shape function on a selected grid element. A local coordinate system K e (x e , y e ) is defined on F e . The origin of K e (x e , y e ) lies in the center of the element. The corresponding coordinate axis are orthogonal to each other and parallel to the respective element edges. The shape functions are defined as Using the shape functions above, φ j defines a mapping between the continuous field and four-element vertices through In this manner, a measurement y j (t k ) can be expressed in terms of the element grid approximation, writing Moreover, the mapping of N measurements at time t k is obtained as Further we define the mapping

Gaussian Markov Random Field Regression
Gaussian Markov random fields define a conditional auto-regressive (CAR) process. A process is a CAR(j) process, if the expectation of a process value is fully defined through the next j adjacent graph vertices. The Markov assumption enables the direct construction of a sparse precision matrix. Given a labeled graph G = (V, E ) with vertices V = {1, ..., n} and edges E, a probabilistic graphical model η defines a GMRF, if the edges E are chosen such that there is no edge between node i and j, if η i ⊥ η j | η −ij , in which −ij denotes the nodes adjacent to i and j, respectively [25]. The pairwise conditional independence properties of x on G are inherent in the subdiagonal entries of the precision matrix Λ. We refer the reader to [25] for an in-depth discussion of GMRFs.
In order to construct the GMRF, the continuously indexed spatial field F * ⊂ R d is discretized into a labeled undirected spatial graph with n * vertex positions S * = x 1 , ..., x n * , where x i denotes the i-th field vertex position (Note that in this work the scalars x or y denote the spatial position coordinates of a two-dimensional spatial position vector x, while a bold y represents an environmental field observation vector.). The set of field locations S * is extended to S with vertex positions S = x 1 , ..., x n , as depicted in Figure 4, to compensate boundary effects occurring due to the GMRF approximation. Then on S, a GMRF η is constructed using the SPDE approach proposed by [12]. Figure 4. The agent's position q k,i at the discrete time step k lies in a spatial field F * with coordinate values x and y. The field F * can be extended to F and is then discretized into a regular grid S to enable the construction of a GMRF and to compensate for boundary effects.
Let η ∼ N (0, Σ) be a GP on R 2 defined by the Matérn covariance function defined as in which · denotes the Euclidean distance in R d and K ν the modified Bessel function of the second kind. The GMRF η ∼ N (0, Λ −1 ) defined on a regular two-dimensional lattice approximates a Matérn GP for ν = 0 if the Gaussian full conditionals read For the case of ν = 1, the Gaussian full conditionals read with a = κ 2 + 4 and θ = [τ, κ] ∈ R 2 >0 being hyperparameters of the model. The additional hyperparameter τ adjusts the GMRF's signal variance independent of κ. The proof of Equations (13) and (14) for the general case of irregular grids is stated in [12]. Figure 5 illustrates the correspondence between the spatial lattice locations and the values in each column of Λ using the previously presented construction scheme. Figure 5. Non-zero elements of a column of the precision matrix Λ associated with one random variable and its neighbor vertices on a regular two-dimensional GMRF lattice.
When designing the GMRF precision matrix, the full conditionals for the border vertices of the GMRF grid affect the estimation result considerably. Three commonly used boundary conditions are the Dirichlet, Neumann, and torus boundary condition [25].

Sequential GMRF Regression
In this Subsection, the GMRF regression algorithm proposed in [13] is extended to enable spatial process estimation with continuous observations. The values of the field are represented by the latent variable z i = z(s i ) ∈ R. The latent variables are expressed using a global linear model, such that Hereby, m = m 1 (s i ), ..., m p (s i ) ∈ R p denotes the regression function vector and the vector β = β 1 , ..., β p contains the unknown regression coefficients. The field belief on the lattice is denoted as z = [z 1 , ..., z n ] . The small-scale correlations of the field are modeled through the zero-mean GMRF η ∼ N (0, Λ η|θ ). We initialize the GMRF precision matrix Λ −1 η|θ with the full conditionals as defined in (13) and (14). A zero-mean Gaussian prior is assumed on the regression coefficients β ∼ N (0, T −1 ) to estimate the regression coefficients as a function of z and θ, where T −1 is initialized as a diagonal matrix with large diagonal elements. The probability distribution of the full latent field with precision matrix Λz |θ ∈ R (n+p)×(n+p) being defined as By exploiting the GMRF's canonical form, only the current available measurements y k are required to sequentially update the conditional precision matrix Λz |θ,y 1:k Λ k|θ and the canonical mean b k = Λ k|θ µ k|θ . A sequential updating algorithm is obtained by inserting the canonical mean definition into the formula for conditioning of a multivariate Gaussian distribution, such that with initial conditions Λ 0|θ = Λz |θ and b 0 = 0. Note that the shape function vectors are extended by a zero vector of length p such that Φ k ∈ R n+p . The final sequential GMRF regression algorithm is summarized in Algorithm 1. In order to obtain the variance vector diag(Λ −1 k|θ ) of the full latent field, without calculating the inverse of the precision matrix, the Sherman-Morrison formula is used, Line 13. The complete derivation is outlined in Appendix A. For the sake of clarity, the notation for the conditioning of the GMRF matrices on the hyperparameters θ is omitted. It is worth mentioning that adding the product Φ k Φ k to Λ 0|θ does not significantly increase the density of the initial precision matrix. Thus, the algorithm has a computational complexity of O(1) over time.

Algorithm 1 Sequential GMRF Regression
Require: Hyperparameter vector θ, Extended field grid S, Regression function vector m get j-th agent location x k,j and measurement y k,j 5:

Hyperparameter Estimation for Sequential GMRF Regression
A possible straightforward extension of the proposed model, is described in Xu et al. [13]. In this work, hyperparameter estimation is incorporated by defining the maximum a posteriori distribution p(θ|y) ∼ p(y|θ)p(θ) with p(θ) being a uniform prior distribution over a discrete set of hyperparameter combinations. Approximating the integral by a discrete sum decreases the computational load compared to a numerical integration over p(θ|y). Furthermore, such an approach scales linearly with the number of hyperparameter combinations and can be extended to incorporation of continuous measurements. However, the method requires that the set of hyperparameters are chosen a priori.

Kalman Regression for Field Estimation
The Kalman regression model by [15], also referred to as spacetime Kalman filtering (STKF), incorporates off-grid measurements by adding new grid vertices to the belief lattice. In the following, we propose the concept of weighted shape functions to fuse new observations from continuous space into the already existing neighboring vertices of the discrete grid, see Figure 6. This allows us to keep the number of vertices and their spatial density constant. In order to make this article self-sufficient, we briefly summarize the derivation of the STKF. Figure 6. An agent takes measurements at the position q while maneuvering through a field F. The environmental field is discretized into a regular lattice with the set of locations S. The STKF random variablesf (t) estimate the spatio-temporal process f (t) on S.

Process Model
Given the spatio-temporal physical process f (x, t), its covariance function K(·) is assumed to be separable in time and space, as well as stationary in time K(x, x , t, t ) = K s (x, x )K t (t, t ). Therefore, the power spectral density (PSD) S r (ω) = W(iω)W(−iω) of the temporal covariance K t can be approximated by a rational function of order 2r. As rational functions are universal function approximators, arbitrary (non-stationary) temporal spectra can be approximated by increasing r.
The individual temporal process component z i (t) defined by K t is represented by a continuous state space model S i in companion form using the Wiener-Khinchin theorem and realization theory, such that where w(t) ∼ N (0, I) and the matrices F, G, and H are in companion form The initial state yields s(0) ∼ N (0, Σ 0 ). The covariance matrix Σ 0 is obtained as the solution to the Lyapunov equation FΣ 0 + Σ 0 F + GG = 0. Note that s(t) does not provide a directly intuitive physical interpretation. The temporal process component is obtained as z(t) = [z 1 (t), ..., z n (t)] ∈ R n . The spatial covariance matrixK S is computed on S through the spatial covariance function K s . Finally, the process on S is obtained by spatially correlating all z i (t) through the Cholesky factorizationK 1/2 S ofK S , reading The spatio-temporal process model is depicted in Figure 7. With s(t) = [s 1 (t), ..., s n (t)] (t) ∈ R n×r and process noise w(t) = [w 1 (t), ..., w n (t)] ∈ R n , Equation (22) is condensed to S : ṡ(t) = (I ⊗ F)s(t) + (I ⊗ G)w(t), Process Model in which the Kronecker product is denoted by ⊗. The previous equations are discretized to enable numerical implementation. Thereby, the discrete time step t k is abbreviated as k with a single time step being T k = t k − t k−1 . The discrete process model of (25) reads s k+1 = As k + w k , The discrete state transition matrix is obtained as The zero mean Gaussian noise w k ∈ R rn is defined by the covariance matrix Q k = I ⊗Q k with The measurement noise vector The discrete observation matrix is obtained as In contrast to [15], where new vertices are initialized to include off-grid information, we map the observation y k collected at timestep k at position q k to neighboring belief vertices through (11). Using the measurement interpolation matrix Φ k , we obtain a transformation from the discrete belief lattice to a continuous field measurement as

Kalman Regression
As the temporal process model in (26) is known and linear, a Kalman filter scheme can be used to estimate the evolution of the temporal process component s k by incorporating observation y k+1 [26]. Furthermore, if the noise is assumed to be zero mean and Gaussian distributed, the Kalman filter estimates are optimal in a mean-square sense. The STKF belief model is summarized in Algorithm 2.
Given the state space model in (26), the temporal state component at time step k + 1 evolves in time according to s k+1 ∼ N (A k s k , Q k ), where Q k is the corresponding process noise matrix. The measurement obtained from the real process is assumed to result from an affine transformation of the temporal state component, reading y k+1 ∼ N (C k+1 s k+1 , Σ y ).
In a first step, the Kalman filter predicts the temporal process component at the next time stepŝ k+1|k based on the previous estimated temporal state componentŝ k|k , such that s k+1|k ∼ N (ŝ k+1|k , Σ k+1|k ). The state mean predictionŝ k+1|k and predicted state covariance matrix Σ k+1|k are depicted in lines 9 and 10 of Algorithm 2 respectively.
In the second step, the Kalman filter updates the temporal state componentŝ k+1|k+1 by conditioning the RV on y k+1 , such that s k+1|k+1 ∼ N (ŝ k+1|k+1 , Σ k+1|k+1 ). The equations for the updated state and covariance matrix are stated in lines 11 to 13 of Algorithm 2. Figure 8 illustrates the STKF model. The process model consists of the state space models of the temporal process components z i (t), which are correlated using the product of the Cholesky decomposition of the spatial covariance matrix C S =K 1/2 S (I ⊗ H). The Kalman filter predicts the next temporal process component, which is then updated using new measurements y k .

Algorithm 2 Kalman regression
Require: (F, G, H) state-space model of S r (ω), measurement noise variance σ 2 y , input location set S, spatial, time kernels K s (·, ·), and h(·) 1:ŝ(0|0) = 0 and Σ(0|0) = I ⊗ Σ 0 . 2: Compute Σ 0 as solution of FΣ 0 + Σ 0 F + GG = 0 3: Compute A k , Q k , Σ y,k and C S =K 1/2 S (I ⊗ H) 4: for t ∈ R >0 do 5: if t ∈ ]t k , t k+1 [ then {open-loop prediction} 6:ŝ(t) = (exp (I⊗F)τ )Σ k|k (exp (I⊗F)τ ) 7: else if t = t k+1 then {Kalman estimation} 8: Compute Φ(q k+1 ) and C k+1 = Φ(q k+1 )C S -Prediction step: 9:ŝ k+1|k = A kŝk|k 10: in which r is the order of a single state space model in (23), n is the number of vertices of S, while P is the number of open-loop predictions performed, and N is the number of agents collecting measurements at each discrete time step [15]. In this work, we do not perform any open-loop predictions, thus P is zero. Note that all variables in (31) are assumed to be constant over time. If spatial hyperparameter estimation is performed, the Cholesky transformation of the new spatial covariance matrix needs to be computed. In this case, the computational load is dominated by the computation of the spatial Cholesky decomposition being O(n 3 ).

Hyperparameter Estimation in Kalman Regression
In the STKF, the spatial Kernel hyperparameters can be estimated using a standard estimation method for GPs, such as maximum a posteriori estimation. However, GP hyperparameter estimation methods have the disadvantage of suffering from the big-n problem. In this sense, finding an optimization method that enables spatial hyperparameter estimation represents a challenge yet to be solved. As a state space model approximates the temporal process component, the temporal process hyperparameters can be estimated using methods developed for parameter estimation in finite state space models, as pointed out in [27]. Such methods inherit the advantage of having linear time computational complexity.

Path Integral Control for Exploratory Path Planning
The final path planning algorithm is summarized in Algorithm 3. In a discrete receding horizon formulation, the optimal control vector is recomputed at each sampling instance, while only the first control input is applied to the path planning model to generate the next way-point.
Exploration of underwater environmental fields is often conducted by underwater robots whose propulsion system produces mainly forward-directed thrust, which allows higher cruising speeds and, thus, faster exploration. These robots typically come with non-holonomic dynamics, e.g., the HipppoCampus micro underwater robot [3]. Hence, analog to [23], we use an unicycle model to make use of the dynamic constraints and efficiently generate path roll-outs. The model reads with the directly controllable system dynamics f c robot = 0, as well as the control transition matrix g (c) = 1 being deterministic. Since g (c) is scalar, the weighted control projection matrix also reduces to a scalar M t j = 1.
The cost for the -th path segment τ i:H, is computed in line 7 of Algorithm 3. Note that if we average over M t i u 0:(H−1) the algorithm could become unstable. As [21] points out, the matrix M t i is a projection of u 0:(H−1) onto the column space of g t j weighted by the metric R −1 . A multiplication with M t i results in a decreasing u 0:(H−1) . The state cost q i, is set to be the predictive variance of the belief model at the associated state.
Afterwards, the probability of each path segment P(τ i:H, ) is obtained by normalizing each probability measure ofS(τ i:H, ) through the sum of path segment probabilities of all roll-outs in line 10. In this line, λ is a sensitivity parameter that is eliminated by subtracting a constant term from S(τ i:H, ), writing with c = 10, as proposed by [21]. Path segment dependent control correction -Compute cost of paths segments:

Field Belief Comparison
In this Section, the sequential GMRF regression model stated in Algorithm 1 and the STKF regression model stated in Algorithm 2 are compared for their computational performance time and prediction capabilities. Throughout the analysis, we use the empirical GMRF algorithm proposed by [23] as performance baseline.

Computational Complexity
In the following we analyze the computational complexity of our proposed field belief algorithms. The upper bounds of the belief algorithm's computational complexity are summarized in Table 1. Thereby, the original empirical GMRF algorithm as proposed in [23] already shows a drastic improvement with regard to processor and memory requirements compared to vanilla GP regression. Nonetheless, the empirical GMRF algorithm still suffers from a linearly increasing computational cost over the number of time steps k. The sequential GMRF regression algorithm utilizes the canonical form of the GP, which in combination with a predefined GMRF precision matrix enables a sequential update of the belief. Therefore, the sequential Bayesian GMRF algorithm has a constant computational time with upper bound O(Nn 3/2 ) for the two dimensional scenario. In general, the computational time of the GMRF increases with the number of dimensions as the bandwidth of the precision matrix increases [25]. The STKF's upper bound on the computational complexity is O (rnN + N 3 ). For the case of spatial hyperparameter estimation, the STKF's computational complexity is limited by the computation burden of the spatial Cholesky factorization being O(n 3 ) for dense matrices. Table 1. Computational complexity of the developed belief algorithms for two and three dimensions of the field belief with number of agents N, number of discrete time steps k, number of field grid values n, and dimension of the state space model r.

Environmental Field Estimation
The hyperparameter configurations of the individual algorithms and the corresponding acronyms are listed in Table 2. The hyperparameters of the belief model in Table 2 are tuned by hand in order to optimize the approximation result. Hereby, the size of the field lattices are tailored such that the computation time has the same value for all algorithms. Note that GMRF-2 and GMRF-3 use the same regression algorithm, but the used CAR process type, boundary condition, and lattice size differ. During the simulation, measurements are collected from the spatial field depicted in Figure 10. On each belief update step, the next measurement location is chosen to be the point with the highest predictive variance plus a Gaussian noise term with a variance of 0.5 m 2 .
In order to provide meaningful results the simulation setup is run over 50 individual cycles for each belief algorithm. Figure 11a shows the root mean square (RMS) of the predictive variance sum. We choose the predictive variance sum as belief convergence criteria as it-unlike the empirical mean-is more sensitive to outliers. The convergence behavior of the predictive variance sum's RMS is utilized as a measure for the exploration performance. Table 2. Hyperparameter configurations of the individual algorithms and corresponding acronyms. The size of the belief field grid S is defined by the total number of vertices in x-direction being n x . The second value inside the brackets depicts the total number of padding vertices in one dimension.

Acronym
Belief Algorithm Process Type Boundary Cond.  Figure A1 in Appendix B illustrates the mean and variance prediction results for the different regression models. Due to the steep correlation structure between known and unknown field values induced by a CAR(1) model, comparatively many measurements are taken in the vicinity of the field boundary. If the padding around the true field GMRF grid is chosen relatively small, distortion effects of the boundary conditions affect the estimation result. The predictive variance of GMRF-1 and GMRF-2 after one measurement shows a non-circular shape, which is induced by the Neumann boundary condition, while the predictive variance of the GMRF-3 after one measurement increases on the edges of the field due to the Torus boundary condition. While designing a GMRF, the dependencies between the field approximation, the used GMRF model, and the chosen boundary condition must be taken into account.   Figure 11b illustrates the median and inter-quarter range (IQR) of the computational time of the different belief models after fifty simulation runs. As expected, the computational time of GMRF-1 increases practically linearly over time due to the increasing number of observations. GMRF-3 has approximately 4500 lattice vertices, while GMRF-2 has approximately 2400 lattice vertices. While GMRF-2 and GMRF-3 both utilize the same regression algorithm with different lattice sizes the computational time almost equals. The same computation time can be attributed to the CAR(2) model used for GMRF-3, which results in a less sparse precision matrix than compared to the CAR(1) model of GMRF-2.

Analysis of the Exploration Algorithm
In this Section, the STKF belief model is combined with the stochastic controller and the performance of the resulting exploration algorithm, abbreviated as PI-STKF, is analyzed. First, the PI-STKF algorithm is simulated for the scenario of a spatially stationary field as introduced in Section 5.2. Afterwards, the PI-STKF algorithm is simulated in a spatio-temporal exploration scenario, demonstrating the suitability of the developed exploration algorithm for long-term field monitoring tasks. The PI-STKF simulation parameters are listed in Table 3. In order to analyze the effect of different control horizons we consider three robot agents which we abbreviate as 'Agent-4', 'Agent-9', and 'Agent-14', corresponding to their control horizons of t H = 4 s, t H = 9 s, and t H = 14 s respectively.
In order to analyze the performance of the algorithms completely isolated from external non-reproducible influences, simulations are conducted on a 2.4 GHz Dualcore-CPU 'i5-2430M' and 8 GB RAM computer where the algorithms are implemented in Python. The computer executes the exploration algorithm at a frequency of approximately 1.5-3 Hz and, thus, sufficiently fast to provide underlying low-level control schemes with the required data. Moreover, it is worth mentioning that the current Python-implementation is not yet speed-optimized such that a sufficiently fast execution time is expected on the micro robot's onboard hardware. Table 3. PI-STKF simulation parameters.

Analytical Field Exploration
At every discrete time step, the agent receives a new measurement at its current location from the underlying simulated real field which is depicted in Figure 10. The measurement is perturbed by Gaussian measurement noise with variance σ 2 y . The measurement is fed to the STKF belief model. The temporal length scale of the STKF belief model is set to 10 7 , such that the conditional variance of the belief model does not noticeably increase over time. The PI 2 algorithm utilizes the belief's conditional (predicted) variance as state cost. Figure 12 illustrates the roll-out sampling step for a control horizon length of t H = 4 s (left) as well as t H = 14 s (right). At t = 40 s the Agent-4 plans a optimal trajectory towards a local variance maximum. In contrast, Agent-14 takes a path towards the global variance maximum.   Figure 12, Agent-4 samples trajectories which are not long enough to reach to the global maximum of the predicted variance field. As a result, Agent-4 follows a rather sub-optimal trajectory from t H = 9 s until t = 30 s. Furthermore, between t = 60 s and 90 s Agent-4 moves again into the upper left corner, as the rather short sample roll-outs do not provide sufficient information regarding the location of the unknown field values. In contrast, Agent-14 shows a more exploratory behavior. On an intuitive level, one might wonder why Agent-14 first navigates to the upper field boundary, instead of directly maneuvering to the upper right corner of the field. However, even though the prediction horizon is longer than Agent-4's horizon, it is still possible that the relatively small number of sampled control roll-outs pushes the optimal path towards a temporary less informative path. When Agent-14 almost reaches the upper field boundary, symmetry breaking occurs. Symmetry breaking is a common phenomenon in stochastic optimal control that describes the sudden fixation of the controller towards one sample direction [22]. As the controller's state cost is the predicted field variance, the agents prioritize a path along the boundary of the field. After t = 90 s both agents obtained a good belief of the original field process. The effective computational time at each time instance sums up to 0.2 s when using a control horizon of t H = 4 s and 0.3 s and 0.4 s for the control horizons of t H = 9 s and t H = 14 s respectively. Hereby, it is worth mentioning that the controller is implemented in Python using mainly a non-optimized for-loop structure. In order to measure the expected average exploration performance of the PI-STKF algorithm, the simulation with the stochastic field in Figure 10 is repeated 50 times for three different control horizons (t H = 4 s, 9 s, 14 s) where each simulation episode has a length of 150 s. The agent initial position is picked uniformly random within the range of x = 0.5 m to x = 9.5 m for the x-coordinate while the y-coordinate is set to y = 0.5 m, and the robot's initial orientation is α = π/2. The obtained results are compared to a random walk exploration strategy as a baseline. Figure 14a illustrates the median and inter-quarter range of the sum of the agent's conditional variance using STKF-1 as a belief model. Hereby, the stochastic controller consistently outperforms the random walk baseline strategy. In the first 15 s of each simulation the controllers with horizon t H = 4 s and t H = 9 s drive the agent towards the upper field boundary which results in a similar exploration performance. At approximately t = 15 s the exploration performance of Agent-4 decreases in comparison to the agents with longer control horizons. The information gain per time step in an unexplored field is almost independent from the control horizon when pursuing the the first time steps. However, agents with longer control horizons begin to profit from their far-sight controller as the field exploration mission continues. As a result, when compared to agents with longer horizon Agent-4 tends to maneuver itself on a less informative trajectory. In Figure 14a, the difference in exploration performance between the controller with t H = 9 s and t H = 14 s is small. Nonetheless, exploration missions in larger fields are likely to profit from longer control horizons. In order to analyze the agent's exploration efficiency we compare the time spans the agents require reach predefined exploration levels which we represent by the summed predictive variance. Hereby, a crossing time is defined as the time it takes for the agent to drive its field belief predictive variance sum beneath a predefined value. The corresponding crossing times are illustrated in Figure 14b. It can be seen that Agent-9 (t H = 9 s) and Agent-14 (t H = 14 s) outperform the controller of Agent-4 with a horizon of t H = 4 s. Thereby, Agent-9 and Agent-4 show a similar exploration performance down to a predictive variance sum of 600. For lower predictive variances, the controller with t H = 14 s provides a better and more predictable exploration performance. Note that although the difference of the crossing times lies within a range of 20 to 30 s and, thus, comparatively small the effect on the exploration performance will scale with the size of the field and the number of agents.

Spatio-temporal Field Exploration
In this section we examine the scenario of a spatio-temporal process. The field values are defined on the GMRF grid and interpolated between each other in order to obtain a continuous field. The temporal process at time instance t = 10 s, 30 s, 60 s, and 90 s is depicted in Figure 15(left column). We use the same parameters as in the PI-STKF simulation, see Table 3, except for the temporal length scale which we set to 15 5 . The reduction of temporal length scale let the agent's belief variance increase over time if no measurement is obtained. Thus, the STKF is able to capture the evolution of the temporal process through the increase in uncertainty of the temporal process component. This is beneficial for coping with unknown stochastic components of the environmental process and hence potentially enables long-term autonomous monitoring scenarios. Figure 15 illustrates the predicted field mean and variance of the STKF belief model, while the PI-STKF explores the spatio-temporal field. The spatio-temporal process at the agent's initial position starts with a concentration value of approximately −1.5. After t = 60 s the agent has finished its initial clockwise exploration maneuver and passes the vicinity of its initial position again which concentration value has now changed to ca. 1.5. As depicted in Figure 9, the optimal control is computed by sampling potential path roll-outs and weighting them through a path probability that results from the conditioned variance as well as the control effort. As the variance of the field belief has noticeably increased during this maneuver, the obtained measurements lead to a process estimate which fairly represents the underlying original concentration field. Thereby, the increasing conditional field variance describes the loss of information in field regions which have not been visited by the agent recently. The time instance t = 90 s is a illustrative example on the controller exploration strategy: the planned trajectory first heads towards a local variance maximum and then points to the agent's starting position at the beginning of the simulation. During the numerical experiment the computational time of the PI-STKF algorithm was on average 0.7 s and remained constant.  5 resulting in a noticeable belief change over time, as represented by the increasing predicted field variance (left column). The agent collects measurements from an spatio-temporal concentration field to compute an optimal path (|) based on the predicted STKF field variance. The next position of the agent's trajectory (|) is obtained by applying the first value of the computed optimal control sequence. Left: Spatio-temporal concentration field Center: Predicted field mean Right: Predicted field variance.

Conclusions
In this work, we outline the combination of Kernel-based belief models with stochastic optimal control while ensuring an upper bound on the computational load of the algorithm. First, a sequential fully Bayesian GMRF algorithm was extended to incorporate off-grid observations through shape-functions. Second, we showed that a spacetime Kalman filter belief model combined with the PI 2 algorithm enables the exploration of spatio-temporal environmental fields. We demonstrated that the STKF algorithm can be extended by using shape functions to handle off-grid observations while remaining constant computational complexity. Extensive simulations were carried out to systematically identify bottlenecks in the exploration algorithm's performance.
Future work will address the robustness of the PI-STKF algorithm while taking into account non-Gaussian measurement noise, uncertain localization, and uncertain hyperparameters. Furthermore, it is interesting to analyze the algorithms' performance in an underwater experimental setup by using micro underwater robots as mobile sensor nodes. However, the design of meaningful experimental studies is a challenging task itself, as it requires reproducible environmental fields to provide the same scenario setup to all algorithms. Such a benchmark field can be realized by using, for instance, look-up tables of the field onboard the robots or spatial depending illumination of the fluid volumes which is then measured by the robots. Moreover, future studies will examine the influence of different field geometries (e.g., irregular grid shapes) on the exploration performance. These will also include varying control horizons and exploration noise. Another aspect is the analysis of different information theoretic criteria to redefine the state cost of the optimal control problem. Such future studies could evolve around the approximation of common information theoretic criteria for path planning such as mutual information [24] to increase exploration performance. In this context, further analysis of the belief models' computational properties should be conducted. A combination of a GMRF which approximates the spatial process component while e.g., a Kalman Filter captures the temporal process component could result in a computationally efficient spatio-temporal GP model. Furthermore, the presented algorithms could extended to a multi-agent fleet which uses a decentralized belief representation rather than sharing one central belief model in order to reduce the dependence on restrictive underwater communication systems.
Finally, the proposed approach for synthesizing an autonomous field exploration algorithm shares similarities to the problem of safely learning dynamic models, such as in [28,29]. In the aforementioned works, the authors leverage GP models to safely explore the state-space of an unknown dynamic system, while the state can only changed continuously. We believe that the findings in one of those fields could be very fruitful for the other.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: First, (20) is rewritten into a sum of vector products with Φ k,j denoting the interpolation matrix of the j-th agent at discrete time step k, such that By analyzes of the structure of (A1) an update rule is obtained that allows the sequential incorporation of new measurements as Applying the Sherman-Morrison formula on (A2) results in With (A1) and (A3), the conditional covariance matrix is obtained as Thus, the sequential update rule for the conditional variance can be written as    Table 2. Left: Predicted field mean of different regression models after fifty observations. In each prediction step the next observation was chosen to be close to the maximum predicted variance. Right: Predicted field variance of different belief models after one observation.