Exploration vs. Data Refinement via Multiple Mobile Sensors

We examine the deployment of multiple mobile sensors to explore an unknown region to map regions containing concentration of a physical quantity such as heat, electron density, and so on. The exploration trades off between two desiderata: to continue taking data in a region known to contain the quantity of interest with the intent of refining the measurements vs. taking data in unobserved areas to attempt to discover new regions where the quantity may exist. Making reasonable and practical decisions to simultaneously fulfill both goals of exploration and data refinement seem to be hard and contradictory. For this purpose, we propose a general framework that makes value-laden decisions for the trajectory of mobile sensors. The framework employs a Gaussian process regression model to predict the distribution of the physical quantity of interest at unseen locations. Then, the decision-making on the trajectories of sensors is performed using an epistemic utility controller. An example is provided to illustrate the merit and applicability of the proposed framework.


Introduction
We consider the general problem of exploration using multiple mobile sensors, or fixed sensors whose field of view is reconfigurable. The problem is to locate regions where concentration of the physical quantity of interest occurs and then, having found such regions, to expend sensor capability to refine data there while continuing to search for new interesting regions. That is, after initial discovery, there is a trade-off between increasing knowledge by taking more measurements in regions already known to be of interest and increasing knowledge by exploring in regions where concentration of the phenomenon of interest (PoI) may exist but are not known to be present. For the purpose of brevity, we refer to regions containing a high concentration of the PoI as interesting regions. Due to the uncertainties of exploration, the problem is not posed as one of optimal path (or resource) planning, but as a problem that balances the competing imperatives of refining measurements while exploring new territory. Therefore, the problem studied here is different than the problems raised in other research fields such as exploration and path planning in robotics and unmanned aerial vehicles (UAVs) [1][2][3][4][5][6][7][8].
There exist various approaches for sensor array configuration or sensor placement [9][10][11][12][13][14][15][16][17][18]. For example, Zhang provides a necessary condition for optimal sensor placement in two-dimensional space using the algebraic structure of sensors [10]. The problem of sensor array configuration in a remote sensing formulation is addressed in [19], in which a statistical optimal criterion is used to identify a solution. By contrast, in this paper, a more dynamic, exploratory stance is taken to trade off typically repeated (or related) measurements against measurements in new areas. Here, we focus on the Gaussian processes (GPs) modeling to formulate the sensor placement problem [13][14][15][16][17][18]. A Gaussian process, as a Bayesian nonparametric tool, is useful for modeling spatiotemporal data, especially in cases where the data contain random variations and so it cannot be well-represented via parametric models [20]. GPs have been used for decades as a supervised learning tool for regression problems known as Gaussian process regression (GPR) models [21,22], and are also referred to as kriging, named after the mining engineer D.G. Krige in the geostatistics literature [23][24][25]. GPR models are used here as a spatiotemporal interpolator/extrapolator tool to predict the PoI at unsampled data points. GPR models and kriging methods are applicable to a wide variety of problems such as modeling or control of robotics-related applications, the prediction and estimation of temperature, precipitation, missing pixels and unmixing of pixels in hyperspectral imaging (HSI), human head pose estimation, concentration of carbon dioxide in the atmosphere, etc. [22,[26][27][28][29][30][31][32][33][34][35]. As an example in HSI, one main objective is to unmix the spectral information to make an inference of the composing materials in the scene. Imbiriba et al. in [30] consider a nonlinear model where the underlying function is governed by a Gaussian process model to detect the nonlinearly mixed pixels. Xing et al. in [33] introduce an algorithm for dictionary learning based on a GP prior to remove the noise and infer the missing data in HSI. Another example is to predict the temperature based on the available data collected from the meteorological stations. For instance, Wu and Li [31] apply the residual kriging method to predict the average monthly temperature at 500 unknown locations in the United States.
More closely related to this research, GPs have also found been applied in sensor placement problems [13][14][15][16][17][18]. In [15], Garnett et al. propose a Bayesian optimization algorithm for sensor placement based on GP modeling. They applied their approach to place 50 sensors around the UK to measure the air temperature. A typical sensor placement technique is to use the variances associated with the maximum a posteriori (MAP) estimates of GP as a measure representing the amount of uncertainty in the region. This leads to placing the sensors at locations with the highest variance (entropy) [17,18], in order to reduce the overall entropy in the region under study. This characterization of the quality of sensor placements seems to be naive due to the following reasons. As observed by [14,18], sensor placement using only the measure of variance usually forces the sensors to be placed at the borders of the region under study, due to the fact that there is no measurement outside the region and, as a result, the borders tend to have very few measurements in their neighborhood to be used in the training set of GPs. To tackle this issue, in [16], a mutual information criterion is proposed in order to place the sensors at the locations most informative about the unseen locations. However, the optimization turns out to be an NP-hard problem. In [14], an approximate form of the mutual information optimization approach is considered to select informative sensor locations via exploiting the sub-modularity property of mutual information. However, if we continue to perform sensor placement successively using either the entropy of the region or the mutual information criterion, there is a high chance ending up with some sort of uniform sampling (equally spaced sensor placements) in the region. This is because the placements tend to occur at the locations far away from the visited locations of the sensors. These criteria only fulfill the exploration goals without taking into account refining measurements of the interesting features about the underlying phenomenon. In addition, the available budget on sensors may not allow us to have widely scattered sensor placements (or place the sensors far away from their previous positions) in applications such as the space missions that exemplify our work.
In this paper, we devise a general framework to specify the trajectory of mobile sensors in order to find and characterize the concentration of the quantity of interest for the PoI in the region under study. This work differs from previous work in this area because we seek to fulfill both exploration and data refinement desires. Here, we assume that there are some interesting phenomena occurring in the region under study. Therefore, the measurement of mutual information or entropy is not enough to reach a single objective. The proposed framework adaptively makes value-laden decisions on the sensor placement. This framework consists of two main stages. The first stage is the prediction stage, where we apply a Gaussian process regression model to predict the PoI at the unsampled locations. The GPR not only provides interpolated/extrapolated values but also the variance of the estimated values. Then, the question is how to decide on the next set of trajectories, based on the information obtained from the previous (GPR) stage. There is not sufficient information to obtain an optimal solution a priori, since the available information is local (the result of previous measurements) and may change over time. The decision between improving information in the neighborhood of known PoI and exploring new territory in the hope of discovering additional interesting locations can be hard or contradictory. In order to deal with the above issues, we set up the decision-making based on epistemic utility theory [36][37][38][39], which forms the second stage of our framework. Epistemic utility is able to provide decisions, called satisficing decisions, based on the local rather than global information in a way that specifically trades off between two different goals [40].
The remainder of this paper is organized as follows. In Section 2, some background of the GPR model is presented. Then, decision-making based on the epistemic utility controller is described. In Section 3, these tools are applied to the problem of trajectory determination of mobile sensors. Section 4 contains an example illustrating the effectiveness of the proposed approach.

Theoretical Preliminaries
This section is divided into two parts. The first part is devoted to a review on Gaussian process regression models, and the second part describes the epistemic utility theory. These two tools will serve as the backbone of our proposed framework.

Gaussian Process Regression
Gaussian processes (GPs) are widely used for modeling a feature or PoI based on observed spatiotemporal data. A GP can be used as a tool for either classification or regression, where the regression application of GP is called Gaussian process regression (GPR) [21,22]. It is assumed that the concentration of the physical quantity of interest can be evaluated via an unknown and probably nonlinear function, which we denote by f (·). The arguments of the function comprise a variable set u referred to as the input data. For example, u can be defined as u = [u x , u y , u z , t] T , where (u x , u y , u z ) and t denote the spatial and temporal information about the measurements, respectively. Unlike parametric models such as linear regression, GP is non-parametric. In GP, one defines a probability distribution as a prior over the unknown function f (·), directly. In other words, GP defines a distribution over functions in the function space and the inference is performed directly in this space [22]. This is more general than a parametric model such as Bayesian linear regression, where the prior distribution is defined over the space of parameters. The GP model treats any observation as an outcome of a Gaussian random variable, and all of these random variables are jointly Gaussian. With this setting, any well-defined GP model only needs a mean accompanied by a positive definite covariance function. Under this assumption, GP provides a posterior distribution over the unknown function f once data are observed. Therefore, for any set of N observations with the input data set {u 1 , . . . , u N }, GP assumes that the distribution p( f (u 1 ), . . . , f (u N )) is jointly Gaussian with some mean µ(U) and a covariance matrix K(U), where U := [u 1 , ..., u N ]. The entry in row i and column j of ) is a positive definite kernel function. The kernel function specifies the covariance between pairs of random variables at the corresponding data points. The GP model is defined as follows [41]: predicts the PoI at unseen data points using the available training data set. The GPR can handle noisy observations. Suppose we have access to a set of N noisy observations y = f(U)+ , where ∼ N (0, σ 2 n I N ) and y n = f (u n )+ n , ∀n = 1, . . . , N. The pair (u n , y n ) is the nth training data. Using GPR, the goal is to predict the underlying function f evaluated at some other input data set U i.e., inferring f(U ), where U := [u ,1 , . . . , u ,M ] T . The set U serves as the input test data set. Based on GP modeling, the prior joint distribution between the training and test data can be expressed as [41] where f denotes f(U ) and [K(U, U )] nm = κ(u n , u ,m ). The predictive distribution over the test data, using the existing rules for conditioning Gaussian distributions, is expressed as follows: where The point estimate for f(U ) is the mean µ f and the amount of uncertainty in the estimation is represented by the variance Σ f in Equation (4). Design of the covariance function requires incorporating some prior knowledge about the behavior of the PoI as it determines the amount of correlation between any pair of data points [21]. Some of the most widely used covariance functions are the squared exponential kernel (κ SE (u, u ) = exp {− (u−u ) 2 2l 2 }) and rational quadratic (κ RQ (u, u ) = (1 + (u−u ) 2 2αl 2 ) −α ), where l and α are hyperparameters [22]. These kernels fall in the category of stationary covariance functions. Once the structure of the covariance function is selected, the corresponding hyperparameters in the model can be chosen either empirically or using some quantified statistical methods. In the empirical approach, the selection of hyperparameters is usually achieved using the empirical features obtained from the observed data such as the smoothness or periodic behavior of the samples.

Remark 1.
Although GPs are powerful tools for regression and classification problems, they suffer from high computational complexity as the sample size of the training data set increases. This problem occurs because the estimation of the test data involves inverting the covariance matrix of the training data which grows as more data are collected. The focus of this paper is not on the computational complexity of the GPs but rather on developing a general framework for mobile sensor configuration. Regarding the complexity of GPs, there exist some approaches such as the one for truncated covariance matrices in GPs [42], online sparse matrix GPs (OSMGP) algorithm [32], sparse greedy GP (SGGP) approximation method [43], and reduced rank GP (RRGP) [44]. One can use these approaches for the GPR estimation in the first stage of the proposed framework. Furthermore, there exist some studies on estimating the covariance matrix instead of an experimentally designed kernel function. For instance, Xu and Choi provide an approach to estimate and improve the quality of covariance function for anisotropic spatiotemporal GP using mobile sensor networks [13]. The suggested sampling method for such a problem is based on minimizing the information-theoretic cost function of the Fisher information [13]. This can also be incorporated into the proposed framework.

Epistemic Utility Framework
When it comes to making a decision, the first approach choice that comes into mind is to seek only optimal decisions. However, it turns out that the demand for optimality may force us to modify the original optimization problem in an objectionable way in order to obtain a tractable computation. This demand may result in solving another problem in an optimal sense, which may not necessarily account for the original problem anymore [37]. A viable alternative to optimality seeking is to formulate satisficing decisions. A satisficing decision-making strategy looks for a satisfactory and sufficient solution for the original problem rather than seeking an optimal solution to a modified problem. This is useful when either the available information is local or insufficient, or the problem is complex enough that the optimal solution is intractable [37,40]. A simple example of this situation can be found in [37] for a routing problem, where a motorist wishes to reach a destination using a given map of city streets and information relating to the fuel costs of traversing each street under the assumption that the fuel costs are subject to change in arbitrary and unpredictable ways as time evolves. A satisficing approach operates like a rational, but not extremal-seeking, human driver would, finding a route that achieves the primary goal of arriving at the destination, reasonably take into account costs of travel, but without ensuring that a globally optimal solution is obtained.
Epistemic utility theory is a framework for making satisficing decisions. In epistemic utility theory, satisficing decisions are made via putting more emphasis on avoiding wrong decisions and favoring informationally valuable decisions rather than the more restrictive (and sometimes unachievable) task of finding the best solution. This theory employs two utility functions [37,39,45] and seeks a trade-off between them. This is different than a more conventional approach with a single utility for which a maximizing point is sought. The first utility function in the epistemic utility deals with the correctness of the decisions by characterizing the truth value of propositions being evaluated. The second utility function accounts for the importance of decisions by characterizing the informational value of rejecting propositions. The two utilities are established independently. As an example [40], the truth value for rival scientific theories can be assessed by compatibility with observations, while the informational value can be assessed by parsimony or predictive power.
To apply epistemic utility theory, each of these utility functions is equipped with the mathematical structure of a probability (through normalization if necessary), which allows these two utilities to be compared. Let P = {p 1 , . . . , p n } be a finite-dimensional decision space (propositions, or options available to an agent) and let F be the power set of P consisting of all subsets of P. Let q : F → [0, 1] be a probability measure over the measurable space (P, F ). The triple (P, F , q) defines a probability space, and the probability q can be interpreted as a measure of the truth value of the elements of F . The truth value utility function is also known as the credal probability function . Thus, for any set G ∈ F , q(G ) is a measure of the truth value (credibility) of G [37]. Similarly, let m : F → [0, 1] be another probability measure over the measurable space (P, F ). The probability m is defined such that for any set G ∈ F , m(G ) represents the informational value of rejecting G . The informational value utility function is also known as the rejectability probability function. The rejectability defines the amount of information the agent gains once some propositions are rejected from the decision space [36]. (If no options are rejected, the agent is left with the same uncertainty, so no information is gained.) After defining the credibility and rejectability probability functions, we can then formulate satisficing decisions. For this purpose, a parameter denoted as agent's index of boldness, b, is used as a weighting factor on the informational value function. The boldness parameter represents the agent's willingness, or boldness, to reject propositions. A larger boldness indicates a propensity to reject more options in the evaluation of the credal probability q vs. the informational probability m. Using Levi's rule [45], among the set of possible options, those propositions whose truth value does not outweigh the weighted information value of rejection are rejected. That is, a proposition p is not rejected if q(p ) > bm(p ). By this means, propositions are retained that are "good enough", ensuring adequate performance without the imposition of an overly restrictive unique "optimal" solution.
For a decision space containing all the possible propositions P at some time, the set of options which are informationally valuable and have a probability of being correct is P l contains all the surviving, satisficing, propositions and thus it may not be a singleton set. The notion of satisficing here means that each element of P l is both likely to be correct and valuable in terms of gathering informational value. If the cardinality of the set P l is greater than 1, then the rejectability and credal probability functions of the surviving propositions are renormalized and the test in Equation (5) is repeated. This procedure is referred to as "deliberation", and is repeated until the cardinality of P l cannot be reduced further [37]. We denote the surviving propositions of the deliberation stage as P l . When there is still more than one decision acceptable and a single decision is needed, P l is then subjected to some "tie-breaking". One way of doing this is decision rule [37]

Trajectory Determination of Mobile Sensors
In this section, we propose a general framework to determine the trajectory of mobile sensors in order to explore locations containing concentration of the physical quantity of interest in the region under study. Assume that an initial trajectory of the sensors has already been determined. As may happen in realistic situations, the measurements collected along these initial trajectories may not necessarily be very informative. Once some measurements are collected over the region, the goal becomes using the capability of sensors in follow-on trajectories to both refine the data and explore for interesting regions. Our proposed framework contains two main stages: the prediction stage and the decision stage. A single pass of the sensors only samples a small fraction of the entire region under study, so that decisions about whether to explore other unseen areas must be based on some predictions of the behavior of the interesting phenomenon. Such predictions will be conducive to decide on how to replace the mobile sensors or equivalently to determine the next set of trajectories of the existing sensors.
The prediction stage employs GPR modeling to estimate the PoI in unseen locations. In this setting, the collected data are treated as the training set, where the spatiotemporal location of the measurements determines the input training data and the corresponding measurements are the output training data. The input test data are the other unseen locations over the region under study and the output test data are unknown and are predicted using the GPR. Without loss of generality, we assume that input training data are collected into the set U s ∈ R N 1 ×d , where U s = {u 1 , . . . , u N 1 } and d is the dimension of the input data i.e., u n ∈ R d . For example, d = 4 for the spatiotemporal input data. The output training data, for the scalar output case, are accumulated in y = [y 1 , . . . , y N 1 ] T , where y n is the output corresponding to the input u n . The spatiotemporal information of the test data is collected into the set U ∈ R N 2 ×d , where U = {u ,1 , . . . , u ,N 2 }. The unknown outputs evaluated at the input test data are defined as f = [ f (u ,1 ), . . . , f (u ,N 2 )] T . As prior knowledge, we assume that the joint density function between the training and test data is zero-mean Gaussian, meaning that on average we expect interesting phenomena to occur rarely. The GPR model was defined in Equation (2). The kernel function for constructing the covariance matrices K(·, ·) in Equation (2) will be defined later. The predictive posterior distribution over f for the noisy observation case was given in Equations (3) and (4).
In Figure 1, we show an example of sampling over the region under study. In Figure 1, the dashed lines represent the trajectory of a satellite, the red circles denote the locations where the samples are taken, and the rectangular shape is the region under study. The function f (·) quantifies the physical PoI. The time frame T m is the mth time the satellite visits the region. Assume that the measurements during the time frame T m are taken much faster than the changes in the PoI. We also assume that the PoI exhibits some sort of smooth behavior in the region under study. Therefore, as the satellite is within a specific time frame T m , we may expect to see high correlation between the nearby samples. We further assume (here, for simplicity) that the time it takes for the satellite to revisit the region is less than the time for changes to occur in the PoI. Therefore, if it happens to have the same trajectory for time frames T i and T j , then a high correlation between the collected data at such time frames is expected for the case where T i is close to T j .
Under the smoothness assumption for the PoI, and in order to account for the correlation that may exist between the nearby measurements, we define the following squared exponential covariance function where l 1 and l 2 are scaling factors, u i and u j are the spatial information, coordinates, of any two arbitrary locations in the region under study. The terms T m 1 and T m 2 are the time frames corresponding to the temporal information related to u i and u j , respectively. As the available prior knowledge changes, one may define a different kernel function from Equation (7). Suppose that S = {s (1) , s (2) , . . . , s (K) } is a set of all feasible trajectories of the sensors through the region under study. We denote the locations along the trajectory s (k) at which measurements are collected by the set U s (k) = {u s (k) 1 , . . . , u s (k) P }. For the case where the trajectory along s (k) has not been taken yet, the locations in the set U s (k) belong to the set of test data i.e., U s (k) ∈ U .
Using the GPR model, we obtain a prediction of the PoI throughout the region and the amount of uncertainty (variance) associated with the predictions. The amount of uncertainty can be evaluated via the kernel function defined in Equation (7). We denotef (U s (k) p ) andΣ(U s (k) p ) as the estimate quantifying the PoI and the associated measure of variance for the pth location along the trajectory s (k) , respectively. In this setting, we assume that the locations containing a high concentration of the physical PoI have a higher value of the quantifying function f (·) compared to the other locations. Then, the goal is to decide on the next trajectory of the sensors based on the available data, the estimated data, and also the amount of uncertainty over the region in order to explore the interesting phenomenon. Although the GPR provides us with an estimate of the PoI over the whole region, we get large uncertainty at locations where there exist almost no measurements; the farther away the sensors are from the measurements, the higher the variance becomes. Therefore, if we only emphasize refining existing measurements, the sensors lose the inclination to explore. In contrast, if we put more emphasis on the variance, then the sensors are more encouraged to choose the trajectories which are far away from the previous trajectories to fulfill the exploration objective of the mission. In this case, even if interesting phenomena are found by the past trajectories, the sensors are reluctant to pass nearby again. One way of taking into account both data refinement and exploration desires can be achieved by constructing the following optimization problem to decide on the trajectories of the sensors: where k denotes the kth trajectory from the dictionary of feasible trajectories S that pass through the region under study. The parameter λ is a tuning parameter that balances between the desire to explore and refine data in regions known or predicted to be of interest. Notice that, since the set S is defined in such a way to only cover the region under study, the maximization problem for small λ will at most result in the selection of a trajectory at the boundary of the region. In Equation (8),f ave (U s (k) ) and Σ ave (U s (k) ) are defined asf The optimization problem Equation (8) results in a single solution, which represents an optimum (of a λ-weighted utility function). Notice that there exist no globally optimal solution for the optimization problem in Equation (8) based on the relatively small amount of information that can be obtained from the region under study, the tension that exists between the two desiderata of exploration and data refinement in the region, and the locality of the information that we obtain. The epistemic utility framework is well-suited to handle these types of problems by making satisficing decisions such that the selected trajectories are informationally valuable decisions. Application of epistemic utility is accomplished by defining the two probability functions so that the decision-maker meets both goals of exploration and data refinement. In this setting, each of the possible trajectories of sensors is considered as a hypothesis. In contrast to the optimization problem in Equation (8), the decision maker with Equation (5) uses the credibility and rejectability probability functions which is a comparison between these two probability functions with the emphasizing factor b, where b denotes how much emphasis we consider for the informational value that we can obtain by rejecting propositions (trajectories) from the decision space. The boldness factor b in Equation (5) represents the agent's willingness to reject propositions, which makes it totally different than the term λ in Equation (8). For b ∈ [0, 1], the higher value of b results in more rejection of possible propositions (decisions). For example, setting b = 1 corresponds to rejecting as many propositions as possible in the decision space.
Based on the discussion provided above, below we formulate the problem using the epistemic utility framework represented in Section 2.2. Specifically, we consider the two following cases with their corresponding credal and rejection probability functions. In the first case, we define the informationally valuable trajectories as the trajectories that pass through the regions with high entropy (that is, passing through those regions will resolve a high uncertainty). Correspondingly, we emphasize the rejection of those trajectories which pass through regions with low entropy. Therefore, the rejectability probability is defined to emphasize rejecting the trajectories from S that contain lower uncertainty in order to favor the desire for exploration. Since the kernel function defined in Equation (7) becomes small when the desired trajectory S (k) is far from previous trajectories, the uncertainty associated with this trajectory, as defined in Equation (9), will be larger. This can be seen in Equation (2) when the covariance K(U, U ) would be small and Σ f large as a result. The probability functions constructing the epistemic utility controller can be written as q(U s (k) ) =f n (U s (k) ), ∀k = 1, 2, ..., K, wheref n (U s (k) ) andΣ n (U s (k) ) are defined aŝ f n (U s (k) ) =f ave (U s (k) ) ∑ K k=1f ave (U s (k) ) , ∀k = 1, 2, ..., K, , ∀k = 1, 2, ..., K, and wheref ave (U s (k) ) andΣ ave (U s (k) ) were defined in Equation (9). The subscript n denotes that the functions in Equation (11) are normalized to act like probabilities. According to Equation (10), we assign credal probabilities to the estimates of each possible trajectory while the measure of uncertainties determine the rejection probabilities. Without loss of generality, we assume that the more interesting behavior the phenomenon at location u s (k) p becomes, the higher value the underlying functionf (u s (k) p ) possesses. In the second case, we relate the less interesting phenomenon (low concentration of the quantity of interest) to the rejection probability function. Particularly, the less interesting the predicted phenomenon behaves along the possible trajectory U s (k) , the higher the rejection of the corresponding hypothesis becomes. The credal probability is evaluated via the amount of uncertainty each possible trajectory may possess. The credal and rejection probabilities for case 2 are q(U s (k) ) =Σ n (U s (k) ), ∀k = 1, 2, ..., K, wheref n (U s (k) ) andΣ n (U s (k) ) were defined in Equation (11). Once the probability functions are computed for either of the two above cases, we apply Levi's rule of epistemic utility. As a result, only those trajectories that satisfy q(U s (k) ) ≥ bm(U s (k) ) are surviving hypotheses. The surviving options are defined by the following set: (10) and (12), we used the normalized version of estimates and the associated variance along each trajectory in order to make q(·) and m(·) follow the "sum to one" property of probability. However, one can remove the normalization step and define the boldness factor of b ≥ 0 instead of b ∈ [0, 1].

Remark 2. In Equations
After applying the rule defined in Equation (13), the number of surviving options may not be necessarily unique i.e., the cardinality of U srv could be greater than one. Each of the elements of U srv is a satisficing hypothesis meaning that it is both likely to be correct and possess high informational value. In order to take action, we seek to accept only one trajectory (one hypothesis). Reducing the number of elements in the set U srv is accomplished in the deliberation stage described in Section 2.2 results in reducing the number of surviving hypotheses. Once the cardinality of the set U srv reduces to a reasonable number, the tie-breaking stage comes into play to force the set U srv to a unique element in order to take an action. For this purpose, one can apply the approach with Equation (6) as described in [37], which selects one hypothesis out of the survived hypotheses as the next trajectory. We refer to this trajectory as U s (k ) . Finally, after measurement, the data obtained from U s (k ) are added to the training set U, and the whole process starts again. Figure 2 illustrates the block diagram of the proposed framework.
In order to restrain the increase in the amount of training data fed to the GPR (to reduce complexity), in the data collection block of Figure 2, we only retain the measurements obtained from the last M visited trajectories of sensors. Once a new trajectory is determined and the corresponding measurements are collected, the new information is added to the training set and the oldest set of measurements are discarded. The reason is due to the assumption that the oldest set of measurements may have a very low correlation with the new measurements and the PoI may have been changed. This avoids dealing with the inverse of a big covariance matrix of the training data as we continue collecting new measurements.

Simulation Results
We demonstrate how the proposed framework is applied via a surrogate problem using a constellation of two satellites at the low Earth orbit (LEO). In this particular problem, we are incapable of performing random sampling over the region. Instead, we are restricted to follow specific trajectories once the orbits of the satellites (or trajectories of the mobile sensors) are determined. Initially, the satellites move in predetermined orbital planes over the region under study. For simplicity, assume that the PoI remains unchanged during the sampling period. Figure 3 illustrates an example including the orbital planes of satellites, where the rectangular slab indicates the region under study. In Figure 3, the constellation is defined based on Keplerian orbital elements with semi-major axis a = {7700, 8500} (km), eccentricity e = 0, inclination of i = {π/2, π/2} (rad), and the right ascension of ascending node (R.A.A.N.) Ω = {π/4, 2π/5} (rad). Since the PoI is assumed to be unchanged during the sampling period, the fastest varying orbital elements, the true (mean) anomaly θ and the argument of perigee are ignored. However, one can also take these two orbital elements into account for the case when the PoI changes over the sampling period. The region under study shown in Figure 3 Figure 4, which is borrowed from [46].
In Figure 4, the highest and the lowest quantity of the interesting phenomenon corresponding to the PoI, the highest and lowest quantities of the electron density, are shown with red and blue colors, respectively. We further assume that the PoI has the same profile for the altitude range under study ([129 Km,629 Km]) as shown in Figure 4 along the z-axis of the rectangular region shown in Figure 3. This image can be thought of as a discretized 2D version of the region of interest defined by the pixel values. In the simulations, it is assumed that it is possible to get direct measurements about the PoI along the current trajectories of satellites. Since the electron density content profile, as the PoI for our case study, is assumed to be unchanged during the sampling period, the kernel function in Equation (7) simplifies into where we set l = 10 and u i is defined by the pixel location of the image shown in Figure 4. From the initial trajectories, shown in Figure 3, the corresponding set of measurements of the region under study is shown in Figure 5a. From the initial measurements, the GPR model defined in Equation (4) and Equation (14) is applied to measure the uncertainty and the estimate of the TEC throughout the region under study. The results are illustrated in Figure 5b,c.  The second stage of the proposed framework is applied to decide on the next trajectories of the satellites. Although the initial orbital planes were constructed directly from the Keplerian orbital planes, below we assume (for simplicity in this example) that each possible trajectory can pass through the region under study and measure the TEC with the coverage Earth's longitude resolution of 0.5 • as shown in Figure 4. The problem of designing the Keplerian orbital elements to generate the actual orbits corresponding to such trajectories is not the focus of this paper and is considered as a future work.

Measurements
The epistemic utility is set with the agent's index of boldness b = 1 and the credal and rejection probability functions in Equation (12). In other words, the rejection probability function is defined such that it tends to remove the trajectories corresponding to regions with low electron density content. The credal probability function is set to encourage trajectories willing to visit regions with higher uncertainty. Figure 6 illustrates some of the results for the measurements corresponding to the selected trajectories, measure of uncertainty, and the reconstruction profile. According to Figure 6, the selected trajectories are not willing to revisit the vicinity of the areas which seem to contain low electron density. Simultaneously, the decision maker does not allow for accepting a trajectory at the very vicinity of the already chosen trajectories even if high electron density has been detected at their neighborhood. This is shown by the measure of variance in the third row of Figure 6. More specifically, very few trajectories are selected at the subregions with low electron density content, and such trajectories demonstrate some sort of uniform sampling. In contrast, more trajectories are chosen in the interesting subregions and yet these trajectories do not tend to be next to each other.
In Figure 7, we compare the performance of the proposed framework for this example for cases 1 and 2 defined in Equation (10) and Equation (12), respectively. In Figure 7a, we show the percentage of the measurements (training data) with respect to the total number of possible data if we were able to cover all the regions under study. Even after accumulating the measurements obtained from the initial and the eleven successive trajectories, we still cover around 16% of the complete data over the region. In Figure 7b, the total variance over the region vs. the increase in the number of trajectories is illustrated. Here, the variance of visited locations are set to zero and the variance of unvisited locations are measured via the kernel function defined in Equation (14). Then, we sum over all the variances associated with the pixels. Finally, in Figure 7c, the peak-SNR evaluation between the true and the reconstructed TEC profile is demonstrated.    According to Figure 7b, the increase of boldness factor for case 1 results in the decrease of the overall uncertainty in the region under study. This is because the rejection probability function corresponds to the inverse of overall variance along the possible trajectories. Therefore, the increase of the boldness factor promotes selecting the trajectories along which higher uncertainty is predicted.
In contrast, case 2 shows a different behavior in which the increase of the boldness factor forces discarding the trajectories that are believed to result in collecting measurements in the subregions with low TEC. The peak signal to noise ratio (PSNR) evaluation depends not only on how to construct the probabilities in the epistemic utility but also on the true behavior of the PoI. Since the PoI has a smooth behavior and we have already taken this fact into account, the PSNR increases as we increase the boldness factor for case 1. For case 2, the reduction of the boldness factor usually provides better performance in terms of PSNR, but it also depends on where we sample. For example, setting b = 0.2 does not show better performance compared to b = 0.5 and b = 1. The reason is that the controller decided on a trajectory that is in the region with high TEC but close to the edge of such a phenomenon. Since the kernel function in the GPR stage assumes the smooth behavior, it expands the interesting phenomenon at the neighborhood of the selected trajectory and thus the estimated PoI exceeds the edges of the true PoI. This yields to a decrease in the PSNR.
Finally, we consider two more cases to highlight the advantage of the proposed framework. In cases 3 and 4, the trajectories are selected only based on either the variance or the quantity of the interesting phenomenon, respectively. Notice that case 3 tends to select the trajectories with highest entropy (variance). Using the variance for sensor placement has been widely used [17,18,47], where we have modified it such that it can be used for making decisions on the trajectories of mobile sensors rather than the pointwise sensor placement problems. In fact, case 3 and case 4 correspond to solving the optimization problem in Equation (8) when the termsf ave (U s (k) ) andΣ ave (U s (k) ) are discarded, respectively. Figures 8 and 9 illustrate the obtained results.  . From (left) to (right), the measurements, measure of uncertainty, and the reconstructed profile of the phenomenon based on the initial and eleven successive trajectories are shown for case 3 (first row) and case 4 (second row), respectively.
As expected, Figure 8 shows that case 3 outperforms case 4 in terms of reducing the overall uncertainty in the region. The reason is that case 3 only favors the trajectories with the highest variance. In addition, due to the smoothness of the PoI, case 3 still shows higher PSNR than case 4. Comparing the total variance and the PSNR of all cases 1-4, it is clear that cases 1 and 2 result in better performance in reducing the overall variance and the increase of PSNR, collectively. The trajectory selection of cases 3 and 4 is also shown in Figure 9 for eleven successive sets of measurements added to the initial measurements. Comparing cases 3 and 4, it is obvious that they both tend to uniformly sample the region. However, the difference is that case 3 uniformly samples the whole region, while case 4 uniformly samples the vicinity of the interesting phenomenon once some are observed. Therefore, if there was any other interesting subregion, we would not have a chance to observe it via case 4 for a low number of trajectories. In contrast, cases 1 and 2 of our proposed framework apply an adaptive sampling structure to balance between the reduction of the amount of uncertainty in the region and refining data in the vicinity of an interesting phenomenon.

Remark 3.
For the purpose of only showing how the principles of the proposed framework can be applied, the simulations were carried out by neglecting the constraint on the ∆V-budget of the satellites for the orbit transfer to reach the trajectories selected by the decision-maker. Thus, in a realistic case scenario, some of the selected trajectories may not be feasible. However, this issue can be resolved by discarding non-affordable trajectories from the dictionary of trajectories in the region under study and then applying Equation (13).

Conclusions
We developed a new framework for placing mobile sensors using Gaussian process regression and the epistemic utility controller. The proposed framework considers both desires of exploration and data refinement once some interesting phenomenon is observed. This approach is useful for problems where very little local information is available and no optimal solution exists for the sensor placement. The proposed framework is constructed in a general way and, with some further modifications, it can also handle the constraints that may exist on the budget for sensor placement.

Future Work
As a future work, we will take into account a more realistic set of constraints that may exist in the decision-maker stage of the proposed framework. For example, some common constraints on the constellation design problem are on the feasibility of the orbital planes and the available ∆V-budget of the satellites, which leads to some modifications in Equation (13) of the decision maker stage and the way the feasible trajectory dictionary S is constructed. We will also investigate the comparison of the epistemic utility-based decisions with other techniques such as reinforcement learning.