Intermittent Information-Driven Multi-Agent Area-Restricted Search

The problem is a two-dimensional area-restricted search for a target using a coordinated team of autonomous mobile sensing platforms (agents). Sensing is characterised by a range-dependent probability of detection, with a non-zero probability of false alarms. The context is underwater surveillance using a swarm of amphibious drones equipped with active sonars. The paper develops an intermittent information-driven search strategy, which alternates between two phases: the fast and non-receptive displacement phase (called the ballistic phase) with a slow displacement and sensing phase (called the diffusive phase). The proposed multi-agent search strategy is carried out in a decentralised manner, which means that all computations (estimation and motion control) are done locally. Coordination of agents is achieved by exchanging the data with the neighbours only, in a manner which does not require global knowledge of the communication network topology.


Introduction
Searching strategies for finding targets using appropriate sensing modalities are of great importance in many aspects of life, from national security [1,2], rescue and recovery missions [3,4], to biological applications [5][6][7]. A taxonomy of search problems is proposed in [8]. We focus on probabilistic search, where the objective is to localise the target in the shortest time (on average), for a given search volume.
The earliest theoretical studies of search strategies [9,10] were based on systematic search along a predetermined (deterministic) path, such as the parallel sweep or the Archimedean spiral [3,11]. The search patterns of animals, on the contrary, are random rather than deterministic. An explanation for this phenomenon is that an event, such as a detection (false or true), changes the strategy and hence the behaviour of the searcher. Subsequent changes of strategy manifest themselves as a random-like motion pattern. Most of the current research into search strategies is towards the mathematical modelling and explanation of random search patterns [2,12,13].
By studying the GPS data of albatrosses, it was discovered that search patterns of these birds consist of the segments whose lengths are random draws from the Pareto-Lévy distribution [6]. This discovery led to several papers demonstrating that the so-called Lévy walks/flight are the optimal search strategy for foraging animals (deer, bees, etc), resulting in fractal geometries of search paths. An alternative to Lévy strategies is the intermittent search: a combination of a fast and non-receptive displacement phase (long jumps within the search domain, with no sensing) with a slow search phase characterised by sensing and reaction [14]. Bénichou et al. provided both a theoretical study and experimental data verification of intermittent search [13,15]. In their terminology,

Problem Formulation
For convenience, and without loss of generality, let us consider a search area A in the shape of a square with sides of length b. The area is discretised into an upright square lattice of the unit mesh size, thus consisting of N = b 2 1 cells. The grid is specified as G = {(x n , y n ); n = 1, . . . , N}, where (x n , y n ) are the Cartesian coordinates of a centre of nth cell.
The team of searching agents consists of S ≥ 1 members. Let the searching agent s ∈ {1, 2, . . . , S} at discrete-time k be located in the cell l s k ∈ G. If the agent is in the sensing mode, it collects at time k a set of detections Z s k = {z s k,1 , . . . , z s k,|Z s k | }. Each detection can originate from the true target or be a false alarm. A vector z ∈ Z s k consists of a range and azimuth measurement of the perceived target, relative to agent position l s k . Thus, if the true target is in the grid cell x ∈ G, its corresponding measurement is a Gaussian random vector · is Euclidean distance, ∠(u) is the angle between u and the y-axis of the references coordinate system and R is the measurement covariance matrix. The probability of detection P d is a function of the distance between the agent and the target. The probability of false alarm P f a is constant within the sensing area around the agent (and zero otherwise). For example, let the probability of detection be adopted as P d = exp(−r/a), where r is the distance between the target and the agent and a = const is a sensor characteristic. Assuming 360 • sensor coverage, the sensing area can be defined as the circular area around the sensor position, with a radius 3a (in this area P d > 0.05).
Searching agents move in formation. Each agent knows its relative coordinates within the formation (for example, the offset from the centroid), however, it does not have to know the topology of the formation or its size. Communication between two agents in the formation can be established only if their mutual distance is smaller than some R max . Motion is subjected to small perturbation errors, meaning that an agent whose destination during the displacement is a cell l ∈ G, may end up in a cell adjacent to it. These motion errors will cause the network topology to vary with time. For simplicity, we assume that communication links, when established, are error-free. Figure 1 shows a formation of 13 searching agents with two different communication graphs for two values of R max . Green lines indicate the existence of a communication link between two nodes (agents) in the graph. The problem for the team of agents is to coordinate the search and find the target in the shortest possible time, using the information-driven intermittent search strategy. The described problem can be applied in the context of a search for a submarine, using a swarm of amphibious drones, each equipped with an active sonar system and a wireless communication device. When a drone floats on the sea surface, it turns on its sonar to collect detections of underwater targets.

Decentralised Estimation: The Probability of Occupancy Map
Let us denote the complete set of measurement data from all S agents at time k as D k ≡ {(Z s k , l s k )} s=1,...,S , where Z s k and l s k represent the measurement set and the location of sth agent, respectively, at time k.
The current knowledge about the target location within the search area A is represented by the probability of occupancy map (POM). This is a map in which each pixel corresponds to a cell of the grid G and represents the posterior probability of target presence in the cell. For cell n and agent s ∈ {1, . . . , S}, this posterior probability at time k is expressed as p s k,n = Pr{δ n = 1|D 1:k }, where D 1:k ≡ D 1 , D 2 , . . . , D k , and δ n ∈ {0, 1} is a Bernoulli random variable representing the event that target is located in cell n. The POM is then a collection P s k = {p s k,n ; n = 1, . . . , N}. The POM is updated sequentially using the Bayes rule. Given the POM at the previous time, that is P s k−1 , and the measurement data at time k from agent t ∈ {1, . . . , S}, that is (Z t k , l t k ), the probability of occupancy in the nth cell is updated as [23] p s k,n = if none of the detections in Z t k falls into the nth cell. The term P t,n d in Equation (2) represents the probability of detection in the nth cell given that the tth agent is located at l t k ∈ G. A similar explanation applies to the probability of false alarm P t,n f a . If a detection from Z t k falls in the nth cell, the update equation for the POM of agent s is: Initially, before any sensing information is received, that is at k = 0, the POM of each agent is set to p s 0,n = 1 2 , for all n = 1, . . . , N and s = 1, . . . , S. This POM corresponds to the state of complete ignorance.
Each agent in the team updates sequentially its local POM using its local measurements and those measurements it receives from other agents in the team (ideally, using the complete set of data D k ). We adopt the dissemination based decentralised architecture [24] for this purpose, where the entire D k is exchanged via an iterative scheme over the communication network. In the first iteration, agent s broadcasts its data-pair (Z s k , l s k ) to its neighbours and receives from them their respective data-pairs. In the second, third and all subsequent iterations, agent s broadcasts its newly acquired data-pairs to its neighbours and accepts from them only the data-pairs that agent s has not seen before (i.e., newly acquired). Providing that the communication graph is connected, after a sufficient number of iterations (which depends on the topology and the size of the graph), a complete list of measurement data pairs from all agents in the formation (i.e., D k ), will be available to each agent for updating its local POM.
An illustration of the evolution of a POM and the effect of sequential Bayes update is shown in Figure 2. A group of 13 agents, in the formation shown in Figure 1a, is placed in the lower left corner of the search area of size 100 × 100 arbitrary units (a.u.). The target is far from all agents and cannot be detected. The sensors are characterised with parameter a = 3 a.u., while the distance between the agents is 6 a.u. The probability of false alarm within the detection (sensing) volume was set to 0.005 per cell. At k = 0, see Figure 2a, all pixels of the POM are set to 1/2. At k = 10 ( Figure 2b) and k = 38 (Figure 2c), the regions of the maps in vicinity of agent locations become white, indicating a low probability of target occupancy in them. Occasional false detections increase the probability of occupancy in affected pixels; however, with time, they all tend to zero (see Figure 2c).
By staying longer in the same position, the white areas around the agent formation grow only up to a certain saturation level, determined by the probability of detection as a function of distance. The measurements received after reaching this saturation level increasingly become uninformative. For this reason, the formation at some point in time should move to another location (as discussed in the next section).
The search is terminated when the probability of occupancy in one of the cells of the POM is higher than a given threshold, i.e., when max s,n {p s k,n } > 1 − , with 1. The cell which satisfies this condition is declared to contain the target.

Formation Motion Control
The search objective is driven by two conflicting demands: exploration and exploitation [25]. Exploration is driving the agents to new locations in order to investigate as much of the search volume as possible. The exploitation demand is urging the agents to stay longer in one place, because this helps determine with certainty if a detection is false or true and improves the localisation accuracy. The balance between exploration and exploitation exposes two questions: how long to stay in one position and where to move next. Intermittent search strategy [13,15] was proposed as a balance between exploration and exploitation. Exploration corresponds to the ballistic flight phase, while exploitation is carried out in diffusion phase.
In decentralised multi-agent search, each agent autonomously makes a decision about its next action. However, some form of coordination between the agents is essential, in order to collectively maintain the prescribed geometric shape of the formation and thus avoid its break-up. Section 4.1 discusses the individual decisions by agents, while the team coordination is explained in Section 4.2.

Individual Decisions
During the diffusion phase, the formation is static and agents repeatedly collect measurements. If the sensing interval (i.e., the time required to acquire a snapshot of measurements) is τ 0 , then the duration of the diffusion phase is τ d = m 0 τ 0 , where m 0 ≥ 1 is the number of sensing intervals, computed as follows. Recall that, after a single sensing interval, the probability that an agent detects a target within a circle centered at its location and with radius L 0 > a, is exp(−L 0 /a). After an arbitrary number of sensing intervals m, the probability that the agent does not detect the target within a circle of radius L 0 , is then: P m is monotonically decreasing with m. The agent should stay in one location as long as P m > p * , where p * is a user defined (small) probability value. Let m = m 0 be the minimal number of snapshots which satisfies P m ≤ p * . After m 0 snapshots, the agent is certain with probability 1 − p * , that the target is not within the radius L 0 . Then from Equation (4) we can write: where ceil[·] is the ceiling function (defined as the smallest integer greater than or equal to its argument).
After the diffusion phase, the agent wants to jump outside the explored area, that is, the length of its subsequent ballistic flight should be at least L 0 . Let the speed of the ballistic flight be V 0 . The ballistic time τ b , according to Bénichou et al. [13], is: where a was introduced earlier (the sensing parameter) and γ is a numerical factor dependent on the search area geometry [13]: Note that the value of γ slowly increases with the ratio b/a. Typically, b a and then γ > 1. The minimum length of a ballistic flight, from Equation (6) and using τ b V 0 = L 0 , equals to L 0 = γa. In summary, for given a, b and p * , the count m 0 can be computed from Equation (5) as The search starts with a diffusion phase. After collecting and exchanging all m 0 snapshot of measurements by agents in the formation, each agent compares the highest value of its local POM with a threshold, set just above 1/2, in order to test if a target have been detected. If the comparison with the threshold is positive, further investigation is required, and hence the agent moves by one step on the grid G towards the cell containing the suspected target position and repeats the diffusion phase. There are four options for this one-step move: left, right, up and down.
If the comparison is negative, the agent would consider a ballistic flight, as follows. First, it would create a set of "move" actions U k , which consists of |U k | − 1 potential destinations for a ballistic displacement, as well as the option to remain static (no move). Let action α ∈ U k represent the destination of the centroid of the formation (after the hypothetical ballistic flight). For each action α ∈ U k , a reward is computed.
The reward function is defined as the information gain rate, i.e., where • H k is the current entropy of the POM, defined as: • H α is the entropy of the POM after the hypothetical move of the agent (and the entire formation) to a new destination (commanded by action α), followed by sensing and updating its local POM during the subsequent diffusion phase. • E is the expectation operator with respect to all possible realisations of random measurements.
To simplify computations, we assume that during the diffusion (sensing) phase, following an action α, sensing resulted in no detections (being the most likely scenario). In this way, we can ignore E in Equation (9). • τ b + τ d is the time required to carry out the hypothetical move and perform sensing (the ballistic time τ b is the time required for the agent to travel to the new location, while τ d is the sensing time in the subsequent diffusion phase). The ballistic time τ b is computed as the quotient of the distance to be travelled (according to action α) and the velocity of ballistic flight V 0 . Recall that one of the actions in U k is not to move. For this action, τ b is zero.
Note that the rewards for all hypothetical actions are computed before the agent actually moves. The action which results in the maximum reward is selected and subsequently involved in the processing described in Section 4.2. Let us denote this action-reward pair for agent s as (α * s , R * s ). It remains to explain how the "move" actions of U k are created. Their number is a parameter of the algorithm. For each "move" action, the length of the ballistic flight b is a random draw from the exponential distribution with the mean¯ = κL 0 , where κ ≥ 1 is the multiplying factor and L 0 was introduced as the minimal length of the ballistic flight. The direction of the ballistic flight is also random: it is drawn from the uniform distribution over the interval of [0, 2π] rad.

Coordination through Consensus
In a decentralised fusion architecture, each agent, independently of the other agents in the formation, makes a decision on its future action. This can result in a disagreement between the agents, unless the decided actions α * s , s = 1, . . . , S, are all equal. The disagreement is undesirable, because it leads to a break-up of the searching formation. The consequence of the break-up is the loss of connectivity in the communication network and, ultimately, reduced effectiveness of search. Initially, at the start of the search mission, the formation is created to ensure its communication graph is connected. The goal of cooperative control is to maintain (approximately) the shape of the formation during the mission and thereby keep the communication graph connected. In addition, it prevents a collision of agents in motion. For this reason, whenever an action decision is made by an agent, it needs to engage in a protocol which will ensure that all members of the formation reach the agreement on the common action to be applied by all of them.
The decentralised cooperative control is based on the consensus protocol [22,26]. Consensus algorithms are a family of iterative schemes which enable autonomous agents to achieve an agreement based on local information in a decentralised fusion architecture. Among the consensus protocols, the most widespread are the average consensus and the max-consensus [27]. Because we use both protocols in the proposed intermittent multi-agent search, they are briefly reviewed next.
Consider a graph representing the communication network used by the agents to share the data (such as those in Figure 1). Let S = {1, 2, . . . , S} be the set of vertices of the graph (representing the agents) and E ⊆ S × S the set of its edges, where an edge (s, t) ∈ E exists only if agents s and t can communicate directly. The set of neighbours of agent s ∈ S is then N (s) = {t ∈ S : (s, t) ∈ E }. Suppose all vertices (agents) in the network have calculated locally a certain scalar value or "state" (such as the action reward). The goal of the max-consensus is, for each agent, to determine the globally maximum state, by communicating only with its neighbours. Let the original (initial) state of agent s ∈ S be denoted as x s (0). At each iteration τ = 1, 2, . . . , agent s communicates and updates its state according to [27]: After a sufficient number of iterations (which depends on the topology of the network), all agents will reach an agreement on the global maximum.
Average consensus is an iterative algorithm by which agent s ∈ S computes the mean value of the collection of initial states {x t (0); t = 1, . . . , S}, by communicating only with its neighbours. At each iteration τ, agent s updates its state according to [22] x s (τ + 1) = q ss x s (τ) + ∑ j∈N (s) q sj x j (τ) (12) where Q = [q ij ] is referred to as the averaging matrix. Q must be symmetric and doubly stochastic [28], and satisfies q ij = 0 if (i, j) / ∈ E . We adopt the averaging matrix as the Metropolis weight matrix [29]: The consensus algorithm is iterative and hence its convergence properties are very important. Although the network topology may change with time (as the agents move), during the short interval, when the exchange of information takes place, the topology can be considered as time-invariant. The convergence of the consensus algorithm in the adopted framework (time-invariant, undirected communication topology) is guaranteed, if the graph is connected [22,29,30]. While this is a theoretical result, valid for an infinite number of iterations, in practice, due to the finite number of iterations, the consensus may not be reached and, as a consequence, one or more agents may be lost during the search (a lost agent has lost the connection with the formation). This event, however, does not mean that the search mission has failed: the target can be found eventually by the remaining (smaller) formation, albeit in a (possibly) longer interval of time.
The max-consensus is used by agents to agree on: (1) the destination of the ballistic flight with the overall highest reward; and (2) the termination of search. The average consensus is used to agree with the simple majority in which direction (left, right, up or down) to carry out the one-step move towards the closest potential target.

An Illustrative Run
The simulation setup is shown in Figure 3a. The search area is a square of size b = 100 a.u. A target is placed at coordinates (19, 75) a.u. The searching formation consists of five agents, which initially have a shape of a square of size d = 5, with R max = 1.5 d. Figure 3a shows the paths of the five searching agents from k = 1 to k = 375. The cyan-coloured lines connecting the agents at k = 375 indicate the established communication links (i.e., the edges of the communication graph). The parameters used in this illustrative run were a = 1 and p * = 0.005, which results in m 0 = 37. Furthermore, κ = 3, V 0 = 1 and = 0.0025. Each agent suggests nine ballistic flight destinations (i.e., the cardinality |U k | = 10). The number of iterations of the consensus algorithm, both for decentralised estimation and decentralised control, was fixed to 30.

Monte Carlo Runs
Monte Carlo runs were performed to compare the performance of four different searching formations: a single agent (S = 1) and formations with S = 5, 9 and 13 agents. The initial communication graph of multi-agent formations are shown in Figure 4. The minimum distance between the agents is d = 5, with R max = 1.4 d. The search area is a square of size b = 100 a.u. The parameters were: a = 1, p * = 0.005, κ = 3, V 0 = 1 and = 0.0025. Each agent suggests nine ballistic flight destinations.
The number of Monte Carlo runs was set to 100. The obtained average search duration is shown in Figure 5, including the 5th and 95th percentile limits. The average search duration is also given in the second row of Table 1. In accordance with our expectations, the larger is the formation, the shorter is the search time. The law of diminishing returns is also evident: the benefit (i.e., the search time reduction) is slowly reducing with S. Other average statistics recorded from Monte Carlo runs are shown in Table 1, in particular: (i) the average (over time and Monte Carlo runs) number of edges in the communication graph (third row); (ii) the average (over time and Monte Carlo runs) number of lost agents (fourth row); and (iii) the fraction of Monte Carlo runs in which the target was successfully located. The only surprising statistic is that the mean number of lost agents is smaller for the formation with S = 9 than for the other two. This is probably due to the shape of this formation, in which each agent is connected to at least two other nodes (see Figure 4).   Next, we discuss the alternative search strategies. One option is a pure diffusion phase of search, where the agents move at most by one step on the grid and sense. This type of search would naturally take much longer to find the target, possibly an infinite length of time. Another option is the intermittent search, where the choice of the ballistic flight is purely random (instead of using the ballistic flight with a maximum reward (Equation (9))). The mean search time with S = 5 agents (all other parameters the same as above) was 4389 a.u using a random choice versus 2250 a.u. using the information reward. The benefit of the information reward is evident.

Conclusions
This report presents an intermittent information-driven search strategy by a team of coordinated agents. Both the estimation of the target occupancy map and the motion control for the entire formation were carried out in a decentralised manner, meaning that all computations were done locally. Coordination of agents was achieved using the consensus algorithm, which is an iterative scheme in which the data are exchanged with neighbours only, in a manner which does not require the global knowledge of the communication network size and topology. Numerical results indicate a high success rate in finding the target with search duration inversely proportional to the size of the multi-agent formation.
Future work could incorporate more complex motion of the searching formation, including its rotation and scaling (growing and contracting). This has a potential to further reduce the search time, especially for larger formations.
Author Contributions: B.R., writing-original draft preparation, methodology, software and validation; A.S., conceptualization, methodology and writing-review and editing. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by DST Group under the Research Agreement with RMIT University: "Underwater Surveillance: Robust TMA and Search for Targets".

Conflicts of Interest:
The authors declare no conflict of interest.