Decentralized Policy Coordination in Mobile Sensing with Consensual Communication

In a typical mobile-sensing scenario, multiple autonomous vehicles cooperatively navigate to maximize the spatial–temporal coverage of the environment. However, as each vehicle can only make decentralized navigation decisions based on limited local observations, it is still a critical challenge to coordinate the vehicles for cooperation in an open, dynamic environment. In this paper, we propose a novel framework that incorporates consensual communication in multi-agent reinforcement learning for cooperative mobile sensing. At each step, the vehicles first learn to communicate with each other, and then, based on the received messages from others, navigate. Through communication, the decentralized vehicles can share information to break through the dilemma of local observation. Moreover, we utilize mutual information as a regularizer to promote consensus among the vehicles. The mutual information can enforce positive correlation between the navigation policy and the communication message, and therefore implicitly coordinate the decentralized policies. The convergence of this regularized algorithm can be proved theoretically under certain mild assumptions. In the experiments, we show that our algorithm is scalable and can converge very fast during training phase. It also outperforms other baselines significantly in the execution phase. The results validate that consensual communication plays very important role in coordinating the behaviors of decentralized vehicles.


Introduction
Over the past decade, the ubiquitous adoption of mobile vehicles has greatly enhanced the flexibility and convenience of environment sensing. When equipped with sensors, multiple vehicles can autonomously navigate to different locations to collect distributed environmental data. This paradigm, often referred to as mobile sensing, has attracted attention from a variety of disciplines, such as air quality sensing [1], traffic monitoring [2], fire detection [3], etc. For example, in a smart home, multiple devices (e.g., sweeping robots) can cooperate to sense the environment and perform related tasks [4], such as cleaning and tidying.
In a typical mobile sensing scenario, multiple events (e.g., fire, traffic jam, and pollution emission) may occur randomly and dynamically (depicted in Figure 1). Detecting such events in time is crucial for the mobile sensing application. However, since each vehicle can only observe the local environment within a limited radius, one central problem emerging is how to navigate the decentralized vehicles to maximize the spatial-temporal coverage of the events. As the vehicles need to make sequential navigation decisions, reinforcement learning (RL), in particular, multi-agent reinforcement learning (MARL) methods, have become a promising approach. RL methods can be model-free to optimize the navigation policies through exploration and exploitation. They are, therefore, applicable in different scenarios, even when the environmental model is not assumed [5,6].  Despite the progress made in recent years, one critical challenge that has been largely overlooked is the decentralized coordination of the vehicles. As illustrated in Figure 1, the events are mostly distributed at the left and right sides of the map. It could be better if one of the right vehicles moves to the left area for sensing. However, without coordination, the right vehicles may compete to sense nearby events, leading to wasted sensing efforts. One possible direction to tackle this challenge is to use a centralized controller that manages the policies of all vehicles. However, centralized approaches may face the problem of "single point of failure" and low scalability.
To navigate multiple vehicles in an open, dynamic environment, we adopt the MARL as the basic solution. However, in the execution phase, the vehicles may still have uncoordinated behaviors due to the lack of common consensus [7][8][9]. Inspired by the recent advances of learning to communicate [10,11], we can also introduce the communication mechanism in the cooperative navigation. On one hand, the common signal can provide global information from all the vehicles. On the other hand, the other vehicles' moving actions can also be inferred if there is positive correlation with the signal.
Our Method In this paper, we consider the decentralized management of the mobile vehicles, and introduce a communication-based framework to coordinate the behaviors of the vehicles. At each step before moving, the vehicles should first broadcast communication messages to others to share information. Afterwards, when receiving the communication messages from others, each vehicle can be conditioned on the received messages to take navigation actions. By adopting this communication framework, the vehicles can share information with each other to break through the dilemma of local observation. In particular, the communication message is also learned via reinforcement learning with the aim to maximize the spatial-temporal coverage of the events. This learning to communicate framework is flexible, and can be applicable in different dynamic environments.
One major concern in the communication framework is that the vehicles may simply ignore the communication message and focus only on local observations. To deal with this problem, in this paper, we try to maximize the mutual information between the received messages and the vehicles' navigation policies. By maximizing this term, the mobile vehicles can correlate their policies with the received messages. Intuitively, a positive correlation implies that other vehicles' policies can be inferred based on the received message. Therefore, the vehicles can achieve consensus implicitly. Theoretical analysis shows that this regularized algorithm can converge to equilibrium points under certain mild conditions.
In the experiment part, we implement and evaluate the proposed algorithm in a simulation environment built from a real-world data set. We first validate the decentralized algorithm in both the training and execution phases. The results show that the consensual communication framework can successfully coordinate the behaviors of the decentralized vehicles. The mutual information term plays an important role in the coordination. Our method can also adapt to multiple scenarios with different hyper-parameters. In different settings, our algorithm can consistently outperform other baselines. Our work can be widely adopted in different fields, such as smart homes, smart city, agriculture, etc.

Contributions
Our key contributions are listed as follows: • We model the mobile sensing problem as a decentralized sequential optimization problem, where the vehicles navigate to maximize the spatial-temporal coverage of the events in the environment. • A communication framework is proposed for cooperative navigation. In particular, the communication protocol is learned by model-free reinforcement learning methods. • We explicitly correlate the vehicles moving policies with the communication messages to promote coordination. The regularized algorithm can be proved to converge to equilibrium points under certain mild assumptions.

•
Extensive experiments are conducted to show the effectiveness of our approach.

Organizations
The rest of the paper is organized as follows. We first introduce the related work in Section 2. Next, we formulate the system model and the optimization objective in Section 3. Section 4.1 presents the framework of learning to communicate. We then present how to enforce positive communication in Section 5. Evaluation is given in Section 6. We conclude the paper in Section 7.

Related Work
In this section, we first introduce the recent advances in reinforcement learning, which is the main technical solution in this work. Next, we will review the related works of mobile sensing, with a focus on how to navigate the mobile vehicles in the environment to maximize the event coverage.

Reinforcement Learning
Reinforcement learning (RL) has achieved great success in wide areas, such as Game of Go [12], Atari [13], Starcraft [8], etc. The problem of RL can generally be modeled as a Markov decision process (MDP) S, A, T , R, γ , where S is the state space, A is the action space, and T : S × A → S is the transition model for generating the next state. R : S × A → R is the reward function. γ ∈ (0, 1] is a discount factor. At each step t, when an agent observes the state s t ∈ S and executes an action a t ∈ A, it will then be transitioned into a new state s t+1 and receive an immediate reward r t , with probability p(s t+1 , r t | s t , a t ) ∈ T . Let R t denote the cumulative return at time t. In an infinite horizon MDP, the cumulative return can be represented as The goal of reinforcement learning is to find the optimal policy µ * to maximize the return: µ * = arg max µ E µ [R 0 ], where policy µ(a t | s t ) is a function which maps the state s t to a distribution of actions a t . MDP has the property of the Bellman equality: where Q(s, a) = E[R t | s t = s, a t = a] is the state-action value function and v(s) = E[R t | s t = s] is the value function of state s.
The process of RL can be generally divided into training and execution phases. In the training phase, the RL agent uses exploration and exploitation in the environment to optimize the policy. While in the execution phase, the agent will fix the policy parameters in the environment. In this paper, as the vehicles need to move in a continuous space, we focus on DDPG [14,15], which can generate continuous actions. In DDPG, there is a critic function to evaluate the state-action value by following a deterministic policy µ as Q µ (s t , a t ), and an actor function which maps the state s t to a deterministic action, a t = µ(s t ).
Recently, multi-agent reinforcement learning (MARL) has also been a hot research topic. MARL models the environment as a decentralized partially observable Markov decision process (Dec-POMDP) [8,9] as a tuple S, T , A, R, O, I , where O is the set of local observations and I is the set of agents. The agents that make decisions are based on the observations. Let o t i ∈ O ⊆ S be the local observation of agent i at step t. Each agent i can choose an action a t i ∈ A, forming a joint action a t ∈ A n , and transition to the next state s t+1 ∈ S according to the function p(s t+1 , r t | s t , a t ) ∈ T , where the reward function r t ∈ R is shared by all the agents.
To optimize the policies of the agents in MARL, previous works, such as COMA [8], MADDPG [9], QMIX [7], etc., mainly adopted the "centralized training, decentralized execution" (CTDE) mechanism: during training, global state information can be used to train the policy network; and during execution, the agents can only condition on local observations. In the execution phase, the agents could still change their policies dynamically, leading to incoordination of the decentralized policies. However, we address that such a CTDE mechanism may not be applicable in decentralized environments where the agents can only be trained separately. Recent works are considering methods of learning to communicate [16][17][18][19], where the communication policy is learned via RL. We will also adopt this mechanism in our work. In comparison to previous works [16,17,19] that mostly use lazy communication, we propose to enforce positive communication so that the messages can be utilized more efficiently. Moreover, most of previous works only used ungrounded, cheap talk for communication [10]. We address that such cheap talk communication may not be effective in coordination.

Mobile Sensing
Mobile sensing has been extensively studied with the emergency of autonomous vehicles. One of the main problem is maximizing the coverage of events in the environment. Earlier works mostly assumed that the environment model is a prior and proposed combinatorial optimization method. For example, Karaliopoulos et al. [20] modeled the problem as a cover problem and proposed the approximation ratio algorithm. Hu et al. [21] also proposed mobile sensing methods with spatial-temporal awareness. The paper adopted a combinatorial pinning zero-determinant (ZD) strategy to find a cost-efficient mobile sensing strategy. In comparison, our work addresses the dynamics of the environment, and the coordinated policies of different mobile users are learned via repeated interactions.
As the users make independent decisions, decentralized algorithms based on game theory were also considered. Rahili et al. [22] designed a rule-based communication protocol in which agents can communicate with local neighbors and use their local information make decisions. Esch et al. [23] depicted a distributed algorithm where the agents can communicate with one another wirelessly within a fixed communication radius. Li et al. [24] modeled the mobile crowdsourcing as a Stackelberge game, and proposed a three-party evolutionary game model for task allocation. However, most previous methods are hard to generalize to unseen scenarios. In an open environment, it is critical for the agents to adapt to dynamic environment events. Data privacy is also important in mobile sensing and has been a hot research topic very recently [25][26][27][28][29]. In comparison, we focus more on the navigation of the mobile vehicles instead of the data-collecting process.
When the environment model is unknown, machine learning approaches attract attention [30,31]. In particular, as the environment is often dynamic [32,33], online learning or RL-based algorithms are widely considered, which are sequential and model-free. An et al. [34] adopted the multi-armed bandits method to select users to improve service quality. However, bandit algorithms neglect the sequential behavior of agents and may not be feasible for mobile sensing problems. As RL uses deep learning to extract the representation of the environment for exploration and exploitation, it can be naturally applicable in the dynamic environment. For example, Zhang et al. [35] adopted RL for a coarse-to-fine deep scheme to address the aspect ratio variation in UAV tracking. Liu et al. [36,37] used deep RL for high quality data collection. The main idea is to employ multiple mobile vehicles to schedule their paths independently to maximize the coverage of distributed POIs (point of interests). Zeng et al. [38] divided the problem into four suboptimal problems, and used an iterative algorithm solve the optimal problem. Liu et al. [5] proposed a multi-UAV mobile sensing framework based on multi-agent reinforcement learning (MARL), and utilized "centralized training decentralized execution" (CTDE) for cooperation. Wei et al. [6] considered the multi-robot informative path planning problem and proposed independent learning through credit assignment for cooperative sensing. Samir et al. [39] leveraged unmanned aerial vehicles (UAVs) for mobile sensing and proposed an RL approach to maximize the sensing coverage. A major challenge in these works is to coordinate the policies of different mobile vehicles for cooperation. While most previous works implicitly learn the cooperation policies for each agent, in our work, we addressed that coordination is crucial and explicitly proposed policy coordination methods based on consensual communication.

System Model
In this paper, we consider a mobile sensing problem where a set of mobile vehicles I = {1, 2, . . . , N} cooperate to maximize the spatial-temporal sensing coverage of the events in the environment. Suppose the time horizon is divided into infinite discrete intervals as {0, 1, 2, . . . , ∞}. At each interval t, each vehicle i ∈ I at position (x t i , y t i ) need to decide the moving action a t i , which can be represented as a tuple of speed The battery has a consumption rate ∆ t i that is linear with the vehicle speed, i.e., ∆ t i = βν t i + ∆ 0 , where β is a coefficient and ∆ 0 is a constant intrinsic battery consumption. The battery capacity will be updated as b t+1 To avoid running out of power, the vehicles should regularly move to the charging station, in which the battery will be recharged for a fixed number of units b 0 at each interval.
In the environment, random events may happen at different positions with timevarying intensities. Let E be the set of events. We use τ t e , e ∈ E to represent the event intensity of e at step t. The event e at position (x t e , y t e ) is sensed/covered by vehicle i if it is within a limited radius of i. Let 1 t ie be an indicator function to represent if the event is covered by vehicle i: where l i is the sensing radius of vehicle i. The benefit will be τ t e if the event e is covered by at least one of the mobile vehicles. Note that if multiple vehicles cover the same event e simultaneously, the benefit is still τ t e . Therefore, the vehicles should cooperate to avoid repeatedly sensing the same event. We use 1 t e as an indicator function that the event e is covered by at least one vehicle at interval t, i.e., 1 t e = max{1 t 1e , 1 t 2e , . . . , 1 t Ne }. The problem can then be formulated as finding the joint moving policies for the vehicles, so that the cumulative sensing coverage of the events is maximized: The inequality constraint in the objective means that the mobile vehicles could no longer move or sense when running out of battery. According to the objective, the vehicles need to make sequential navigation decisions to cover the dynamic events. However, as the vehicles make decentralized decisions, it could be difficult for the vehicles to acknowledge others' observations or intentions. This brings the dilemma of local observation and will be the main focus of this paper. Table 1 summarizes the key parameters in this paper.

Learning to Communicate
To break through the dilemma of local observation, in this section, we first formulate the problem as a Markov game. Then we formally introduce the communication framework, where the vehicles can share information with each other. Finally, we will show how to optimize the moving policies of each vehicle under this framework.

Mobile Sensing as a Markov Game
According to the system model, we can formulate the mobile sensing problem as a Dec-POMDP with tuples of S, O, A, T , R, I , where the set of agents I represent the mobile vehicles. Now we give the definitions of other elements as follows: • State: In the mobile sensing problem, at each interval t, the system state s t ∈ S includes the global information of the environment. • Observation: In the environment, each vehicle i can only partially observe the state. The observation o t i ∈ O is the subset of the environment state: o t i ⊆ s t . We assume that each vehicle can observe the environment information within the sensing radius l i , including its own position, last moving action, remaining battery capacity and sensed events.
• Action: The action of the mobile vehicle i is a continuous tuple i is the speed, and φ t i represents the moving angle. At each interval, all the vehicles will take the moving action to form a joint action a t = (a t 1 , a t 2 , . . . , a t N ). • Transition: Given the joint actions of the vehicles, the environment will transit to the next state s t+1 according to the transition function: Note that this function is not known to be used, and can only be inferred through repeated interaction with the environment. • Reward: As the mobile vehicles cooperate to maximize the spatial-temporal coverage of the environment, we define a global reward as the sensed events intensities: However, for each vehicle, it is intricate to infer its contribution to the global reward. Therefore, we decompose the reward function and define the individual reward for each vehicle i as The reward function indicates that the reward of sensing event e is averaged by the number of vehicles that cover e at this step. It is obvious to see that r t = ∑ i∈I r t i . To take the battery capacity into account, we relax the constraint in Equation (4) with an additional term c when the vehicle runs out of battery power. The vehicles will receive this penalty when the capacity is below zero, i.e., The value of c balances the preference between sensing a reward and penalty of battery loss. The relaxed version of the reward can be formulated as

The Communication Framework
As the vehicles only have limited observation, we introduce a communication framework to share information among the vehicles. Figure 2 presents an illustration of the communication procedure. We now separately describe how to broadcast and receive the messages.
Communication Broadcasting As presented in Figure 2, at each step t before moving, each vehicle i first broadcasts a message m t i to other vehicles. When broadcasting the message, an intuitive idea is to send the observation o t i and the intended action a t i to other vehicles. However, this is not possible since the vehicle will also be conditioned on the received messages from others to take action a t i . Moreover, the dimensions of the observation may be large with high overhead. Instead, we introduce the mechanism of learning to communicate. Suppose vehicle i uses a communication policy network µ m i (o t i ) parameterized by θ m i to output the message content m i , which can be a fixed-size continuous vector. In particular, the communication policy network can be optimized via the RL-based algorithm, where the goal is the long-term cumulative sensing coverage of the events. By learning to communicate, the vehicles can encode the observations and intentions into a compact embedding, which significantly reduces the transmission cost. Moreover, it can be flexible to deal with different scenarios and environments. More details on how to optimize the communication policy network will be introduced in Section 4.3. Communication Receiving After broadcasting, each vehicle can also receive the messages from other vehicles: m t = (m t 1 , m t 2 , . . . , m t N ). The messages can be aggregated with different operators, such as mean, max, or neural networks, such as recurrent neural networks (RNN). The aggregated message can be represented as m g = AGG(m t ), where AGG is the aggregator of the received messages. Suppose the moving policy of vehicle is represented as µ i (·). It can be formulated as conditioning on the local observation and received messages for moving: a t i = µ t i (o t i , m g ).

Policy Optimization
With the communication framework, we can now optimize the moving policy networks µ i (·) and communication policy networks µ m i (·) for each vehicle i ∈ I. As the moving action of each vehicle is a continuous vector, we use DDPG for policy optimization. Let Q i (·) be the action value function (critic) parameterized by θ Q i . (We temporarily abbreviate the time indicator t. The sign − indicates t − 1 and indicates t + 1.) The policy functions µ m i (·), µ i (·) and the critic function Q i (·) can all be implemented with neural networks. The parameters θ i of the moving policy network µ i (·) can be updated according to the deterministic policy gradient theorem [14]: where J(·) is the return of the policy and D is the set of historical data samples. Similarly, we can also update the parameters of the communication policy network µ m i as where θ m i represents the parameters of the communication policy network. The action value network can be updated by minimizing the temporal difference (TD) error: According to the above formulations, we can update the parameters of the policy networks and action value networks simultaneously. Compared to the CTDE framework, which requires centralized training, in our framework, the networks can be optimized independently based on the local observation and communication messages. Therefore, this framework can be applicable in decentralized training scenarios.

Consensual Communication
By learning to communicate, the mobile vehicles can share local information with each other. However, previous works have shown that selfish agents do not learn to use this type of ungrounded, cheap talk communication channel effectively [11]. In this section, we first try to enforce the mobile vehicles to have consensual communication, i.e., the communication will indeed influence the vehicles' behaviors. Next, we show that the algorithm can converge under the communication framework.

Mutual Information for Consensual Communication
To enforce positive communication, we maximize the mutual information between the moving policy µ i and the aggregated message from i's neighbors: m g . Intuitively, by maximizing the mutual information, the vehicle can correlate its moving policy with the messages from neighbors. This can also be regarded as reducing the uncertainty of vehicles' moving policy after receiving the messages. Formally, we augment the reward function as follows:r where ρ ∈ [0, 1] is a hyper-parameter that controls the importance of the mutual information term I(µ i ; m g ). The mutual information item can be expressed in terms of entropy and conditional entropy: where H(·) is the entropy function. The mutual information will become zero if the communication message does not influence the moving policy. In this case, H(m g ) equals H(m g | µ i ). Maximizing the mutual information indicates that we enforce all the vehicles to correlate their policies with the message. Thus, the vehicles can infer other neighbors' behaviors by acknowledging the broadcast message, which implicitly promotes coordination among the vehicles. However, directly maximizing the MI is intractable. We instead introduce the variational distribution q(m g | µ i ) as a proxy for the posterior over µ i . Learning a neural network to predict the messages based on the policy µ i provides a lower bound on MI: where D KL is the KL divergence between two probabilities. The establishment of inequality is because the KL-divergence distance is non-negative. In practice, as the policy µ i is a network, we use historical observation-action trajectories traj i to represent the policy. The network structure of our framework is presented in Figure 3. For each vehicle, there are four neural networks associated, including one critic network, two actor networks, and an additional variation network which is used for policy coordination. The output of the critic network can be used to update the actor networks during training. For the variation network, even though the gradient cannot be backpropagated to update the actor-critic networks, the augmented reward function can guide the mobile vehicles to generate coordinated behaviors. In the network structures, FC means fully connected, and GRU is gated recurrent unit. GRU is used to extract information from the sequential observations. More details of the network parameters will be introduced in the experiment part. As the network parameters for each vehicle can be optimized in a decentralized way, this framework can be scalable to a large number of mobile vehicles. Algorithm Now we formally present the algorithm in Algorithm 1 for an ego vehicle i. In this algorithm, we first initialize the parameters of the networks for the ego agent i. At each step, we generate the broadcast message m i based on the current observation o i . The agent will then receive and aggregate messages from others and execute actions a i . The tuples will be stored into the replay buffer D. During training, we sample a minibatch of tuples from the buffer and perform gradient back propagation to update the critic network and actor networks. Finally, the variation network is also trained by maximizing the mutual information. Update moving actor µ i (·) with Equation (9); Update critic network Q i (·) with Equation (11); Update communication actor µ m i (·) with Equation (10); Update variation network q(·) by maximizing Equation (14); end Complexity According to the above algorithm, we give a formal analysis of the time complexity of the training phase for each ego vehicle. At each step of training, the vehicle need to sample K tuples and update the networks. The update of the networks takes O(1) complexity for gradient descent. Suppose the convergence takes C steps. The time complexity of the algorithm will be O(KC). In the experiments, we will show that when choosing the batch size K = 256, the algorithm takes about C = 100, 000 steps to converge. In fact, this algorithm can be computed on a cuda device very quickly. During execution, the policy can be computed in O(1) time.

Convergence Analysis
Given the above algorithm, in this section, we formally show that the value functions Q i , i ∈ I can converge to an equilibrium point under certain assumptions: Assumption 1. Every state s ∈ S and action a i ∈ A, for i ∈ I, is visited infinitely often. s, a)] 2 < ∞ holds uniformly with probability 1. Assumptions 1 and 2 are weak ones that are easy to meet. Assumption 3 is met if (1) the communication message m g can encode the entire state without information loss; (2) every other vehicle's policy can be inferred based on m g . The two conditions are reasonable according to our communication-based framework. Assumption 4 is a strong assumption. It may not be easily met. However, our empirical experiments demonstrate that this assumption is satisfied mostly since the algorithm can converge in different scenarios. The convergence result mainly originates from the following lemma [40]: Lemma 1. (Szepesvari and Littman (1999), Corollary 5) Assume ρ t satisfies Assumption 2 and the mapping P t : Q → Q has the following condition: there exists a number 0 < γ < 1 and a sequence λ t ≥ 0 converging to zero with probability 1 such that P t Q − P t Q * ∞ ≤ γ Q − Q * ∞ + λ t for all Q ∈ Q and Q * = E[P t Q * ], then the iteration defined by

Assumption 2. The critic learning rates α t for optimizing Equation
converges to Q * with probability 1.
According to Assumption 3, the messages m g is a compact representation of the global state s and actions a. Therefore, there is Q i (s, a) = Q i (o i , a i , m g ). Define the transition function P t and the convergence point Q * as Definition 1. Let P t : Q → Q be a mapping on the complete metric space Q → Q, P t Q = (P t Q 1 , P t Q 2 , . . . , P t Q N ), where P t Q i (s, a) = r i + γQ i (o i , µ i (o i ,¯m(s )),¯m(s ))) (16) for i ∈ I, where¯m(·) = (µ m 1 , . . . , µ m N ).
With the above definitions, we show that the transition function P t is a "contraction mapping" with the fixed point at Q * .

Lemma 2.
The convergence point is a fixed point: E[P t Q * ] = Q * .
Proof. Since Q * is a convergence point in the game, the vehicles will still follow the current policy µ * . According to the Bellman equation (Equation (1)), there is where the forth line takes the expectation from p(s | s, a) and the Bellman equation.
Next, we show that P t is a "contraction mapping". According to Assumption 3, there is µ i (o i , m g ) = µ i (s). Similar to [41], the max-norm of the mapping operator can be defined as Proof. According to the transition function P t , there is The fourth line of equality comes from our Assumption 3 that the message m g is a compact representation of s. The fifth line of inequality is from Assumption 4 that the vehicles play the best response with respect to the broadcast message m g .
Summarizing the above two lemmas, it is proved that P t is a "contraction mapping" with the fixed point at Q * . Thus, according to Lemma 1, there is the following. Theorem 1. Under Assumption 1-4, the sequence (Q 1 , . . . , Q N ) updated by Algorithm 1 converges a fixed value Q * = (Q * 1 , . . . , Q * N ).

Evaluation
In this section, we first introduce the experiment setup, including the description of the environment, the baselines, and the model parameters. Next, we will show the performance of our algorithm with comparisons with other baselines. In particular, the results validate the importance of the consensual communication framework.

Experiment Setup
The Environment To validate the effectiveness of our algorithm, we manually construct a mobile sensing simulation environment based on real historical data set. The data set is collected from a road network from Google Map (Google Map: https://www.google. com/maps, accessed on 10 March 2022), which has the traffic volume at the road network across different hours (the data sets generated during the current study are available in the following https://www.dropbox.com/s/42cl68ns2fud5yk/GOOGLETraffic.zip?dl=0, accessed on 10 March 2022). We focus on an area of 10 km × 10 km square area centered at (48. 16, 16.33). In this map, we uniformly sample 40 × 40 points as the locations of events. For each position, the traffic volumes are extracted as the event intensities. An illustration of the event map at a given time is presented in Figure 4. The dots represent the events happening at different locations. The events have 5 levels of intensities as 0, 1, 2, 3, 4. We also add random uniform noise (0, 1) to the event intensities for randomness. Dots with darker colors have higher event intensities. In this map, there assumed to be 5 charging stations at locations of (8, 32), (32,8), (8,8), (32,32) and (20,20). By default, we suppose the max speed of each vehicle is S max = 2, and the sensing radius is l i = 2. Therefore, each vehicle can cover multiple events at the same time. The battery capacity of each vehicle is b max = 40. During moving, the coefficient of battery consumption is β = 1, ∆ 0 = 1. The vehicles can regularly navigate to the charging station, where they will be recharged b 0 = 20 units of battery at each time step. The penalty of running out of power is set as c = 40. We will also try other values to validate the effectiveness of our algorithm. A small size of the replay buffer is set as 10 5 , since the vehicles policies may be dynamic.
Baselines We name our algorithm as ConComm (CONsensual COMMunication), and compare with the following baselines which can generate continuous actions.
• ConComm (no MI): In this algorithm, we implement the ConComm algorithm without the mutual information item. This comparison is to demonstrate the effectiveness of the mutual information item. • DDPG [15]: In this algorithm, each mobile vehicle independently learns a policy to schedule the sensing path. The main drawback is that the multi-agent environment does not follow the Markov property, which may lead to the failure of this algorithm. • MADDPG [9]: MADDPG uses the CTDE framework, where there is a global critic function that has access to the historical samples from all mobile vehicles. However, the policies of the vehicles are not coordinated explicitly during execution. • MAPPO [42]: This algorithm is a multi-agent version of PPO. It has achieved state-ofthe-art performance in many scenarios.

Model Parameters
For different algorithms, we use similar critic network structures with an FC layer with 64 hidden units. The FC layer is followed by a ReLU activation layer for non-linear activation. The output is connected with a GRU layer with 64 hidden units and then fed into another FC layer to output the critic value. The actor networks have a similar structure. The only difference is the output of the networks. The communication actor network outputs a message with size 6 followed by a sigmoid layer to restrict the message in the range (0, 1). The messages are aggregated with a MEAN operator, i.e., m g = 1 N ∑ i∈I m i . The moving actor network outputs a vector of size 2, followed by a sigmoid layer to restrict the range of the speed and angle. Maximum speed and angle are used to project the outputs into new ranges. For the variation network, the input is the embedding after the FC layer. It is then fed into two FC layers with 64 hidden units to predict the aggregate message. Mean squared error is used as the loss function for the variation network. The weight of the MI item ρ is set as 0.5 so that different parts of the reward function are comparable.

Convergence of Training
In the first experiment, we assume there are N = 12 mobile vehicles, and examine the convergence of the algorithms during training in Figure 5. The average step reward is evaluated every 200 steps. We assume different vehicles share the same network parameters. Nonetheless, the vehicles can still behave differently with local observations. The y-axis represents the average step reward for each vehicle r = 1 N ∑ i∈I r i . Each of the RL-based algorithms is trained 3 times. The shaded area represents one standard deviation. As presented, our proposed ConComm achieves the highest performance at most of the time. The average step reward of ConComm can converge to around 17 after about only 50,000 steps. The performance then stabilizes around at this level. Moreover, the variance of ConComm is also more stable compared to others. This is because the vehicles are more likely to have coordinated behaviors. ConComm (no MI) is the algorithm without explicit policy coordination. The result can be relatively high due to the communication among the mobile vehicles. However, the performance is worse than ConComm, which validates the effectiveness of the MI item. DDPG has the worst performance among the algorithms. This is mainly due to the fact that the vehicles make decisions independently. Therefore, there may be lots of repeated sensing efforts among the vehicles. MADDPG and MAPPO have similar performances that are slightly better than DDPG. The main reason is that they adopt the "centralized training, decentralized execution" mechanism. However, in the execution phase, there may still be uncoordinated behaviors with unseen environment states. Different vehicles may not achieve consensus before making decisions. The above comparisons show that communication plays an important role in coordinating the vehicles' behaviors. Performance during Execution After training, we fix the network parameters and compare the performance of different algorithms in the simulation environment without exploration. The results are shown in Figure 6. In this figure, the height of each bar represents the sensing reward, where the red part is the battery penalty, and the blue part is the true average reward, which equals the sensing reward minus the battery penalty. The algorithm with the highest blue bar has the best performance.  As presented, our proposed ConComm achieves the best performance (the blue part) among the algorithms. In particular, the sensing reward (the blue+red part) also outperforms other algorithms significantly. This is because the vehicles in ConComm can avoid repeated sensing through communication. The ConComm (no MI) can also have high performance. It achieves lower battery penalty because the vehicles' behaviors will not be affected by the communication messages explicitly. DDPG also performs well in charging since each vehicle only cares about its own reward. However, the global sensing reward can be quite limited, which may be caused by the lack of coordination. For the MADDPG and MAPPO algorithms, as they lack the mechanism of coordination in the execution phase, they may not perform as well as our ConComm algorithm. In summary, to achieve high performance, the vehicles should not only try to sense more events with the limited battery, but they need also coordinate with others to avoid repeated sensing.
We also investigate the trajectories of the vehicles in our ConComm to show the effectiveness. We collect the vehicles' trajectory in the execution phase for 1000 steps and obtain the appearance count in the map. The appearance counts are normalized and plotted as a heatmap. The result is presented in Figure 7. In the heatmap, areas with a redder color are visited more often by the mobile vehicles, and the blue areas are visited less often. Compared with Figure 4, the areas where the event intensities are higher also have more vehicle appearances. These areas are dispersed since the vehicles can cooperate to maximize the coverage and reduce repeated sensing. Moreover, the areas near the charging stations also have redder colors; this is because the vehicles regularly moves to the stations for charging. Above all, the heatmap validates that the vehicles of ConComm can not only navigate back for charging, but also properly move to the areas with high event intensities. This heatmap illustrates that our proposed ConComm can properly coordinate the navigation of the vehicles.

Policy Coordination via Communication
The above two experiments have already shown that explicitly coordinating the policies of different mobile vehicles is crucial for cooperative sensing. In this part, we investigate the effect of coordination by adjusting the weight of the MI item. In addition to the default value ρ = 0.5, we change the weight ρ to different values from 0 and 1 and observe the convergence process during training. Note that when ρ is 0, the algorithm degrades to the case of ConComm (no MI). When ρ is 1, the vehicles neglect the sensing reward and battery penalty, and focus only on coordinating with others.
The results are shown in Table 2. As presented, introducing the policy coordination can significantly improve the performance when ρ is non-zero. This validates that positive communication is necessary for coordinating the decentralized vehicles. Meanwhile, when the coefficient is too large, the performance may decrease since the vehicles care more about coordination and less about sensing reward. When the coefficient reaches 1, the vehicles focus only on the coordination and thus the sensing reward is very poor. The results show that the vehicles need to balance between coordination and sensing. The performance will degrade if focusing on only one of them. Validating the Variation Network In this part, we show that the communication message indeed influences the vehicles' moving policy. More concretely, we compute the cross entropy between the policy µ i and the neighbors' aggregated message m g as H(µ i , m g ). The policy µ i is represented as the historical trajectories traj i . Cross entropy measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution, rather than the true distribution. It can also be regarded as the distance between the two probability distributions. A low cross entropy distance indicates that the two probability distributions could have high correlation.
We present the dynamics of the cross entropy during training in Figure 8. As presented, the cross entropy is high at the beginning. This is because the vehicles have not learned to correlate with the communication message. As the training proceeds, the cross entropy value becomes lower and stabilizes at about 1.0. This validates that the policy µ i becomes more correlated with the communication. Validation the Penalty of Battery Next, we investigate the effect of the hyper-parameter c in shaping the battery penalty. Generally, with a larger value of c, the vehicles will navigate to the charging station more frequently to avoid running out of battery. In practice, this parameter can be set freely by the vehicles and our algorithm can adapt to different values of c. In this experiment, we train the algorithms with different values of c and validate the performance with the default value c = 40. Figure 9 presents the results.  As presented, when c is 0, the vehicles will not care about the battery penalty and focus only on sensing events. Therefore, there will be high sensing reward, but the battery penalty will also be very high, leading to low average step reward. When the value of c increases, the vehicles will be more conservative to run out of power. They will have low battery penalty. However, the sensing reward will also decrease. In general, choosing a proper value of c can balance the preference of sensing and battery. In practice, we can set the value of c as the cost of reclaiming the vehicles when they run out of power. If this is unacceptable, we can also enforce the vehicles to navigate back to the charging station if needed.
Scalability In this last experiment, we validate the scalability of ConComm. We increase the number of mobile vehicles to 128 and charging stations to 16. The map is divided into 80 × 80 grid space with charging stations randomly and uniformly distributed. Similar to that above, we assume the mobile agents share the same network parameters. The algorithms of MADDPG and MAPPO would take too much time, so we only present the result of ConComm, ConComm (no MI) and DDPG. As shown in Figure 10, the ConComm algorithm can still achieve better performance. When there are more agents, they may become more easy to coincide. So the average step reward will be lower than before. Nonetheless, ConComm can still successfully coordinate the behaviors of the agents and achieve high performance. In this case, as there is no explicit coordination, the variance of ConComm (no MI) will be larger. The result of DDPG is also not stable since the vehicles' policies are mostly dynamic, leading to low efficiency of coordination. Moreover, the performance of DDPG will even degrade after about 600,000 steps. This may result in the DDPG agents being not coordinated and falling into local optimal points.

Conclusions
This paper studies the problem of mobile sensing in an open, dynamic environment. To maximize the long-term spatial-temporal coverage of the events, we propose a decentralized policy coordination framework. The main idea is to introduce a communication mechanism among the mobile vehicles. On one hand, the vehicles can share local information with each other to break through the dilemma of decentralized execution; on the other hand, the vehicles can have coordinated behavior with enforced positive communication.
In particular, the consensual communication is achieved by maximizing the mutual information between the received message and the policy. We conduct extensive experiments to validate the performance of our algorithm. The results show that our algorithm can converge very fast in the training phase, and outperforms other baselines significantly in the execution phase. Moreover, the experiments show that the consensual communication mechanism plays an important role in coordinating the behaviors.
For future works, we aim to extend the current method from two aspects. First, the battery constraints in this paper are relaxed as part of the objective, and may lead to violations. Therefore, we need to devise method with "hard" constraints. Second, we will improve the interpretability of the communication messages to understand the internal mechanism that promotes the cooperation among the vehicles.

Conflicts of Interest:
The authors declare no conflict of interest.