Attention-Shared Multi-Agent Actor–Critic-Based Deep Reinforcement Learning Approach for Mobile Charging Dynamic Scheduling in Wireless Rechargeable Sensor Networks

The breakthrough of wireless energy transmission (WET) technology has greatly promoted the wireless rechargeable sensor networks (WRSNs). A promising method to overcome the energy constraint problem in WRSNs is mobile charging by employing a mobile charger to charge sensors via WET. Recently, more and more studies have been conducted for mobile charging scheduling under dynamic charging environments, ignoring the consideration of the joint charging sequence scheduling and charging ratio control (JSSRC) optimal design. This paper will propose a novel attention-shared multi-agent actor–critic-based deep reinforcement learning approach for JSSRC (AMADRL-JSSRC). In AMADRL-JSSRC, we employ two heterogeneous agents named charging sequence scheduler and charging ratio controller with an independent actor network and critic network. Meanwhile, we design the reward function for them, respectively, by considering the tour length and the number of dead sensors. The AMADRL-JSSRC trains decentralized policies in multi-agent environments, using a centralized computing critic network to share an attention mechanism, and it selects relevant policy information for each agent at every charging decision. Simulation results demonstrate that the proposed AMADRL-JSSRC can efficiently prolong the lifetime of the network and reduce the number of death sensors compared with the baseline algorithms.


Introduction
Wireless sensor networks (WSNs) have been widely applied in target tracking, environment monitoring, intelligent medical and military monitoring, etc. [1,2], which have advantages including fast construction, self-organization, fault tolerance, and low-cost deployment [3]. Meanwhile, WSNs are usually composed of a large scale of sensors deployed in an area. However, sensors in WSN are always powered by batteries, and the capacity of these batteries is constrained by the volume of the sensor, which limits the lifetime of the sensors. Furthermore, the energy constraint problem affects the quality of service of WSN directly and greatly hinders the development of WSN. In recent years, the breakthrough of wireless energy transmission (WET) technology has greatly promoted the wireless rechargeable sensor networks (WRSNs) [4], since it provides a highly reliable and efficient energy supplement for the sensors. Particularly, a promising method to overcome the energy constraint problem in WRSNs is mobile charging by employing one or more mobile chargers (MCs) with a high capacity to charge sensors via WET. The MC can move to sensors autonomously and charge them according to a mobile charging scheduling scheme, which is formulated by MC based on the status information of sensors, including the residual energy, energy consumption rate, and position of sensors in WRSN. The status information of sensors is highly controllable and predictable. Theoretically, WRSNs could work indefinitely under a well-designed charging scheme [5]. Therefore, the design of the charging scheme in WRSN is critical, and it has drawn extensive attention from the research community.
There are plenty of works that have been presented to design the mobile charging schemes on WRSN. According to whether MC carries the determined charging scheme before starting from the base station, the existing works can be divided into two categories [6][7][8][9][10][11][12][13][14][15][16]: (1) offline methods [7][8][9][10][11][12] and (2) online methods [6,[13][14][15][16]. In offline methods, before starting from the base station, MC will formulate a transparent charging scheme according to the status of sensors, including accurate location, fixed energy consumption rate, regular information transmission rate, etc. The MC will charge sensors with a scheduled trajectory determined by the charging scheme. The offline methods ignore the dynamic change in the status of sensors. Hence, the offline method is not suitable for dealing with application scenarios where the energy consumption rate of the sensor changes in real time and the large-scale WRSN. For example, Yan et al. [17] first attempted to introduce particle swarm optimization into optical wireless sensor networks, which could optimize the positioning of nodes, reduce the energy consumption of nodes effectively and converge faster. In [18], Shu et al. made the first attempt to deal with the jointly charging energy and designing operation scheduling in WRSN. They proposed an f-Approximate algorithm to address this problem and verify that the proposed algorithm could obtain an average 39.2% improvement of network lifetime beyond the baseline approaches. In [19], Feng et al. designed a novel algorithm called the newborn particle swarm optimization algorithm for charging-scheduling in industrial rechargeable sensor networks by adding new particles to improve the particle diversity. This improvement made the algorithm achieve better global optimization ability and improved the searching speed. V.K. Chawra et al. proposed a novel algorithm for scheduling multiple mobile rechargers using the hybrid meta-heuristic technique in [20], which combined the best features of the Cuckoo Search and Genetic Algorithm to optimize the path scheduling problem to achieve shorter charging latency and more significant energy usage efficiency. To enhance the charging efficiency [21][22][23][24], Zhang et al., Liang et al., and Wu et al. proposed some hierarchical charging methods for multiple MCs to charge sensors and themselves.
Different from the offline methods, in some application scenarios, the energy consumption rate of sensors is time-variant, and there are many uncertain factors in the network, which make the offline approaches unable to obtain an acceptable charging scheduling scheme according to the information in the network, while online approaches could successfully deal with these issues. The specific implementation is that the MC does not need to know the status of sensors clearly before starting from the base station but only needs to build candidate charging queues. When the residual energy of the sensor is lower than the set threshold, it will send a charging request and its energy information to the MC. The MC accepts the charging request and inserts it into all candidate charging queues. Then, the charging sequence will update according to the status of the sensors. For example, Lin et al. aimed to maximize the charging efficiency while minimizing the number of dead sensors to achieve the purpose of prolonging the lifetime of WRSN in [16]. Therefore, they developed a temporal-spatial real-time charging scheduling algorithm (TSCA) for the on-demand charging architecture. Furthermore, they also verified that the TSCA algorithm could obtain a better charging throughput, charging efficiency, and successful charging rate than the existing online algorithms, including Nearest-Job-Next with Preemption scheme and Double Warning thresholds with Double Preemption charging scheme. Feng et al. [25] proposed a mobile energy charging scheme that can improve the charging performance in WRSN by merging the advantages of online mode and offline mode. It includes the dynamicity of sensors' energy consumption in the online mode and the benefit of lower charging consumption by optimizing the charging path of the mobile charger in offline mode. Kaswan et al. converted a charging scheduling problem to a linear programming problem and presented a gravitational search algorithm [26]. This approach presented a novel agent representation scheme and an efficient fitness function. In [27], Tomar et al. proposed a novel scheduling scheme for on-demand charging in WRSNs to address the joint consideration of multiple mobile chargers and the issue of ill-timed charging response to the nodes with variable energy consumption rates.
Unfortunately, although the online methods can address the mobile charging dynamic scheduling problem, they still have disadvantages, including short-sightedness, non-global optimization, and unfairness. Specifically, most recent works assume that the sensor closest to MC is usually inserted into the current charging queue. Meanwhile, sensors with low energy consumption rates are always ignored, resulting in their premature death and a reduction in the service quality of the WRSN. It is generally known that the mobile charging path planning problem in WRSN is a Markov decision process, which has been proved to be an NP-hard problem in [28]. Therefore, the most difficult problem is how to design an effective scheduling scheme to find the optimal or near-optimal solution more quickly and reliably when the size of network increases gradually.
It is known that Reinforcement Learning (RL) is an effective method to address the Markov decision process. As mentioned above, the charging scheduling problem in WRSN is NP-hard; thus, it is unable to provide available optimal labels for supervised learning. However, the quality of a set of charging decision can be evaluated via the reward feedback. Therefore, we need to design a reasonable reward function according to the states of WRSN for RL. During the interaction between agent and environment, the charging scheduling scheme will be found through learning strategies that can maximize the reward. There are several works that have tried to solve the charging scheduling problem with RL algorithms. For example, Wei et al. [29] and Soni and Shrivastava [30] proposed a charging path planning algorithm (CSRL), combining RL and MC to extend the network lifetime and improve the autonomy of MC. However, the proposed CSRL method only suits offline mode, where the energy consumption of sensor nodes is time-invariant. Meanwhile, this method can only be used to address small-scale networks, since the Q-learning algorithm generally fails to handle high-dimensional state space or large state space. Cao et al. [28] proposed a deep reinforcement learning-based on-demand charging algorithm to maximize the sum of rewards collected by the mobile charger in WRSN, which is subject to the energy capacity constraint on the mobile charger and the charging times of all sensor nodes. A novel charging scheme for dynamic WRSNs based on an actor-critic reinforcement learning algorithm was proposed by Yang et al. [31], which aimed to maximize the charging efficiency while minimizing the number of dead sensors to prolong the network lifetime. The above works have made significant model innovation and algorithm innovation, yet they ignore the impact of sensor charging energy on the optimization performance. Although Yang et al. [31] proposed a charging coefficient to constrain the upper charging energy threshold, they assumed that all sensors have a fixed charging coefficient during the scheduling, which cannot adjust according to the needs of the sensors. Specifically, the charging coefficient could directly determine the charging energy for the sensor. Therefore, how to select the next sensor to be charged and determining its corresponding charging energy brings novel challenges to the design of the charging scheme.
We study a joint mobile charging sequence scheduling and charging ratio control problem (JSSRC) to address the challenges mentioned above, where charging ratio is a parameter introduced to determine the charging energy for the sensor and replace on-demand charging requests with real-time changing demands. JSSRC provides timely, reliable, and global charging schemes for WRSNs in which sensors' energy changes dynamically. Meanwhile, we propose the attention-shared multi-agent actor-critic deep reinforcement learning approach for JSSRC; this approach is abbreviated as AMADRL-JSSRC. We assume that the network deployment scenarios are friendly, barrier-free, and accessible. The transmission of information about real-time changes in energy consumption is reliable and deterministic. When the residual energy of MC is insufficient, it is allowed to return to the depot to renew its battery. Table 1 highlights the performance comparison of the existing approaches and the proposed approach with respect to four key attributes. The main contributions of this work are summarized as follows.
(1) Different from the existing works, we consider both charging sequence and charging ratio optimization simultaneously in this paper. We introduce two heterogeneous agents named charging sequence scheduler and charging ratio controller. These two agents give the charging decisions separately under the dynamic changing environments, which aims to prolong the lifetime of the network and minimize the number of dead sensors. (2) We design a novel reward function with a penalty coefficient by comprehensively considering the tour length of MC and the number of dead sensors for AMADRL-JSSRC, so as to promote the agents to make better decisions. (3) We introduce the attention shared mechanism in AMADRL-JSSRC to the problem that charging sequence and charging ratio have different contributions to the reward function.
The rest of the paper is organized as follows: Section 2 describes the system models of WRSN and formulates the JSSRC problem. The proposed AMADRL-JSSRC approach is described in Section 3. Simulation results are reported in Section 4. The impacts of the parameters on the charging performance are discussed in Section 5. Conclusions and future work are given in Section 5.

System Model and Problem Formulation
In this section, we present the network structure, energy consumption model of sensors, energy analysis of MC, and the formulation of the charging scheduling problem in WRSNs.

Network Structure
In Figure 1, a WRSN with n heterogeneous isomorphic sensors SN = sn 1 , sn 2 , . . . , sn n , an MC, a base station (BS), and a depot are adopted. It is assumed that due to different information transmission tasks, all sensors have the same energy capacity E sn and sensing ability but different energy consumption rates. They are deployed in a 2D area without obstacles; the positions of all sensors are fixed and can be determined accurately, and they are recorded as (x i , y i ), i ∈ [1, n], and the position of BS is set as (x 0 , y 0 ). Therefore, a weighted undirected graph G = {BS, SN}, D sn , E 0 , E c is used to describe the network model of WRSN, where D sn is the set of distances between sensors, which is expressed as The set of initial residual energy and the energy consumption rate of each sensor are represented by E 0 and E c , respectively. (x D , y D ) is defined as the position of the depot.

Network Structure
In Figure 1, a WRSN with n heterogeneous isomorphic sensors = { , , … , }, an MC, a base station (BS), and a depot are adopted. It is assumed that due to different information transmission tasks, all sensors have the same energy capacity and sensing ability but different energy consumption rates. They are deployed in a 2D area without obstacles; the positions of all sensors are fixed and can be determined accurately, and they are recorded as ( , ), ∈ [1, ], and the position of BS is set as ( , ). Therefore, a weighted undirected graph = (  It is assumed that each sensor in WRSN collects data and communicates with BS via ad hoc communication. The BS could estimate their residual energy according to data sampling frequency and transmission flow. MC can obtain the state information of the sensor but will not interfere with the working state of the sensor. Meanwhile, the total moving distance of MC during the charging tour is defined as . Although, in theory, the lifetime of the network can be extended indefinitely with single or multiple MC. The network will shut down, since the energy modules of sensors will age. Therefore, inspired by [28,31], we define the lifetime in this article as below.

Definition 1 (Lifetime). The lifetime of WRSNs is defined as the period from the beginning of the network to the number of dead sensors reaching a threshold.
The lifetime and the threshold are described with and %, respectively. Furthermore, the abbreviations used in this paper are summarized in Table 2.  It is assumed that each sensor in WRSN collects data and communicates with BS via ad hoc communication. The BS could estimate their residual energy according to data sampling frequency and transmission flow. MC can obtain the state information of the sensor but will not interfere with the working state of the sensor. Meanwhile, the total moving distance of MC during the charging tour is defined as Dis.
Although, in theory, the lifetime of the network can be extended indefinitely with single or multiple MC. The network will shut down, since the energy modules of sensors will age. Therefore, inspired by [28,31], we define the lifetime in this article as below.
Definition 1 (Lifetime). The lifetime of WRSNs is defined as the period from the beginning of the network to the number of dead sensors reaching a threshold.
The lifetime and the threshold are described with T li f e and ω%, respectively. Furthermore, the abbreviations used in this paper are summarized in Table 2.

Energy Consumption Model of Sensors
The energy of the sensor is mainly consumed in data transmission and reception. Therefore, based on [32,33], the energy consumption model at time slot t is adopted as below: where ρ is the energy consumption for receiving or transmitting 1 kb data from sensor sn i to sensor sn j (or BS). ς t i,j = ξ 1 + ξ 2 · d r i,j represents the energy consumption for transmitting 1 kb data between each sensor, where d i,j is the distance between sn i and sn j . ξ 1 and ξ 2 represent the distance-free and distance-related energy consumption index, respectively. r is the signal attenuation coefficient. f r k,i means the data flow of receiving, f t i,j (1 ≤ j ≤ n) and f t i,B are the data flow of transmitting from sn i to sn j and BS. Hence, ρ ∑ n k=1,k =i f r k,i (t) represents the energy consumption of sn i receiving information from all sensor nodes.
is the energy consumption of sn i by sending information to other sensors and BS.

Charging Model of MC
In this paper, the sensors in WRSN are charged by MC wirelessly, and the empirical wireless charging model is defined as [34] where d ms represents the distance between the sensor and the mobile charger, P 0 is the output power, G s is the gain of the source antenna which is equipped on the mobile charger, G r is the gain of the receiver antenna, d ms is the distance between the mobile charger and the sensor, L p , and λ denote the rectifier efficiency and the parameter to adjust the Friis' free space equation for short-distance transmission, respectively. Since the MC moves to the position near the sensors, the distance can be regarded as a constant. Therefore, (2) can be simplified to (3) The moving speed of the MC is set as v ms , and the energy consumed per meter is e m J. The capacity of MC is E mc , and the target sensor will be charged with one-to-one charging mode only when the MC reaches it.

Problem Formulation
We define three labels to describe the working states of the visited point at time slot t, i ∈ [0, n], i = visit, i = visit and i = dead. They represent that sn i is selected to charge, not be selected and dead, respectively, while i = 0 represents that the visited point is a depot. The residual energy of the sensor is defined as e i r (t), the charging demand of sensor sn i is defined as e i d (t) and the residual energy of MC is defined as e r mc (t). At time slot t, the residual energy of the sensor is described as (4), and the charging demand will also be updated with (5) where ε is the charging ratio, it could decide the upper threshold of charging energy, and its value range in (0, 1].
To effectively charge the sensors, more energy in the MC should be used on charging sensors, while the energy wasted on moving between the sensors and the depot should be minimized. Hence, within the network lifetime T li f e , the JSSRC problem under WRSNs with the dynamic energy changing is defined as below.
Definition 2 (JSSRC). The joint mobile charging sequence scheduling and the charging ratio control problem, which aims to prolong the lifetime of the network and minimize the number of dead sensors in WRSNs with dynamic energy changing, is defined as the JSSRC problem.
The relevant notations are defined as follows: at time slot t, the current state of sensor i is defined as (6) according to its residual energy, if τ i (t) = 1, it indicates that the sensor is alive, and τ i (t) = 0 represents that the sensor has died.
Furthermore, the number of dead sensors is defined as N d (t), which is obtained with (7) There are three termination conditions of the JSSRC scheme, and they are described with (8): (1) The number of dead sensors reaches ω% of the total number, ω ∈ (0, 100]. (2) The remaining energy of MC is insufficient to return to the depot.
(3) The target lifetime or the base time is reached.
where d represent the distance from the MC's current location to the depot, t is the running time of the test, and T target is a given base time. Specifically, when any of the termination conditions in (8) are met, the charging process will end.
Then, within the network lifetime, the JSSRC problem can be formulated as

Details of the Attention-Shared Multi-Agent Actor-Critic Based Deep Reinforcement Learning Approach for JSSRC (AMADRL-JSSRC)
JSSRC is a joint scheduling problem with sequence scheduling and charging ratio control; it is difficult to schedule them simultaneously with the traditional single-agent reinforcement learning algorithm. Therefore, the multi-agent reinforcement learning algorithm is introduced to solve this problem. In this section, we first briefly introduce multi-agent reinforcement learning algorithms. Then, we model the provided problem and propose the AMADRL-JSSRC.

Basis of Multi-Agent Reinforcement Learning
Multi-agent reinforcement learning is developed on the basis of the reinforcement learning algorithm, which is often described as the Markov game (or stochastic game) [35]. Multi-agent reinforcement learning is also an important branch of machine learning and deep learning, which aims to improve the shortcomings of multi-objective control that cannot be achieved by a single agent. Each agent can be a cooperative, competitive or mixed relationship, and they learn how to make decisions in an environment by observing the rewards obtained after the environment performs some actions. Specifically, there are m agents; each agent first receives their own observations o ϑ (ϑ ∈ [1, m]). Then, we select an operation a ϑ from action spaces, which are subsequently sent to the environment. After that, the environment state transits from S to S , and each agent receives a reward r ϑ associated with these transitions. The purpose of training agents is to collect accumulated rewards from multiple agents as much as possible.

Learning Model Construction for JSSRC
The tuple {S, A 1 , A 2 , R, S } is used to define the JSSRC scheme, where S is the state space of two agents, A 1 and A 2 are the action spaces, R is the sum of rewards obtained by two agents after performing actions, and S is the state of the environment after executive action [36]. A state transition function is defined as T with T : where ϑ is the number of the agent, and R is the set of all possible rewards.
The time step is defined as the time slot when the scheduling decision is made. Hence, at the k-th time step, the MC visits position i and completes the charging decision, where i ∈ [0, n]. K is defined as the maximum time step when any of the termination conditions are met. The time slot corresponding to the k-th time step is defined as t(k); when the action of the k-th time step is completed, the corresponding time slot is recorded as t(k).
A scheduling example of JSSRC is shown in Figure 2. To express clearly, we omit the information communication process between sensors, leaving only the scheduling decision and the charging path. The relationship between the time slot and the time step is described in the upper part of the figure. Within the network lifetime T li f e , two agents determine two actions a 1 (k) and a 2 (k) according to their observations o 1 (k) and o 2 (k) in state S(k) at time step k. a 1 represents the decision decided by agent 1 to choose the next sensor to be charged, and a 2 represents the decision decided by agent 2 to control the charging ratio. Agents obtain their policy according to the continuous exploration and calculate the rewards R through the obtained strategies at the end of the T li f e . Then, the states, actions, policies, and rewards of the environment are defined as follows.  States of the environment: The state space of the environment in JSAAC includes the state information of the MC and sensors, which are defined as S mc and S net , respectively. An example of the information at time step k, S mc is (p mc (k), e r mc (k)), S net is p i sn (k), e i d (k), e i c (k) , where p mc (k) and e r mc (k) are the position and the residual energy of MC, p i sn (k) is the position of sn i to be visited, e i d (k) and e i c (k) are the charging demand and the energy consumption rate of sn i , where k ∈ [0, K] and i ∈ [0, n]. sn 0 represents the depot, and the value of e 0 d is 0 because the depot does not need to be charged. The state embedding is a 5 × K dimensional vector at time step k with S(k) = (S mc (k), S net (k)); only the position of the sensor is a static element, the others are dynamic.
Actions of the environment: The actions in JSSRC represent the decision of the target sensor and the charging ratio, which are determined by two agents.
Policies of the environment: The policy for a single agent is described with a = π(o), where a is an action, o is the observation of the agent, and π is the policy. In JSSRC, there are two agents; we define two agents with policies parameterized by θ = {θ 1 , θ 2 } and let The main goal of JSSRC is to learn a set of optimal policies to maximize two agents' expected discounted rewards.
Rewards of the environment: Reward is used to evaluate the action; its value is obtained by the agent after executing an action. In this paper, our goal is to improve the charging performance of WRSN, which includes minimizing the moving distance of MC and reducing the number of dead sensors. Since the total number of dead sensors is inversely proportional to the reward, if the performed actions lead to more sensors being dead, we will give a penalty for this behavior. Therefore, the expected discounted rewards for two agents are defined with (10), and the immediate reward obtained after performing the actions at the k-th time step is defined with (11).
The expected discounted rewards for two agents can be defined as where the action space of a 1 is A 1 = {a 1 |a 1 ∈ {0, 1, . . . , n}}, and the action space of a 2 is A 2 = {a 2 |a 2 ∈ {0.5, 0.6, . . . , 1}}. is a reward coefficient between 0 and 1, which can ensure the shorter the moving distance is, the greater the reward that will be obtained. The N d (k) indicates the number of new dead sensors after the actions at the k-th time step are performed, and is the penalty coefficient. In (12), Dis a 1 ∼π 1 ,a 2 ∼π 2 represents the total moving distance obtained after performing the actions when the termination conditions are met. Obviously, the decision of the charging sequence and the charging ratio have different contributions to the reward function, which brings difficulties to the design of the algorithm. State Space Update of the environment: One episode of the JSSRC can be formed as a finite sequence of decisions, observations, actions, and immediate rewards, which is described in Table 3.
To display the specific update process of states, we assume that the MC is located at the depot at time step 0. At each time step, MC decides the next charged sensor from SN and determines the corresponding charging ratio for it. It is defined that the residual energy of sensor sn i before charging and after charging are e i r (k) and e i r (k), respectively. The charging demand of each sensor and the residual energy of MC will be updated after performing the charging operation at time step k. They are shown as follows: where t m (k) is the moving duration of the MC between the k-1-th and k-th time step. It is assumed that at the k-1-th time step, the MC is located at sn j , at the k-th step, MC is located at sn i . Therefore, we have d(k, k − 1) = d ij , and t m (k) can be obtained by (17) If sn i is alive at the k-th time step, the charging time is where ε i (k) is the unique charging ratio of sn i at the k-th time step. Therefore, the charging demands of three types of working states about sn i are The residual energy of the MC before and after performing the charging operation is defined as e mc r (k) and e mc r (k), respectively; they will update with (22) and (23) e mc r (k) = max{0, e mc To speed up the training and obtain feasible solutions, we give the following constraints: (1) The MC could visit any position in the network as long as its residual energy could satisfy the charging demand of the next selected sensor or is enough to move back to the depot. (2) All sensors with a charging demand greater than 0 have a certain probability of being selected as the next one to be charged. (3) The MC does not charge the sensors whose charging demands are zero. (4) If the residual energy of MC does not satisfy the charging demand of the next selected sensor, but it is enough to return to the depot, the MC is allowed to return to the depot to charge itself, and the charging time of the MC is ignored. (5) The charging decision of two adjacent time steps cannot be the same sensor or depot.
(6) If the residual energy of the MC does not meet the charging demand for the next sensor, is not enough to return to the depot, or the preset network lifetime is reached, the charging plan will be ended no matter whether the sensors are still alive or not.

AMADRL-JSSRC Algorithm
As depicted in Figure 3, AMADRL-JSSRC's implementation consists of the environment, the experience replay buffer (D), the mini-batch (B), the obtained rewards, and the different neural networks. The environment can be partially observed by each agent, where the actor and critic networks estimate the optimal control policies for the charging sequence scheduler and the charging ratio controller. The detail of training AMADRL-JSSRC is described in Algorithm 1. Unlike the traditional methods such as MADDPG [36] and MAPPO [3 receives information from other agents without discrimination and calcul sponding Q-value. In JSSRC, the contribution of the charging sequence sche charging ratio controller to the Q-value are different. Compared with the c Unlike the traditional methods such as MADDPG [36] and MAPPO [37], each agent receives information from other agents without discrimination and calculates the corresponding Q-value. In JSSRC, the contribution of the charging sequence scheduler and the charging ratio controller to the Q-value are different. Compared with the charging ratio, the decision of the charging sequence has a greater impact on the reward. To calculate the Q-value function Q ϕ ϑ (s, a) for agent ϑ, we introduce the attention mechanism with a differentiable key-value memory model [38,39]. This kind of mechanism does not need to make any assumptions about the temporal or spatial locality of the inputs, which is more suitable to overcome the difficulty that each agent has a different action space and contributes a different reward in this article.
At each time step, the critic network in each agent will receive the observation information s = (o 1 , o 2 ) and action information a = (a 1 , a 2 ), for all ϑ ∈ [1, 2]. We define the set of all agents except for ϑ as \ϑ, and we useθ as the pointer to index the set. Q ϕ ϑ (s, a) is defined as the function of agent ϑ which is obtained by combining with the observation information, action information, and contribution from other agents: where f ϑ is a two-layer multi-layer perceptron (MLP) [40], and g ϑ is a one-layer MLP embedding function. c ϑ is the contribution from other agents, which is a weighted sum of the value of each agent with (25) c ϑ = ∑θ =ϑ κθvθ = ∑θ =ϑ κθh Vgθ oθ, aθ In (25), vθ is the embedding function of agentθ encoded with an embedding function. Then, the shared matrix V is used for linear transformation. h is an element-wise nonlinearity activation function named leaky ReLu, which could retain some negative axis values to prevent all negative axis information from being lost. h is realized by (26) h where φ is a very small constant. The attention weight κθ uses bilinear mapping (i.e., query-key system) to compare the embedded eθ with e ϑ = g ϑ (o ϑ , a ϑ ), and it passes the similarity value between these two embedding into a SoftMax function: where the e ϑ is transformed to a "query" with W q , and the eθ is transformed to a "key" with W k [41].
To prevent vanishing gradients, the matching is scaled by the dimensionality of these two matrices. The multiple attention heads mechanism is introduced in AMADRL-JSSRC, each head with a separate set of parameters (W k , W q , V), which could give rise to an aggregated contribution from another agent to the agent i. We concatenate the contributions of all heads into a vector. The most important point is that each head could focus on a different weighted mixture of agents.
In AMADRL-JSSRC, the weights for extracting selectors, keys, and values are shared between two agents, because the multi-agent value function is essentially a multi-task regression problem. This parameter sharing in the critic network enables our method to learn effectively in an environment where the action space and reward for individual agents are different but share common observation features. The structure of the critic network and the structure of the multiple head attention mechanism are clearly shown in the left part of Figure 3.

Parameters Update in AMADRL-JSSRC
The parameters ϕ and θ used in the critic networks and policies gradient will be updated, respectively, according to line 17 to line 24 and line 28 to line 32 in Algorithm 1.
Since the parameters are shared among critic networks in AMADRL-JSSRC, all critic networks are updated together to minimize a joint regression loss function: In (28), y ϑ is obtained by (29) It is worth noting that Q ϕ ϑ is used to estimate the action value for agent ϑ by receiving the observation information and action information from all agents. D is a replay buffer to store past experiences. In (29), α is a parameter that could trade off maximizing entropy and rewards.
Since the charging sequence decision has a greater impact on the expected reward than the charging ratio decision, in order to give the optimal policies objectively, we need to compare the value of a specific action to the value of the average action of the agent, with another agent fixed. We could determine whether said action will lead to an increase in expected return or whether any increase in reward is attributed to the actions of another agent. This problem is called multi-agent credit assignment. An effective solution is to introduce an advantage function [42] with a baseline that only marginalized the actions of the given agent from Q ϕ ϑ (s, a), and the form of this advantage function is shown below: In (31), b s, aθ is the multi-agent baseline used to calculate the advantage function. We calculate our baseline with the AMADRL-JSSRC algorithm in a single forward pass by outputting the expected return Q ϕ ϑ s, a ϑ , aθ for every possible action, a ϑ ∈ A ϑ . The expectation could be calculated exactly with (32) E a ϑ ∼π ϑ (o ϑ ) Q ϕ ϑ s, a ϑ , aθ = ∑ a ϑ ∈A ϑ π(a ϑ |o ϑ )Q ϑ s, a ϑ , aθ To achieve this goal, we make the following four adjustments: (1) We must remove a i from the input of Q i and output a value for every action.
(3) We also modify f ϑ to output the Q-value of all possible actions rather than the single input action. (4) To avoid overgeneralization [43], we sample all actions from the current strategies of all agents to calculate the gradient estimation of agent ϑ rather than sampling the actions of other agents from the experience replay buffer such as [36,39].

Algorithm 1 AMADRL-JSSRC
1: Initialize the number of parallel environments for two agents as N p , initialize the update time of parallel operation as T update , initialize the experience replay buffer with D and the minibatch with B, initialize the number of episodes as N e , the number of steps per episode as N pe the number of critic updates as N cu , the number of policy updates as N pu , and the number of multiple attention head as N m , initialize the critic network Q ϕ , and actor network π θ with random parameters ϕ, θ, initialize the target network, ϕ ← ϕ and θ ← θ , T update ← 0 2: for i ep = 1, . . . , N e do 3: Reset environments, and obtain the initial o env ϑ for each agent, ϑ 4: for k = 1, . . . , N pe do 5: Randomly select actions a env ϑ ∼ π ϑ (·|o env ϑ ) for each agent ϑ, in each environment (env) with greedy search strategy 6: Send actions to all parallel environments, then obtain o ϑ env and r env ϑ for all agents 7: Store transitions for all environments in D 8: T update = T update + N p 9: if T update ≥ min steps per update then 10: for j = 1, .., N cu do 11: Sample B 12: function Update target parameters: T update ← 0 31: end if 32: end for 33: end for 34: Output: The parameters of target actor Therefore, the policies of each agent will be updated by:

Experimental Setup and Results
In this section, we will conduct experiments to evaluate the performance of AMADRL-JSSRC. The simulations are divided into two phases: (1) the training phase of AMADRL-JSSRC and (2) the testing phase for a comparative study with baseline algorithms. The experiment setting and training details are described in Section 4.1. The testing details of the comparison with the baseline algorithms are described in Section 4.2.

Experimental Environment and Details
We conduct the AMADRL-JSSRC using Python 3.9.7 and TensorFlow 2.7.0 over 10,000 episodes, and each episode is divided into 100 time slots. Then, AMADRL-JSSRC has tested over ten episodes, where the average values of the important metrics are calculated.
We use the same simulation settings as described in [31], and some details are supplemented here. We assume that the locations of sensors are assigned uniformly at random in the unit square [0, 1] × [0, 1], and the residual energy of each sensor is randomly generated between 10 and 20 J. The moving speed of MC is 0.1 m/s, and the energy consumption rate of moving unit distance is 0.1 J/s. The rate at which MC charges the sensor is 1 J/s, and the time that the MC returns to the depot to charge itself is ignored here. The main simulation settings are provided in Table 4. In addition, the relevant data of the real-time energy consumption rate of the sensor are shown in Table 5. The set of initial residual energy 10~20 J After the network environment is initialized, we will conduct simulation training on the environment. Our implementation uses an experience replay buffer of 10 5 . The size of the minibatch is 1024. As for the neural networks, all networks (separate policies and those contained within the centralized critic networks) use a hidden dimension of 128, and the Leaky Rectified Linear units are used as the nonlinear activation. We train our models with the Adam optimizer [44] and set different initial learning rates when the network size is different. The key parameters used in the training stage are described in Table 6. We have trained our model for three different environment settings on four NVIDIA GeForce GTX 2080ti for 10 h, after which the observed qualitative differences between the results of consecutive training iterations were ignored. We present one set of experimental results to describe the relationship between episodes and reward, which is shown in Figure 4. We can see that obtained rewards increase slowly through episodes to reach peak values after 240 training episodes. This is mainly caused by the efficient learning of AMADRL-JSSRC to the WRSN with dynamic energy changes so that agents could make reasonable decisions to obtain a greater reward.

Comparison Results against the Baselines
In this section, we compare the performance of the AMADRL-JSSRC with that of the ACRL algorithm, the GREEDY algorithm, the dynamic programming algorithm, and two typical online charging schemes algorithms NJNP and TSCA [16]. The detailed execution process of the above algorithms is shown in [31]. It is noted that some details of the baseline algorithms need to be adjusted.  Since the reward discount factor and penalty coefficient have a great impact on the performance of the algorithm in the training and testing process, we have made two sets of experiments, and the results are shown in Tables 7 and 8, respectively. These experimental results prove that AMADRL-JSSRC will strive for a long-term reward rather than a shortsight reward when the reward discount factor approaches 1. Furthermore, with the increase in the penalty coefficient, the number of dead sensors gradually decreases. The reason is that in order to obtain a high global charging reward, AMADRL-JSSRC preferentially charges the sensors with low residual energy to avoid sensor death when a large value is assigned to . Therefore, in this paper, the penalty coefficient is set as 10, and the reward discount factor is set as 0.9.

Comparison Results against the Baselines
In this section, we compare the performance of the AMADRL-JSSRC with that of the ACRL algorithm, the GREEDY algorithm, the dynamic programming algorithm, and two typical online charging schemes algorithms NJNP and TSCA [16]. The detailed execution process of the above algorithms is shown in [31]. It is noted that some details of the baseline algorithms need to be adjusted. For example, we have replaced the reward calculation equation in line 13 of Algorithm 1, line 6 of Algorithm 2, and the line 10 of Algorithm 3 described in [31] with r(S(k), a(k)) = d(k−1,k) + (− )N d (k), and we change their seeking rule from the minimum global reward to the maximum global reward.
We consider three networks with different scales, including 50, 100, and 200 sensors; these environments are denoted as JSSRC50, JSSRC100, and JSSRC200. We have run our tests on WRSNs based on these environments, and the corresponding MC capacity is set as 50, 80, and 150 J. In addition, the base time of these three tests is set as 100 s, 200 s, and 300 s, respectively. Unless otherwise specified, these parameters are fixed during the test.
The tour length, the extra time, and the number of dead sensors obtained via different algorithms based on different JSSRC environments are shown in Table 9. It is observed that when the network size is small, such as the network with 50 sensors, the exact heuristic algorithm is better than AMADRL-JSSRC and ACRL algorithms in terms of average tour length and the average number of dead sensors. Meanwhile, the ACRL performance is slightly better than that of AMADRL-JSSRC at JSSRC 50. However, with the increase in network scale, the results of AMADRL-JSSRC and ACRL outperform the GREEDY, DP, JNJP, and TSCA significantly; the AMADRL-JSSRC and ACRL algorithms begin to show their superiority. The AMADRL-JSSRC algorithm is better than the ACRL algorithm, especially in the terms of the number of dead sensors. The reason for this phenomenon is that the charging ratio of the ACRL algorithm is fixed and cannot adjust adaptively according to the real-time charging demand, which will lead to some sensors becoming dead during the MC charging the selected sensors. The extra time comparisons are also presented in this table, where all the times are reported on one NVIDIA GeForce GTX 2080ti. We find that our proposed approach significantly improves the solution while only adding a small computational cost in runtime. Moreover, the extra time of AMADRL-JSSRC is longer than that of ACRL, verifying that multi-agent collaborative decision making consumes more computational cost.

Discussions
The impacts of the parameters on the charging performance, including the capacity of the sensor and the capacity of MC, and the performance comparison in terms of lifetime are discussed in this section. The test environment is set as 100 sensors, the baseline time is 300 s, the initial capacity of the sensor is 50 J, and the initial capacity of MC is 100 J. Meanwhile, the initial residual energy of sensors will change with the capacity of sensors. Since the baseline algorithms do not have the ability to adaptively control the charging ratio for each sensor, for a fair comparison, we introduce the optimal charging ratio from Table 4, which is named the charging coefficient in [31]. Therefore, the charging ratio of the baseline algorithms are ACRL ε = 0.7, GREEDY ε = 0.8, DP ε = 0.9, NJNP ε = 0.8, and TSCA ε = 0.8, respectively.

The Impacts of the Capacity of the Sensor
As depicted in Figure 5, NJNP has the lowest tour length, and the average tour length gradually decreases with the increase in the capacity of sensors. The reason is that with the increase in the capacity of the sensor, the number of sensors' charging requests will decrease on the premise of sufficient residual energy. Moreover, the charging for each sensor is also prolonged due to the increase in sensor capacity. Based on the fixed baseline time, the more time the MC spends on charging sensors, the less time it will spend on moving. Therefore, the average tour length decreases gradually. The fluctuation in Figure 5 is caused by the random distribution of sensor positions and the dynamic change of their energy consumption rate in each test experiment. Furthermore, the NJNP algorithm has the lowest tour length is because it preferentially charges the sensors close to MC. It is noted that the moving distance of AMADRL-JSSRC is slightly longer than that of ACRL. This is because AMADRL-JSSRC can determine different charging ratios for the selected sensors according to the real-time charging demand to avoid the punishment caused by the dead sensors. Therefore, AMADRL-JSSRC spends slightly less time on charging than ACRL. When the base time is fixed, more time will be spent on moving, resulting in a longer moving distance. Meanwhile, the initial residual energy of sensors will change with the capacity of sensors. Since the baseline algorithms do not have the ability to adaptively control the charging ratio for each sensor, for a fair comparison, we introduce the optimal charging ratio from Table 4, which is named the charging coefficient in [31]. Therefore, the charging ratio of the baseline algorithms are ACRL = 0.7, GREEDY = 0.8, DP = 0.9, NJNP = 0.8, and TSCA = 0.8, respectively.

The Impacts of the Capacity of the Sensor
As depicted in Figure 5, NJNP has the lowest tour length, and the average tour length gradually decreases with the increase in the capacity of sensors. The reason is that with the increase in the capacity of the sensor, the number of sensors' charging requests will decrease on the premise of sufficient residual energy. Moreover, the charging for each sensor is also prolonged due to the increase in sensor capacity. Based on the fixed baseline time, the more time the MC spends on charging sensors, the less time it will spend on moving. Therefore, the average tour length decreases gradually. The fluctuation in Figure  5 is caused by the random distribution of sensor positions and the dynamic change of their energy consumption rate in each test experiment. Furthermore, the NJNP algorithm has the lowest tour length is because it preferentially charges the sensors close to MC. It is noted that the moving distance of AMADRL-JSSRC is slightly longer than that of ACRL. This is because AMADRL-JSSRC can determine different charging ratios for the selected sensors according to the real-time charging demand to avoid the punishment caused by the dead sensors. Therefore, AMADRL-JSSRC spends slightly less time on charging than ACRL. When the base time is fixed, more time will be spent on moving, resulting in a longer moving distance.  Figure 6 shows that with the increase in the capacity of the sensor, the average number of dead sensors shows an opposite change to the average tour length. Obviously, the average number of dead sensors of AMADRL-JSSRC is always the smallest. This is because the optimal charging ratios of the baseline algorithms are fixed, while the baseline algorithms are fixed. With the increase in the capacity of the sensor, the charging ratio for each selected sensor will be prolonged, increasing the risk of subsequent low residual energy sensor death. This result proves that the adaptive control of the charging ratio for each sensor could improve the charging performance effectively for the network.  Figure 6 shows that with the increase in the capacity of the sensor, the average number of dead sensors shows an opposite change to the average tour length. Obviously, the average number of dead sensors of AMADRL-JSSRC is always the smallest. This is because the optimal charging ratios of the baseline algorithms are fixed, while the baseline algorithms are fixed. With the increase in the capacity of the sensor, the charging ratio for each selected sensor will be prolonged, increasing the risk of subsequent low residual energy sensor death. This result proves that the adaptive control of the charging ratio for each sensor could improve the charging performance effectively for the network.

Performance Comparison in Terms of Lifetime
We have analyzed the test results of six schemes under the fixed baseline lifetime. In this section, we explore the lifetime of the six schemes under different JSSRCs, which are JSSRC50, JSSRC100, and JSSRC200, until the termination condition is satisfied. These six algorithms have run 50 times independently, and the test results are shown in Figures 9-11. It can be seen from the figure that although the fluctuation range of network lifetime obtained by the AMADRL-JSSRC algorithm is large, the lower and the upper bounds of network lifetime are still higher than the other five algorithms significantly. Moreover, with the increase in the number of sensors, this performance is outstanding significantly. It is noted that the network lifetime obtained by the AMADRL-JSSRC is better than the ACRL, which further proves that adjusting the charging ratio adaptively for each sensor could prolong the network lifetime effectively.

Performance Comparison in Terms of Lifetime
We have analyzed the test results of six schemes under the fixed baseline lifetime. In this section, we explore the lifetime of the six schemes under different JSSRCs, which are JSSRC50, JSSRC100, and JSSRC200, until the termination condition is satisfied. These six algorithms have run 50 times independently, and the test results are shown in Figures 9-11. It can be seen from the figure that although the fluctuation range of network lifetime obtained by the AMADRL-JSSRC algorithm is large, the lower and the upper bounds of network lifetime are still higher than the other five algorithms significantly. Moreover, with the increase in the number of sensors, this performance is outstanding significantly. It is noted that the network lifetime obtained by the AMADRL-JSSRC is better than the ACRL, which further proves that adjusting the charging ratio adaptively for each sensor could prolong the network lifetime effectively.

Performance Comparison in Terms of Lifetime
We have analyzed the test results of six schemes under the fixed baseline lifetime. In this section, we explore the lifetime of the six schemes under different JSSRCs, which are JSSRC50, JSSRC100, and JSSRC200, until the termination condition is satisfied. These six algorithms have run 50 times independently, and the test results are shown in . It can be seen from the figure that although the fluctuation range of network lifetime obtained by the AMADRL-JSSRC algorithm is large, the lower and the upper bounds of network lifetime are still higher than the other five algorithms significantly. Moreover, with the increase in the number of sensors, this performance is outstanding significantly. It is noted that the network lifetime obtained by the AMADRL-JSSRC is better than the ACRL, which further proves that adjusting the charging ratio adaptively for each sensor could prolong the network lifetime effectively.

Conclusions
In this paper, a novel joint charging sequence scheduling and charging ratio control problem is studied, and an attention-shared multi-agent actor-critic-based deep reinforcement learning approach (AMADRL-JSSRC) is proposed, where a charging sequence scheduler and a charging ratio controller are employed to determine the target sensor and charging ratio by interacting with the environment. AMADRL-JSSRC trains decentralized policies in multi-agent environments, using a centralized computing critic network to share an attention mechanism, and it selects relevant policy information for each agent. Meanwhile, the AMADRL-JSSRC performance significantly prolongs the lifetime of the WRSN and minimizes the number of dead sensors, and the performance is more significant when dealing with large-scale WRSNs. In future work, the multi-agent reinforcement learning approach for multiple MCs to complete the charging tasks jointly is the key point for further study.

Conclusions
In this paper, a novel joint charging sequence scheduling and charging ratio control problem is studied, and an attention-shared multi-agent actor-critic-based deep reinforcement learning approach (AMADRL-JSSRC) is proposed, where a charging sequence scheduler and a charging ratio controller are employed to determine the target sensor and charging ratio by interacting with the environment. AMADRL-JSSRC trains decentralized policies in multi-agent environments, using a centralized computing critic network to share an attention mechanism, and it selects relevant policy information for each agent. Meanwhile, the AMADRL-JSSRC performance significantly prolongs the lifetime of the WRSN and minimizes the number of dead sensors, and the performance is more significant when dealing with large-scale WRSNs. In future work, the multi-agent reinforcement learning approach for multiple MCs to complete the charging tasks jointly is the key point for further study.

Conclusions
In this paper, a novel joint charging sequence scheduling and charging ratio control problem is studied, and an attention-shared multi-agent actor-critic-based deep reinforcement learning approach (AMADRL-JSSRC) is proposed, where a charging sequence scheduler and a charging ratio controller are employed to determine the target sensor and charging ratio by interacting with the environment. AMADRL-JSSRC trains decentralized policies in multi-agent environments, using a centralized computing critic network to share an attention mechanism, and it selects relevant policy information for each agent. Meanwhile, the AMADRL-JSSRC performance significantly prolongs the lifetime of the WRSN and minimizes the number of dead sensors, and the performance is more significant when dealing with large-scale WRSNs. In future work, the multi-agent reinforcement learning approach for multiple MCs to complete the charging tasks jointly is the key point for further study.
Author Contributions: Conceptualization, C.J. and Z.W.; methodology, J.L.; data curation, S.C. and H.W.; investigation, J.X.; project administration, W.X. All authors have read and agreed to the published version of the manuscript.