Prioritized Uplink Resource Allocation in Smart Grid Backscatter Communication Networks via Deep Reinforcement Learning

: With the rapid increase in the number of wireless sensor terminals in smart grids, backscattering has become a very promising green technology. By means of backscattering, wireless sensors can either reﬂect energy signals in the environment to exchange information with each other or capture the energy signals to recharge their batteries. However, the changing environment around wireless sensors, limited radio frequency and various service priorities in uplink communications bring great challenges in allocation resources. In this paper, we put forward a backscatter communication model based on business priority and cognitive network. In order to achieve optimal throughput of system, an asynchronous advantage actor-critic (A3C) algorithm is designed to tackle the problem of uplink resource allocation. The experimental results indicate that the presented scheme can signiﬁcantly enhance overall system performance and ensure the business requirements of high-priority users.


Introduction
With the rapid development of smart grids, a growing number of wireless sensors are deployed in power grids to sense and monitor transmission lines, substations, homes, etc. [1]. The devices in wireless sensor networks (WSNs) are autonomous and resource-limited. To extend battery life and reduce maintenance costs, WSNs require robust and intelligent approaches to deal with data exchange, information security, resource allocation and strategy optimization. Specifically, the sensor node needs to select the optimal policy from a set of accessible policy groups. Markov decision process (MDP) is a powerful tool for designing adaptive algorithms and protocols, and thus [2] reviewed a number of instances of MDP framework applications.
However, as the scale of WSNs continues to expand, wireless powered communication networks (WPCNs) are emerging to charge energy-constrained sensor devices. The authors of [3] derived the signal-to-noise-ratio (SNR) outage zone in a energy free backscatter WPCN. Simulation results indicated that the backscatter WPCN coverage range is larger and the SNR outage zone is smaller of the backscattering system. It was demonstrated that the AmBC receiver can recover the desired information from BD. The authors of [13] made an in-depth study of RF source signals and proposed a new cooperative AmBC system in which readers could decode the desired information from RF sources. In [14], a low-complexity evolutionary game algorithm was proposed so that STs could select network access points and service patterns according to practical requirements.
The research on resource allocation of backscatter communication focuses on time and power resources. The authors of [15] introduced a backscattering auxiliary network composed of a hybrid access center and multiple BDs, and they then proposed a transmission strategy optimization problem based on working mode selection and time scheduling. In [16], the authors proposed a two-stage Stackelberg game model to deal with the time scheduling problems in the cognitive network of AmBC. In the first stage, gateway, as the leader, adjusted the price to maximize its profit. In the second stage, ST determined the backscattering time with the goal of utility maximization. The authors of [17] proposed the hybrid transmission mode of HTT and backscatter communication, and realized the optimal throughput of the secondary system by adjusting the time allocated to different modes. The authors of [18] considered power, power splitting ratio and time resources simultaneously. To ensure fairness and security, an optimization goal of maximizing the minimum throughput was proposed. The authors of [19] studied the throughput maximization problem of the full-duplex AmBC system, where the primary access point can simultaneously transmit the signal and receive the backscatter signal through full-duplex communication. An iterative method was proposed for joint time scheduling, power allocation and reflection coefficient adjustment.
Machine learning (ML) is a powerful solution to decision problems and has many applications in backscatter communication. In [20], the authors adopted a Q-learning algorithm to address the optimal strategy of an AmBC system iteratively with only partial environmental information. The authors of [21] proposed a label signal detection method for an AmBC system, which classifies the energy characteristics of received signals by an ML classification algorithm. In [22], a dynamic spectrum access framework was designed for an RF-powered backscatter system, which maximized the system throughput through an online RL algorithm. The authors of [23] proposed a Q-learning method to explore the optimal working mode of an AmBC system under the fading channel environment. Because of energy constraints, BDs sometimes require offloading computing tasks to nearby computing servers through active transmission or low-power backscatter communications. The authors of [24] proposed a deep reinforcement learning algorithm (DRL) is to implement the optimal unloading strategy in a hybrid unloading AmBC network.
However, the previous research work has not considered the priority of the BDs. The BDs are numerous and the spectrum resources and energy resources are limited. It is necessary to consider priorities of different BDs to ensure high-priority service quality in smart grid applications.

Multi-User Backscatter Communication Network in CR-Based Smart Grid
In this section, we consider a multi-user backscatter communication in CR-based smart grid, as shown in Figure 1. The network consists of a mobile edge computing (MEC) center, multiple primary users (PU) and multiple RF-powered SUs. The MEC holds the basic state information of the terminals and acts as the manager to make the resource allocation strategy. The PUs, such as the substations, are mainly responsible for transmitting control, management and other important information in the smart grid. A large number of sensors that sense information about the environment act as SUs, which have three working modes: backscatter, energy harvest and active transmission.
Assume that the number of PUs and SUs is M and N in the system, respectively. And the PUs are connected to the network with frequency division multiple access to avoid interference. Both PUs and SUs are in demand to transmit information to the receiver, i.e., MEC. When the channel is occupied by a PU, the SU can either conduct energy capture or backscatter communication by superimposing its own signal on the RF signal of the PU. When the PU leaves the channel to make it free, the SU can access the channel and consume the stored energy to send signals to the MEC. For SU n, the energy captured per time slot is e h n units, and d b n packets are transmitted per time slot in the backscatter mode. When in active transmission mode, SU n transmits d a n packets, and e a n units of energy are consumed per time slot.

User Scheduling Based on Priority
Because of the differences between the business activities in a smart grid, SUs are classified into different priority classes j(j = 1, 2, ..., J) based on the importance of the information to be transmitted. SUs that transmit important information have a higher priority , while those that transmit information with high delay tolerance have a lower priority. By prioritizing, the information of users with high priority can be transmitted first, and the data of users with low priority can be transmited later. Therefore, the communication delay of high priority users will be greatly reduced. The initial weight value of SU n is represented by w j o,n which satisfies the constraint w 1 o,n > w j o,n > ... > w J o,n . Moreover, when the SU senses an emergency in the smart grid, the MEC will adjust and promote its priority to send the emergency information as soon as possible. Packets of emergency messages will automatically arrive at the head of the data cache queue, that is, packets of emergency messages will be transmitted first when the user transmits data. w e is defined to represent the emergency weight. Therefore, the real priority weight of SU n at time t, including initial weight and emergency weight, is expressed as, where 1 {•} is an indicator function. E n (t) = 1 shows that an emergency has occurred, otherwise E n (t) = 0. When the emergency packet is sent, the MEC resets the real priority weight of SUs to the original priority. Particularly, the SUs at priority class j + 1 sending emergency information have a higher real priority weight than the SUs at class j. This priority-based user scheduling strategy ensures the priority transmission of important and urgent information in the smart grid, reduces the communication delay of these data, and is of great significance to accelerate the decision-making process of the smart grid monitoring center.

Problem Formulation
To achieve the goal of maximizing throughput, we model the communication process as a MDP, which can be represented by a tuple < S, A, P, R >. S is the state space of the whole network. A represents the action space of the MEC. The state transfer function and the reward function are denoted by P and R, respectively. Firstly, the network state space consists of two parts: SU state space and PU state space. We defined the SU state space as, where r n represents the queue states of SU n, and R n describes the maximum length of the data queue space for SU n. c n represents the energy state, and C n denotes the maximum energy storage in SU n battery. w n (t) represents the weight of SU n, and w e represents the emergency weight. The PU state space is as follows, where 0 represents the primary user m leaving the channel. Otherwise, it means that the channel is occupied. Therefore, the network state space is expressed as, The network action space can be viewed as a collection of action spaces for each SU. In each time slot, the action space of a SU n can be expressed as, A n = a n , p n ; a n ∈ {1, 2, 3, 4} , p n ∈ {0, 1, 2, . . . , M} where a n = 1, 2, 3, 4 denotes that the SU n performs the actions of keeping waiting, energy harvest, backscattering and active transmission, respectively. p n represents the result of channel allocation. p n = 0 is assigned a channel, otherwise p n = 0. To avoid interference, each idle PU channel can only be allocated to one SU to transmit data actively. The network action space can be expressed as, The transition function is a state transition probability showing that the SU takes an action and the state changes from s to s with a certain probability at time t. The formula is expressed as, In particular, the environment we have established is a deterministic environment in which the agent executes action a and then transfers it from s to s with a probability p (s, a, s ) = 1. Because the objective is to maximize total throughput for total SUs, the design of reward function is directly related to the number of packets sent. In addition, we introduce the real priority weight into the expression. The more important the business of SU, the greater the real priority weight. When primary channel resources are tight, the operation that allocates channel to more important users will get more rewards. The reward function is as follows, The problem in this paper is to solve the maximum throughput of the multi-user backscatter communication system by employing an optimal resource allocation policy strategy. The formula is as follows, max π∈Π R{s, a} (9) where π, Π represents a certain policy and the policy space is composed of all policies respectively.

Algorithm Description
In large-scale user scenarios, users are classified by unsupervised clustering before resource allocation. Given a set of n entities X = {x 1 , x 2 , · · · , x n }, each entity is represented by a feature vector. The goal of k-means clustering is to divide n entities into k different clusters, assuming that where i ∈ {1, 2, · · · , n}, l ∈ {1, 2, · · · , k}. Squared Euclidean distance is adopted as the distance between entities, where m represents the dimension of the feature vector. Define the sum of the distance between the entity and its cluster center as the loss function, wherex l = (x 1l ,x 2l , · · · ,x ml ) T represents the center of the cluster l. The number of entities in each cluster l is defined as, where I(C(i) = l) is the indicator function, which is 0 or 1. K-means clustering is the optimization problem shown below, In fact, the optimal solution of k-means clustering is a NP hard problem, which is solved by iterative method in reality. First, we need to select a clustering center (m 1 , m 2 , · · · , m k ). Entities are assigned to the clusters with the nearest centers one by one. The target of partition C is to minimize the sum of the distance between the entity and the cluster center to which it belongs. The objective function of this process is as follows, The next step is to recalculate the mean value of the entities of each cluster as the new cluster center value according to the partition result of the previous step. The new clustering center (m 1 , m 2 , · · · , m k ) minimizes the sum of the distance between the entity and the cluster center to which it belongs. The loss function is as follows, The update formula of the mean m l of cluster G l with n l entities is shown as follows, Repeat the above two steps until the partition no longer changes, and get the result of clustering. The pseudo-code for clustering is shown in Algorithm 1.

Algorithm 1: K-Means Algorithm for BDs Clustering.
Input: The set of entities to be clustered, X = x 1 , x 2 , · · · , x n The number of clusters, k The maximum iterations, T max Output: The clustering result C(i), i = 1, 2, · · · , n 1 Select K entities randomly as the initial clustering center m (0) = (m Update new clustering results C (t) according to formula (15) 6 Update new clustering center m (t+1) according to formula (16) and (17)

Clustering Evaluation
We use silhouette coefficient to measure how appropriately the entities have been clustered through k-means. For entity x i in the cluster c i , the mean distance between x i and all other entities in the same cluster is defined as, where d(x i , x j ) is the Euclidean square distance between entities x i and x j in the cluster c i . Obviously, the smaller the value a(i), the better the assignment. For each entity x i ∈ c i , we define the smallest mean of the distance from i to all entities in C as, where C = c i . We now define a silhouette value of one entity x i as follows, where −1 ≤ s(i) ≤ 1. As a(i) is a measure of how dissimilar x i is to its own cluster, a small value means it is well matched. Furthermore, a large b(i) implies that x i is badly matched to its neighboring cluster. Thus an s(i) close to 1 means that the entity is appropriately clustered. If s(i) is close to −1, entity i would be more appropriately clustered in its neighboring cluster. An s(i) near 0 means that the entity is on the border of two natural clusters. The mean s(i) over all entities of a cluster is a measure of how tightly grouped all the points in the cluster are. Thus the mean s(i) over entire entities is a measure of how appropriately the entities have been clustered. The silhouette coefficient over all the entities for a specific number of clusters k is defined as,s where −1 ≤s (k) ≤ 1. The closer thes (k) is to 1, the better the clustering. Therefore, we need to calculate the silhouette coefficient under different k values and select the k value corresponding to the maximum silhouette coefficient as the clustering number.

Deep Reinforcement Learning Algorithm
The target of our algorithm is to discover the actions that are taken to maximize the total throughput. This is because general RL algorithms, such as q-learning, are only applicable to situations in which the state action space is small. In order to obtain the mapping of large-scale state space S to action space A, we decided to adopt the A3C algorithm. A3C integrates the advantages of the policy-based method and the value-based method. In order to accelerate the convergence speed, A3C uses the skill of multi-threading experience playback. Multiple threads learn interactively with the environment simultaneously, and the learning outcomes of each thread are put together and stored in a public place. The learning results are then periodically retrieved from the public place to guide the next interaction with the environment. The A3C framework eliminates the strong correlation of the empirical playback data in an asynchronous way.
At the beginning, A3C requires maintaining a policy function π (a t |s t ; θ) with the parameter θ and an value function approximation V (s t , θ v ) with the parameter θ v . The agent updates the policy using the estimated value function. To reduce the variance of estimate, the advantage estimate is adopted, which is shown as, where U t (θ v ) is the estimate of state action value. R is the reward in formula (8) and β means the discount coefficient which is greater than zero and less than 1. The neural network is trained to obtain approximation of policy π (a t |s t ; θ) and the value function V (s t , θ v ). At the beginning, the global network parameters are the same for each actor-learner. Afterwards, multiple actor-learners use asynchronous gradient descent to train and optimize their neural networks. The learning process is parallel and independent. Network parameters are updated by the calculated gradient, and the actor-learner sends the new parameters to the global network. Again, the global network propagates the updated weight parameters to the actor-learners to make sure that they share a common policy.
We define loss functions for policy and estimate function, respectively. The policy loss function is as below, f π (θ) = log π (a i |s i ; θ) A(s, a; θ, θ v ) (23) The gradient is expressed as, The loss function for estimate value function is defined by, Similarly, the gradient of value loss function with respect to θ v is expressed as, After the neural network is trained for thousands of rounds, the A3C algorithm will give the decision results. It will take seconds for the A3C algorithm to choose a best resource allocation action according to the network state dynamically. The pseudo-code for A3C is summarized in Algorithm 2.
∂θ v 23 perform asynchronous update of θ using dθ and of θ v using dθ v

Performance Evaluation
Sources [9,17,23,25] are all about resource allocation in a single user scenario. The multi-user scenario model is more complicated, for the resource allocation must consider the interaction between users. The time scheduling strategy of the multi-user AmBC system was studied in [10,19,24,26]. RL and DRL have certain advantages in dealing with resource allocation problems in time-varying communication networks because of their characteristics of learning in environment interaction. It is one of the characteristics of smart grids that different services vary greatly. The previous literature on backscattering did not consider the priority between the businesses; we introduced the priority strategy into the resource allocation problem. In this section, we evaluate the performance of the proposed DRL algorithm by simulation experiment. The backscatter scheme in [27], HTT in [17] and the random scheme are selected for comparison. Backscatter scheme includes backscattering but not active transmission, while the HTT captures the energy first and then transmits it actively. The random scheme means selecting the action in action space, which includes backscatter, energy harvest and active transmission. To verify the performance of the proposed scheme, we consider convergence, adaptability to the environment and priority policy evaluation.

Simulation Setting
We consider a network composed of a MEC and multiple users. The small-scale scenario has 3 PUs and 5 SUs, while 18 PUs and 50 SUs are in the large-scale scenario. The locations of all users are randomly distributed. There are ten time slots in a frame. The learning rate of both the action network and the critic network is 0.001. In the hardware environment, we use a laptop equipped with AMD Ryzen 5 2500U and 8G memory. The software environment is Pycharm and Tensorflow 1.8 on Windows 10. The programming language we choose is Python 3.5. RMSProp optimizer is adopted to minimize the loss function in the A3C algorithm. The simulation parameters for the proposed DRL algorithm are shown in Table 1.

Parameters Values
The maximum length of date queue 10 packets The maximum capacity of energy 10 units The amount of data in each packets 1 kbit The probability of packet arrival 0.9 The probability that the channel is idle 0.

Convergence Evaluation
In the convergence evaluation experiment, the probability of packet arrival is fixed as 0.9. And the probability of channel idle and emergency are both 0.5. Throughput is an important indicator of performance evaluation. Figure 2 shows the average throughput of the A3C algorithm and the comparison algorithm. Average throughput here represents throughput of all SUs per frame time. As shown in Figure 2, the proposed algorithm converges to a much higher value than the average throughput of the comparison algorithm. In addition, the convergence speed of the A3C algorithm is fast, reaching convergence value in about 120 episodes. The backscatter scheme and the HTT scheme perform poorly because of a single transmission mode. The random scheme performs better because it adopts both backscatter and active communication as the proposed algorithm.
For large-scale user scenarios, we start with clustering. However, the number of clusters is artificially selected. To find out the best k value, we carry out the clustering evaluation experiment through computing the silhouette coefficient in a different number of clusters. As shown in Figure 3, the silhouette coefficient maximizes at k = 9. The maximum value of k is 18 because the number of clusters must not exceed the number of PUs to ensure at least one PU in each cluster. We know from the description of the algorithm that the larger the silhouette coefficient, the better the clustering results. Finally, k = 9 is selected for the next experiment.
As shown in Figure 4, we conducted comparative experiments on large-scale user scenarios. The experimental process is divided into two steps. First, K-means clustering is carried out according to the location distribution of users, then, the proposed DRL is used to find the optimal strategy in each cluster. As can be seen in the figure, the proposed DRL algorithm has obvious advantages when the convergence value of average throughput is about 15 percent higher than that of the random scheme.

Adaptability to Environment
The dynamic and changeable power grid environment leads to great uncertainty in the generation time and quantity of information data. To evaluate the adaptability of the proposed DRL algorithm, we consider different scenarios by varying the probability of packet arrival and the PU idle probability. In Figure 5, with the increase of packet arrival probability, the proposed algorithm still performs better than the comparison algorithm. The proposed algorithm can allocate network resources more reasonably. In particular, when the probability of packet arrival changes from 0.4 to 0.5, the growth achieved of average throughout in the DRL scheme is approximately four times that of the HTT or backscatter schemes. Figure 6 shows the effect of PU idle probability on average throughput. The proposed DRL algorithm can better adapt to the dynamic change of the PU channel. The single transmission mode is greatly affected by environmental changes. In particular, the backscatter scheme performance degrades as primary user idle probability decreases. This is because the backscatter scheme is heavily dependent on the presence of the primary user signal. However, the broken line of the HTT scheme shows a trend of first rising and then falling. This phenomenon is due to the requirement for the PU energy signal from energy harvesting and the requirement for an idle PU channel from the active transmission.

Priority Policy Evaluation
In the description of the system model, a user priority policy is introduced which gives priority to ensuring the communication needs of high-priority users. The SUs are divided into two classes: with priority and with no priority. The PU idle probability is fixed as 0.5. Figure 7 shows the average throughput versus packet arrival probability for priority and non-priority SUs. As the probability of packet arrival increases, throughput for both priority and non-priority users increases. In particular, a SU with priority always maintains high throughput regardless of the packet arrival probability.  Figure 8 shows the average throughput when all SUs have a fixed packet arrival probability at 0.9. The smooth trend of the two broken lines is because the proposed algorithm can adapt well to the changes of channel environment. The priority policies ensure that channel allocation gives priority to high-priority users when channel resources are limited. Therefore, the priority user achieves higher average throughput performance.

Conclusions and Future Research
In this paper, we propose a multi-user backscatter communication network architecture in CR-based smart grid. The network architecture contains two transmission modes: backscattering and active transmission. Backscattering requires a primary user energy signal. Active transmission requires an idle primary user channel. To improve network performance and increase system throughput, we formulate the resource allocation process as an optimization problem. In addition, we consider the business priorities of different users and introduce a priority strategy. Then, a DRL scheme is proposed to address the problem. The k-means clustering method is adopted to pre-process the application of the DRL scheme in the large-scale user scenario. Large-scale problems are transformed into small-scale problems. The numerical results verify that the proposed scheme enables greater system throughput with limited resources and gives priority to ensuring the throughput of high-priority users. In the future, we will study more flexible backscatter communication systems combined with D2D. More factors will be considered, such as smart jamming attack [20], concurrent transmission, channel conflict [28], signal power, reflection coefficient and energy conversion efficiency. As for solution methods, communication systems are matched with RL algorithms because of their interactive characteristics; therefore, we will give priority to DRL algorithms that have a lot of potential in communication scenarios.