Deep Graph Reinforcement Learning Based Intelligent Trafﬁc Routing Control for Software-Deﬁned Wireless Sensor Networks

: Software-deﬁned wireless sensor networks (SDWSN), where the data and control planes are decoupled, are more suited to handling big sensor data and effectively monitoring dynamic environments and events. To overcome the limitations of using static routing tables under high trafﬁc intensity, such as network congestion, high packet loss rate, low throughput, etc., it is critical to design intelligent trafﬁc routing control for the SDWSNs. In this paper we propose a deep graph reinforcement learning (DGRL) model-based intelligent trafﬁc control scheme for SDWSNs, which combines graph convolution with deterministic policy gradient. The model ﬁts well for the task of intelligent routing control for the SDWSN, as the process of data forwarding can be regarded as the sampling of continuous action space and the trafﬁc data has strong graph features. The intelligent control policies are made by the SDWSN controller and implemented at the sensor nodes to optimize the data forwarding process. Simulation experiments performed on the Omnet++ platform show that, compared with the existing trafﬁc routing algorithms for SDWSNs, the proposed intelligent routing control method can effectively reduce packet transmission delay, increase packet delivery ratio, and reduce the probability of network congestion.


Introduction
With major technological advances in communication, computing, and sensing, sensor networks play important roles for modern society. Sensor nodes help in collecting data from environment or devices, which can be used to monitor environments, develop and implement intelligent control systems, such as smart cities, smart factories, and intelligent surveillance systems. For example, there are about 70 million surveillance cameras in the U.S.. While the fast-growing number of sensors provide the data needed for big data analytics and intelligence, the massive data traffic also presents a big challenge for data transport and networking, e.g., increased network congestion and poor network quality of services. One of the promising networking approaches to tackle these challenges is softwaredefined networking (SDN) [1]. In the SDN paradigm, the control plane is decoupled from the data plane to provide flexible traffic control and simplify the network operation and management. An investigation of the SDN technology for the Internet of Things (IoT) was reported in [2,3].
Intelligent traffic routing control for the SDWSN controllers is critical and very challenging as it needs to be effective, adaptive, and reliable. Several researches discuss the routing control for SDWSN, such as SDN-WISE [4] and IT-SDN [5]. The performance of SDN-WISE is evaluated by considering six nodes in a linear topology and only one node generates a data packet at a certain time. The size of the network in IT-SDN is larger, and every node transmits one packet per minute. The controllers of these approaches determine the path from one node to another only using the Dijkstra algorithm. The Dijkstra algorithm is effective when the topology becomes small and the network is under light traffic. However, it will lead to network abnormalities such as slow response speed in the case of heavy traffic [6]. For example, in Figure 1, the Dijkstra algorithm will select node 5 as a relay node of multiple nodes. Node 5 needs to undertake the task of transmitting data packets to the sink node for multiple nodes. Congestion will occur on node 5 in the case of heavy traffic. The packet loss rate and delay of the network will then rise sharply. The routing architecture proposed by Shanmugapriya and Shivakumar [7] combines context-aware and policy-based routing modules. The controller chooses the finest next hop depending on the context information such as CPU load, service information, and power levels. In literature [8], the authors introduced reinforcement learning into SDWSN by designing a quadruple reward function and combining Q-learning algorithm. In fact, the Q-learning algorithm is suitable for discrete and low dimensional action space. When there are many states, the Q table will be very large, and the search and storage will consume a lot of memory. In view of the research gaps, we propose in this paper a deep graph reinforcement learning (DGRL) model-based intelligent traffic control scheme for SDWSNs, which combines graph convolution and deterministic policy gradient. Reinforcement learning is a category of machine learning technology in which agents learn and make decisions by interacting with the environment according to the current state [9]. Compared with deep learning, reinforcement learning has better real-time performance. In the general reinforcement learning (RL) algorithm, the quality of the action is determined by the reward value after the policy is implemented. As time goes by, the agent will be able to make the best reward decision based on experience. The model fits well for the task of intelligent routing control for the SDWSN, as the process of data forwarding can be regarded as the sampling of continuous action space and the traffic data has strong graph features. In the application scenario of SDWSN data forwarding and traffic control we studied in this paper, the controller learns the policy of data packets forwarding according to the current operating conditions. Next, it makes corresponding routing policy, changes the routing paths, and generates optimal forwarding policy through real-time iterations. The intelligent control policies are then implemented at the sensor nodes to optimize the data forwarding process. The main contributions of the paper can be summarized as follows: • We proposed a deep graph reinforcement learning (DGRL)-based framework for intelligent traffic control in SDWSN systems. By learning and optimizing the data forwarding policy, the SDN controller can provide adaptive and effective routing control for dynamic traffic patterns and network topologies. • We designed an actor-critic network architecture for the DRGL model, which takes into account both graph convolutional networks and deterministic policy gradient. Reward functions for the reinforcement learning and training method were developed for the DRGL model. • Compared with traditional routing protocols, the proposed DGRL traffic control mechanism can effectively reduce the probability of network congestion, especially in the case of high concurrent traffic intensity. Simulation experiments based on the Omnet++ platform show that, compared with existing traffic routing algorithms for SDWSNs, the proposed intelligent routing control method can effectively reduce packet transmission delay, increase PDR (Packet Delivery Ratio), and reduce the probability of network congestion.
The rest of the paper is organized as follows. Section 2 summarizes the related literature in the research direction. Section 3 describes the system structure of the proposed deep reinforcement learning model DGRL in detail. We also discuss the specific training method and updating process when the simulation network is running. Section 4 shows detailed experimental results and presents performance evaluation. Section 5 presents discussion and challenges associated with DGRL. Section 6 concludes the paper and discusses future work.

Related Works
Several researches use traditional methods for intelligent routing control. Literature [10] introduces a node mobility prediction scheme to enhance the network throughput. The controller predicts node mobility through machine learning and sends optimal route information to the data plane in preparation for upcoming link failures. In the literature [11], a control mechanism based on context-driven is introduced in the scenario of SDN. It contributes to improving the autonomous capability of the network. In the literature [12], a hybrid network search path scheme is designed for intelligent traffic forwarding. Dijkstra and K path forwarding algorithm are used under different network loads. Literature [13] introduces a probabilistic-based QoS routing mechanism for SDN to reduce bandwidth blocking. Bayes' theorem and the Bayesian network model are used to determine the link probability and select the route.
Artificial neural network extracts features data by imitating the way that neurons processes the input information. It is the main means of intelligent information processing nowadays. With the increase of GPU computing power, end-to-end models based on deep learning have become state-of-the-art in the fields of computer vision, natural language processing, and reinforcement learning [14].
The application of deep learning and reinforcement learning to network traffic control and routing forwarding is a relatively new application scenario which has received a lot of attention in recent years. Tang et al. [15] introduce the deep convolutional neural network into the Wireless Mesh Network (WMN). The traffic pattern of sensor network is sent to the deep neural network in the form of multi-channel tensor for training, and the optimal routing strategy is obtained. In the literature [16], the author tries to replace the original routing strategy by training multiple restricted Boltzmann machines. Correspondingly, this can effectively reduce the data transmission delay and improve the overall transmission efficiency. However, this method is only suitable for small sensor networks. Once the number of nodes increases, the number of neural networks to be trained will also increase exponentially. In our previous work [17], we study the combination of deep learning and WSN with super nodes. Through link reliability prediction, the routing decision algorithm is introduced to reduce the overall transmission delay and effectively improve the network life. Literature [18] combines convolutional neural network with restricted Boltzmann machines to calculate routing for software-defined routers in wireless mesh sensor networks. Literature [19] uses the three-dimensional tensor formed by the time series traffic patterns of nodes in the network for training. It also explores the influence of the deep model and the shallow model on the training effect. Deep learning is also used in scenarios such as intelligent channel allocation and traffic prediction [18,20]. Due to the characteristics of data transmission policy for software-defined sensor network, offline algorithms such as deep learning cannot match the dynamic characteristics of network data forwarding.
In [21,22], conversion of traditional routing rules into computational paradigms is investigated based on deep learning and reinforcement learning respectively. Younus et al. [8] introduce reinforcement learning into the software-defined wireless sensor network. By designing a quadruple reward function and combining with the Q-Learning algorithm [23], it effectively improves the energy efficiency of the SDWSN and prolongs the lifetime of network nodes. Some routing planning algorithms are proposed based on Q-Learning, such as the QELAR model proposed by Hu et al. [24], the SDWSN model proposed by Huang et al. [25], and the DACR routing algorithm proposed by Razzaque et al. [26]. These models are designed to improve the energy efficiency of nodes and the quality of service (QoS) of WSN. Deterministic policy gradient (DPG) [27] applies to routing policy in continuous action space, while Q-Learning applies to finite state space in specific scenarios. Deep deterministic policy gradient (DDPG) [28] and deep Q network (DQN) [29] are the products of the combination of deep learning and reinforcement learning, which effectively improves the feature expression ability and decision-making ability of reinforcement learning. Liu et al. [30] simultaneously introduced DQN and DDPG into routing policy, which greatly improves the throughput of the network and made the network load more balanced. Their experimental results show that DDPG has a better decision-making effect than DQN in continuous state space. Yu et al. [31] generated a routing policy by using DDPG to predict node connection weights. The network transmission delay is therefore reduced. Abbasloo et al. [32] proposed the Orca model, which effectively solves the congestion control requirements of the TCP protocol of the transport layer. By integrating the deep policy gradient algorithm, they introduced the congestion window and the network data forwarding pacing rate. Due to the black box characteristics of deep reinforcement learning, Meng et al. [33] proposed the Metis framework. By introducing two different interpretation methods based on decision trees and hypergraphs, the DNN policy is transformed into an interpretable rule-based controller. To a certain extent, it demonstrates the feasibility of using deep reinforcement learning for network traffic control.
We compare the literatures using reinforcement learning mentioned above and summarize them in Table 1.

System Design
In this section we describe the design of DGRL, a distributed traffic control algorithm model based on deep graph reinforcement learning. The model DGRL uses an experience pool for playback training. Each node in the model can optimize its own transmission path through online training and make the best next-hop policy. It is a lightweight and real-time routing control algorithm for data forwarding.

Problem Statement and Notations
A SDN-based WSN is represented by a topology adjacency matrix A. We assume that the controller can receive timely updates of the network state S (e.g., channel delay, loss rate, and buffer occupancy of nodes). The controller generates a policy µ for the nodes to forward packets. The policy determines routing paths based on the network state: a t = µ(S t |θ µ ). The reward for taking action a t in state S t can be represent by R µ (a t , S t ). The performance of policy µ will be measured by function Q(µ). The problem is defined as follows. Given A, S, find a policy to determine the path for forwarding packets. Our goal is to find the optimal behavior policy µ, which is to maximize the function Q(µ): µ = argmax µ Q(µ). Table 2 summarizes the important notations used in DGRL. Table 2. Terms and notations used in DGRL.

N
The number of nodes in the network A The adjacency matrix of the network X The feature matrix of all nodes l One-hot coding of current node d One-hot coding of target node F The number of features of nodes µ The output action decision vector of the Actor model J The number of packets that received by its destination node Ts j The time when packet j sent by its source node Tr j The time when packet j received by its destination node Pr n i The number of packets received at node n i as a destination Pd n i The number of packets dropped at node n i β i The ratio of the number of packets forwarded by node n i to the total number of packets forwarded in the network β The average of β i

System Framework of DGRL
The system block diagram of the controller is shown in Figure 2. The Environment represents the network to which the controller is connected. The controller captures the current network status from the environment and uses the Online Policy in the Actor neural network to make policy. The previous state, action made by the controller, reward value of the environment feedback, and the latest observed SDN state form a four-tuple record (LastState, Action, Reward, NewState). The four-tuple is stored in the experience pool for training multiple neural network models. One is the Online Q in Critic NN, and the other is the decision network Online Policy in Actor NN. The Online Policy is used to make decisions µ based on the state of the current environment: state → action, while the Online Q is used to fit the reward of the environment for the controller's decision: (state, action) → reward. Both the Online Policy and the Online Q have exactly the same target network as their structure, named Target Policy and Target Q, respectively. The purpose is to use soft updates to assist the training process to achieve convergence and avoid large gradient fluctuations during training.
In general, the Actor NN and Critic NN both contain two parts, namely online network and target network. Online Policy outputs real-time actions for the Actor NN to use in real time. The Target Policy is used to update the Critic NN. The output of Online Q and Target Q are both the value of one state, while the input is different. Online Q takes the actual actions taken by the Actor NN and the current state as input. Target Q uses the output of Target policy. Details about the feedforward processes for the Actor NN and the Critic NN can be found in Section 3.3. The structure diagram of the Actor neural network and Critic neural network is shown in Figure 3. Actor neural network is at the top of the figure, and Critic neural network is at the bottom of the figure. The blue rectangles in Figure 3 represent tensor transformation operations, the golden circles represent neurons, and the gray circles represent tensors. The Actor model contains four inputs, topological adjacency matrix A of the network, current features X of all nodes, one-hot code l of the node where the data packet is currently located, and one-hot code d of the destination node. The state of the current node and the state of neighbors must be considered when transmitting data packets. Actor uses graph convolutional neural network to extract and aggregate the state features of the current node and all its neighbors. Since it is highly correlated with data forwarding, the destination node is necessary to be considered after extracting the network state. Therefore, we flatten the output of the GCN layers and concatenate the result with l, d. The vector state represents the state features of the network environment, which is given by where GCN represents the graph convolution operation, means concatenate. The detailed calculation of GCN will be given in the next subsection. The output µ of the Actor model is generated by Hadamard product operation which consists of the result of fully connected layers and an inner product operation. The activa-tion function of this layer is Softmax, and the output is the next hop policy µ of the current data packet. The Critic model contains five inputs. In addition to the four input parameters for the Actor model, it also includes the next hop action a obtained by the Actor model. The Critic model also contains the GCN layers for extracting the characteristics of the controller and its neighbor nodes, but the weights of the GCN layers are not trainable. Its weights are completely copied from the corresponding GCN layers of the Actor model. The purpose is to ensure that the state features of all nodes extracted by the two model are completely consistent. The Critic model contains a connected layer. The outermost layer uses a single neuron to fit the feedback value Q of the action policy adopted by the model.

Feedforward Calculation of Actor Neural Network
First, we give the feedforward process of the graph convolutional layer. We use A i,j and X to denote the topological adjacency matrix of the sensor network and the state of network observed by the controller, respectively. The number of nodes and the number of features are denoted by N and F, respectively. Apparently, we have A i,j ∈ R N×N and X ∈ R N×F . We define an adjacency matrix with self-loopÂ i,j = I N + A i,j , where I N is the unit matrix of order N. Thus, the degree matrix D i,j ∈ R N×N is In the process of graph convolution calculation, nodes with too many neighbor nodes will have a large gap with other nodes when aggregating features. To avoid this problem, we define the standardized adjacency matrix asÃ i, . After two graph convolutional layers, the output tensor H is obtained by g ∈ R Z×1 are the weights that need to be trained in the Actor model. C and Z are the characteristic dimensions of the output after graph convolution.
In addition to the adjacency matrix A i,j and the state matrix X, the Actor model also includes the position vector l ∈ R N×1 of the current node and the vector d ∈ R N×1 of the target node. Both vectors are composed of one-hot codes, which represents the unique ID of nodes.
The state vector of the current environment S can be obtained by In (4), the dimension of S is N × Z + 2N. The state vector S will be used as the input of the fully connected neural network in the Actor model, and finally the corresponding action vector µ ∈ R N×1 is obtained at the output of the Actor. This vector determines the next hop after the current node receives the data packet. Since the network topology is not fully connected in most cases, it is necessary to use the mask vector to filter the action vector output by the Actor fully connected layer. The vector mask represents the adjacency information of the node where the packet is located. It can be calculated by where A T i,j is the transpose of the adjacency matrix. Specifically, the purpose of setting the mask vector is to limit the decision space and avoid forwarding data packets to nonneighbor nodes.
The final output of the Actor model can be obtained by where represents the Hadamard product, and W a ,b a are the weights of the fully connected layer.

Feedforward Calculation of Critic Neural Network
It has been mentioned in the last section that the weights of the GCN layers in the Critic model are completely replicated from the Actor network. The feedforward calculation process and results of the GCN layers are completely consistent with that in the Actor model, and this section will not repeat it again. After obtaining the output µ of the Actor model, we define a = µ + N t , where N t is a random disturbance term. Next, we can obtain the output y of the Critic model by the following equation: where represents a concatenate operation, S is the tensor output by GCN, W The model proposed in this paper is based on the deterministic policy gradient and this framework is based on Q-value. The Critic neural network of DGRL is used for Q-Value fitting. The quality of the policy can be expressed by the expected value according to the following equation: where R t = R(S t , a t ) is the feedback value of the environment, and γ is the time series decrement factor. Taking t = 2 as an example, the iterative formula of Q(S 2 , a 2 ) can be obtained by approximating Equation (8) as follows: Inductive Equation (9) can get the iterative formula as shown in (10): Define y t = R t + γQ * (S t+1 , a * t+1 ), where a * t+1 is the output of the Target Policy and Q * (S t+1 , a * t+1 ) is the output of the Target Q. Then the loss function of the Online Q can be defined as follows: where θ Q is all the weights that need to be trained in the Online Q model, Q(S t , a t ) is the output of the Online Q model, and n is the batch size. At this point, replay can be done by sampling the four tuples (Last State, Action, Reward, New State) in the experience pool of DGRL model. The error back-propagation algorithm can be used to train the Online Q network and update all its internal weights θ Q . All weights of the Online Policy are represented by θ µ . Its update is slightly different from the Critic network because Online Policy does not have explicit label data. However, through the deterministic gradient policy, it can be known that a good policy will get a larger Q value. Given the gradient of the loss function for Critic network and the output policy µ(S t ) of the Online Policy, the weight gradient update of the Q value can be obtained as follows: Where a t = µ(S t | θ µ ) + N t , N t is an Ornstein-Uhlenbeck stochastic process with 0 mean characteristics [34]. It enables the agent to explore beyond the learned policy, and avoids the network from falling into a local optimal policy. The initial weights of the Target Q model and the Target Policy model θ Q * , θ µ * are completely copied from the respective corresponding models: θ Q * ← θ Q ; θ µ * ← θ µ . After the weights of the Critic and Actor models have been updated through training, the Target model uses a soft update to update the weights, as shown in Equation (13). A soft update is used to assist the training process to reach the convergence state.

Design of Node Characteristics and Reward Function
In the process of deep reinforcement learning, the controller needs to observe the state of the system to obtain the feedback after executing the forwarding policy. In this paper, the state matrix X t ∈ R N×F is composed of four features of all nodes, which are: number of connections per node, average transmission delay of the channel (link quality), packet loss rate of nodes, and occupancy of node buffer.
The designed objective function can guide the controller to forward the data packet towards a high benefit goal. High benefit is defined as: shorter forwarding time, shorter forwarding path, lower buffer occupation. In this paper, the reward function is as follows: where delay pre represents the time taken by the data packet from the previous node to the current node and distance ∈ [0, 1] represents the relative distance of the data packet to its destination node. The relative distance is calculated by the ratio of the total hops of the shortest path from the packet to its destination to the total number of nodes (in a weighted network, the sum of the shortest path weights and the weights of all edges is calculated). bu f f er ∈ (0, 1] represents the occupancy of the node buffer after receiving the data packet. The buffer of each node stores the data packets to be sent, and the occupation of buffer determines the packet loss and congestion of the network. The reason for using these three items is to avoid the larger order of magnitude factor from occupying a dominant position in the optimization process. Therefore, the three objectives will be optimized at the same time. The design of R(t) ensures that R(t) ∈ [0, 1], which is equivalent to a standardized operation and beneficial to the training of deep neural network. It is important to note that when the data packet has reached the destination node, distance = 0, R(t) = 1; when the data packet is discarded because the buffer queue is full or TTL (Time to Live) arrives, set R(t) = 0.

Traffic Control Based on Deep Graph Reinforcement Learning
The traffic control model based on deep graph reinforcement learning includes two phases. One is the data collection stage when the simulation network is running, and the other is the training stage. When the network is running, each node following the policy forwarded by the controller determines the next hop according to the current state of the WSN. After the forwarding task is performed, the next hop node will integrate the real-time state and reward value into a four-tuple record (LastState, Action, Reward, NewState), and store it in the experience pool for training. For a detailed process description, see Algorithms 1 and 2.

Algorithm 1 Running phase
Input: batch size,θ µ * Load the weights: θ µ * for target Actor model 1: Generating data packet p, using θ µ * and stochastic OU process to determine its next hop a, and recording it together with the current network state. p := {destination, location, state, a, data} 2: while Length(Experience Replay) < batch size do 3: if packet p received then 4: if p has arrived at its destination then 5 if not done then 12: use θ µ * to determine its next hop a It should be noted in the Running Phase that θ µ * is obtained by the shortest path method if the deep reinforcement learning neural network model has not yet started training. After collecting enough data, the Training Phase will start. The weights of all the deep graph neural networks in the model will be updated by experience playback and gradient descent. After the training, each node will use the updated weights to choose a better next hop for the packets in the Running Phase.
The weights of Policy network θ µ updated in the Training Phase will be stored and used for the packet forwarding in the Running Phase. At the same time, weights of Target Q model θ Q * and weights of Target Policy model θ µ * will be stored and used as the initial value for the next training phase.
After many iterations of data collection and training, the controller will become smart enough to make the best next hop policy for all packets passing through the node at the packet-level. The policy considers the transmission delay, buffer occupancy, and the distance from the destination node at the same time. The data forwarding and traffic control policy based on deep graph reinforcement learning mentioned in this paper can be trained online by combining two phases, and the Online Policy deep neural network used for decision-making can be updated in real time.

Evaluation
In this section, we focus on evaluating our approach's performance in different traffic intensity and comparing it with some related algorithms. We first describe our simulation parameters, simulation environment, and evaluation metrics. Then, we train the model and evaluate its performance based on a series of experiments.

Operation Platform
The simulation part is completed by OMNET++4.6.0, and the deep learning model of DGRL is built by TENSORFLOW 1.14.0 with the following experimental environment:Intel(R) Core(TM) i5-8300H, 16G RAM, NVIDIA GeForce GTX 2060. Table 3 shows the parameter settings of network. We perform an evaluation of some hyperparameters of the DGRL implentation to optimize the performance of the agent, specifically for our problem environment. Particulary, we consider the following three hyperparameters: the ratio of soft replacement TAU, learning rate for the Actor model a_lr and learning rate for critic model c_lr. Figure 4 presents the results of this evaluation. The settings for these three hyperparameters and some others can be found in Table 4. Table 5 gives the details of the Actor model and the Critic model.

Protocol Architecture
The protocol architecture of DGRL is shown in Figure 5. Network logics are dictated by controller and wise-visor. The adaption layer between the wise-visor and nodes is responsible for formatting messages received from nodes in such a way that they can be handled by the wise-visor, and vice versa. On top of the mac layer, the forwarding (FWD) layer handles incoming packets as specified in the flow table. The FWD layer updates the flow table according to the configurations sent by the control plane. The In-Network Packet Processing (INPP) layer runs on top of the forwarding layer and it is responsible for operations like data aggregation or other in-network processing. Topology discovery (TD) layer uses beacon packets to help nodes discover their interconnected nodes. This part of the protocol structure is set according to the literature [4].

1.
Open Shortest Path First (OSPF): According to the directed graph of the network topology, the algorithm generates a shortest path tree as a static routing table. Nodes will forward packets based on the flow table generated by this static routing table.

Performance Evaluation Metrics
In the process of experiment, the related metrics are obtained by OMNeT++ including packet loss rate, average delay, and total number of packets forwarded. The notations involved in the following calculation formulas are explained in Table 1.

Average Network Delay
Delay refers to the average transmission time of all packets reaching the destination nodes.

PDR
PDR represents the ratio between packets received by the destination nodes and packets sent by the source nodes. This metric reflects the adaptability of solution under different traffic intensity.

Dispersion of Routing Load
Dispersion of routing load indicates the evenness of load distribution in a network. A large dispersion indicates that the load in the network is not balanced. A relatively single and fixed route will produce this result, which will lead to congestion at nodes and even network crash.

Training DGRL
For training purpose, we built six traffic intensity models ranging from 20% to 125%. The traffic intensity is reflected as the time interval between packet generation. Figure 6 shows the traffic model when the traffic intensity is 125%. Figure 6a shows the time interval distribution of data packets, and Figure 6b shows the number of data packets sent. The shortest path algorithm Dijkstra is used to generate the initial packet transmission path, as well as the relevant environment data, including channel delay, packet loss rate of nodes, the occupancy of buffers, and the generated rewards value.
After the pre-training, the model is used to predict the transmission direction in the simulation environment. We will collect the environment data and the corresponding reward value and store them in the experience pool. The weights of all the neural networks in the model will be updated by experience playback and gradient descent. The loop will be repeated 200 times, each of which comprises 100 steps. Figure 7 shows the loss and reward during the training process. The loss curve shows a downward trend on the whole. However, the loss often increases throughout the training. This can be explained by the difference between reinforcement learning and supervised learning based on fixed data sets. The training data of reinforcement learning come from the experience pool, and the data in the experience pool are collected from the environment, so they are in constant change. The loss curve fluctuates when encountering a new state space that leads to better reward. In the beginning, the reward is lower because the controller does not have enough knowledge about the network and explores the environment. After some training episodes, the reward increases rapidly. It reflects that routing policies made by the controller are able to guide packets forwarding to obtain better returns.   Figure 8a describes the packet delivery rate. For limited buffer, the increase of traffic intensity is prone to packet loss, which reduces the delivery rate and makes the network difficult to operate properly. Reinforcement learning helps DGRL select better routes and increase packet arrival through environmental data. As you can see from Figure 8a, DGRL has a better delivery rate than other algorithms. Under the traffic intensity 125%, the PDR of DGRL is 5.1% higher than OSPF on average. The performance of DRL-OR is close to OSPF. From traffic intensity 75%, the PDR of DGRL-FC is less than 50%, while we notice that the PDR of others can be maintained above 50%.  Figure 8b shows the average transmission delay for four different algorithms. Each port of each node has a buffer of length 32. If the channel is free, the packets will be sent out immediately. Otherwise, they will be stored in the buffer and queue for being sent out. In this experiment, the transmission delay of data packets mainly comes from the time they queue for forwarding in the node buffer. Under low traffic intensity, the average delay of each algorithm is similar. That is because the buffers are still free, and the waiting time required for forwarding is also short. When the traffic intensity increases, the buffer occupancy rises and even the buffers are filled up. In this case, more time will be spent through this node and even packet loss will occur. OSPF considers only the shortest path, so the forwarding path is fixed. Based on the smallest number of hops, the forwarding is efficient at low traffic intensity. Under the traffic intensity 20%, it has the best performance compared with other algorithms. From traffic intensity 100%, its delay turns out to be the worst. DGRL-FC is similar to DGRL in that it considers the features of nodes, but it lacks consideration of the relationship between nodes and its neighbors. In the case of high traffic intensity, the optimization effect is poor. DRL-OR takes the traffic characteristics into consideration, alleviates congestion, and reduces delay to a certain extent. However, its delay is higher than others when the traffic intensity is low. Different from other algorithms, DGRL considers multiple factors in routing decisions, including hops and buffer occupancy. It is able to adjust the forwarding route flexibly. Under low traffic intensity, it forwards packets according to the hops. When the traffic intensity increases, buffer occupancy will play a more critical role in routing decisions. Figure 8c shows the dispersion of network load. In this experiment, the source and destination nodes of data packets are both random. Therefore, the difference of network load dispersion comes from routing. It is apparent that the network using DGRL has the smallest dispersion, while the network using OSPF has the largest dispersion. For example, under the traffic intensity 125%, the dispersion of load in network using DGRL is 0.051, whereas those of OSPF, DRL-OC, and DGRL-FC are 0.062, 0.060, and 0.058, respectively. These results confirm DGRL's effectiveness in adjusting network load. The reasons behind lower dispersion are as follows. DGRL is an algorithm with memory. It will learn from the experience. Suppose that in a state s, the action a forwards the packet to a next hop whose buffer occupancy is high. The reward of this action will be low. When it is in the state s again, it remembers that a contributes to a higher reward than a. Therefore, it avoids forwarding packets to high occupancy nodes. In the model using OSPF, some nodes are on the shortest path of multiple source-destination nodes pairs. These nodes will bear more forwarding than edge nodes. The situation is more serious with the increase of traffic intensity.

Evaluation Results
In our experiments, the new received packets will be dropped when the buffer is full. PDR is the most direct index to reflect packet loss. We designed a new experiment to verify the relationship between PDR and buffer size. The traffic intensity is fixed at 50% and the buffer size ranges from 24 to 48. Figure 9 shows the PDR curves of each model. As we can see from the figure, buffer size has little effect on DGRL. When the buffer size is 32, as set in the previous experiments, the PDR of DGRL is 77.9%. When the buffer size is reduced to 28 and 24, the PDR decreased by 0.7% and 1.4%, respectively. When the buffer size increases to 48, the increase of PDR is little. This can be understood as DGRL has a better utilization rate for the network buffers. A buffer of size 32 already meets the requirements of DGRL. The PDR of the other three models are closely related to buffer size. As buffer size decreases from 32 to 24, the PDR of OSPF decreases by 4.2%, DRL-OR by 4.4%, and DRL-FC by 7.1%. Increasing buffer size to 48 gives a improvement, ranging from 3.9% to 11% for the three models. It is obviously due to the increase in buffer size, which reduces packet loss caused by full buffer.  Figure 10 shows the experiment about reward, which aims to reflect the influence of different parameters in the reward on the training model. In this experiment, the traffic intensity is fixed at 50%. In addition to the trained DGRL model, there are three other variants of DGRL. The reward function of DGRL-I considers delay pre and bu f f er. DGRL-II considers bu f f er and hop. DGRL-III considers delay pre and hop. It is clear that the performance of DGRL is better than the other three models. Figure 10a shows that the missing parameter bu f f er has the greatest impact on the PDR. It results in a 23.6% drop in PDR than DGRL. Compared with DGRL, DGRL-I and DGRL-II also have a similar decrease in PDR. Figure 10b shows that the loss of all the three parameters each lead to an increase in mean transmission delay. Figure 10c shows that the impact of bu f f er and delay pre on Dispersion is greater than that of hop. In conclusion, all the three parameters are indispensable to reward.

Discussion
DGRL significantly improves the performance of WSN. However, there are also some challenges that need to be addressed further.
The computational complexity of DGRL is mainly due to the computation of graph convolution and reinforcement learning. The structure of DGRL's neurons network includes four GCN layers and three full connected layers. The complexity of matrix operations to get the features of the next layer can be regarded as the computational complexity of neural networks. The complexity of a single-layer neural network can be expressed as O(|V|FF ), where |V| represents the number of nodes in the network topology, F represents the feature dimension of the node, and F represents the embedding dimension. Meanwhile, the computational complexity of the Dijkstra algorithm used in OSPF is only O(|V| 2 ).
In order to avoid congestion, DGRL will choose a path in which there are more hops but lower buffer occupancy. This policy will lead to more times of packet forwarding, resulting in an increase in the overall network load. However, the load pressure of nodes in the central position and nodes with relatively large degree can be relieved. As shown in the Figure 11a, under the same traffic intensity 50%, the total times of forwarding of DGRL are higher than OSPF. We selected several key nodes to count the number of packets forwarded by them, as shown in Figure 11b. It can be seen that compared with other algorithms, the key nodes in DGRL forward the least number of packets. Considering that in wireless sensor networks, the main energy consumption of nodes comes from the sending and receiving operations, reducing the forwarding times of key nodes in the network can prolong the network life cycle. To verify the generalization of our proposed model, we perform training and evaluation on two other network topologies. Figure 12 shows a visualization of our original network topology and the other network topologies. The number of nodes increases to 17 and 24, respectively. Figure 13 shows the trends of reward value during the two training processes. As can be seen, rewards experience a rapid increase and gradually tend to a stable state. Table 6 compares the performance of DGRL and OSPF under fixed traffic intensity 50%. It can be concluded from the analysis that the proposed algorithm achieves great improvement compared with OSPF in various metrics.

Conclusions
In the framework of software-defined sensor networks, we propose a network routing control method (DGRL) based on graph convolution network and DDPG. The new solution extracts the characteristics of sensor networks through graph convolution network, and controls packets forwarding on a control plane through reinforcement learning. It improves the data delivery rate and reduces delay and the forwarding pressure caused by multihop transmission of data packets in the network. In the simulation experiment, DGRL is compared with DGRL-FC, DRL-OR, and OSPF. The results show that the DGRL method can effectively optimize the related metrics of network and make full use of the network resources. In future studies we will improve the training of the deep learning DGRL model and investigate the performance of DGRL with more network scenarios.