Many-to-Many Data Aggregation Scheduling Based on Multi-Agent Learning for Multi-Channel WSN

: Many-to-many data aggregation has become an indispensable technique to realize the simultaneous executions of multiple applications with less data trafﬁc load and less energy consumption in a multi-channel WSN (wireless sensor network). The problem of how to efﬁciently allocate time slot and channel for each node is one of the most critical problems for many-to-many data aggregation in multi-channel WSNs, and this problem can be solved with the new distributed scheduling method without communication conﬂict outlined in this paper. The many-to-many data aggregation scheduling process is abstracted as a decentralized partially observable Markov decision model in a multi-agent system. In the case of embedding cooperative multi-agent learning technology, sensor nodes with group observability work in a distributed manner. These nodes cooperated and exploit local feedback information to automatically learn the optimal scheduling strategy, then select the best time slot and channel for wireless communication. Simulation results show that the new scheduling method has advantages in performance when comparing with the existing methods.


Introduction
A WSN (wireless sensor network) is one of the most important technical means to realize IOT (Internet of Things) systems, and now it is widely applied in agriculture, industry, medical, military and other fields [1,2]. With the rapid development of technology, the capability of WSN hardware and software are apparently enhanced, making it possible to run machine-learning-based programs on sensor nodes [3]. Meanwhile, the demand and feasibility of deploying multiple different application tasks inside a single WSN are increased as well. In such application scenarios, multiple sinks are usually deployed in a network, and sensor data of interest are concurrently collected from multiple sources. For example, a HVAC (heating, ventilation, and air conditioning) system is a potential application of multi-source multi-sink WSN [4]. The data collected by a certain temperature sensor may be simultaneously delivered to multiple sink nodes (heaters, air conditioning controllers), and a single sink node will possibly be interested in data from multiple source nodes. The rise in edge computing has also created the demand for multiple sink nodes [5]. Many tasks, such as control, calculation, and storage, are migrated to the edge nodes closer to local devices in network, and this behaviour helps to lighten the burden on the cloud [6].
The most common performance expectation of wireless communication in WSNs can be summarized as conflict-free, low latency and energy consumption [7,8]. Inspired by the fact that the energy consumption of sensor calculation is lower than the energy consumption of wireless communication, researchers have utilized data aggregation to reduce the amount of data and the number of transmissions; this technique is helpful for achieving the performance expectations [9]. In order to support multiple sinks simultaneously collecting data with less data traffic load and less energy consumption, many-to-many data aggregation (or multi-sink data aggregation) has been developed and adopted in WSNs [10,11]. In addition, the emergence of multi-channel technology can help sensor nodes switch between different wireless channels to avoid wireless communication interference, and further improve network performance [12]. As one of the most critical problems for many-to-many data aggregation, the problem of how to efficiently allocate a time slot and channel for each node should be solved.
TDMA (time division multiple access), as a common non-competition technology, is widely applied to implement medium-access control in WSN data-collection applications [13,14]. TDMA enables conflict-free wireless communication, and it has good performance for prolonging the network lifetime [15,16]. By inheriting the core concept of TDMA, both time and channel can be viewed as communication scheduling resources to be allocated for sensor nodes. Communication period or data collection period is divided into a certain number of time slots with the exact same time length, where the specified nodes can perform wireless communication. The number of available wireless channels is determined by the adopted sensor device and application requirement. The research problem of this paper is how to allocate a respective time slot and wireless channel for each node and construct conflict-free many-to-many data aggregation scheduling for a multi-channel WSN. A cooperative multi-agent learning-based scheduling method is proposed in this paper; the main contributions can be summarized as follows: • Multi-channel WSN environment has been, firstly, introduced into the research of many-to-many data aggregation scheduling up to now. The characteristics of this new type of scenario are sufficiently considered in this paper, such as that an intermediate node is probably assigned to multiple transmission times, and some communication conflicts can be avoided by switching channel. • The scheduling process of many-to-many data aggregation in a multi-channel WSN is formulated to decentralized, partially observable Markov decision process, as a result of summarizing its distinguishing features of wireless communication. A multi-agent is viewed as the nodes participating in wireless communication, and the system state cannot be accurately obtained by agents. • Cooperative multi-agent learning is introduced to implement a new distributed scheduling method. Thanks to the property of group observability, a group of sensor nodes within one hop can attempt different behaviours and receive corresponding feedback. After accumulating adequate experience, sensor nodes learn the best action strategy and select the most efficient time slot and channel for wireless communication.
For understanding the proposed new method further, it is necessary to clarify the mutual relationships among these mentioned technical terms. The function of data aggregation scheduling is to allocate the time slot and channel resources to sensor nodes, where data aggregation as the data-processing operation is applied on the sensor nodes to reduce data traffic during data transmission. Multi-agent learning is an intelligent algorithm running on sensors to help sensors learn the best scheduling policy and exploit the most efficient time slot and channel.
The rest of the paper is organized as follows: Section 2 compares and analyses the shortcomings of the existing research. Section 3 introduces the concerned system model and illustrates the problems which are aimed to be solved in this paper. Section 4 explains the principle and components of the proposed many-to-many data aggregation scheduling method for a multi-channel WSN. Section 5 displays the simulation platform and analyses the simulation result in order to prove the high performance of the proposed policy. Finally, Section 6 concludes the current work. The abbreviations of the utilized technical terms are listed in the Appendix A.

Related Works
Existing data aggregation scheduling methods mainly focus on traditional WSN with exclusive wireless channel and many-to-one communication modes [12,17]. Generally speaking, there are two types of existing scheduling methods in WSN. The centralized-computing-based methods are normally operated on a sink node or a base station, which collects the global network information and computes the scheduling result with good performance [18]. The distributed-based methods are lightweight and deployed on sensor nodes [19]; the scheduling result is computed by the cooperation of many nodes with local information.
S. Kumar et al. propose the multi-channel TDMA scheduling algorithm with the objective of minimizing the total energy consumption in the network [20]. In order to alleviate collisions and support concurrent communications, multiple RF channels are utilized. The proposed heuristic algorithms offer computationally efficient scheduling operation, although they provide sub-optimum schedules for data gathering. J. Ma et al. study the continuous link scheduling problem in WSN [21], in which each node is assigned continuous time slots, so that the node can only wake up once in a scheduling cycle to complete its data-collection task. Many-to-many communication scheduling problems for battery-free WSN were firstly concerned by B. Yao et al., where energy bottlenecks were analysed, and an energy-adaptive and bottleneck-aware scheduling algorithm was proposed as well [22]. Bagaa et al. proposed a cross-layer trusted data aggregation scheduling method for a multichannel WSN [23]. This method constructs k disjoint paths for each source node to the sink node based on the aggregation tree at first, and then finds a conflict-free communication schedule according to a routing structure. Jiao et al. firstly proved that the data aggregation scheduling problem for multi-channel duty cycle wireless sensor networks is NP-hard [24]; this research adopts the candidate-activity conflict and feasible-activity conflict graph to describe the node scheduling relationship, and, finally, used the coloring method to achieve efficient scheduling. Nevertheless, there are several common premises to achieve data aggregation scheduling using these centralized computing methods. First of all, a certain powerful base station has to take responsibility to collect global network information and compute a good scheduling plan. Once a network structure has undergone any change, global network information must be collected again, and a scheduling algorithm has to be re-executed. Moreover, the time of all the network nodes must be synchronized with high precision in advance. It is difficult to meet such requirements in large-scale wireless sensor networks.
A few researchers have designed distributed data aggregation scheduling method for multi-channel WSNs. B. Kang et al. [19] developed a distributed delay effective scheduling method to solve the problem of time slot scheduling in duty cycle wireless sensor networks. This method makes full use of duty cycle technology to appropriately turn off node communication and sensing capabilities. The active time of nodes is significantly reduced, and the lifetime of the network is apparently extended. Y. Lu et al. integrates an independent Q learning technique into the exploring process of an adaptive time slot scheduling for many-to-one application; the scheduling gradually approaches the optimal result along with the execution of frames [25]. A cluster-based distributed data aggregation scheduling algorithm with multi-power and multi-channel is proposed by Ren M. et al. in [26], which puts network nodes into multiple clusters, and uses different power levels for inner cluster communications and the communications among cluster heads separately. Moreover, communication latency caused by conflicts is reduced a lot due to the allocations of multiple channels. In order to minimize the time slot length of multi-channel wireless multi-hop wireless sensor networks, Lee et al. propose a conflict-free TDMA link scheduling method [27], using min-max to optimize the time-slot length, and minimize the end-to-end delay using a sorting algorithm. Nevertheless, these scheduling methods are designed for the communication pattern with a single sink, and they cannot be directly applied for many-to-many communication. Yu B. et al. consider the minimum-time aggregation scheduling problem in multi-sink sensor networks to support many-to-many data aggregation for the first time [28], where the bounds of the aggregation time are analyzed by a theoretical model, and they propose a nearly constant approximation algorithm to solve the aforementioned problem. Saginbekov S. et al. [10] designs a time-slot scheduling method with data aggregation for two sink nodes, but they do not discuss the feasibil-ity and performance of their method for the scenarios with more sinks. Meanwhile, the multi-channel environment is not taken into account in this research.
In conclusion, there has been no existing work that directly researches many-to-many data aggregation scheduling methods for multi-channel WSN until now; in particular, the scheduling method needs to be implemented in a distributed manner to support a dynamic and extensible network environment.

System Model
A WSN can be abstracted as a graph G V, L where V and L denote the set of sensor nodes and the set of communication links (edges), respectively. Sensor nodes use a halfduplex transmission mode, where one node cannot perform data transmission and data reception at the same time. If any pair of nodes v i ∈ V and v j ∈ V are located within the wireless communication range of each other, both links l i,j ∈ L and l j,i ∈ L exist in network. The nodes located in the wireless communication range of v i are called called neighbors ngh(v i ). There are |CH| available wireless channels, and ch k denotes the k th channel. For simplicity, the protocol interference model is adopted in this system, the communication radius r cm and the interference radius r it of each sensor node are set to the same value. Some sensors cannot transmit data simultaneously within the same wireless channel on account of communication conflicts. The utilized notations and variables are listed in Appendix B.
Sensing data produced on each source node is delivered to a set of sink or destination nodes; meanwhile, a sink node d i expects to collect the data from a set of source nodes. For example, sink node d 1 expects to collect sensing data from a set of source nodes Figure 1. Intermediate nodes between source and sink nodes are going to perform data aggregation and forward the processed result. The communication period is defined as a frame TS c , which consists of a fixed number of time slots ts. The fundamental task of the scheduling method is to allocate a time slot and wireless channel for each node without a communication conflict in order to maximize network performance. In this scenario, two kinds of potential conflict may appear in a network. The first one is direct conflict, where two or more links which possess at least one exact same terminal are allocated for the same time slot, and this overlapping terminal cannot concurrently handle two or more communication tasks at the same time slot, so the communication conflict appears. An example is depicted in Figure 2a, l i,k and l j,k have the same receiving terminal v k , the same allocated time slot ts 1 will lead to the generation of communication conflict on v k . Second one is indirect conflict: the receiving terminal of one link is located in the interfering range of the transmitting terminal of another link, and both links are allocated the same time slot and channel; then, the indirect communication conflict appears. An example is depicted in Figure 2b, v k is located in the interference range of v i ; once the same timeslot ts 1 and channel ch 1 are allocated to both l i,h and l j,k , an indirect communication conflict happens on v k .
In each time slot, the links without any conflict could perform wireless communication together. The links with the same time slot but also with the indirect conflicts could be allocated for different wireless channels. Once a data packet is successfully transmitted to a receiver, the corresponding transmitter is supposed to obtain an acknowledgement (ACK) packet from this receiver inside the same time slot and channel. An example is shown in Figure 1: Solid lines represent the links with data transmission. Dashed lines indicate the links without data transmission, which is not the path of data routing. Links l 1,6 and l 4,8 without any conflict are allowed to be concurrently performed in the time slot ts 1 and the channel ch 1 . l 2,7 has to use the channel ch 2 because it has indirect conflict with l 1,6 , where v 7 is located in the interfering range of v 1 .

Optimization Objective
Even though a WSN routing protocol is not the main research contents of this paper, our system model requires that each node possesses local routing information before manyto-many data aggregation scheduling. In order to maintain consistency with the scheduling optimization objective, MUSTER as a classical distributed routing protocol for many-tomany data aggregation is adopted in our model [29], so that a routing structure with less transmission delay and less energy consumption can be constructed before allocating a time slot and channel.
In this case, a node v i has the knowledge of its upstream US(v i ) and downstream nodes DS(v i ) in the routing structure. Thanks to the property of the data aggregation function, the data from multiple receiving packets towards the same sink could be combined into a single data copy, such as v 8 in Figure 1 which is able to combine the packets from v 4 and v 5 . In addition, the existence of multiple sinks probably makes one node take multiple transmission operations, such as v 9 having to transmit two packets to different next-hop nodes. Figure 3 focuses on this data aggregation operation, where f (v 2 v 3 ) represents the aggregation results of source nodes v 2 and v 3 . This node receives three input packets, then performs a data aggregation function, and, finally, generates two aggregation results as output packets From the viewpoint of a global network, many-to-many data aggregation scheduling for an entire network in a frame is set to allocate a time slot and channel for each link with a data transmission task, and then the link-based scheduling set can be expressed as LS = { ls i,j , · · · } , which consists of the resource allocation sets ls i,j for each link, where |LS| is equal to the number of links |L|. ls i,j = ( l i,j , ts i,j , ch i,j ) denotes the resource allocation set for the link l i,j including the allocated time slot ts i,j and the allocated channel ch i,j ; an example can be found in Figure 1, where ls 1,6 = ( l 1,6 , ts 1 , ch 1 ), ts 1,6 = ts 1 and ch 1,6 = ch 1 . There is a specified time period in one frame which is called working window wd. During wd, a sensor node maintains an active state to conduct wireless communication, computation and other operations. wd is further divided into reception slice wd r and transmission slice wd t . According to the feature of data aggregation, the current node switch on the radio receiver during wd r , the data packets from upstream nodes are supposed to be received from any wireless channel, and the aggregated result is obtained at the end of wd r . After that, the current node starts to deliver the results to downstream nodes during wd t . For a node v i , the length of wd is equal to |wd r | + |wd t |, which is directly related to the number of upstream and downstream nodes.
is an additional amount to enhance the success rate of packet reception. |wd t | is strictly equal to the number of downstream nodes DS(v i ), and each time slot is allowed to conduct one time of transmission. Besides working windows, the current node remains in an inactive or sleep state and temporarily switches off power supply for primary electronic units, this behavior helps to effectively save energy.
The allocation of many-to-many data aggregation for one node can be expressed as a scheduling tuple with two parameters (wd t .end, CH u ), where wd t .end denotes the end of transmission slice and CH u denotes the channel usage set. Since the size of wd is fixed, wd t .end as the last timeslot directly decides the location of wd in a frame, it is also indicates which timeslots are used for reception or transmission. CH u is a channel sequence {ch i,j , · · · } to specify the channel for each transmission timeslot. Figure 3 depicts an example of the scheduling operation on v 9 of Figure 2, where the scheduling tuple is (ts 5 , {ch 1 , ch 1 }).
Multiple optimization objectives of scheduling are considered in this paper, and these objectives can be alternated according to the real-life application demand. For example, if communication delay is decreased, residual energy of nodes should be increased. Let η k represent k th or the last objective function; then, the scheduling problem is expressed as argmin LS {ϕ(η 1 (LS), · · · , η k (LS))}, where ϕ denotes the overall objective function, RS denotes the routing structure (set), and the solution should be subject to the following constraints: If l i,j ∈ RS, ∀ l i,m ∈ RS or l n,j ∈ RS, then ls i,m .ts i,m = ls i,j .ts i,j or ls n,j .ts n,j = ls i,j .ts i,j 3.
If l i,j ∈ RS, ∀ l i,m ∈ L and / ∈ RS, l n,m ∈ L, then ls i,j .ts i,j = ls n,m .ts n,m or ls i,j .ch i,j = ls n,m .ch n,m 4.
The first constraint requires that communication on each link can only be performed once, so the allocation of the slot and channel for one link ls i,j is unique. The second constraint indicates the avoidance of direct interference; when a certain link l i,j activates communication, then any link with the same terminals cannot perform communication in the same time slot. The third constraint indicates the avoidance of indirect interference: the links with interference cannot share the same time slot or the same channel. The fourth constraint is generated from the principle of data aggregation, where an aggregated result is supposed to be transmitted after receiving all expected data. The last constraint explains that the transmission operation of the current node should be located between its downstream node and its upstream node. It is evident that the essence of the scheduling problem is to find the best set of links that satisfies the optimization goals and constraints. This is a typical combinatorial optimization problem that can be solved by reinforcement learning methods [30]. Besides these constraints, transmission delay and energy consumption as the optimization objectives of many-to-many data aggregation scheduling are selected in this paper.

Decentralized Partially Observable Markov Decision Process
By summarizing the characteristics of many-to-many data aggregation scheduling for a multi-channel WSN, it is not difficult to find a match between this scheduling process and the decentralized partially observable Markov decision process (Dec-POMDP) [31]. Dec-POMDP can be formulated as I, S, A, P, R, Ω, O, b, T , which is a tuple and its components are described as follows: .., |V|} is the set of agents; one sensor node participating in communication is viewed as one agent.
..S |V| is a finite set of system or joint states where s = {s 1 , s 2 , ..s |V| }, s ∈ S, S i is the state set of the i th agent, which reflects whether the reception and transmission of packets on this node is successful, and this information cannot be accurately acquired due to the environment of wireless communication.
is the action set of the i th agent. The change in scheduling for time slot and channel is realized by modifying the tuple mentioned before (wd t .end, CH u ). • P( s | s, a) is the transition function which denotes the probability of transitioning from the state s to the new state s when taking the joint action a. • R( s, a) is the reward function which denotes the immediate reward when taking the joint action a at the state s.
.Ω |V| is a finite set of joint observations, Ω i is the individual observation set of the i th agent, where a joint observation is ω = {ω 1 , ω 2 , ..., ω |V| }, ω ∈ Ω. One observation ω i contains the size and number information of the successfully received and transmitted packets, and this information is part of the acknowledgement packet.
is the observation function which denotes the probability of observing ω when the system state transfers to s by taking the joint action a. Due to the wireless communication environment, the observation result may not truly reflect the system state, because the reception of ACK cannot ensure no error is contained in transmission data; meanwhile, not receiving ACK also cannot determine whether the receiving node did not obtain data.
is the initial system state distribution (also called the initial belief), T is the finite horizon or the number of time steps in which an agent can interact with Dec-POMDP model.
In a specific system state s t at time step t, a joint observation ω t can be generated. Each agent obtains its individual observation ω t i , and selects its individual action a t i which is a component of a joint action a t . After taking action, the system transitions to the next state s t+1 , and each agent obtains its immediate reward r. The action-observation history of the i th agent is denoted as , so the joint actionobservation history is denoted as Φ t = Φ t 1 , Φ t 2 , ..., Φ t |V| . Agent policy uses history to decide actions, which is denoted as π i : Φ i → A i , and a joint policy π = π 1 , π 2 , ..., π |V| is the combination of all individual policies. The final goal of solving Dec-POMDP is to discover an optimal joint policy in order to maximize the expected accumulated discounted reward; the state value function V π ( s) of a joint policy π from state s is defined as follows: where γ is the discounted factor to decide the importance or weight of the future rewards, and if γ = 0, then only the current reward is considered in the value function. To obtain such a policy, the reinforcement-learning algorithm normally evaluates an action quality by Q-function or Q-value function Q( s t , a t ), which is denoted as follows: However, it is impossible to let agents obtain accurate system state s, so the basic edition of Q-learning cannot be directly applied for Dec-POMDP. In this case, the actionobservation history is applied to replace the system state, and the updated rule of Q-value can be denoted as follows: where α is the learning rate to control the updating speed of the Q-value. The optimal policy can be found to make the action decision on agents, which can be expressed as follows:

Group Cooperation
Regardless of the discovery of the optimal scheduling set or the optimal action policy for slot and channel allocation, the global information of the entire WSN is a common prerequisite. However, to acquire global information is almost impossible in such a dynamic network environment; meanwhile, the spaces of action, observation, and policy are exponential in the number of agents. One feasible method is distributed independent learning, in which agents only utilize their own observations and rewards, and ignore other agents' information. However, without considering the cooperation of agents, this type of method cannot ensure the quality of the solution; thus, probably performing the scheduling with inferior performance.
To address the mentioned issues further, the core idea of a multi-agent learning with group observability for Dec-POMDP in [32] can be exploited for designing an efficient many-to-many data aggregation scheduling method. By building a number of agent groups, it is possible to split the global function into the group functions. Due to the existence of a WSN routing structure, a group can be naturally constructed by the nodes within one hop. The downstream node for data transmission is automatically selected as group head, when there are multiple downstream nodes; only the node with the largest identity number is recognized as group head for a current node. Meanwhile, the upstream nodes are group members, which are supposed to transmit their own observations to the group head. An example can be found in Figure 4. This method distributes the learning tasks by utilizing the interactions inside the routing structure RS, and it makes the learning agents cooperate in order to ensure the global performance, and its feasibility is proved by Theorem 1 in Section 4.5. A decomposable Q-functionQ( Φ t , a t ) is designed to represent the global Q-function Q( Φ t , a t ), and the former can be defined as the sum of the group Q-function:Q where Q g ( Φ t g , a t g ) is the expected rewards for a group of agents after performing a joint group action a t g with a group history Φ t g . The relationship betweenQ( Φ t , a t ) and Q( Φ t , a t ) is proved by Lemma 2 in Section 4.5. The update rule of the Q-function in Equation (3) can be rewritten as follows: As the discounted future reward, even though global information cannot be directly obtained to compute max a∈AQ ( Φ t+1 , a), the latter can be expressed by decomposing the optimal joint action a * = argmax a∈AQ ( Φ, a), where a * = g∈RS a * g ; finally, max a∈AQ ( Φ t+1 , a) can be rewritten as follows: Benefiting from the decomposition, the update rule of group Q-function can be formulated as follows: During the learning process of an agent group g at time step t, after taking the joint action a t g , group members transmit their own observations to the group head, and then the group head receives its group reward signal R( Φ t g , a t g ). After updating the actionobservation history Φ t+1 g , the group head computes the next optimal action a * g for Φ t+1 g using the distributed constraint optimization (DCOP) technology in [33], and then it distributes the next action to group members, which may execute a * g or explore actions. In this way, the global Q-function is decomposed into multiple local Q-functions on group heads. The selection of a group action is computed in a distributed manner with local group information.

Reward Function
The scheduling optimization for the objectives and constraints in Section 3.2 can be embodied in the reward function. The total reward of a group can be considered as the product of the rewards from group history and action, where R( Φ t g , a t g ) = R hs ( Φ t g )R at ( a t g ). The reward from group history R hs ( Φ t g ) is affected by the numbers n r , n t and the size m t of the successfully received and transmitted packets; this information is attached to the ACK packet. In general, the more numbers with a larger size of successfully received and transmitted packets are definitely helpful for reducing the energy consumption of nodes; as the probability of packet re-transmission will be significantly decreased, fewer conflicts will appear. sgn 1 (n r , n t ) as an signum function is adopted to control the value of R hs ( Φ t g ). If n r = US(v i ) and n t = DS(v i ), then sgn 1 (n) = 1; this means that the current node receives all the expected packets from upstream nodes and all outgoing packets are successfully transmitted to downstream nodes. Otherwise, not all the expected packets are successfully received or transmitted; then, sgn 1 (n r , n t ) = 0. Let us assume the maximum packet capacity is m t max , then R hs ( Φ t g ) can be defined as in the following equation.
The reward from group action R at ( a t g ) has impacting factors containing the number of overlapped transmission time slots, and the position of the last transmission time slot. The first factor is a strict constraint to avoid communication conflicts in a group, and it directly decides whether a reward value is positive or not. The second factor is a typical index for communication delay; if this value is smaller, then a group has a higher chance of achieving a smaller communication delay. Finally, according to the previous definition of transmission window wd t , the reward from a group action can be defined as follows: where v i ∈g wd t (v i ) denotes the intersection of the time slot set of a transmission window on each group member, and sgn 2 is another signum function. If v i ∈g wd t (v i ) is empty, it means that the transmitting nodes has no overlapping time slot; then, the reward value is positive where sng 2 = 1. Otherwise, a communication conflict appears, the reward becomes a punishment and its value should be negative where sng 2 = −1. wd t .end(v i ) indicates the final transmission delay on the current node, and its value is expected to decrease.

Action Policy
The capability of random exploration of reinforcement learning should be maintained; then, the scheduling method has some probability to choose a random action instead of the optimal action. To match the convergent characteristic of learning, even though the random exploring range is normally required to be large at the earlier stages of learning, the random action probability should be decreased along with an increase in time steps. By making the parameter of the classic − greedy policy alterable, the mentioned goal can be achieved. The adopted alterable parameter of selection probabilityˆ t is correlated to the time steps t, which can be denoted as follows: where σ is a shrinking factor, and, along with the increase in time steps, the probability of random action become very little.

Many-to-Many Data Aggregation Scheduling Procedure
Algorithm 1 illustrates the execution process of the many-to-many data aggregation method. Horizon T, which controls the end of time steps, is a limited number. At the beginning of one frame, the current node executes a t i or a random action to set the many-tomany data aggregation scheduling set on line 2. During the working window, if the current time slot is a reception time slot, then a packet is going to be received. On line 5-6, a group action a * g to group members is received; then, the individual action a t i can be extracted from a * g . On line 9-18, if an individual observation ω t i to a group head is received, then the information is stored. Once a group head has received all observations from its members, group observations ω t g are subsequently constructed from memory; then, the group reward R( Φ t g , a t g ) can be obtained from the local environment, and the group history Φ t+1 g can be updated. The next optimal group action a * g can be computed by using DCOP, and the group Q-value is also updated. After that, a * g is attached to ACK and transmitted to all group members. On line 19-25, if the current time slot is a transmission time slot, data packets are supposed to be transmitted to all downstream nodes. When v j is the last downstream node in DS(v i ), individual observation ω t i is obtained and attached to the data packet. Finally, the data packet is delivered to v j . Algorithm 1 Many-to-many data aggregation scheduling procedure. for timeslot ts = 1 to TS c do 4: if ts ∈ wd r then 5: if v i receives a * g then 6: decompose a * g to individual action a t i ; 7: else if v i receives ω t j then 8: store ω t j into memory; 9: if all observations are received from group members then 10: construct group observations ω t g ; 11: obtain group reward R( Φ t g , a t g ) based on Equation (9) and (10); 12: update group history Φ t+1 g ; 13: compute optimal action a * g based on DCOP; 14: update group Q-value Q g ( Φ t g , a t g ) based on Equation (8); 15: attach a t g on ACK and transmit ACK to group members; 16: end if 17: end if 18: end if 19: if ts ∈ wd t then 20: for v j ∈ DS(v i ) do 21: if DS(v i ) == ∅ then 23: obtain individual observation ω t i and attach on data packet; 24: end if 25: transmit data packet to v j ; 26: end for 27: end if 28: end for 29: end for

Theoretical Analysis
The time and space complexity of reinforcement-learning-based algorithms has already been discussed in [34]. The difference in our method is the group cooperation among agents, such as the selection of optimal group actions based on DCOP. In this case, the number of group members is an indispensable impact factor for the complexity. The upper bound of time complexity on each agent can be expressed as O(t|g|), where t denotes the total number of time steps, and |g| denotes the number of group members, as mentioned before. Correspondingly, the upper bound of the space complexity on each agent is expressed as O(h|ω||a||g|), where h denotes the number of recent observation histories for selecting an action, |ω| and |a| represent the size of observation and action, respectively, and both values are fixed. According to the formulation of Dec-POMDP in Section 3.3, |ω| depends on the number of transmitted packets, and |a| is decided by the fixed number of the involved parameters. Thanks to the distributed nature, the overhead of computation and memory are scattered; either the time complexity or the space complexity declines.
The theoretical feasibility of the proposed scheduling method is established only in the case that the global Q-function Q( Φ t , a t ) for a network system is the same as the decomposable Q-functionQ( Φ t , a t ), and this condition has to be verified by theoretical analysis. A global Q-function with system state variables Q( s t , Φ t , a t ) is considered, then it is proved to be decomposable; after that, the result helps to prove the above condition. For the convenience of expression, the probability of state and observation transition are abbreviated as follows: According to the definition of the Bellman equation, Q( s t , Φ t , a t ) can be expressed as follows: where Φ t+1 is the Φ t appended by action a t and observation ω t , max a∈A Q( s t+1 , Φ t+1 , a)) actually denotes Q( s t+1 , Φ t+1 , a * ), and a * is the global optimal joint action. For the time step t, b t is the belief or distribution, which completely depends on the initial belief b and history Φ t ; then, Q( s t , Φ t , a t ) transforms into the global Q-function without system state Q( Φ t , a t ): By utilizing the above principle, the group Q-function with group state is defined as follows: In this way, the group Q-function without group state is, subsequently, defined as follows: Lemma 1. For any finite time step t in the Dec-POMDP model, the global Q-function with system state Q( s t , Φ t , a t ) is decomposable and equal to ∑ g∈RS Q g ( s t g , Φ t g , a t g ).
Proof. Mathematical induction is adopted to prove this lemma. Firstly, let us assume that the following decomposition equation holds for the time step t + 1, After that, let us analyse whether the decomposition equation for the time step t still holds; the derivation process is as follows: Lemma 2. For any finite time step t in the Dec-POMDP model, the global Q-function without system state Q( Φ t , a t ) is decomposable and equal to ∑ g∈RS Q g ( Φ t g , a t g ).
Proof. According to Lemma 1 and Equation (15) and (18), the derivation process is as follows: Theorem 1. In Dec-POMDP model, the optimal policy π * ( Φ) will be found by the proposed cooperative multi-agent learning method.
Proof. Based on the basic property of Q-learning and Equation (8), the group Q-function without group state Q g ( Φ g , a g ) will converge to the group optimal value Q * g ( Φ g , a g ). According to Lemma 2, the proposed cooperative multi-agent learning method will discover the optimal value of global Q-function Q * ( Φ, a), which is decomposable and equal to ∑ g∈RS Q * g ( Φ g , a g ). After that, the optimal policy π * ( Φ) will be found according to the following equation,

Simulation Setting
To simulate the realistic wireless network environment and reserve the concurrent execution characteristics of the distributed system, OMNeT++ is adopted to complete the task of performance evaluation. The model of the sensor node is constructed using the OSI model on this simulation platform, and the different functionalities of sensors are implemented on the corresponding logic layers. A visualization example of this layered sensor model is depicted in Figure 5a, where the "nic" layer contains both the physical and data link layer. In this model, the many-to-many data aggregation scheduling method is implemented as a MAC protocol. A periodic data collection event is implemented as the network application, which helps to make data aggregation produce a significant effect on data transmissions. The routing protocol at the network layer builds the routing structure to determine the upstream and downstream relationship of nodes. Sensor nodes are randomly deployed in simulation scenarios and the network always remains connected. A visualization example of node deployment on OMNeT++ is depicted in Figure 5b, where two sinks and 40 sources are located in a multi-channel WSN. By conducting a sufficient number of priori tests, there are some recommended settings for important system parameters.  The proposed method in this paper is named MDS-ML (Many-to-many Data aggregation Scheduling based on Multi-agent Learning for multi-channel WSN). As comparison targets of performance, three existing methods are selected. EESPG (Energy Efficient Scheduling in wireless sensor networks for Periodic data Gathering) [20] as a typical centralized method is adopted in simulation. Data Aggregation Scheduling method for multi-channel Duty cycle WSN called DASD [24] is implemented to support many-to-many communication mode, and it works in a centralized way. CDSM (Cluster-based distributed Data aggregation Scheduling algorithm with Multi-power and multi-channel) [26] using different transmission power and channels for intra-cluster and inter-cluster, respectively.

Performance Evaluation
The scheduling results are optimized using a multiple objectives function, such as communication delay and residual energy. Since one node only performs the scheduling operation once in one data collection period, the number of periods represents the number of time steps for a learning method. Figure 6 depicts the comparison results on an average delay, where the scenarios with the different number of nodes are displayed, and a tuple (source,sink) is used to denote the number of source nodes and sink nodes. If the source nodes or sink nodes increase, the average transmission delay will increase as well, because the network structure becomes more complex and the path of packet transmission usually becomes longer. EESPG, DASD and CDSM have higher values of average delay than MDS-ML. One possible reason is that these methods are originally designed for many-to-one data aggregation, and they have to transform some components to support many-to-many data aggregation. The gap between the compared methods and MDS-ML become more obvious along with the increase in nodes. When there are 60 source nodes and 4 sink nodes in the application scenario, MDS-ML has about a 36% lower delay than the second best performing DASD. If more channels are available in network, all methods designed for a multi-channel WSN can obtain a lower delay; the related result can be found in Figure 7. When a wireless channel increases from 2 to 4, MDS-ML and EESPG decrease to 77% and 74% of their original value, respectively. The impact of node number on the average residual energy is depicted in Figure 8. When there are 20 source nodes in a network, MDS-ML and EESPG have a similar performance; however, their difference becomes bigger along with the increase in source nodes. Especially for the scenario with 60 sources and 4 sinks, MDS-ML keeps its energy value about 22% higher than the value of EESPG. In addition, the increase in sink nodes has a relatively limited impact on the energy value of MDS-ML.  Figure 9 depicts the comparison result on average residual energy. DASD did not consider the reduction in energy consumption as a primary optimization objective, so it performs the worst among four methods, and its energy level drops quickly along with an increase in time. CDSM utilized different power and channel for different kinds of communication, but it also cannot obtain a satisfactory result due to its distributed nature. MDS-ML is hardly affected by the change in the number of periods, and its energy percentage is almost 1.5 times the energy percentage of DASD. For the purpose of evaluating the comprehensive performance on multiple objectives, the weighted sum with normalized objective value is adopted. The value of weighted sum is named a scheduling quality, which represents the quality of an optimized scheduling result, and the comparison result on this metric with different numbers of nodes is depicted in Figure 10. More nodes involved in data transmission generally means the scheduling optimization becomes more complex, and it is harder to obtain a higher value of scheduling quality.MDS-ML always holds the best scheduling quality when comparing with the other three methods. In the scenario with 20 source nodes and 2 sink nodes, the quality of MDS-ML becomes almost 1.1 times larger than the quality of EESPG and DASD. When the number of sources and sinks become 60 and 4, the advantage increases to 1.5 times larger than the quality of EESPG and DASD, at most.  Figure 11 indicates the advancement made by our proposed method by comparing the scheduling quality with a different number of channels. CDSM always obtains the lowest value of scheduling quality. EESPG and DASD have a similar overall performance. In the scenario with four channels, the scheduling of MDS-ML is almost 1.4 times higher than CDSM. The benefits of increasing channels on scheduling quality becomes very little when the number of channels becomes 6. In Table 1, the impact of learning rate α on scheduling quality is presented. Along with the increase in time steps, the proposed method uses more steps to learn better policy and to obtain better quality. When α is small, the update rule keeps more of the original Q-value, so the learning speed in relatively slower. When α is set to 0.2, a scheduling with good quality is learned early, but it barely changs along with the increase in time steps. The most apparent variation with time steps happens when α = 0.1; the quality increases about 2.2 times. In Table 2, the impact of shrinking factor σ on scheduling quality is presented. The smaller value of shrinking factor leads to a higher probability of random action. Even though it may help to explore more different schedulings, it also may slow down the convergence speed due to too many random actions. An example can be found when σ = 2: along with the increase in time steps, the quality only promotes about 1.5 times. The best performance on shrinking factor σ is equal to 4 in these tests. The convergence of scheduling result is an indispensable feature for learning-based methods, and a specific metric called selection consistency SC is designed to observe the convergence of the proposed method. Let us assume the current period to be t c ; then, the selection consistency of an agent for recently observed periods h rt can be defined as follows: Figure 12 depicts the result of selection consistency with different numbers of nodes; when the value is equal to 1, it means selection becomes stable and convergent. With the increase in nodes, MDS-ML takes more periods to reach consistency. When there 20 sources and 2 sinks, it only takes about 120 periods, and when there are 40 sources and 2 sinks in a network, it costs about 400 periods. This phenomenon is probably caused by the complexity of the scheduling problem. The more nodes represent more chance of conflicts, and arranging more working time slots and channels.
PLR (packet loss ratio) and PDR (packet delivery ratio), as two common performance metrics, are used to evaluate network throughput. By applying the proposed algorithm to the different network scenarios, the influence of the number of nodes on PLR and PDR can be further observed, and the corresponding result is depicted in Figure 13. In case of the fixed number of collection periods, the more complex network scenario with more nodes implies that the scheduling method spends more time to achieve convergence, so more packets will be lost due to the unsuccessful communications during the uncovergence stage. For example, PLR increases by about 4 times when the source nodes increase from 20 to 60 and the sink nodes increase from 2 to 4 in the simulation scenario. When there is 20 source nodes in the simulation scenarios, the change in sink number distinctly affects PLR and PDR. However, this tendency is gradually diminished if the number of source node reaches 60, where the gap among the scenarios with different sink numbers is only about 2.5% at most.

Discussion of Simulation Results
According to the simulation results presented above, MDS-ML obtains better performance on transmission delay, energy consumption overall scheduling quality, and PLR and PDR. As this new scheduling method is directly designed for supporting many-tomany data aggregation in a multi-channel WSN, it also considers multiple optimization objectives. Thanks to the feature of continuous learning, this new method can obtain good performance for data transmission. CDSM as a distributed scheduling method lacks the optimization capability for the global network; then, it achieves the relatively poor performance. Although EESPG and DASD have good performance as well, the construction and maintenance of a virtual tree structure still increases the additional network overhead, because the exchanges of extra control packets among nodes are inevitable, and they target implementing the many-to-one data aggregation scheduling for multiple channels. When these scheduling methods are compulsorily applied into the multi-sink scenarios, concurrent and independent scheduling operations toward different sinks have to be executed, and these operations without cooperation lead to higher costs in the network. The perfor-mance advantage of the new method is more obvious when there are more source nodes and sink nodes in simulation scenarios. Since the network structure becomes more complex, it is more difficult to find the optimal scheduling set.

Conclusions
To handle the many-to-one data aggregation scheduling problem for a multi-channel WSN, a cooperative multi-agent learning-based scheduling method is proposed in this paper. The optimization goal of scheduling is formulated and analysed, firstly. According to the characteristics of many-to-many data aggregation scheduling, the scheduling process is mapped to a decentralized partially observable Markov decision model. The cooperative multi-agent learning is implanted into a many-to-many data aggregation scheduling procedure. Nodes within one hop distance establish a group, which is a basic cooperative unit to learn the optimal policy. Finally, performance experiments are conducted on a discrete event simulator, and the simulation results validate the advantage of the proposed method on common metrics. In future work, a more detailed system model which is closer to the realistic communication environment should be considered for the proposed scheduling method. In this new model, the channel fading problem will be effectively handled, the communication security will be guaranteed, the malicious and selfish nodes will be detected and prevented.

Data Availability Statement:
The datasets generated and analysed in this study are available from the corresponding author on reasonable request.

Acknowledgments:
The authors would like to thank all anonymous reviewers for their constructive comments and insightful suggestions.

Conflicts of Interest:
The authors declare no conflict of interest.

V
The set of sensor nodes v i Sensor node i L The set of communication links l j,i Link from node i to node j ngh(v i ) The neighbor nodes of node i CH The set of available wireless channels ch k Channel k d i Sink node i TS c Communication period (or a frame) ts Time slot US(v i ) The upstream nodes of node i DS(v i ) The downstream nodes of node i LS The link based scheduling set ls i,j The resource allocation set for the link l i,j wd Working window wd r Reception slice including the time slots for data reception wd t Transmission slice including the time slots for data transmission η k The k th objective function ϕ Overall objective function RS Routing structure (set) I The set of agents S The set of system or joint states A The set of joint actions P The transition function of the state R Reward function Ω The set of joint observations O Observation function b Initial system state distribution (initial belief) T Horizon or the number of time steps Φ The action-observation history π Agent policy V π ( s) The value of a joint policy π from state s Q Q-function or Q-value function γ Discount factor α Learning rate g Agent group sgn Signum function t Alterable parameter of selection probability σ Shrinking factor SC Selection consistency h rt Recently observed periods