Dynamic Cooperative Spectrum Sensing Based on Deep Multi-User Reinforcement Learning

: Dynamic spectrum access (DSA) has been considered as a promising technology to address spectrum scarcity and improve spectrum utilization. Normally, the channels are related to each other. Meanwhile, collisions will be inevitably caused by communicating between multiple PUs or multiple SUs in a real DSA environment. Considering these factors, the deep multi-user reinforcement learning (DMRL) is proposed by introducing the cooperative strategy into dueling deep Q network (DDQN). With no demand of prior information about the system dynamics, DDQN can efﬁciently learn the correlations between channels, and reduce the computational complexity in the large state space of the multi-user environment. To reduce the conﬂicts and further maximize the network utility, cooperative channel strategy is explored by utilizing the acknowledge (ACK) signals without exchanging spectrum information. In each time slot, each user selects a channel and transmits a packet with a certain probability. After sending, ACK signals are utilized to judge whether the transmission is successful or not. Compared with other popular models, the simulation results show that the proposed DMRL can achieve better performance on effectively enhancing spectrum utilization and reducing conﬂict rate in the dynamic cooperative spectrum sensing.


Introduction
Due to the increasing demand for wireless communication, spectrum resources become severely scarce, which has promoted demand of developing high-efficiency dynamic spectrum access (DSA) schemes [1][2][3]. To alleviate the existing conflict between increasing spectrum demand and spectrum shortage, DSA allows secondary users (SUs) to opportunistically search and use licensed frequency bands or channels that are not fully utilized by primary users (PUs). Conventional DSA schemes usually focused on model-related settings or near-sighted targets, including short-sighted strategies [4], multi-arm gambling strategies [5,6], and Whittle index strategies [7,8]. They are performed under the premise that the channels are independent of each other or the channels follow the same distribution, which ignore the impact of the high correlation between channels. In recent years, many studies have begun to focus on the impact of prior knowledge and channel correlations on multichannel perception.
Most conventional studies on channel selection strategies require network status information, but this may be impractical [9][10][11]. Machine learning has been applied to the field of dynamic multi-channel perception [12][13][14]. As an important branch of machine learning, reinforcement learning (RL) is characterized by frequently interacting with changing environments to gain knowledge, which shows excellent performance in processing dynamic systems [15]. In reinforcement learning, each SU acts as an agent; and the agent who takes actions interacts with the external environment through a reward mechanism and adjusts the action according to the reward value obtained in the environment [16,17]. In other words, RL allows inexperienced agents to continuously learn through trial-and-error, and it maximizes the reward function to obtain the best strategy. So, RL-based models do not require the complete network status information. Q-learning, one of the algorithms of RL, has been used to depict the SU behavior in cognitive networks [18][19][20][21]. However, Q-learning performs well on small scale models but becomes extremely inefficient when dealing with large state spaces [22].
Deep reinforcement learning (DRL), which uses neural networks to approximate state values, has attracted much attention. It can effectively deal with large dynamic unknown environments [23][24][25][26], including very large state and action space. In recent years, DRL has made many breakthroughs in dynamic spectrum access due to its intrinsic advantages. Reference [27] can be considered as the pioneer work of exploring the DRL. It proposes a dynamic channel access scheme based on the deep Q network (DQN) with experience replay [28]. In each slot, the user selects one of M channels to transmit its packet. As the user knows the channel state only after the channel is selected, the related optimization decision problem can be expressed as a partially observable Markov decision process (POMDP). DQN can easily search the best strategy from experience [29][30][31], so it has been widely utilized in DSA.
DQN [32] is explored to jointly address the dynamic channel access and interference management. In [28], the DQL keeps following the learned policy over time slots and stops learning a suitable policy. However, actual environments are dynamic, in which DQN needs to be re-trained. To address the problem, an adaptive DQN scheme [33] is proposed by evaluating the cumulative incentives of current policies in each period. When the reward decreases to less than a given threshold, the DQN will be retrained to find the new good policies. The simulation results show that when the channel state changes, this scheme can detect the changes and start relearning to get high returns. In [34], a DRL framework based on actor-critics is proposed, which can get a better performance in a relatively large number of channels.
Most methods mentioned above are constrained to consider only one user instead of the real multi-user or multiple channels environment, in which collisions are prone to occur. Noticing this, a completely distributed model [35] is adopted for SUs to conduct dynamic multi-user spectrum access. Deep multi-user reinforcement learning method [36] is developed for DSA. It can effectively depict the general complex environment, while overcoming the expensive computational cost caused by the large state space and inefficient observability. Based on Q learning, the spectrum access scheme is proposed in [37], which can allocate idle spectrum holes without interaction between users. The multi-agent reinforcement learning (MARL) is proposed in [38] based on Q learning, which can effectively resist interference and obtain the good sensing results. In this model, each radio treats the behavior of other radios as a part of the environment. Then each radio attempts to evade the transmissions of other wideband autonomous cognitive radios (WACRs) as well as avoiding a jammer signal that sweeps across the whole spectrum band of interest. However, it is different to obtain a stable policy in changeable and complicated environment. The multi-agent deep reinforcement learning method (MADRL) is proposed in [39]. To avoid interference to the primary users and to coordinate with other secondary users, each SU decides its own sensing policy from the sensing results of the selected spectra. The method in [39] applies a channel sorting algorithm based on RL to decrease the number of sensing operations and spectrum handoffs. Compared with traditional RL models, MADRL has the advantages in the reward performance and the convergence speed.
In summary, [27,33] only assume one SU in the network with perfect spectrum sensing outcomes. This is, there is no error in spectrum sensing. On the other hand, [34,35] consider multi-user access, but [34] does not perform spectrum sensing for multi-user in a time-varying environment and [35] needs to improve performance in convergence speed and collision rate. Naparstek et al. [36] assumes no PU in the network and the collision with PUs is not considered. MARL is no longer applicable when the number of states is large because it does not consider the coordination among users and reduce the computational complexity. MADRL is simulated in channels where the number of channels grows exponentially with the increasing number of users. Meanwhile, it simply takes sensing reward into account. Most mentioned DRL models were "Sequential" architectures. That is, a layer could be only connected to the neurons in just one layer before and one layer after its own layer. Utilizing non-sequential architecture of deep learning, dueling deep Q learning (DDQN) [40] achieves the learning process of the Q-learning by exploring separate estimates of the state value function and the state-dependent action advantage function.
In general, for dynamic cooperative spectrum sensing, it is necessary to consider the DSA environment where PUs and SUs appear randomly, which is much closer to the real scenarios. In this paper, deep multi-user reinforcement learning is proposed to effectively achieve dynamic cooperative spectrum sensing. After investigating communication strategies of the cooperative DSA environment, ACK signals are utilized to achieve partially observable setting. As the cooperative strategy, ACK signals are then introduced into DDQN to construct the proposed DMRL model, which allows users to learn good policies and achieve strong sensing performance. The main contributions of the paper are presented as follows.
We propose a cooperative strategy for the DSA networks, without the need to exchange user spectrum information, online coordination, or carrier sensing. In the network, we consider cognitive MAC protocols for the secondary network with N users sharing K orthogonal channels.
Due to the high complexity in DSA, we develop a dueling deep Q-network approach, which has the ability to determine the best action for each state in the large dynamic unknown environments.
Compared with MARL and MADRL, the simulation results show that the proposed DMRL model can obtain a higher average reward and a lower average collision. It means our method can effectively achieve the dynamic cooperative spectrum sensing, and significantly reduce the chance of conflict between PUs and SUs.

Cooperative Spectrum Sensing Model
In cooperative spectrum sensing, all cognitive radios measure the licensed spectrums and make independent decisions. The cooperative spectrum sensing system is illustrated in Figure 1. The cognitive radio network usually consists of several PUs and SUs, where the SUs aim to share multichannel spectrum resources with PUs without interrupting the PUs. SUs exchange their sensing information with each other, and each SU maximizes its own spectrum sensing performance. Because the channel environment is unknown, SU can only make a decision after each attempt to sense a different channel. large because it does not consider the coordination among users and reduce the compu tational complexity. MADRL is simulated in channels where the number of channel grows exponentially with the increasing number of users. Meanwhile, it simply take sensing reward into account. Most mentioned DRL models were "Sequential" architec tures. That is, a layer could be only connected to the neurons in just one layer before and one layer after its own layer. Utilizing non-sequential architecture of deep learning, duel ing deep Q learning (DDQN) [40] achieves the learning process of the Q-learning by ex ploring separate estimates of the state value function and the state-dependent action ad vantage function.
In general, for dynamic cooperative spectrum sensing, it is necessary to consider th DSA environment where PUs and SUs appear randomly, which is much closer to the rea scenarios. In this paper, deep multi-user reinforcement learning is proposed to effectively achieve dynamic cooperative spectrum sensing. After investigating communication strat egies of the cooperative DSA environment, ACK signals are utilized to achieve partially observable setting. As the cooperative strategy, ACK signals are then introduced into DDQN to construct the proposed DMRL model, which allows users to learn good policie and achieve strong sensing performance. The main contributions of the paper are pre sented as follows.
We propose a cooperative strategy for the DSA networks, without the need to ex change user spectrum information, online coordination, or carrier sensing. In the network we consider cognitive MAC protocols for the secondary network with N users sharing K orthogonal channels.
Due to the high complexity in DSA, we develop a dueling deep Q-network approach which has the ability to determine the best action for each state in the large dynamic un known environments.
Compared with MARL and MADRL, the simulation results show that the proposed DMRL model can obtain a higher average reward and a lower average collision. It mean our method can effectively achieve the dynamic cooperative spectrum sensing, and sig nificantly reduce the chance of conflict between PUs and SUs.

Cooperative Spectrum Sensing Model
In cooperative spectrum sensing, all cognitive radios measure the licensed spectrum and make independent decisions. The cooperative spectrum sensing system is illustrated in Figure 1. The cognitive radio network usually consists of several PUs and SUs, wher the SUs aim to share multichannel spectrum resources with PUs without interrupting th PUs. SUs exchange their sensing information with each other, and each SU maximizes it own spectrum sensing performance. Because the channel environment is unknown, SU can only make a decision after each attempt to sense a different channel.  Therefore, the channel exchange can be modeled as a Markov chain that can solve the problem of partial observability. Each SU is synchronized with the slot time. Figure 2 shows the slot structure for the SUs. The time slot structure is divided into two periods, namely the sensing period and the data transmission period. At the beginning of the sensing period, all SUs in the system independently sense the slot. Then, the SU sends its own sensing data to other users and combines data with the received data. Finally, each SU sends its decision. Before the end of sensing period, there is an ACK signal to send confirmation information about whether the transmission is successful or not.
ppl. Sci. 2021, 11,1884 Therefore, the channel exchange can be modeled as a Markov chain the problem of partial observability. Each SU is synchronized with the slot shows the slot structure for the SUs. The time slot structure is divided into namely the sensing period and the data transmission period. At the beginni ing period, all SUs in the system independently sense the slot. Then, the SU sensing data to other users and combines data with the received data. Fin sends its decision. Before the end of sensing period, there is an ACK signal t mation information about whether the transmission is successful or not.

Deep Reinforcement Learning for DSA
Developing reinforcement learning (RL) methods is a new research dire ing DSA problems. The RL methods can adapt to the dynamic spectrum env do not require the prior knowledge about the system model. Q-learning is branch in reinforcement learning, which is widely used in various applicat model free nature. Q-learning is a value iteration method, and its essence is value of each state and action pair. Each SU follows the Q-learning metho the decision to sense the channel based on the previous channel occupatio recent action-observation experience. Although Q-learning performs well with small actions and state spaces, it becomes inefficient with a large state In recent years, DQN combining deep neural network with Q-learning great potential to solve the DSA problem. Deep neural networks can represe models well, so that they can maintain good performance for large-scale sc der to model the correlation across channels, the entire system is described chain of 2 N states, where N represents the number of channels. The D the channel selection problem as < S , represents the action space,

{0, 1} R represents ward and '
S represents the next state. The framework of DQN is presente At the beginning of each time slot, the current state S consists of all prev and observations. The S is input into the Q network and the network se action A to the output. A represents which channel to sense at time slot forming the action A , the system can obtain the next state and instant rew the time slot record < S , A , R , ' S > is put into the experience memory over it again. When the experience memory reaches the capacity limitation, pling is performed. Sample m records ( S , A , R , ' S ) from the experi

Deep Reinforcement Learning for DSA
Developing reinforcement learning (RL) methods is a new research direction for solving DSA problems. The RL methods can adapt to the dynamic spectrum environment and do not require the prior knowledge about the system model. Q-learning is an important branch in reinforcement learning, which is widely used in various applications due to its model free nature. Q-learning is a value iteration method, and its essence is to find the Q value of each state and action pair. Each SU follows the Q-learning method, and makes the decision to sense the channel based on the previous channel occupation history and recent action-observation experience. Although Q-learning performs well when dealing with small actions and state spaces, it becomes inefficient with a large state space.
In recent years, DQN combining deep neural network with Q-learning, has shown a great potential to solve the DSA problem. Deep neural networks can represent large-scale models well, so that they can maintain good performance for large-scale scenarios. In order to model the correlation across channels, the entire system is described as a Markov chain of 2 N states, where N represents the number of channels. The DQN illustrates the channel selection problem as <S, A, R, S >, where S ∈ {S 1 , . . . , S N } represents the state space, A ∈ {1, . . . , N} represents the action space, R ∈ {0, 1} represents the instant reward and S represents the next state. The framework of DQN is presented in Figure 3. At the beginning of each time slot, the current state S consists of all previous decisions and observations. The S is input into the Q network and the network selects the best action A to the output. A represents which channel to sense at time slot t. After performing the action A, the system can obtain the next state and instant rewards. Finally, the time slot record <S, A, R, S > is put into the experience memory D and iterate over it again. When the experience memory reaches the capacity limitation, random sampling is performed. Sample m records (S, A, R, S ) from the experience memory D and calculates the current target Q value. The related formula is listed as (1): y is the output of the neural network based on the new experience (S, A, R, S ). Q(S t+1 , a ; θ − i ) represents the approximation function formed through old experience, where S t+1 is the new state after taking action a given the state S t , t represents the time slot t. θ − i is the weight of the Q-network used to compute the target at iteration i, and γ represents discount rate. The loss function is used to update the neural network parameters through the gradient back propagation of the neural network. The loss function is the mean square error between the value of the objective function and the value of the actual function, which is formulated in (2) and (3), where Q(S t+1 , a ; θ − i ) is the approximation function formed through old experience and Q(S t , A t ; θ i ) is the approximation function formed by the new experience.
Appl. Sci. 2021, 11, 1884 5 of 16 slot t. i θ − is the weight of the Q-network used to compute the target at iteration i, and γ represents discount rate. The loss function is used to update the neural network parameters through the gradient back propagation of the neural network. The loss function is the mean square error between the value of the objective function and the value of the actual function, which is formulated in (2) and (3),  The specific process of the DQN algorithm is shown in Algorithm 1.

Algorithm 1 Dynamic multi-channel sensing based on DQN
Input: State S Output: Action A

BEGIN
Step 1: Initialize experience memory D , all parameters of the neural network and Q value of state-action pair Step 2: Initialize S as the first state of the current state sequence Step 3: Inputs the channel state S into the Q network, and utilizes the ε -greedy method to select a action A from the current output Q value Step 4: Perform action A and access the channel with the same value as A Step 5: Obtain the next state ' S and reward value R of the channel environment according to the state S and the executive action A Step 6: Store the sensing record < S , A , R , ' S > in the experience memory D Step 7: Randomly sample m records from the experience memory D , and update the Q neural network parameters according to (2) and (3) Step 8: Assign the value of ' S to S , and repeat step 3 END The specific process of the DQN algorithm is shown in Algorithm 1.

Algorithm 1 Dynamic multi-channel sensing based on DQN
Input: State S Output: Action A BEGIN Step 1: Initialize experience memory D, all parameters of the neural network and Q value of state-action pair Step 2: Initialize S as the first state of the current state sequence Step 3: Inputs the channel state S into the Q network, and utilizes the ε-greedy method to select a action A from the current output Q value Step 4: Perform action A and access the channel with the same value as A Step 5: Obtain the next state S and reward value R of the channel environment according to the state S and the executive action A Step 6: Store the sensing record <S,A, R,S > in the experience memory D Step 7: Randomly sample m records from the experience memory D, and update the Q neural network parameters according to (2) and (3) Step 8: Assign the value of S to S, and repeat step 3 END

The Proposed Deep Multi-User Reinforcement Learning
To solve the multi-user conflict problem, the multi-user DDQN framework is proposed by integrating cooperative sensing with DDQN. Consider an environment with K related channels, and each channel has two possible states: idle (1) or busy (0). SU can dynamically Appl. Sci. 2021, 11, 1884 6 of 16 access these K authorized channels and select one channel for communication. Based on this, channel exchange can be modeled as a Markov chain of up to 2 K states. Because the environment is unknown to the user, the user can only make a decision after each attempt to perceive a different channel. In the proposed model, N users can dynamically access K shared channels. At the beginning of each time slot, each user selects a channel and sends a data packet with a certain trial probability. After each time slot, each user who has sent a data packet receives an observation indicating whether its data packet was successfully sent (i.e., ACK indicator). If the data packet is delivered correctly, the reward is 1. Otherwise, the collision occurs if the transmission fails, and the reward is set to 0. The model parameters are as follows.

•
S stands for state space. It is a matrix with dimension N × (2K + 2), which expresses the state set of N users, and its expression is shown as: Among them, s i (1 ≤ i ≤ N) represents the status of the i-th user, and the status of each user consists of 2K + 2 elements. The first K + 1 elements represent the user's transmission status (If the user does not send, the first element is 1 and the other elements are 0; If the user selects channel K for transmission, the K + 1 element is 1 and other element values are 0). The c 1 to c k represent the remaining capacity of K channels. The last element represents the ACK signal (When the ACK signal is received, the value is 1; Otherwise, the value is 0). Therefore, s i is expressed as: Among them, ξ i represents the selection of the (j − 1)th channel for transmission; c j represents the remaining amount of the j-th channel; p i represents the ACK signal.

•
A represents the action set. The actions of N users in each time slot form the action matrix, which is expressed as: where a i t ∈ {1, . . . , j, . . . , K , indicating user i selects channel j(1 ≤ j ≤ N) for accessing.

•
O represents the observation space. It is mainly composed of the ACK signal, instant reward, and the remaining capacity of the channel, which are listed as: Among them, (p i , r i ) represents the ACK signal and instant reward obtained by user i after accessing the channel a i t . Symbol c j represents the remaining capacity of the channel j. • R represents the reward space. It is a matrix with a dimension of 1 × N, representing the reward set of N users. Its value is shown as: where r i represents the reward obtained by selecting the i-th user.
In this paper, dueling DQN network (DDQN) is utilized to understand which states are valuable and get better sensing strategies. The channel state with the cooperative signal is learned as the input of the network. Then the dynamic multi-channel cooperative strategy is obtained by training the DDQN network.

Architecture of the Proposed DMRL
The network structure of DMRL is illustrated in Figure 4. The input of the network is the state of each user, which contains a vector of size 2K + 2. The first K + 1 elements represent the user's transmission status. The next K elements from K + 1 present the remaining capacity of channels. The last element represents the ACK signal. In the hidden layer, the outputs are calculated according to the Matmul function and the Relu function. Matmul function [41] represents the weighted sum of nodes at each level. The ReLU [41] is the linear activation function that will output the input directly if it is positive, otherwise, it will output zero. In the neural network, the activation function is responsible for transforming the summed weighted input of the node into the activation of the node or output. It is formulated as: where ω j represents the weight matrix of the j-th hidden layer in the network structure; b j represents the offset matrix of the j-th hidden layer. In the DMRL structure, there is a layer of value and advantage layer. Among them, V(s n (t)) represents the value of the static state environment itself, s n (t) represents the state in the time slot t, A(a n (t)) represents the additional value of selecting an action and a n (t) represents the action in the time t. Then, finally these two come together and get the Q value for each action. Therefore, we can change all the Q values in a certain state by just changing the value of V(s n (t)). We calculate the value V(s n (t)) and the advantage A(a n (t)) separately based on the output h j of the hidden layer. The related formulas are shown as (10) and (11), A(a n (t)) = Matmul(h j , ω j,2 ) + b j,2 .

Architecture of the Proposed DMRL
The network structure of DMRL is illustrated in Figure 4. The input of the netwo is the state of each user, which contains a vector of size 2 2 K + . The first +1 K elemen represent the user's transmission status. The next K elements from +1 K present the r maining capacity of channels. The last element represents the ACK signal. In the hidd layer, the outputs are calculated according to the Matmul function and the Relu functio Matmul function [41] represents the weighted sum of nodes at each level. The ReLU [4 is the linear activation function that will output the input directly if it is positive, otherwi it will output zero. In the neural network, the activation function is responsible for tran forming the summed weighted input of the node into the activation of the node or outpu It is formulated as:  (10) and (11), (1  To avoid the case that only the parameters ω j,2 and b j,2 are updated in the proposed DMRL network, (11) is rewritten as: A(a n (t)) = Matmul(h j , ω j,2 ) + b j,2 .
(12) From (12), it can be found that matrix A(a n (t)) has a zero-sum characteristic. That is, the sum of matrix A(a n (t)) is 0. Finally, adding the value V(s n (t)) and the advantage A(a n (t)) to output layer, the Q-value matrix can be obtained, which is formulated as: Q(a n (t)) = V(s n (t)) + A(a n (t)). (13) If the value of the first element in the Q-value matrix is the largest, it means that the current user does not send data packets. If the value of the K + 1 element in the Q-value matrix is the largest, it means that the current user will send data packets. To facilitate the channel selection for the current users, the Q-value matrix in (13) is normalized to a probability matrix as shown in: Therefore, for channel access, it is necessary to select the subscript with the highest probability value to transmit packets.

Channel Cooperative Module
According to the previous step, each user obtains the channel number to be accessed by selecting the subscript with the largest probability value, then the access channels of the N users form the action matrix A, as shown in (6).
By observing the action matrix A, the observation matrix obs can be calculated. The observation matrix is mainly composed of the ACK signal, instant reward, and the remaining capacity of the channel. If the channel selected by the current user to access is the same channel selected by other users, the ACK signal and instant rewards received by these users are both 0. If they are not the same, it means that there is no collision. Then the ACK signal and instant reward are both set as 1. When the value of ACK signal is 1, the remaining capacity of channel is set to 0, otherwise it is set to 1. Then, ACK signals and action matrix are generated according to (5) and (6). By this, the next channel state next_state can be calculated. That is, the cooperation information is integrated into the channel state to form channel cooperation.

DMRL Training
In each time slot, we will record <S,A,R,S > into the experience memory, where S represents state space, A represents action, R represents reward and S represents the next state space. Extracting a batch of records from the experience cache, S and S is normalized to calculate Q pred and Q next through Q value, and then Q actual can be calculated according to: Utilizing the function provided by TensorFlow, the loss can be calculated and its gradient optimization can be performed to update the network parameters. The specific formulas are listed as: opt = AdamOptimizer.minimize(Q loss ). (17) To sum up, the specific process of the proposed deep multi-user reinforcement learning algorithm is presented in Algorithm 2.
Algorithm 2 Deep multi-user reinforcement learning for dynamic cooperative spectrum sensing Input: State S and parameter values, including channel number K, user number N, action space size K + 1, state space size 2 × (K + 1), exploration probability ε, buffer space size, hidden layer size, experience memory D, and channel environment, etc. Output: Q value, action A BEGIN Step 1: Initialization parameter values and channel environment Step 2: Obtain the A randomly from the channel environment, and obtain observation obs S and R after executing the A. Execute the above number, the model can get the next state S Step 3: Store the record <S,A,R,S > in the experience memory D Step 4: Assign S to S to produce a random number ε i .
(1) If the random number ε i is less than the exploration probability ε, then randomly sample the A from the channel environment.
(2) If not, input the current state S into the neural network and obtain Q according to Equation (13), then the Q value is normalized according to Equation (14). A is selected according to the maximum value, and the corresponding channel is selected for sensing according to the action.
Step 5: Obtain the S and R according to the selected A, and store the record <S,A,R,S > in the experience cache Step 6: Sampling from the experience cache, training the network according to the loss function and gradient descent strategy, the specific operations are shown in (16) and (17) Step 7: If the iteration is not over, continue to Step 4 END

Experiment
In this section, we evaluate the performance of the proposed cooperative spectrum sensing scheme. The performance metrics include the total reward, the cumulative collision volume and the cumulative reward of all users. To verify the performance on dynamic cooperative spectrum sensing, we compare the proposed DMRL with two popular methods: multi-agent reinforcement learning (MARL) and multi-agent deep reinforcement learning (MADRL).

Simulation Setup
In the experiments, decentralized cognitive MAC protocols are considered for the secondary network with N users sharing K orthogonal channels, which allow SUs to independently search for spectrum opportunities without a central coordinator or a dedicated communication channel. Each user of the cognitive node can obtain slot information from the PU, so each cognitive user pair is allocated to each slot. In each slot, spectrum sensing, data transmission and ACK transmission is completed continuously. In the spectrum sensing stage, the CR sender and receiver, according to POMDP, takes the effective throughput as the benefit function to decide which frequency band to select for sensing and which frequency band to utilize for data transmission. The proposed DMRL consists of an input layer, three hidden layers, and an output layer as shown in Figure 4. The last hidden layer is divided into a value sub-network and an advantage sub-network. We refer to the setting of Dueling DQN [40] and make appropriate adjustments. Based on this, the number of neurons in the first two hidden layers is set to 150. The states of the proposed DMRL are defined as the combination of user's state on each channel, the related capacity, and ACK signals. The value of the exploration rate ε is set to 0.15, in which case exploration and utilization can be well balanced. When updating the weights of DMRL, the small batch M with 25 samples is randomly selected from the historical data to calculate the loss function. Adam algorithm is utilized to carry out stochastic gradient descent to update the weight. To observe different time slots, we set the total number of training to 100,000. The detailed simulation parameters are shown in Table 1. In order to achieve good sensing in multi-user scenarios, two networks with different number of users and channels are simulated. To achieve good results with lower time consuming, the number of iterations is set to 20 and the step size is set to 5000 individually. In the multi-user case, many studies [36] consider the case of 2 users and 3 channels. To verify the case for more users, we also consider the case of 10 users and 6 channels in the experiments. The specific training settings are shown in Table 2. Three quantitative indicators are applied to evaluate the performance of the dynamic cooperative spectrum sensing models, including the total reward, the cumulative collision volume and the cumulative reward of all users. The total reward represents the reward value after the user senses the channel in each time slot. If SUs successfully sensed channel, their reward value is set to 1. That is, the value of the total reward is proportional to the number of successful perception by SUs. For example, a total reward value of 0 means that no SUs successfully sense the channel and a total reward value of 4 means that 4 SUs successfully sense the channel. Considering r i represents the reward, the total reward is: Cumulative collision indicates the cumulative amount of collision times as the time slot increases. When two or more SUs access the same authorized channel in the same time slot, collisions will happen. In general, the number of collisions is the number of users who sense access to the same authorized channel at the same time. The cumulative collision rate is defined as the ratio of the number of cumulative collisions to the cumulative sensing attempts. For example, if there are 3 SUs sensing channel 1 at the same time, the number of collisions is 3. Another way to express cumulative collisions is the total number of channels minus the total reward value. If the number of channels is 5 and the total reward value is 3, the number of collisions is 2. The cumulative collision is: The last indicator is cumulative reward, which represents the cumulative amount of the total reward value as the time slot increases. The cumulative amount has been explained above, so the cumulative bonus is also well understood. It can be calculated by:

Simulation Results
In this section, the proposed framework is simulated and analyzed by different experimental setting. Firstly, with the fixed number of authorized channels, the total number of successful packet transmissions are observed as the training time slot increases. Secondly, the total number of successful packet transmissions are analyzed when the number of users significantly increases. Finally, MARL and MADRL are chosen as compared approaches to verify the performance of the proposed DMRL. Figure 5 shows the values of the total reward, the cumulative collision and the cumulative reward obtained by DMRL, in which the number of authorized channels is 2 and the number of users is 3 at the beginning of training. From Figure 5a, it can be easily found that the total reward value is mainly distributed between 0 and 1. This result shows that in most cases, no user successfully accesses the channel or only one user successfully accesses the channel. As shown in Figure 5b, when the number of authorized channels and the number of users remain unchanged, the distribution of the total reward will change as the number of training increases. Most values of the total reward are 1. Comparing Figure  5a,b, the cumulative collision rate decreases, and the cumulative reward increases. It can be found that in the first 5000 time slots, the cumulative collision volume is high and the cumulative reward is low. The main reason is that the neural network with fewer training times can't efficiently learn the strategy of accessing the channel.  After training about 100,000 times, the experimental results are shown in Figure 6. It can be easily found that most values of the total reward are 2, which means that two authorized channels can be well allocated when being occupied by three users, and there are few conflicts. The cumulative collision rate has also been reduced from 0.7 to 0.02, which greatly reduces the probability of conflict. The cumulative reward has also increased from 2500 to 10,000. From these indicators, it is quite obvious that the proposed DMRL can effectively solve the problems of multi-user DSA problem, while reducing the collision rate and maximizing the success rate of channel access.   Figure 7 shows the scenario with 6 authorized channels and 10 users. It can b observed that the total reward increases as the increase in training slots, i.e., the rea number of authorized channels is increased, but the rate of increase is not fast.
It can be obviously found from Figure 7a that the total reward value is roug tributed around a lower value at the beginning of training, and the cumulative volume remains in a high-value range. However, the total cumulative reward rem a low range. The reason is the same, i.e., too few channel perception records are train the neural network. As can be seen in Figure 7b, as the number of training in the multi-user cooperative sensing strategy learned by the neural network becom and more accurate, and the total reward value gradually increases, but the degre timization is relatively slow. However, the cumulative collision volume is still This is because the increase in the number of authorized channels and the numbe ondary users will increase the state space, and more difficult problems will take l Figure 6. When K = 2 and N = 3, the total reward, cumulative collision, and cumulative reward for the last 5000 time slots. Figure 7 shows the scenario with 6 authorized channels and 10 users. It can be easily observed that the total reward increases as the increase in training slots, i.e., the reasonable number of authorized channels is increased, but the rate of increase is not fast.
It can be obviously found from Figure 7a that the total reward value is roughly distributed around a lower value at the beginning of training, and the cumulative collision volume remains in a high-value range. However, the total cumulative reward remains in a low range. The reason is the same, i.e., too few channel perception records are used to train the neural network. As can be seen in Figure 7b, as the number of training increases, the multi-user cooperative sensing strategy learned by the neural network becomes more and more accurate, and the total reward value gradually increases, but the degree of optimization is relatively slow. However, the cumulative collision volume is still not low. This is because the increase in the number of authorized channels and the number of secondary users will increase the state space, and more difficult problems will take longer to resolve. When the number of training reaches a certain level, the experimental results are shown in Figure 7c,d. Although the total reward distribution remains in the upper middle part, turbulence and unstable points appear. From the cumulative amount of collision maintained at 8000, and the cumulative reward value maintained at 20,000, it can be seen that the amount of knowledge learned by the neural network due to the complexity of the problem has stagnated. This will be one of the problems to be solved in the future.
resolve. When the number of training reaches a certain level, the experimental results are shown in Figure 7c,d. Although the total reward distribution remains in the upper middle part, turbulence and unstable points appear. From the cumulative amount of collision maintained at 8000, and the cumulative reward value maintained at 20,000, it can be seen that the amount of knowledge learned by the neural network due to the complexity of the problem has stagnated. This will be one of the problems to be solved in the future. To verify the performance of cooperative strategy, the proposed DMRL is compared with two popular methods: MARL and MADRL. MARL is a multi-agent reinforcement learning (MARL) method based on Q learning, which is proposed to avoid sweeping interference signals and accidental interference from other wideband autonomous cognitive radios (WACRs). For MADRL, it allows multiple WACRs to simultaneously operate over the same wide spectrum band. During the learning process, it is easier to achieve faster To verify the performance of cooperative strategy, the proposed DMRL is compared with two popular methods: MARL and MADRL. MARL is a multi-agent reinforcement learning (MARL) method based on Q learning, which is proposed to avoid sweeping interference signals and accidental interference from other wideband autonomous cognitive radios (WACRs). For MADRL, it allows multiple WACRs to simultaneously operate over the same wide spectrum band. During the learning process, it is easier to achieve faster convergence speed and better reward performance by balancing exploration and utilization. The same data are applied to train the MARL and MADRL and the average collision and average reward is computed. The specific data is shown in Table 3. According to Table 3, the average collision from each policy is listed in descending order: 0.37 (MARL), 0.24 (MADRL), 0.06 (DMRL). It can be seen that DMRL performs best in the complicated real scenario. It can also be seen from the indicator of average reward that our method can achieve a higher average reward. That's because the dynamic multi-channel cooperative sensing method based on DMRL adaptively learns multi-user sensing signals, and allocates multiple channel resources equally and reasonably while maximizing the objective function. Although MARL and MADRL can learn the sensing strategy through user history information, users can easily choose a fixed channel in the multi-channel case. Therefore, when two users are transmitting on the same channel, it is easy to fall into a persistent conflict.

Conclusions
This paper studies the multichannel access problem by proposing deep multi-user reinforcement learning. To reduce the conflict of user access channels and the high complexity caused by large state space, we consider a cooperative strategy and propose a dynamic multi-channel cooperative sensing algorithm based on DDQN. The proposed algorithm can achieve better reward performance with faster convergence speed than other algorithms based on Q-learning and DQN, especially in large networks. In our algorithm, we train DDQN for all users individually, and each user judges whether there is a conflict through the ACK signal. Comparing with other DSA methods, the proposed DMRL consider multi-user access and a cooperative DSA network under the presence of spectrum sensing errors. We conducted simulation experiments using publicly available real communication data sets. The simulation results verify the superior performance of the proposed algorithm, which can promote the development of the CR technology to achieve more efficient utilization of spectra. Our algorithm has a lower conflict rate and gets higher rewards in the multi-user case.