Multi-Agent Reinforcement Learning for Joint Cooperative Spectrum Sensing and Channel Access in Cognitive UAV Networks

This paper studies the problem of distributed spectrum/channel access for cognitive radio-enabled unmanned aerial vehicles (CUAVs) that overlay upon primary channels. Under the framework of cooperative spectrum sensing and opportunistic transmission, a one-shot optimization problem for channel allocation, aiming to maximize the expected cumulative weighted reward of multiple CUAVs, is formulated. To handle the uncertainty due to the lack of prior knowledge about the primary user activities as well as the lack of the channel-access coordinator, the original problem is cast into a competition and cooperation hybrid multi-agent reinforcement learning (CCH-MARL) problem in the framework of Markov game (MG). Then, a value-iteration-based RL algorithm, which features upper confidence bound-Hoeffding (UCB-H) strategy searching, is proposed by treating each CUAV as an independent learner (IL). To address the curse of dimensionality, the UCB-H strategy is further extended with a double deep Q-network (DDQN). Numerical simulations show that the proposed algorithms are able to efficiently converge to stable strategies, and significantly improve the network performance when compared with the benchmark algorithms such as the vanilla Q-learning and DDQN algorithms.


Introduction
Recent years have witnessed remarkable success of unmanned aerial vehicle (UAV) clusters in a variety of scenarios ranging from disaster relief to commercial applications of unmanned swarm operations [1,2]. As one backbone technology for UAV systems, communication protocol design for UAVs, thus, naturally receives intensive attention from both academia and industry [3,4]. However, due to the ad hoc nature of UAV networks, directly applying the off-the-shelf wireless access protocols for vehicle-to-vehicle (V2V) becomes a difficult task, especially when the UAVs have to overlay upon the spectrum occupied by an existing infrastructure and ensure zero interference. In this regard, the adoption of cognitive radio (CR) technologies [5,6] into UAV systems becomes a tempting solution, since it not only avoids a series of problems caused by the rigid fixed-spectrum authorization model [5,6], but also has the potential to adapt to a complex and time-varying radioactive environment. Nevertheless, UAVs are typically constrained by their on-device computation capabilities, but are required to quickly respond to the radio environment changes with limited coordination. Therefore, designing an intelligent mechanism to efficiently perform spectrum sensing and distributed channel access becomes a challenge of vital importance.
So far, pioneering studies have established a number of different frameworks for spectrum sensing in CR networks [7][8][9]. For instance, in [7], an iterative signal compression filtering scheme is proposed to improve the spectrum sensing performance of CR-enabled UAV (CUAV) networks. Its core idea is to adaptively eliminate the primary user (PU) component in the identified sub-channel, and directly update the measured value to detect other active users. In [8], the space-time spectrum sensing problem for CUAV network in the three-dimensional heterogeneous spectrum space is discussed. Spectrum detection is improved based on the fusion of sensing results over both the time domain and the space domain. In [9], aiming to reflect the dynamic topology change of the CUAV network, a clustering method based on the maximum and minimum distances of nodes is proposed to improve the performance of cooperative spectrum sensing. With these spectrum sensing methods, channel access schemes, such as channel rendezvous for opportunistic channel reservation [10] and one-shot optimization-based channel allocation [11], can be deployed for throughput-optimal allocation for CUAVs.
The above studies tackle the channel sensing and allocation problem in CUAV networks by assuming that the radioactive environment is static, and a centralized information aggregator (e.g., a leader UAV) exists. However, more than frequently, a practical CUAV network not only faces a time-varying channel environment, i.e., the UAV-ground communication with multiple antennas will incur a 3D nonstationary geometry-based stochastic channel [12] or the ultra-wideband communication with the Saleh-Valenzuela time-varying statistical channel model [13], but is also deployed in an ad hoc manner. Therefore, it is necessary to develop a distributed sensing-allocation mechanism that causes an affordable level of overhead due to V2V information exchange. For this reason, a series of distributed allocation mechanisms, in particular, based on reinforcement learning (RL), are proposed to replace the traditional self-organizing schemes for spectrum sensing or channel access [6,[14][15][16][17][18]. In [6], a novel Q-learning-based method is proposed for secondary users (SUs) to select cooperative sensing nodes using the discounted upper confidence bound (D-UCB) for strategy exploration and reducing the number of sensing samples. In [14], a neighbor-based cooperative sensing mechanism using Q-learning is proposed for collaborative channel sensing by SUs. In [15], a robust joint sensing-allocation scheme is proposed based on RL to counter the impact of adversary SUs (e.g., spectrum sensing data falsification attackers). Compared with these tabular-search-based RL methods, deep neural networks (e.g., deep Q-network) are adopted for state-value approximation [16]. Still, for the cooperative spectrum sensing problem [17], a multi-agent deep reinforcement learning method was adopted, and each secondary user learns an efficient sensing strategy from the sensing results to avoid interference to the primary users, also in which the upper confidence bound with Hoeffding-style bonus is used to improve the efficiency of exploration. Furthermore, cooperative multi-agent RL (MARL) methods are proposed for dynamic spectrum sensing and aggregation [18], typically with the aim of maximizing the number of successful transmissions without interrupting PUs.
In summary, most of the existing studies treat the problems of high-precision spectrum sensing and dynamic access channel allocation separately. However, how to jointly optimize the cooperative channel sensing and spectrum access processes remains open issues, especially in the time-varying radio environment. In addition, although a plethora of distributed algorithms (some based on RL [19][20][21]) have been proposed in the literature, most of them are subject to rigid assumptions and cannot be directly adopted by CUAV applications, which, for example, usually emphasize network/spectrum scalability or face real-world constraints such as limited sensing/signaling capabilities and limited energy/computation resources. These concerns naturally lead to the consideration of formulating the joint sensing-and-access problem from the perspective of UAVs. As a result, the decision process of CUAVs may face more complex coupling problems in terms of sensing-and-access strategies when compared with the purely cooperative methods. Based on these considerations, this paper investigates the semicompetitive channel-sensingand-access problem in CUAV networks where the spectrum sensing phase is organized cooperatively based on the exchange of binary sensing results. An MARL-based framework of strategy searching is proposed in the form of two distributed execution algorithms that address state-value representation differently. The main contribution of this paper is summarized as follows:

•
To coordinate the behaviors of various CUAVs for efficient utilization of idle spectrum resources of PUs, a CUAV channel exploration and utilization protocol framework based on sensing-fusion-transmission is proposed. • A problem maximizing the expected cumulative weighted rewards of CUAVs is formulated. Considering the practical constraints, i.e., the lack of prior knowledge about the dynamics of PU activities and the lack of a centralized access coordinator, the original one-shot optimization problem is reformulated into a Markov game (MG). A weighted composite reward function combining both the cost and utility for spectrum sensing and channel access is designed to transform the considered problem into a competition and cooperation hybrid multi-agent reinforcement learning (CCH-MARL) problem.

•
To tackle the CCH-MARL problem through a decentralized approach, UCB-Hoeffding (UCB-H) strategy searching and the independent learner (IL) based Q-learning scheme are introduced. More specifically, UCB-H is introduced to achieve a trade-off between exploration and exploitation during the process of Q-value updating. Two decentralized algorithms with limited information exchange among the CUAVs, namely, the IL-based Q-learning with UCB-H (IL-Q-UCB-H) and the double deep Q-learning with UCB-H (IL-DDQN-UCB-H), are proposed. The numerical simulation results indicate that the proposed algorithms are able to improve network performance in terms of both the sensing accuracy and channel utilization.
The rest of this paper is organized as follows. Section 2 presents the network model and formulates the problem from a centralized perspective. Section 3 casts the problem into the context of MARL, and Section 4 proposes the RL-based solutions for joint spectrum sensing and channel access. Simulation results and analyses are presented in Section 5. Section 6 concludes the paper. two states (i.e., busy and idle) [18] as shown in Figure 2

Framework of Channel Sensing and Access
To enable the coexistence of multiple CUAVs over a limited number of PU frequency bands, we need a protocol framework to coordinate the channel sensing-and-access behaviors of these CUAVs. We assume that the CUAVs are able to access and synchronize over a dedicated common control channel (CCC), i.e., CH 0 in Figure 3, which the spectrum sensing results and channel selection decisions can be shared among the CUAVs. We also assume that the CUAVs operating on the same PU channel transmit with nonorthogonal spectrum sharing techniques. The processes of spectrum sensing and channel access are organized in time slots (see Figure 3). More specifically, each time slot of PU channel sensing and utilization by CUAVs is divided into three consecutive sub-frames of sensing (τ s ), cooperation (τ c ), and access/transmission (τ t ). At the beginning of the sub-frame of sensing, CUAVs decide on which channels to sense and access by switching their transceiver operations to the corresponding channels. Note that in this sub-frame, some of the CUAVs may stay idle and select no channel. In the subsequent sub-frame of cooperation, each CUAV broadcasts their own sensing results over the CCC in an orderly manner. Based on the received sensing results, each CUAV is able to perform the local sensing-result fusion and obtain a uniform vector of state observation as the other CUAVs. The local fusion results will be used for deciding on whether to access or not in the last sub-frame of access.
We assume that the messages exchanged over the CCC are reliable (cf. [17]), and for cooperation, we assume that the same fusion rule, such as the "K-out-of-N" or "AND" rules [26,27], is adopted by all the CUAVs. This ensures that all the CUAVs obtain a consistent observation about the status (i.e., busy or idle) of PU channels. Obviously, the more CUAVs participating in sensing the same channel, the higher accuracy of the sensing result is [28]. However, since the CUAVs choose to access the same channel that they sense, this will also lead to a higher congestion level over the PU channel. Therefore, the CUAVs need to develop a proper channel selection strategy to balance between the spectrum sensing accuracy (i.e., to reduce transmission failure probability) and the quality of transmissions (i.e., to avoid severe congestion over the selected channel).

Sensing
Cooperation Access Figure 3. Structure of one time slot for the joint channel sensing and access protocol.

Problem Formulation
Given the presented network model and the proposed access protocol, we know that the network performance is determined by the channel selection strategies of the CUAVs for joint channel sensing-and-access. Our goal is to find an appropriate approach to jointly reflect the system cost in cooperative spectrum sensing and the utility in successful transmissions. Furthermore, we aim to derive an optimal joint strategy of the CUAVs for channel selection in the time-varying radio environment, such that the utility of PU channels is maximized. Therefore, from a Genie's perspective, we can formulate the following centralized optimization problem for the considered CUAV network: where T is the total number of time slots for CUAV network operation. In practice, T is typically not known in advance. c t n,m is the binary decision variable of CUAV n on PU channel m in time slot t, and c t n,m = 1 if CUAV n selects PU channel m to sense and access at time slot t. r t n,m is the reward of CUAV n on PU channel m at time slot t, and is determined by the weighted sum of user sensing access cost and utility. For ease of discussion, we defer the detailed definition of r t n,m to Section 3. Obviously, we have r t n,m = 0 if c t n,m = 0. γ ∈ [0, 1) is the reward discount factor to translate the future rewards into the reward at t = 0 [29]. In addition, the expectation operation E[·] is calculated over the PU channel evolution model (see also Figure 2). Without considering the expectation operation E[·], (1) will degrade to be an one-shot, NP-hard binary programming problem. However, in the real world, the PU channel evolution model is not known in advance, and it is impractical to assign a centralized coordinator in the CUAV cluster, due to the constraints of on-device computation/signaling capability. Therefore, in the following, we reformulate the static problem as described in (1) into a CCH-MARL problem based on MG, and then resort to the MARL-based algorithms for deriving the channel selection strategies of the CUAVs.

Markov Game-Based Problem Formulation
Before proceeding to the reformulation of the considered problem, we provide the definition of MG as follows.
Definition 1 (Markov game [30]). An MG is defined by a sixtuplet as S is the state space observed consistently by all agents. • A n is the action space of agent n, and the joint action space of all the agents is A : is the transition probability from any state s ∈ S to any state s ∈ S for any given joint action a = (a 1 , a 2 , ..., a n ) ∈ A. • The reward function r n : S × A × S → R determines the instant reward received by agent n in the controlled Markov process from (s, a) to s . • γ ∈ [0, 1] is the reward discount factor.
Based on Definition 1, we are able to map the considered optimization problem from (1) into the following MG: • Agent Set N consists of the N CUAVs (agents), i.e., N = {1, ..., N}.

•
State space S of the MG is defined as where s t m ∈ {0, 1, . . . , N} is the number of CUVAs that select PU channel m to sense and access in the previous time slot. In particular, s t 0 is the number of CUVAs that do not select any PU channel. Since each CUAV can select at most one single PU channel for sensing-and-access, ∑ M m=0 s t m = N. o t m ∈ {0, 1} is the observed occupancy state of PU channel m in the previous time slot. Following (2), the size of the state space is Let a t n ∈ A n denote the PU channel selected by agent n at time slot t, a t n = 0 indicates that no channel is selected. The joint action space A = ∏ N n=1 A n can be defined as the Cartesian product of all the CUAVs, and the joint action at time slot t is a t = (a t 1 , . . . , a t N ) ∈ A.

Definition of CUAVs' Reward Function
Let m and N t+1 m denote the PU channel selected by CUAV n (i.e., a t n = m) and the CUAV set selecting the same channel at time slot t + 1, respectively. For the considered CUAV network, the reward of each CUAV is defined by the weighted sum of the cost due to its spectrum exploration (spectrum sensing) and the utility obtained from channel utilization (channel access). The reward r t+1 n for CUAV n is defined as where d t+1 m ∈ {0, 1} is the sensing fusion result of the cooperative CUAVs over PU channel m at time slot t + 1. d t+1 m is a function of a t , i.e., d t+1 m = f (a t ), and the form of f (·) is determined by the adopted sensing fusion rule. We note that due to the inevitable missed detection and false alarm [14], the real PU channel state o t+1 m may not be consistent with the sensing fusion result d t+1 m and thus we have the first four cases in (3). In (3), E t+1 ss,n and E t+1 dt,n are the spectrum sensing and channel access cost for CUAV n, respectively. More specifically, the cost of sensing/access is mainly incurred by the energy consumption of the transceiver for spectrum sensing and data transmission. R t+1 n is the reward corresponding to the amount of successively transmitted data during time slot t + 1 for CUAV n. η ∈ (0, 1) and µ ∈ (0, 1) are the weighting factors for the spectrum sensing and channel access cost, respectively. The five cases in (3) are further explained as follows: (i) If PU channel m is busy, and the sensing fusion result is the same, i.e., o t+1 m = d t+1 m = 1, the reward of CUAV n is solely determined by the spectrum sensing cost −E t+1 ss,n . (ii) If PU channel m is busy but the sensing fusion result leads to a missed detection, i.e., o t+1 m = 1, d t+1 m = 0, CUAV n's reward is determined by the sum of spectrum sensing cost −E t+1 ss,n and the cost due to the failed data transmission, −E t+1 dt,n . (iii) If PU channel m is idle and the sensing fusion result is the same, i.e., o t+1 m = d t+1 m = 0, CUAV n's reward is determined by the weighted sum of the sensing cost, −E t+1 ss,n , the cost for data transmission, −E t+1 dt,n , and the utility of successful transmission, R t+1 n . (iv) If PU channel m is idle but the fusion result leads to a false alarm, i.e., o t+1 m = 0, d t+1 m = 1, the reward of CUAV n is determined by the weighted sum of spectrum sensing cost −E t+1 ss,n and the lost transmission utility −R t+1 n . (v) If CUAV n does not select any PU channel, i.e., a t n = 0, the reward is 0. Furthermore, we adopt the following forms of E t+1 ss,n , E t+1 dt,n , and R t+1 n in (3): • Spectrum sensing cost E t+1 ss,n for CUAV n at time slot t + 1 is defined as the energy consumed for spectrum sensing, namely, a function proportional to the working voltage V DD of the receiver, the bandwidth of the sensed channel B, and the sensing duration τ t,n [31]: • Data transmission cost E t+1 dt,n for CUAV n in time slot t + 1 is defined as the energy consumed for data transmission during the time slot, E t+1 dt,n = τ s,n p s,n , where τ s,n and p s,n are the data transmission duration and transmit power, respectively. τ t,n , τ s,n , and p s,n are assumed to be the same for all the CUAVs, i.e., τ t,n = τ t , τ s,n = τ s , p t,n = p t , ∀n ∈ N . • Transmission utility R t+1 n for CUAV n in time slot t + 1 of (cf. Cases iii and iv) is measured as the amount of data transmitted over the time slot. We consider that the quality of transmission is evaluated based on the throughput over a given channel under the co-channel interference: where SI NR t+1 n,m is the received signal-to-interference-to-noise ratio (SINR) for CUAV n over its selected PU channel m. SI NR t+1 n,m can be expressed as where σ 2 is noise power. g n,m is the channel gain of CUAV n on PU channel m and g n j,m is the channel gain between CUAV j and CUAV n on PU channel m. As mentioned earlier, with platooning of the CUAV cluster, the channel gains among the CUAVs could be considered as quasi-static over the period of interest. ∑ j∈N t m ,j =n g n j,m p t is the co-channel interference from the other CUAVs sharing the same PU channel m. Since the spatial positions and the transmitting-receiving relationship of the CUAVs over the same channel are not necessarily the same, the channel gains between different CUAVs are different, and thus the SINR of the received signals of each CUAV are different.
Finally, we examine the impact of fusion rules on the sensing fusion result d t+1 m = f (a t ) in (3). In this paper, the "K-out-of-N" spectrum sensing fusion rule [26] is adopted to obtain the final spectrum sensing fusion result, namely, where 1 {A=B} is the indicator function taking the value of 1 if the condition A = B is true and 0 otherwise. Especially, it is known that for (8), if K = 1, the "K-out-of-N" rule degrades to the "OR" rule, while if K = N, the "K-out-of-N" rule becomes the "AND" rule [26]. We assume that the observation of each CUAV follows an independent, stationary observation process on the binary Markov process in Figure 2.

MARL Algorithm Framework
When the model of the state transition in the established MG is unknown to the CUAVs, we aim to learn to optimize the long-term statistical performance of the CUAV network. From the perspective of a single CUAV n, the problem of social optimization in (1) is transformed into the following local optimization problem ∀n ∈ N : where the value of the discount factor γ reflects the effect of future rewards on optimal decision-making particularly. π −n denotes the joint policy taken by the other CUAVs except CUAV n. v n (s 0 , π n , π −n ) is the value function for the given state s 0 and joint policy (π n , π −n ). Herein, the policy of CUAV n is defined as π n : S n → ∆(A n ), where ∆(A n ) is the collection of probability distributions over CUAV n's action space A n . π n (a t n |s t n ) in π n (s t n ) = {π n (a t n |s t n )|a t n ∈ A n } is the probability of CUAV n choosing action a t n at state s t n during time slot t (π n (a t n |s t n ) ∈ [0, 1]). For this MARL process, each CUAV aims to find a strategy π n to maximize its average cumulative discounted reward, given the (implicit) impact of the adversary strategies of the other CUAVs.
It is known that without considering the influences of the other CUAVs' actions, the solution of (9) is a fixed point of the following Bellman equation, and an iterative search method can be used to find its solution, where r t+1 n (s t , a t n ) is the instant reward of CUAV n if it takes action a t n over system state s t at time slot t. P(s t+1 |s t , a t n ) is the state transition probability as described in Section 3.1. Based on (10), the classical Q-learning method [29] can be adopted by each CUAV to approximate the solution to (10) by treating the adversary CUAVs as part of the stationary environment. Then, the Q-function is updated as where q t+1 n (s t+1 , a t n ) is estimated state-action value at t + 1 if CUAV n takes action a t n at state s t , α t ∈ [0, 1) is the time-varying learning rate. It is proved in [32] that if ∑ ∞ t=1 α t = ∞, ∑ ∞ t=1 (α t ) 2 < ∞ and the assumption of stationary environment holds, the iterative sequence based on Equation (11) converges to q t+1 n (s t , a t n ) as each state is visited enough times. Based on (10), we now consider the impact of the adversary policies on the performance of CUAV n explicitly. Let π = (π n , π −n ) and a t −n denote the actions of all the CUAVs except CUAV n in time slot t. Then, (9) can be rewritten as follows, max π n v n (s 0 , (π n , π −n )) = max π n +∞ ∑ t=0 γ t E(r t+1 n (s t , (π n , π −n ))|s 0 , (π n , π −n )).
With (12), for ∀s 0 ∈ S, each CUAV searches for the optimal π n to maximize its value function v n (s 0 , (π n , π −n )), given the stationary adversary policy π −n . The joint solution to (12) for all n ∈ N leads to a Nash equilibrium (NE) solution, which can be mathematically defined as follows.
Although there always exists an NE for discounted MGs [33], guaranteeing the convergence to an NE through decentralized learning without exchanging the reward/policy information still remains an open problem. To tackle our considered problem in a decentralized manner, we leverage the idea of IL [22], and propose a Q-learning-based algorithm and a DDQN-based algorithm in Section 4. Fortunately, we are able to show the convergence of the proposed algorithms through numerical simulations in Section 5.

Algorithm Design Based on Independent Learner
In this section, we introduce exploration strategy based on UCB-H, with which we develop two MARL algorithms in the framework of IL. The information exchanging overhead and execution complexity of the proposed algorithms are also discussed.

UCB-H Strategy
The main aim of introducing UCB-based action exploration strategy is to avoid the drawbacks of the traditional -greedy strategy, which imposes no preference for the actions that are nearly greedy or particularly uncertain [29]. The original UCB strategy is proposed for the multi-armed bandit scenario without discerning the underlying state evolution [29]: where N t n (a) is the times that action a has been selected prior to time slot t, and c > 0 controls the degree of exploration. With (14), actions with lower estimated values or that have already been selected frequently will be selected with decreasing frequency over time [29]. For our concerned problem of channel selection, modification is needed to replace N t n (a) by the times of selecting the state-action pair (s t , a t ). For our studied problem, we introduce the UCB-H strategy to achieve a trade-off between action exploration and exploitation (cf. [17,34]). Specifically, it also helps to balance a CUAV's strategy between preferring cooperation during sensing and incurring competition with more interference in channel access. Based on (14), the corresponding Q-value updating method now becomes (15) from (11): In (15) and (16), α t is learning rate that varies with time. b t is the confidence bonus indicating how certain the algorithm is about the current state-action pair. N t n (s t , a t ) is the times that state-action pair (s t , a t ) has been visited prior to time slot t. T is the total number of time slots of the CUAV network operation. p is an arbitrary small value to ensure that the total regret of the learning process is upper-bounded by O(H 4 |S||A|T ln(|S ||A|T/p)) with probability 1 − p. H is the steps in each episode of episodic Markov decision process (MDP) where H = 1 in general MDP, i.e., our considered scenario [17,34].

IL-Q-UCB-H Algorithm
By treating the other CUAVs as part of the environment, the IL-Q-UCB-H algorithm can be developed based on standard Q-learning with UCB-H. This essentially approximates the original MARL problem in the MG by a group of single-agent RL problems, as shown in  For ease of generalization, we provide in (17) the traditional IL-Q algorithm that adopts -greedy strategies for action selection. UCB-H can be conveniently incorporated into (17) by modifying the temporal difference term therein, as in (15).
For n ∈ N , we set the learning rate of IL-Q uniformly as [24] where c α > 0, ϕ α ∈ (0.5, 1]. For either (15), the action update is obtained through tabular search: a t+1 n = arg max a t+1 n Q t+1 n (s t , a t+1 ). (19) In summary, the IL-Q-UCB-H algorithm based on standard IL-Q learning is described in Algorithm 1.

Algorithm 1 : IL-Q-UCB-H algorithm.
1: Initialize: Set t = 0, choose p ∈ (0, 1), c > 0 c α > 0, ϕ α ∈ (0.5, 1], and set the maximum time slots T; 2: for all agent n ∈ N do 3: initialize Q t n (s t , a t ) = 0 and s 0 ; 4: end for 5: while t < T do 6: for all agent n ∈ N do 7: Update the learning rate α t according to (18); 8: Select an action a t n at s t according to (19); 9: Take action a t n to select channel for spectrum sensing and produce sensing decision d t+1 n,m ; 10: Feedback sensing information D t n = {n, a t n , d t+1 n,m } on CCC; 11: Receive sensing fusion decision d t m according to (8); 12: Access channel based on sensing fusion decision, and receive reward r t+1 n according to (3) and observe s t+1 ; 13: Update Q t+1 n (s t , a t ) according to (15); 14: end for 15: t = t + 1 and s t ← s t+1 ; 16: end while

IL-DDQN-UCB-H Algorithm
The proposed IL-Q-UCB-H algorithm requires each CUAV to construct a Q-table of size |S| × |A n |. Then, with the increasing number of PU channels, the IL-Q-UCB-H algorithm faces the curse of dimensionality. To handle such a problem, we adopt the framework of DDQN [35] for value space approximation with deep neural networks which replace the IL-Q-UCB-H algorithm with the IL-DDQN-UCB-H algorithm. Compared with the vanilla DQN algorithm, the core of the IL-DDQN-UCB-H algorithm decomposes the maximization operation into a neural network for action selection and a target neural network for action evaluation [35]. The main functional components [18] are illustrated in Figure 5, and each component is described in detail as follows. Output Layer: The output of DDQN is a vector of size (M + 1), corresponding to the Q-value estimation of all optional actions given the current system state, i.e., Q t n = [Q t n,0 , Q t n,1 , . . . , Q t n,M ]. Experience Replay: In DDQN, the experience replay component stores the accumulated samples in history in the form of experience tuples (s t , a t n , r t+1 n , s t+1 ) which are composed of the current state s t , action a t n , reward r t+1 n , and the next state s t+1 . During the learning process, the agent randomly samples a batch of experience tuples of length B from the experience replay to fit the deep network to the Q-values, aiming to eliminate the temporal correlation of historical samples.
Current Q-Network: The current Q-network (i.e., Q-table fitting deep neural network) realizes the mapping of the input state s t to the corresponding Q-value Q t+1 n (s t , a t+1 ; θ t n ) of each action a t n , where θ t n is the parameters of the current Q-network. The experience tuples are mainly used to train the current Q-network to update its own set of parameters θ t n until convergence. After training, an action will be selected based on the output Q-values.
Target Q-Network: The target Q-network has the same structure as the current Qnetwork, also with the same initial parameters. The output target Q-value Q t+1 n (s t , a t+1 ;θ t n ) is mainly used to supervise the iterative training of the current Q-network, whereθ t n is the parameters of the target Q-network. In DDQN,θ t n is updated after a fixed rounds F of training. It directly assigns the value of θ t n toθ t n , which is known as the fixed Q-targets in DDQN.
Action selection strategy: To prevent the actions falling into the local optimum during the period of unconverged deep neural network training stage, the greedy strategy is introduced during action selection (cf. (19)), Loss Function: The loss function used in training the current Q-Network is defined as follows: where B is the batch size and y n,i is the target Q-value. With UCB-H, the updating method of the target Q-value is a t,max n = arg max a t n Q t n,i (s t , a t ; θ t n ), with y n,i = r t+1 n,i + γ max a t n Q t n,i (s t+1 , a t,max n ;θ t n ) + b t .
We note that the loss function is a mean square error between the output Q-value of the target Q-network and that of the current Q-network. After receiving the value of the loss function, the gradient descent method is used to update θ t n iteratively, i.e., θ t+1 n ← θ t n + ζ∇ θ t n L t n (θ t n ) with a learning rate ζ. The gradient ∇ θ t n L t n (θ t n ) is calculated following (25) For the considered CUAV network, the framework of the IL-DDQN-UCB-H algorithm is given in Algorithm 2 based on the aforementioned functional components. Randomly initialize the current Q-network parameters θ t n , target Q-network parametersθ t n and s 0 ; 4: end for 5: while t < T do 6: for all agent n ∈ N do 7: Select an action a t n at s t according to (20); 8: Take action a t n to select channel for spectrum sensing and produce sensing decision d t+1 n,m ; 9: Feedback sensing information D t n = {n, a t n , d t+1 n,m } on CCC; 10: Receive sensing fusion decision d t m according to (8); 11: Access channel based on sensing fusion decision, and receive reward r t+1 n according to (3) and observe s t+1 ; 12: Store (s t , a t n , r t+1 n , s t+1 ) into experience replay; 13: if t > C then 14: Remove the old experience tuples from experience replay; 15: end if 16: Randomly select a batch size B experience tuples (s t , a t n , r t+1 n , s t+1 ) from experience replay; 17: Calculate loss function L t n (θ t n ) according to (21) and (25); 18: Update parameter θ t n according to (24); 19: if t mod F = 0 then 20:θ t n ← θ t n ; 21: end if 22: end for 23: t = t + 1 and state s t ← s t+1 ; 24: end while

Algorithm Complexity Analysis
• IL-Q-UCB-H algorithm: Since each CUAV executes the IL-Q-UCB-H algorithm independently, its information interaction overhead is mainly caused by broadcasting its own sensing decision information. The amount of information interaction increases linearly with the increase of CUAVs. For algorithm execution, each CUAV needs to store a Q-table of size N · 2 M (M + 1) N according to the number of states and actions.
It increases exponentially with the numbers of CUAVs and PU channels. The computational cost for each CUAV is dominated by the linear update of the Q-table and the search for the optimal action, which are both of constant time complexity. • IL-DDQN-UCB-H algorithm: The cost of information exchange is the same as the IL-Q-UCB-H algorithm. For algorithm execution, since a deep neural network is used to fit the Q-values, the storage cost mainly depends on the structure of the deep neural network. Since the IL-DDQN-UCB-H algorithm involves updating two Q-networks, the computational complexity is dependent of the neural network structure (i.e., the network parameters) at the training stage.

Simulation and Analysis
In this section, the performance of the proposed algorithms is evaluated in the same CUAV network through numerical simulations. Specifically, the experiments are carried out with respect to several indicators, including the average reward, sensing accuracy, and channel utilization. The average reward is evaluated as the average instant reward of all the CUAVs,r t+1 = N −1 ∑ N n=1 r t+1 n . The sensing accuracy is evaluated as acc = (N t acc /M) × 100%, where N t acc is the number of PU channels over which the sensing fusion produces correct observation of the channel states. The channel utilization is evaluated as uti = (N t uti /M) × 100% where N t uti is the number of PU channels selected by CUAVs in time slot t. The main parameters used throughout the simulations are given in Table 1. The binary Markov model for PU activities are randomly initialized as (α m , β m ), ∀m = 1, . . . , M. The hyperparameters of all the RL algorithms are given in Table 2. The learning rate α t is initialized as 0.9. To demonstrate that the proposed algorithms are able to handle the network congestion, the simulations in Figures 6 and 7 evaluate the average reward and sensing accuracy for two cases of N = 4, M = 5 and N = 6, M = 5.
We observe from Figure 6a that all of the four algorithms are able to converge with sufficient training epochs. We note that the two IL-DDQN algorithms are able to obtain higher average reward than the two IL-Q algorithms. The reason lies in that DDQN not only reduces the correlation of sampled data, but also prevents overfitting to handle the excessive state-action space more efficiently. At the same time, the UCB-H-enabled algorithms are able to achieve higher average rewards than their -greedy counterparts. This indicates that the UCB-H strategy is able avoid the performance degradation caused by the randomness due to -greedy exploration and the local optimality caused by insufficient exploration when using myopic strategy to select actions.     Figure 6a, the performance of IL-Q-UCB-H and IL-DDQN-UCB-H is also better than -greedy IL-Q and IL-DDQN. In addition, the -greedy-enabled algorithms fluctuate more severely in the early stage of training. The reason is that the Q-values using the -greedy strategy bear little difference at the early stage, and this makes the agents select actions randomly. The UCB-H-enabled algorithms are relatively smooth in the early stage of training, thanks to the confidence bonus, which makes the Q-values discernible. In summary, Figure 6 shows that the proposed IL-DDQN-UCB-H algorithm is able to achieve the best performance, in terms of the average reward and the sensing accuracy, when the number of CUAVs are less than that of PUs and the congestion does not exist. Figure 7 shows the performance in terms of the average reward and sensing accuracy of the four algorithms with N = 6, M = 5. As can be seen from the figure, the UCB-Henabled algorithms are able to achieve better performance in the condition of congestion. In addition, comparing Figures 6b and 7b, we note that when there are more CUAVs, the sensing accuracy rate can be increased by 10% to 15%. This demonstrates the efficiency of the cooperative sensing mechanism.
A further illustration of the trade-off between the sensing accuracy and network congestion is provided by Figure 8 with N = 10. It can be seen that the performance of CUAV cooperation is significantly better than that of non-cooperation. In particular, the sensing accuracy of the IL-DDQN-UCB-H algorithm in the cooperative scenario can reach 97%. At the same time, the achieved average reward of cooperation is less than the cases of N = 4 or N = 6, which indicates that the improved accuracy may not fully compensate the degradation of transmission due to congestion.  Considering the cases where some CUAVs do not select a channel for sensing and access in every time slot, another simulation is performed with channel utilization as an performance indicator. Figure 9 shows the channel utilization performance of the four algorithms. It can be seen that the four algorithms can achieve a channel utilization of more than 42%, especially the IL-DDQN-UCB-H algorithm which has a channel utilization of 49%. It shows that the proposed cooperative sensing and access algorithms can find idle PU channels in time and significantly improve the channel utilization. We note from Section 3.2 that there are four situations for CUAVs to sense and access PU channels. The obtained reward is dependent on the channel bandwidth in these four situations. This is mainly reflected in the spectrum sensing cost and the available data transmission volume (utility). By the definition of the reward function, the spectrum sensing cost −E t+1 s,n is a negative reward and has a negative correlation with the channel bandwidth, while R t+1 n > 0 with a positive correlation with the channel bandwidth. As the channel bandwidth increases, the absolute values corresponding to the cost and utility will also increase, resulting in a decrease in the system reward. The simulation analyzes the relationship between the average reward and PU channel bandwidth. PU channel bandwidth is taken as B m ∈ {50, 60, 70, 80, 90, 100} MHz and the result is shown in Figure 10. It can be found that as the channel bandwidth increases, the system average reward also increases. This indicates that the cost due to sensing a larger bandwidth can be compensated by the utility gained from channel utilization. Namely, choosing a PU channel with a large channel bandwidth to construct a set of candidate sensing channels generally leads to better performance of the CUAV network.
The average reward of the four algorithms under different PU channel state transition probabilities is analyzed with (α m , β m ) varying as α m = β m ∈ {0.1, 0.3, 0.5, 0.7, 0.9}. Figure 11 shows that when the state transition probabilities (α m , β m ) increase from 0.1 to 0.5, the average reward decreases. Comparatively, when it gradually increases from 0.5 to 0.9, the average reward increases. As shown in Figure 2, the randomness of PU channel state is small when (α m , β m ) is either very large or small. In this situation, the CUAVs estimate PU channel states more accurately based on the historical experience, and greater rewards can be obtained based on this decision. However, PU channel state transition is highly random when (α m , β m ) is about 0.5. In this situation, the reward will decrease based on the historical experience of the CUAVs and so will the sensing accuracy.

Conclusions
In this paper, the problem of joint spectrum sensing and channel access for a CUAV communication network in a time-varying radio environment was studied. In a situation where the information about the primary network dynamics is not known in advance, a competition-cooperation protocol framework was proposed for CUAVs to implicitly cooperate over the channels to sense and access. An MG-based model was introduced to translate the centralized one-shot network optimization problem into a group of MARL problems that locally optimize the cumulative sensing-transmission reward of each CUAV.
To avoid excessive information exchange overhead for channel cooperation, an independent Q-learning algorithm and an independent DDQN algorithm were proposed to approximate the equilibrium strategies of the MG. The proposed learning algorithms were improved with the UCB-H-based action-exploration strategy. Numerical simulation results showed that the proposed algorithms can increase the system average reward, sensing accuracy, and channel utilization efficiently.