Learning Frameworks for Cooperative Spectrum Sensing and Energy-Efﬁcient Data Protection in Cognitive Radio Networks

: This paper studies learning frameworks for energy-efﬁcient data communications in an energy-harvesting cognitive radio network in which secondary users (SUs) harvest energy from solar power while opportunistically accessing a licensed channel for data transmission. The SUs perform spectrum sensing individually, and send local decisions about the presence of the primary user (PU) on the channel to a fusion center (FC). We ﬁrst design a new cooperative spectrum-sensing technique based on a convolutional neural network in which the FC uses historical sensing data to train the network for classiﬁcation problem. The system is assumed to operate in a time-slotted manner. At the beginning of each time slot, the FC uses the current local decisions as input for the trained network to decide whether the PU is active or not in that time slot. In addition, legitimate transmissions can be vulnerable to a hidden eavesdropper, which always passively listens to the communication. Therefore, we further propose a transfer learning actor–critic algorithm for an SU to decide its operation mode to increase the security level under the constraint of limited energy. In this approach, the SU directly interacts with the environment to learn its dynamics (i.e., an arrival of harvested energy); then, the SU can either stay idle to save energy or transmit to the FC secured data that are encrypted using a suitable private-key encryption method to maximize the long-term effective security level of the network. We ﬁnally present numerical simulation results under various conﬁgurations to evaluate our proposed schemes.


Introduction
Cognitive radio is one of the effective solutions to the problem of spectrum scarcity in wireless communications networks. Secondary users (SUs) with cognitive capability can utilize the spectrum bands licensed to primary users (PUs) for reliable and effective data transmission [1]. To achieve this, the SU modifies its parameters to adapt to the time-slotted operation of the PU on the channel of interest, and then senses the presence of the PU on that channel in every time slot. When the PU is sensed as inactive in a particular time slot, the SU can use the licensed channel during that time slot to transmit data. In this paper, the SU uses its limited-capacity battery, powered by a non-radio frequency (non-RF) energy harvester, for spectrum sensing, data encryption, and data transmission.

Motivation
Many studies concerning energy management problems for energy-harvesting nodes have been conducted, primarily to maximize a system's throughput [2][3][4][5][6]. For example, Park and Hong [2] proposed a joint design of a spectrum sensing policy and a detection threshold to maximize total expected throughput under energy constraints. Pappas et al. [4] examined the two-dimensional maximum stable throughput region for a simple cognitive system comprising two source-destination pairs. Razaque and Elleithy [6] designed an intelligent decision-making (IDM) model for wireless sensor networks, which allows the sensor node to obtain energy from the Sun, and thus preserves its battery energy in an outdoor environment. Liang et al. [7] studied the optimal sensing duration to maximize achievable throughput for a secondary network while sufficiently protecting primary users. There is research that analyzes optimal transmission power and density of secondary transmitters to maximize secondary network throughput under the constraints of a given outage-probability [8].
In addition, the work in [9] explored a multiple-input multiple-output (MIMO) technique for collaborative spectrum sensing for the distributed detection framework in cognitive-radio scenarios; this paper focuses on the reporting channel in a spectrum-sensing context and exploits the results from decision fusion to improve probability of detection.
In addition, cognitive radio networks (CRNs), like any modern communications system, should guarantee the privacy of the data traveling through the network [10]. However, due to its open and random access nature, wireless communications in CRNs is susceptive to security threats targeting the physical or media access control layers (e.g., passive eavesdropping or radio frequency (RF) jamming). For that reason, a remarkable number of contributions focus mainly on security technologies for CRNs [11]. In particular, Wen et al. [12] presented physical layer approaches to defend against security threats in CRNs. The authors first introduced a MIMO technique that guarantees a low probability of interception, and that enhances the confidentiality of the network; then, they proposed an identified scheme based on channel responses to defend against primary user-emulation attacks. Ciuonzo et al. [13] studied channel-aware decision fusion rules to classify the presence of a (either distributed or co-located) multi-antenna jamming device in wireless sensor networks.
Moreover, physical layer security in CRNs has been widely studied to secure wireless transmissions, especially in the presence of a hidden eavesdropper [14,15]. Besides this, keeping the data classified from prying eyes by using encryption techniques is one of the most feasible solutions to maintain security; but, in reality, it is not easy to implement conventional encryption techniques in CRNs, since the networks have constrained resources (e.g., limited energy or memory). As a consequence, encryption techniques such as symmetric and asymmetric key algorithms are not preferred for data protection in CRNs. Nevertheless, in modern CRNs, wireless energy harvesting technology can ensure the energy autonomy of the network by using a small rechargeable battery integrated with an energy harvester, thus providing the SUs with redundant energy to improve data security. Therefore, protecting data using encryption methods still attracts a lot of interest in the research community [16][17][18]. To illustrate, Sen [19] identified numerous security threats to cognitive wireless sensor networks and the defense mechanisms against these vulnerabilities by selecting the most appropriate cryptography algorithm for each class of attack.
Recent work proposes an energy-efficient data encryption scheme for an SU powered by an energy harvester to decide its operation mode (e.g., stay silent or transmit encrypted data) in the current time slot [20]. This scheme aims to find an optimal policy for the data encryption decision to maximize the long-term security level of the system. More specifically, the scheme uses a well-known symmetric encryption method called the Advanced Encryption Standard (AES) [21] for the same data block length with different key sizes (AES-128, AES-192, AES-256). The SU can encrypt data using an algorithm with longer key lengths to enhance security, and then transmits the encrypted data on an idle licensed channel. Furthermore, the SU determines the encryption key length based on the impact of spectrum sensing error, the energy causality constraint, and the effect of the current decision on future time slots. The problem is first formulated as the framework of a partially observable Markov decision process (POMDP), and is then solved by using value iteration-based dynamic programming to find the optimal policy. However, this solution is rarely directly useful in reality. It is akin to an exhaustive search, looking ahead at all possibilities, computing the probabilities of occurrence and their desirability in terms of expected rewards (i.e., security levels) [22]. The solution relies on the assumption that we know in advance the dynamics of the environment (i.e., an arrival of harvested energy), which is rarely true in practice. Consequently, this paper is going to investigate the problem from a different point of view in which the solution does not require prior information about the environment's dynamics.

Contributions
Our focus in this paper is to solve the problem of reaching a data encryption decision that aims to maximize the security of data transmissions in CRNs by using model-free reinforcement learning [22], namely, an actor-critic algorithm. The main advantage of the actor-critic solution over the POMDP-based approach is that it does not require complex computations or information about the arrival of harvested energy. In this work, we model the arrival of harvested energy and the primary traffic as a Poisson point process and a time-homogeneous discrete Markov process, respectively. At the beginning of a time slot, the SU does not have the exact information about the energy harvesting model and the spectrum occupancy status of the PU, except for the average value of harvested energy and the transition probabilities for the PU state. Thus, the SU needs to carry out spectrum sensing to identify whether the primary channel is busy or not; then, it either stays idle or transmits data on the free channel. Accordingly, to increase the chances for the SU to transmit data on the primary channel and to reduce the probability of collision with the primary user, we propose a new cooperative spectrum sensing technique using a convolutional neural network (CNN) and historical sensing data.
More than that, the primary purpose of this paper is to find an optimal data encryption decision policy that fits into the framework of a Markov decision process (MDP). During this process, we employ an actor-critic sequential learning model so the SU can interact with the environment in a stochastic way to acquire information on the environment's dynamics. Based on this method, the SU can learn the energy harvesting model and the primary traffic variations from the learning practice. Afterwards, it can either stay idle or select an appropriate key length for data encryption (also known as action in this paper), and then verify the effect of the decision based on the returned rewards. By repeating this kind of action over time, the SU can establish the policy to make determinations in the future. However, it would take time for the actor-critic learning procedure to converge to an optimal policy, especially with the large size of the state space [23]. To deal with such an issue, we employ the idea of transfer learning, which exploits the historical relevance of the harvested energy model and the primary user's activity in order to speed up the learning process of the conventional actor-critic algorithm [24]. In this paper, we call this method a transfer learning actor-critic (TLAC) algorithm. Compared with previous work, the main contributions of this paper are summarized as follows: • We first introduce a new energy harvesting model, which is represented by a transformed Poisson distribution proven to give the nearest fit to the empirical measurements of a solar energy harvesting node for time-slotted operation [25].

•
We also introduce a new CNN-based technique for cooperative spectrum sensing to enhance the performance of spectrum sensing by increasing the probability of detection while guaranteeing a low probability of false alarm.

•
We then formulate the stochastic problem of the data encryption decision policy as the framework of a constrained MDP, and solve the problem by using the transfer learning actor-critic algorithm.
The rest of this paper is organized as follows. In Section 2, we introduce the system model of the proposed schemes. A new energy harvesting model based on transformed Poisson distribution is introduced in Section 3. Section 4 presents the new CNN-based cooperative spectrum sensing (CBCSS) technique. Section 5 focuses on the transfer learning actor-critic algorithm for data protection in CRNs. In Section 6, we evaluate the performance of the proposed schemes through numerical simulation results. Finally, we present a conclusion in Section 7. To make it clear, the most commonly used notations in this paper are listed in Table 1.

System Model
The system considered in this paper comprises a pair of licensed primary users, several secondary transmitters (denoted as SUs), a secondary receiver equipped with a fusion center, and an eavesdropper (E), as shown in Figure 1. From now on, we will call the secondary receiver as the fusion center or FC for simplicity. In this work, the SUs are assumed to always have data to transmit to the fusion center. Thus, they would try to access the licensed channel of the PUs for data transmission by carrying out cooperative spectrum sensing.
The primary user's states (active [A] and not active [Ā]) are assumed to follow a two-state Markov discrete-time process, in which the transition probabilities between the states are denoted P i,j : i, j ∈ {A,Ā}, as illustrated in Figure 2.
Two-state Markov discrete-time model for the primary user's states. P i,j : i, j ∈ {A,Ā}: the transition probabilities between the states.
The performance of the sensing scheme can be evaluated by using the probability of correct detection P d and the probability of false alarm P f . The former represents the probability of detecting the active state (A) of the PU accurately, whereas the latter indicates the probability that the PU is identified as active, but it is truly not (Ā), each of which are given by and respectively, where H denotes the state of the primary user as determined by spectrum sensing. Although the PU state transition probabilities are unknown in practical situations, the historical statistics information of the primary channel can be used to estimate the state transition probabilities based on the Markov model [26]. Therefore, we assume that the SU has a prior information about the PU state transition probabilities based on the historical sensing results; and the global information of the network (e.g., channel state information, probabilities of detection and false alarm) are available for all nodes in the network. The system's operation proceeds as follows. The system is assumed to operate in a time-slotted manner. At the beginning of each time slot, the SUs perform spectrum sensing separately and send the sensing outcomes to the fusion center, where the data are fused together using a certain rule to decide the state of the primary user. The final sensing result is then broadcast to the SUs. If the channel is free, it is allocated to one of the SUs for data transmission. The SUs take turns using the channel, based on the arrival order of their transmission requests. Each SU can occupy the channel over many time slots until it finishes transmitting data. Meanwhile, the eavesdropper is listening to the communication quietly. Therefore, we are going to investigate learning frameworks for cooperative spectrum sensing and energy-efficient data protection against the hidden eavesdropper for the communication between one SU and the fusion center.
We first present a simple but effective cooperative spectrum sensing method based on a CNN to improve the sensing performance. The CNN is constructed and trained to predict the PU states by using individual sensing data as inputs, which leads to specific target outputs. Hence, the fusion center can make global decisions about the PU state based on the outputs of the neural network. Relying on the final decision, if the channel is free, it is allocated to an SU (denoted as SU1) to transmit data. Furthermore, the SU is assumed to have a finite-capacity battery regularly recharged by a non-RF energy harvester. In addition to that, under energy constraint, the SU encrypts data using the AES algorithm with an appropriate key length to maximize the long-term security level of the system.
Regarding data protection techniques, there are two primary types of cryptography: symmetric (or private key) and asymmetric (or public key) algorithms . In general, using private-key cryptography for data encryption is not a time-consuming process, and thus expends less energy than public-key cryptography. For example, the experimental results from Kim et al. [27] showed that a public-key algorithm named the elliptic curve integrated encryption scheme (ECIES) consumes a thousand times more energy during the encryption process than the popular AES-128 private-key method. Even though a public-key algorithm can increase the security level by sacrificing a huge amount of energy, it is not a favorable choice for many wireless systems like CRNs. Subsequently, in the paper, we focus on using the AES algorithm to secure the communications between SU1 and the FC. Specifically, the SU can use one of the three key sizes (128, 192, and 256) to encrypt data using the AES algorithm.
In this paper, the security level is defined by the number of repetitions of the transformation rounds that convert the input data into encrypted data [21]. Therefore, the security level S Nk is dependent on the key length Nk of the AES algorithm, as follows: Using the longer key lengths provides the SU with better data security but consumes more energy [28]. As a result, at the beginning of each time slot, the SU needs to decide its operation mode based on the sensing result and the remaining energy to maximize the long-term security level while efficiently using the limited energy. For example, it can stay silent to save energy for future use; or it can encrypt information by the AES algorithm with a proper key length and transmit the data to the FC. Therefore, in the paper, we additionally design an actor-critic learning framework for SU1 to find the optimal operation mode decision policy. More specifically, when the primary user is determined to be inactive and the remaining energy is sufficient for data transmission, the SU can decide to stay idle to save energy or to transmit data encrypted by the AES algorithm with a suitable key length by calculating the total expected reward in future time slots according to the proposed actor-critic learning algorithm.

Energy Harvesting Model
Recent advances in energy harvesting technologies allow small, low-cost devices such as wireless sensor nodes to operate based solely on wireless harvested energy that is stored in a finite-capacity battery. Hence, in designing network protocols, it is essential to obtain a reliable energy-harvesting model to guarantee energy autonomy in the network. In many studies, the arrival of harvested energy is assumed to be identical and independently distributed [29], to follow a deterministic Markov model [30], or to follow a normal Poisson point process [20], all of which are discrete-time models. In [31], the authors considered the problem of decentralized hypothesis testing in energy harvesting wireless sensor networks, where the arrival energy during a time interval is assumed to be drawn from a Bernoulli distribution. The extensive experimental results from Lee et al. [25] showed that the transformed Poisson distribution model produces the nearest fit for most of the empirical datasets.
In this paper, the number of energy packets that an SU can harvest during a particular time slot, e h , is given as where 0 < e h,1 < e h,2 < · · · < e h,max < E ca , and E ca is the maximum battery capacity of the SU. We assume that e h follows a Poisson point distribution with mean e h,avg . Furthermore, the fit with the Poisson distribution can be improved by using a transformation x = e h − e h,min , where e h,min is the minimum harvested energy. The probability mass function (PMF) of e h is then given by where x avg = e h,avg − e h,min is the sample average of the new variable x. This new distribution is called the transformed Poisson distribution (TPD). This transformation of the original variable can improve the fitting to the empirical datasets, as proven in [25]. In practice, although it is not easy to measure the exact amount of harvested energy in a time-slot interval, we can always estimate the average, the minimum and the maximum values of the harvested energy. Meanwhile, if the normal Poisson point process is used, the minimum harvested energy is assumed to be 0 (or zero) by default , which is rarely true in practical scenarios. Figure 3 shows the difference in the PMF between the normal Poisson distribution and the transformed Poisson distribution when the average harvested energy is e h,avg = 8 packets, with different values of minimum harvested energy: e h,min ∈ {1, 2, 3, 4} packets. As can be seen from the figure, the SU can harvest with a higher probability those energy values located near the mean by using the transformed Poisson model. As a consequence, we can also improve the learning rate of the actor-critic algorithm because the SU can focus on learning the variations of the energy values that are adjacent to the mean.

Convolutional Neural Network-Based Cooperative Spectrum Sensing
In this paper, we exploit the strength of the convolutional neural network, a particular type of deep neural network, to design a new cooperative spectrum sensing solution for the FC to determine the state of the PU on the primary channel. The process of cooperative spectrum sensing is illustrated with the following steps: 1. The FC trains the CNN using historical sensing data represented by the local spectrum decisions provided by the SUs. 2. At the beginning of each time slot, all the SUs are required to perform local spectrum sensing by using an energy detection method and reporting their sensing outcomes to the FC via a control channel. 3. The FC uses the new sensing data as input for the trained CNN to make a global decision about the PU state on the channel of interest, and then feeds back the final decision to the SUs.
Accordingly, the problem of neural network-based cooperative spectrum sensing is divided into two important parts: local spectrum sensing by the SUs and global decision making by the FC using the trained CNN.

Local Spectrum Sensing
The considered CRN is assumed to be composed of K SUs. Each of them performs spectrum sensing independently using an energy detection algorithm, and then sends the outcome to the FC. Moreover, we assume that the status of the PU remains unchanged during the time slot. The hypothesis test statistics for local spectrum sensing at SU i can be formulated as follows [32]: where x i (t) is the received signal by the ith SU in time slot t, h i denotes the channel gain of the link between the PU and the ith SU, s(t) denotes the PU signal, and w i (t) is zero mean and unit variance additive white Gaussian noise (AWGN). Regarding energy detection, the observed energy at the ith SU is expressed as follows [33]: where x i (j) is the jth sample of the received PU signal at the ith SU, and N i is the number of sensing samples during each sensing period. For simplicity, we assume that the number of sensing samples collected by each SU is the same for all the SUs. When N i is sufficiently large (e.g., N i ≥ 200), xE i can be approximated by a Gaussian random variable under the two hypotheses (A andĀ) with mean µ A , µĀ and variance σ 2 A , σ 2Ā , given as follows [34]: where γ i is the average gain of the sensed channel in terms of signal-to-noise ratio (SNR). In this paper, we assume that γ i follows a Gaussian distribution with mean µ i and variance σ 2 i as γ i ∼ N µ i , σ 2 i . For a single-SU spectrum-sensing scheme, the local decision, D i , is given by where 1 and 0 are single-bit data that represent states A andĀ of the primary user, respectively; and λ i is a predefined decision threshold.

Convolutional Neural Network-Based Cooperative Spectrum Sensing
In a deep-learning research, the CNN is widely used in computer vision fields, such as image classification, speech recognition, and handwriting recognition, by making use of spatial characteristics. In this section, we present the process of creating and training a CNN for PU state prediction.

Network Configuration
The first step in designing a CNN is to define the network layers that specify the structure of the CNN, as depicted in Figure 4. This network consists of the following layers [35]. • The input layer stores the input sensing data in the form of a gray scale image with size 1 × K × 1, where K is the number of secondary users.

•
The convolutional (CONV2D) layer contains K neurons (filters) that connect to the local subregions of the input image to learn its features by scanning through it. In this work, each region has a size of 1 × 2.

•
The rectified linear unit (ReLU) layer uses the ReLU function to introduce nonlinearity to the CNN by performing a threshold operation on each input element, simply defined as • The fully connected layer combines all the local information from the original image (e.g., the results of feature extraction) determined in the previous layers to classify the status of the PU, which is active (A) or inactive (Ā). Consequently, the size of the output data is equal to the number of states of the primary user.

•
The softmax and output layers follow right after the fully connected layer for the classification problem. The softmax layer uses an output unit activation function, also known as a normalized exponential function, to create a categorical probability distribution for the two input elements (A andĀ), as follows: where P(H i ) is the class prior probability; H i ∈ {A,Ā} is an element class; and q(H i ) is the output value from previous layer of the sample given class H i . Thereafter, the output (or classification) layer takes the values from the softmax function and assigns each input to one of the two classes.
It should be noted that the original image with size 1 × K × 1 is a vector containing the local decisions from K SUs; thus, a one-dimensional (1D) convolution layer can be used in the CNN to solve the problem of PU state classification instead of using a two-dimensional (2D) convolution layer. However, using a 2D CNN is more useful than 1D CNN in image classification. Furthermore, it would be easier to further develop the current approach to deal with three-dimensional data without making many changes in the current architecture of the CNN. For this reason, the size of the input image is generalized as 1 × K × 1; thus, if the number of secondary users cooperating in spectrum sensing is large enough, the image size could be changed to M × N × 1, where M × N = 1 × K. Moreover, we can enhance the sensing accuracy by placing other information (e.g., the channel SNRs, the distances between the SUs and the PU) in the second and the third layers of the image, and performing some modifications (e.g., permutation, repetition) to the original data structure to provide the CNN with more features.

Network Training and PU Status Prediction
The local sensing decisions from the SUs, D i ∀i ∈ {1, 2, . . . , K}, are used as input for the CNN. Because a CNN is mostly used for image classification, the local decisions from K secondary users are rearranged to form a grayscale image with the size of 1 × K × 1, where the last figure describes the number of color channels in the image. A stochastic gradient descent (SGD) optimizer with an adaptive learning rate is used in training the network. With this algorithm, the initial learning rate of 0.01 is later reduced based on a pre-defined schedule. For instance, it can be multiplied by a factor of 0.1 after every 10 epochs. The training set is a collection of local decisions from K SUs under different environmental conditions (i.e., a wide range in the sensed channel gain).
The FC uses the historical sensing data to train the CNN for the classification problem in advance. Thereafter, the FC determines the presence of the primary user on the licensed channel in every time slot by using the new individual sensing outcomes received at the beginning of each time slot as input for the trained network.

Transfer Learning Actor-Critic Framework for Data Protection in Cognitive Radio Networks
In this section, we present an optimal operation mode-decision policy based on an actor-critic learning framework so the SU can maximize the system's security level and energy utilization. Subsequently, the SU can encrypt data using the AES algorithm with a suitable key length before transmitting the secured information to the FC; or it could stay inactive in a time slot to save energy. In particular, if the SU does not have enough energy to transmit data, or if the sensing result indicates the PU is in state A, the SU will stay silent during the remainder of the time slot. Otherwise, it can decide to transmit the data encrypted by the AES algorithm with one of the three key lengths, Nk ∈ {128, 192, 256}, considering the effect of the decision on the long-term security level of the system.

Markov Decision Process
The problem of the operation mode decision in this paper is first formulated as a framework of a Markov decision process that is defined as a tuple S, A, P, R , where S is the state space, A is the action space, P : S × A → S is a transition probability function, and R is the reward space. The state of the SU at the tth time slot is defined as s(t) = (e r (t), ρ(t)), where e r (t) is the remaining energy of the SU, and ρ(t) is the probability (also called belief ) that the PU is inactive in that time slot. The action state space is defined as A = {ID, TR Nk }. At the tth time slot, the SU can choose to stay idle (action a(t) = ID) or it can choose to transmit data encrypted by the AES algorithm with key length Nk ∈ {128, 192, 256} (action a(t) = TR Nk ). This action provides an immediate reward, and causes the SU to transit into a new state, s , with the following transition probability: We denote as R(s(t), a(t)) the reward (i.e., security level) achieved at the tth time slot when the SU is in state s(t) and taking action a(t) ∈ A, which is defined as The value function is defined as the total discounted reward from the tth time slot, when the SU's state is s(t) = s, which is given as follows [22]: where η is the discount factor. The objective of this paper is to find an optimal action for the SU in the tth time slot to maximize the value function as The solution to the problem of the operation mode decision can be found by solving this equation.

Transfer Learning Actor-Critic Algorithm
Previous work proposed a POMDP-based approach to solving the problem in Equation (14) on the assumption that the SU already has information about the harvested energy model. In this paper, we introduce a new solution to the problem based on the actor-critic learning framework, which does not require the SU to already know the dynamics of energy harvesting. Instead, the SU determines those dynamics by directly interacting with the environment. A regular actor-critic model comprises three main elements: an actor (related to a learning policy), a critic (related to a learning value function), and the environment, as shown in Figure 5. At time step t, the actor selects action a(t) based on the current state, s(t), and the policy, π(s(t)), which is defined by using a Gibbs softmax function as follows [22]: where θ(s, a) is the tendency to select action a when the SU is in state s. The final objective of this paper is now to find an optimal mode decision policy for the SU at the tth time slot, and the problem in Equation (14) can be rewritten as where P(s |s, a) is the transition probability from state s to state s after taking action a.
After that, the SU transits into a new state, s(t + 1), and receives an instant reward R(s(t), a(t)). The critic evaluates the new state and computes a temporal difference (TD) error as δ(t) = R(s(t), a(t)) + ηV(s(t + 1)) − V(s(t)).
The critic uses the TD error to improve the estimate of the value function as well as the policy. The value function is updated as where α c is a positive parameter of the critic. The action resulting in a positive TD error is favorable, since the state value is better than expected. Hence, the probability of selecting action a(t) = a in state s(t) = s in the future should increase, and vice versa. Following that, the tendency to select this action is updated as where α a is a positive parameter of the actor. Furthermore, we exploit the idea of transfer learning to increase the convergence speed to the optimal solution by making use of historical learning data, as depicted in Figure 6. The obtained information is transferred to the new actor-critic algorithm for real-time training in which the initialized value function is the same as the transferred function while the overall policy, θ o (s(t), a(t)), for choosing an action at time step t is given as where θ l (s(t), a(t)) is the transferred policy; θ n (s(t), a(t)) is the new policy, which will be updated in every time slot by using Equation (19); and ε(t) is the transfer rate, which will be reduced after each time step to gradually remove the effect of the transferred policy on the new one. The training process of the actor-critic learning framework for the SU to decide its operation mode is illustrated as follows. At the beginning of the tth time slot, the SU chooses an action according to policy π considering the sensing result and the remaining energy in its battery. The SU can decide to stay idle, a(t) = ID, to save energy, or it can transmit the encrypted data, a(t) = TR Nk , to the FC. The immediate reward, R(s(t), a(t)), and the next state, s(t + 1), are updated at the end of the time slot based on the following cases.

Case 1
The sensing result shows that the PU is in state A on the primary channel, so the SU has to stay idle. Thus, no reward is achieved: R(s(t), ID) = 0. The belief that the PU is inactive in the current time slot is updated using Bayes' rule [36] as follows: The belief for the next time slot is given as and the remaining energy that the SU can use for the next time slot is where E s is the total energy consumption for spectrum sensing, including the energy consumption from local spectrum sensing and that from sending the sensing outcomes to the fusion center.

Case 2
The sensing result indicates that the PU is absent from the primary channel. There are two possible occurrences: (1) The SU decides to stay idle to save energy for the next time slot.
(2) The SU transmits encrypted information to the fusion center.
In the first occurrence, there is no reward: R(s(t), ID) = 0. The probability that the PU is truly inactive in the current time slot is updated using Bayes' rule as follows: The belief and the remaining energy for the next time slot are calculated by using Equations (22) and (23), respectively.
Regarding the second occurrence, the SU uses e tr (t) packets of energy to transmit the encrypted data to the FC. The remaining energy of the SU for the next time slot is calculated as follows: where E Nk is the energy consumption for the encryption process, which is dependent on the key length Nk of the encryption algorithm. If the SU does not receive an acknowledgement (ACK) from the FC, which means the transmission was unsuccessful, there is no reward: R(s(t), TR Nk |ACK) = 0. The probability that the channel will be free of the PU signal in the next time slot is given as On the other hand, if the SU receives ACK from the FC, indicating that transmission was successful, the reward is The belief that the PU will be absent from the channel in the next time slot is given by Thereafter, the value function and the new policy are updated based on the received reward and the new state. This process repeats until it converges into the optimal solution that maximizes the long-term reward of the system, which means that value function V(s) and policy π(s) will finally converge to V * (s) and π * (s) as k → ∞ [37].

Results and Discussion
In this section, we present simulation results to demonstrate the efficiency of the proposed CBCSS and TLAC algorithms for energy-efficient data protection in CRNs. We first present simulation results to evaluate the performance of the proposed CBCSS technique compared with other fusion techniques, such as a half-voting rule [38], an energy detection (ED) method performed by a secondary user, and the Chair-Varshney rule [39]. We then investigate the potential of the TLAC solution for establishing an operation mode decision policy by comparing it with the POMDP-based solution from earlier work [20], the myopic scheme, and the fixed encryption methods, which will be described in detail later.

Convolutional Neural Network-Based Cooperative Spectrum Sensing
The proposed CBCSS for the two-state classification problem was implemented using the Neural Network Toolbox in Matlab (R2017a, The MathWorks Inc., Natick, MA, USA, 2017). Unless presented otherwise, the simulation parameters were as listed in Table 2. The average SNR of the sensed channel, γ i , that was used for training the CNN ranged from −16 dB to −6 dB. Furthermore, the number of training samples for each SNR was 2000. In this work, we consider three different performance metrics: probability of detection P d , probability of false alarm P f , and sensing error P e . The total number of time slots for testing the performance of the proposed CBCSS was 10,000. Furthermore, the process was performed 10 times to get average values for P d , P f , and P e . The first two parameters are calculated by using Equations (1) and (2), whereas sensing error is defined as the sum of the probability of false alarm (P f ) and the probability of missed detection (1 − P d ), as follows: In Figures 7 and 8, we compare the performance of the proposed CBCSS with those of the conventional half-voting fusion rule for cooperative spectrum sensing, the local sensing result based on the energy detection method from one of the K = 10 secondary users, and the Chair-Varshney fusion rule.  Regarding the half-voting rule, the fusion center makes a global decision based on the local sensing data. Specifically, the FC decides that the PU is active (A) if at least half of K SUs report the decision D i = 1. With respect to the energy detection method, the local decision from SU1 was obtained for comparison. Under the Chair-Varshney rule, the detection statistics are expressed as the weighted sum of the local decisions; and the weights are functions of detection probability and false alarm [40]. The Chair-Varshney rule is the optimal decision fusion rule but requires a prior knowledge of the PU's activities and the local sensing performance of the secondary users. From the figures, we can confirm that the proposed CBCSS outperforms other conventional methods, except for the Chair-Varshney optimal fusion rule, in terms of detection probability and sensing error. We can also see that with an increment in the average SNR, the probability of detection increases while the probability of false alarm and the sensing error decrease. This is because the effect of AWGN on the local decisions, and thus the training accuracy, decreases as SNR increases. Accordingly, larger sensed channel SNRs at the SUs provide better detection performance and fewer false alarms. Although the probability of false alarm with the proposed scheme is a little higher than with the half-voting and the Chair-Varshney rules, the total sensing error of the proposed CBCSS almost reaches to that of the Chair-Varshney optimal fusion rule and is lower than those of conventional methods.
In Figures 9 and 10, we examine the effect of the number of secondary users, K, on the performance of the proposed CBCSS. To verify this, we evaluated the output results from three distinct CNNs that were trained with K ∈ {5, 10, 20}, while keeping the number of sensing samples unchanged at N i = 300. For each value of K, the performance metrics were calculated again for comparison purposes. As can be seen from the figures, the increases in the number of SUs that cooperate in spectrum sensing can significantly improve the performance of the CBCSS. This is caused by the increase in spatial diversity when using more SUs, which can help the CNN to extract more information from the sensing data. Moreover, in Figure 10, there is almost no sensing error at SNR = −10 dB with K = 20 sensing nodes.  Finally, we measured the performance of the CBCSS by varying the number of sensing samples, N i , as shown in Figures 11 and 12, for K = 10 secondary users. The training process is the same as with the changing K, but now the number of sensing samples is varied instead of K: N i ∈ {200, 300, 400}. We assert that the effectiveness of the new cooperative spectrum sensing system can be improved by increasing the number of sensing samples that are collected by the SUs for individual spectrum sensing using the energy detection method. Again, the larger value of γ i provides better detection accuracy as well as a lower sensing error.  Since in the paper we focus on developing a new CNN-based cooperative spectrum sensing technique, for the sake of simplicity, we use a simple energy detection method for local spectrum sensing. However, the sensing efficiency can be further enhanced by improving the local spectrum sensing. That is, if the local sensing outcomes provide more accurate sensing data, the CNN can learn the features of the data with higher accuracy, which will produce more precise classification results. From the simulation results, we can observe that larger values of the channel SNR can ensure the better local sensing results, which leads to better overall sensing performance of the system.

Transfer Learning Actor-Critic Solution for Energy-Efficient Data Protection Scheme
This section verifies the performance of the proposed actor-critic framework in comparison to the myopic solution and the POMDP-based solution from earlier work. With regard to the myopic scheme, if the PU is found absent from the channel, the SU will sacrifice its energy to maximize data security [41]. Previous work proposed an optimal decision policy for a CRN to maximize the security level based on a POMDP framework, which requires complex numerical computations as well as prior information about the arrival of harvested energy [20]. The complexity of the problem depends on the required amount of the computation space (e.g., the sizes of the input states, actions, transition probabilities, and observations). In a POMDP, an agent controls the process by choosing the action at each time step based on the observation history to maximize the expected long-term reward. The optimal policy for the agent to choose an action can be found by solving the Bellman's equation using value iteration-based dynamic programming. Each iteration requires O(|A||S| 2 ) operations to compute all the probabilities of transitioning from one state, s ∈ S, to another state, s ∈ S, after taking an action, a ∈ A. The actor-critic method, on the other hand, does not require the agent to compute all the occurrence probabilities to find the solution in advance. In addition to that, the agent learns the optimal policy from actual experienced transitions by directly interacting with the stochastic environment.
The basic simulation parameters for this exercise are shown in Table 3. For analytic convenience, we fixed the SNR value of the sensed channel at −10 dB, and thus the probabilities of detection and false alarm are approximated as P d ≈ 0.9 and P f ≈ 0.1, respectively (based on the results of the proposed CBCSS method). We assume that the SU transmits a packet of 16-byte data in every time slot, which is equivalent to the minimum encryption block length in the AES cryptography; and the transmission channel gain is unchanged during a time slot. It is worth noting that one packet of energy is equivalent to 25 µJ, and each simulation was run over a thousand of time slots for several iterations to obtain average values. We first examined the convergence speed of the TLAC algorithm during the training process by calculating the average reward received after every 1000 time slots. The average harvested energy was fixed at e h,avg = 4 packets. As can be seen from Figure 13, there is a significant rise in the convergence rate of the algorithm during the first 10,000 time slots of the training process; after that, the reward keeps increasing, but at a slower speed. Finally, the algorithm converges to an optimal policy for the SU to determine operation mode after 20,000 time slots when the reward is about 0.91. In Figure 14, we show the efficiency of the proposed scheme compared with the POMDP-based and myopic schemes under the effect of harvested energy. As can be seen from the figure, a larger harvested energy yields a higher reward, indicating that data are protected better. The reason is that, if the SU can harvest more energy, it has a greater chance to operate in transmission mode, and can transmit more data to the FC. Furthermore, the result of the proposed TLAC algorithm is better than the myopic one and a little lower than the POMDP method. To explain this, in the myopic scheme, the SU makes a decision on its working mode without considering the effect of this action on the future reward. In particular, if the primary channel is found free via spectrum sensing, the SU uses too much energy for data encryption to enhance data protection, which causes the SU to stay in idle mode over many time slots due to limited remaining energy. Regarding the POMDP-based solution, the SU is assumed to already have information about the harvested energy model, which is hardly ever true in practice. As a result, by using value iteration-based programming, we can compute all possible happening states and the corresponding occurrence probabilities to find the optimal policy beforehand. Consequently, the SU can predict the next state of the primary user and the upcoming harvested energy before effectively distributing the energy over future time slots. Meanwhile, employing the TLAC algorithm requires the SU to frequently interact with the environment to determine the dynamics of the arrival of harvested energy, which can result in a locally optimal policy [22]. In particular, the SU makes decisions based on a predefined policy (i.e., local or immediate consideration), which is updated at the end of every time slot, to improve future behavior without needing to have any information about the environment's dynamics.   Figure 15 illustrates the channel utilization by the SU for its data communications, computed as the ratio of the total number of successful data transmissions to the total time slots in which the primary user is sensed as inactive. From the figure, we can see that the primary channel is utilized more effectively when harvested energy e h,avg increases. In addition, the proposed TLAC algorithm utilizes the free channel better than the myopic scheme about 2% of the total successful transmissions. We can also see that the POMDP technique provides an optimal solution to the problem of the operation mode decision. However, the TLAC solution without requiring too much effort in mathematical computation or prior information about the environment's dynamics can provide the SU with a locally optimal policy that almost reaches the result of the POMDP scheme, especially when the amount of harvested energy is large. This is because the SU can encrypt data with a longer key size (e.g., Nk = 256) by utilizing extra energy in the battery when the average harvested energy increases. Therefore, the policy would be updated to favor the action that gives a better reward in the future.  Figure 16 depicts the total number of data packets transmitted from the SU to the fusion center based on harvested energy under three different data protection schemes. As can be seen from the figure, the SU can transmit more packets of data when using the TLAC algorithm, compared to the myopic scheme. The reason is that the proposed learning scheme can allocate the harvested energy more efficiently than the myopic one. Consequently, the SU can operate in transmission mode in more time slots, and thus, can transmit more encrypted data packets to the FC. Meanwhile, using the myopic scheme can cause the SU to be inactive due to lack of energy for future use. For that reason, the proposed TLAC framework can guarantee the security level, and can effectively utilize the limited energy resource.  More specifically, in Figure 17, we present the detailed number of successfully transmitted data packets that are encrypted using the AES algorithm with different key lengths. We can see from the figure that the total number of data packets delivered under the TLAC algorithm is 10% higher than when using the myopic scheme. In particular, more packets are encrypted with longer key sizes (i.e., AES-192 or AES-256) with a rise in the arrival of harvested energy. Finally, we examine the performance of the proposed TLAC by comparing it with that of AES algorithms with fixed key length. In the fixed key length schemes, the SU uses only one key size to encrypt data at each time step even when it has enough energy. In Figure 18, the rewards under the proposed TLAC and other schemes grow persistently with the increment in the harvested energy. While the proposed solution provides the highest average reward, the fixed encryption method with the shortest cipher key (AES-128) shows the lowest security level. The reason is that the proposed method can efficiently allocate the energy to every time slot by estimating expected reward in the future time slots. Meanwhile, the AES-128 algorithm always uses the lowest amount of energy for data encryption, and thus, does not utilize the redundant energy in the SU's battery to enhance the security level as the arrival speed of the harvested energy increases.  On the other hand, the AES-256 uses maximum energy to encrypt data whenever the energy is sufficient to increase data security. However, this action reduces the chance for the SU to operate in transmission mode, which leads to low successful transmissions, as shown in Figure 19. From the figure, we can see that the proposed TLAC provides the SU with the highest channel utilization since the SU can transmit more data packets in comparison to other methods. This is because the fixed encryption techniques do not utilize the energy effectively for future use. Among those fixed encryption methods, the AES-128 with lower energy consumption allows the SU to transmit more data packets than the AES-192 and the AES-256, but provides the SU with the lowest reward. Consequently, we can verify that the proposed TLAC algorithm can ensure effective data communications between the SU and the fusion center in terms of security level and channel utilization.

Conclusions
In this paper, we propose learning-based techniques for cooperative spectrum sensing and energy-efficient data protection in CRNs, by which the SUs can effectively utilize the primary channel under the constraint of limited harvested energy. We first design a new CNN-based cooperative spectrum sensing method. In this approach, the CNN is trained by using historical sensing data collected from secondary users under various environmental conditions. At the beginning of each time slot, the SUs individually perform spectrum sensing using an energy detection method, and then send the local decisions to a fusion center to make a global decision about the state of the primary user. The proposed CBCSS can increase the detection probability and remarkably reduce the sensing error, which can also contribute to effective communications between the SUs and the fusion center. Regarding the proposed TLAC scheme, the SU determines its operation mode based on the remaining energy and the sensing result considering the effect of this decision on future time slots. By calculating the expected accumulated reward from the current time slot, the SU can decide to stay in idle mode to save energy for future use, or operate in transmission mode and transmit cypher data that are protected by using the AES algorithm with an appropriate key length. We then present simulation results to evaluate the performance of the proposed solutions, which show that the proposed schemes can guarantee energy-efficient data communications in cognitive radio networks.
However, there are still some areas of the proposed learning frameworks that can be of interest for future research. First, it is possible to improve the performance of the current CNN-based cooperative spectrum sensing technique by modifying the structure of the input data. In addition, the local sensing results from K secondary users, other information such as the sensed channel SNRs, sensing duration and even the distances between the SUs and the PU can be used as the input data for the CNN to predict the state of the primary user. Those information sources could provide the CNN with more useful features for reducing the negative effect of the noise on the local sensing outcomes. Secondly, with respect to the TLAC framework, it is essential to choose good learning-rate parameters that can balance the convergence speed and the computational resource. Furthermore, the current actor-critic framework would be further extended to apply to the problems with large or continuous domains instead of discrete-time state space and action space.