You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

10 January 2023

Deep Reinforcement Learning for Physical Layer Security Enhancement in Energy Harvesting Based Cognitive Radio Networks

,
,
,
,
and
1
College of Electrical Engineering and Automation, Fuzhou University, Fuzhou 350108, China
2
College of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore
*
Author to whom correspondence should be addressed.
This article belongs to the Section Communications

Abstract

The paper studies the secrecy communication threatened by a single eavesdropper in Energy Harvesting (EH)-based cognitive radio networks, where both the Secure User (SU) and the jammer harvest, store, and utilize RF energy from the Primary Transmitter (PT). Our main goal is to optimize the time slots for energy harvesting and wireless communication for both the secure user as well as the jammer to maximize the long-term performance of secrecy communication. A multi-agent Deep Reinforcement Learning (DRL) method is proposed for solving the optimization of resource allocation and performance. Specifically, each sub-channel from the Secure Transmitter (ST) to the Secure Receiver (SR) link, along with the jammer to the eavesdropper link, is regarded as an agent, which is responsible for exploring optimal power allocation strategy while a time allocation network is established to obtain optimal EH time allocation strategy. Every agent dynamically interacts with the wireless communication environment. Simulation results demonstrate that the proposed DRL-based resource allocation method outperforms the existing schemes in terms of secrecy rate, convergence speed, and the average number of transition steps.

1. Introduction

Cognitive Radio (CR) is regarded as a potential solution for spectrum resource scarcity as a result of the extensive use of wireless technology, the growing demand for high-speed data transmissions and the traditional static spectrum strategies [1]. In Cognitive Radio Networks (CRNs), cognitive users are able to utilize licensed spectrum resources by underlay, overlay or interweave modes. In the underlay mode, the cognitive users are allowed to access the licensed spectrum occupied by the Primary Users (PUs) only when the interference temperature to PUs is lower than a desired level [2].
However, battery-limited devices in CRNs will bring great inconvenience [3], e.g., for the key elements embedded inside human bodies or wireless sensors working under extreme environments, replacing or recharging their batteries is not accessible. Luckily, Energy Harvesting (EH) technique has appeared as an exciting solution to this issue. EH refers to harvesting energy from the environments (e.g., via thermal, wind, solar, and wireless Radio Frequency (RF) energy sources) and then converting it into electric power energy for self-maintenance circuits [4]. Compared with natural energy supply, RF energy is capable of providing continuous, stable, and clean power to CRN terminals. Therefore, using RF energy to supply cognitive wireless networks is a significant technology for raising spectrum utilization and energy efficiency in CRNs.
Despite the aforementioned advantages, the CRN system is often subjected to illegal wiretapping as a result of characteristics of wireless channels [5]. In recent years, owing to rapid development in computational capacity, traditional cryptography encryption techniques are easily decoded by illegitimate users. Therefore, Physical Layer Security (PLS) approach has become an alternative technology for secure transmissions [6]. PLS technique aims to ensure secrecy performance, e.g., secrecy rate means a communication rate at which secrecy signals could be transmitted from a transmitter to an expected receiver [7]. In the information security theory, secrecy capacity refers to the maximum achievable secrecy rate. Once the secrecy capacity is worse than zero, the communications between the transmitter and the receiver are at risk, and eavesdroppers would be able to wiretap secrecy signals transmitted by the transmitter [8]. The method significantly improves the security of communications by diminishing the wiretapping capacity of eavesdroppers [7]. The broadcast characteristics of wireless channels also provide the opportunity to introduce interference into transmissions to reduce the wiretapping ability of an eavesdropper while enhancing the ability of both legitimate users to communicate securely [9]. To this end, Artificial Noise (AN) and Cooperative Jamming (CJ) have emerged as promising approaches for enhancing secrecy performance. The former realizes the process by mixing the AN signal, which acts as jamming signals, into the confidential information signals to interfere with eavesdroppers. In contrast, the latter realizes it by directly sending the jamming signals from the cooperative jammer to hinder the wiretapping channels and weaken the capabilities for decoding the wiretapped information [10]. If the legal receivers support full-duplex communications, it is technically feasible to transmit jamming signals to raise system performance benefits [11], and, furthermore, the CJ technology will be more positive and efficient once the eavesdroppers get closer to the legal receiver [12]. In addition, there are also beamforming and precoding secure transmission methods; however, the complexity of the Beamforming and Precoding schemes in the actual wireless communication system is critical to the operation of the system, and the extremely high computational complexity makes it difficult to apply it in practical systems. In the research of existing papers, CJ is one of the most important ways to achieve secure PLS transmission. In the CJ secure transmission scheme, the jammer can complete the design of jamming signals beamforming vector by using the statistical Channel State Information (CSI) of illegal channels, which is more suitable for actual wireless communication scenarios. Considering the above points, in this paper, we apply the CJ method to our proposed network. The research on physical layer security is usually divided into two cases: one is that the CSI of the eavesdropper is known, and the other is that the channel state information is not perfect. In most practical cases, the accurate location and CSI of the eavesdropper are unknown to the network. Our work considers the second case, which is a common assumption in the field of physical layer security.

3. Motivation and Contributions

The DRL methods show great advantages in dealing with various resource allocation problems in secure secrecy communication scenarios. However, to the best of our knowledge, few papers have studied secrecy communications in EH-CRN networks by using the DRL methods for joint scheduling of EH time slots and transmission power. Moreover, motivated by the prior works, we focus our study in this paper on the PLS enhancement in EH-CRN systems in the presence of a potential eavesdropper by combining CR, EH, and PLS techniques for the following main reasons. Firstly, the CR technique allows SUs to utilize the licensed spectrum of PU. Secondly, the EH technique is utilized to provide a sustainable energy supply for energy-constrained nodes, i.e., Secure Transmitter (ST) and jammer. More importantly, the throughput of the proposed network can be significantly improved through the EH technique, and the jammer also becomes more aggressive in defending against eavesdropping. Thirdly, the PLS technique is used to achieve secure communication of the SU to transmit secrecy information. Although the authors in [20] consider the combination of CR, EH, and PLS techniques, the traditional reinforcement learning algorithm in [20] can only deal with the discrete state space, and it appears to be too idealistic to reflect the real situation, so it is necessary to consider use the DRL method to study a more reasonable and practical continuous state space.
Moreover, Refs. [21,22,23] do not consider the EH technique for their systems. Furthermore, the proposed DRL-based resource allocation method for PLS enhancement basically differs from these existing papers [20,21,22,23] in the followings: A multi-agent DRL framework for the proposed EH-CRN is modeled; The classical DRL algorithm is combined with the Long Short-Term Memory Network (LSTM) for acquiring more performance improvements; The proposed algorithm is equipped with favorable stability and fast convergence speed.
The main contributions of this paper are listed in the following.
  • We consider an EH-CRN, where the communication security of a legitimate user is under threat, and a cooperative jammer is deployed to improve the system’s secrecy performance. The main goal is to enhance the physical layer security by achieving the optimal resource allocation for our proposed network.
  • For tackling the joint EH time slot and power allocation issue, we propose a multi-agent DRL-based resource allocation framework for our proposed network. Specifically, we model two types of agents, and each agent interacts with the dynamic environment through the state, action, and reward mechanisms.
  • To improve the classical DRL algorithm performance, we propose a new DRL network architecture, where the LSTM architecture is incorporated into the Dueling Double Deep Q-Network (D3QN) algorithm for overcoming the negative effects of the dynamic characteristics of the network caused by the time-varying channels and the random noise. Moreover, the construction of the loss function in the proposed method is different from the previously mentioned algorithms, and, thus, it can well avoid the overestimation of the state-action value and make the training process more stable.
  • Based on presented experimental results, the proposed scheme is efficient in improving the long-term achievable secrecy rate with small training episodes overheads.

4. Paper Organization

The remainder of the paper is organized as follows. Section 5 presents an EH-based CRN system model and a related optimization problem. Section 6 proposes a multi-agent DRL framework to obtain a solution to this joint EH time and power allocation problem. Section 7 presents simulation results to evaluate the system performance via our proposed method as compared to benchmark schemes. Finally, Section 8 concludes this paper.

5. System Model and Problem Formulation

5.1. System Model

As shown in Figure 1, we examine a secrecy communication in a CRN with a PU which consists of a PT and a Primary Receiver (PR), a SU which consists of an ST and a Secure Receiver (SR), a cooperative jammer which enables jamming signals, a potential eavesdropper who attempts to eavesdrop the secrecy data transmitted by the ST. The SU is entitled to utilize the licensed spectrum to the PU using underlay mode. The ST and the jammer are equipped with energy harvesters and batteries, respectively. The energy harvester can collect RF energy signals from the PT, and store this energy in the battery. We adopt a block-based quasi-static model; that is, the wireless CSI remains unchanged over each transmission block but may vary from one block to another [7]. The information link refers to the channels of PT-PR and ST-SR, while the wiretapping link refers to the channels of ST-Eavesdropper. The EH link refers to the channels of PT-ST and PT-Jammer. The interfering link refers to the channels of PT-SR and ST-PR. The jamming link refers to the channels of Jammer-Eavesdropper. All nodes carry only one antenna.
Figure 1. A CRN structure.
To destroy the eavesdropping capability, the SU and the jammer have perfect knowledge of the CSIs for the wiretapping link, and the jamming link over each transmission block [6,22,24]. It is assumed that the jamming signal from the jammer can be eliminated by the SR but cannot be removed at the eavesdropper [25,26]. This can be realized by the following method: A large number of stochastic sequences (jamming signals) with Gaussian distribution are pre-stored at the jammer, and their indices act as the keys. The jammer stochastically picks a sequence (jamming signal) and transmits its key to the SR. The key can be transmitted in a secret way via channel independence and reciprocity. As the stochastic sequence is only known at the SR, any eavesdropper is unable to extract the stochastic sequence. We suppose that the ST will be given an indicator signal which signifies whether the received Quality of Service (QoS) by the PR is satisfied [27].
A transmission link is made of multiple subcarriers, and let M { 1 , 2 , , M } as the set of these subcarriers. We denote G PP { g PP m | m M } , G SS { g SS m | m M } , I SP { i SP m | m M } , I PS { i PS m | m M } , W SE { w SE m | m M } , H PS { h PS m | m M } , H PJ { h PJ m | m M } , and J JE { j JE m | m M } as the channel gain coefficients sets from the PT-PR, the ST-SR, the ST-PR, the PT-SR, the ST-Eavesdropper, the PT-ST, the PT-Jammer, and the Jammer-Eavesdropper links. Different fading subchannels in each link are independent and identically distributed Rayleigh distributed random variables with mean zero and variances one. Over the m-th subcarrier, let p PT m , p ST m and p J m indicate the RF power of the PT, the transmission powers of the ST and the jammer, respectively. Denote, respectively, the RF energy signal, the secrecy information signal from the ST, and the jamming signal from the jammer as S PT m , S ST m and S J m , which are independent cyclic symmetric complex Gaussian random variables with mean zero and different variances E ( | S PT m | 2 ) = p PT m , E ( | S ST m | 2 ) = p ST m and E ( | S J m | 2 ) = p J m . E ( · ) = ( · ) f ( x ) d x is the statistic expectation.
As shown in Figure 2, we consider a two-phase transmission scheme for both the ST and the jammer by dividing each transmission block into two time slots. In the first time slot T 1 , namely the EH phase, the PT broadcasts wireless RF energy signals to its receiver. In the meantime, the ST and the jammer harvest and store RF energy in their batteries, respectively. In the second time slot T 2 , namely the data transmission phase, the PU instantaneously updates its transmission power based on a power control strategy but is unknown to the SU [27], and transmits the public broadcasting signals to the PR, because the public broadcasting signal is meaningless to eavesdroppers, this paper only considers that the signal received by the eavesdroppers only includes the secrecy signal of SU. The ST transmits secrecy information to the SR while the eavesdropper begins to wiretap the secrecy information. To ensure secure communication, the jammer instantaneously sends jamming signals to intercept its wiretapping. The length of each transmission block is T and includes an EH duration and a data transmission duration. We denote α 1 t and β 1 t as portions of two time slots of the ST over a transmission block, α 2 t , and β 2 t for the jammer. Therefore, at the ST and the jammer, we have
α 1 t + β 1 t = 1 , 0 α 1 t 1 , 0 β 1 t 1
and
α 2 t + β 2 t = 1 , 0 α 2 t 1 , 0 β 2 t 1 .
Figure 2. A two-phase EH transmission scheme.
The RF powers received by the ST and the jammer over the m-th subcarrier are given by
p ST , received m = α 1 t η 1 p PT m | h PS m | 2 , α 1 t < T 1 T
and
p J , received m = α 2 t η 2 p PT m | h PJ m | 2 , α 2 t < T 1 T ,
respectively. Here, η 1 and η 2 represent the EH efficiency at the ST and the jammer, respectively. According to the traditional non-linear EH model, the harvested energy of the ST and the jammer are respectively expressed as
E NL , l m = Γ l m A l Ψ l 1 Ψ l ,
Ψ l = 1 1 + exp ( a l b l )
and
Γ l m = A l 1 + exp ( a l ( p l , received m b l ) ) ,
where Γ l m , l = { ST , J } is a traditional logic function related to the received RF power p l , received m . Parameters a l and b l are related to the specification of the specific EH circuits. A l represents the maximum EH power received by the energy receiver when the EH process reaches a saturation [17,28]. Owing to the fact that an ideal linear EH model is unable to reflect the practical EH situation. Consequently, this paper considers a non-linear EH model for the proposed network in this paper.
Over the m-th subcarrier, the received signals by the PR, the SR, and the eavesdropper are, respectively, denoted by
y PR m = g PP m S PT m + i SP m S ST m + n PR m ,
y SR m = g SS m S ST m + i PS m S PT m + n SR m
and
y E m = w SE m S ST m + j JE m S J m + n E m ,
where n PR m , n SR m , and n E m denote Gaussian noise signals received by the PR, the SR, and the eavesdropper with mean zero and variances E ( | n PR m | 2 ) = E ( | n SR m | 2 ) = E ( | n E m | 2 ) = 1 , respectively.
At the ST and the jammer, we have maximum transmission power constraints
0 m M p ST m p ST , max
and
0 m M p J m p J , max ,
where p ST , max and p J , max denote maximum tolerable transmission powers at the ST and the jammer, respectively.
The QoS constraints at the receivers, that the received Signal-to-Interference-plus-Noise-Ratio (SINR) by the SR and the PR are, respectively, no lower than their minimum levels λ 1 , λ 2 , can be represented by
SINR SR = m M p ST m | g SS m | 2 p PT m | i PS m | 2 + 1 λ 1
and
SINR PR = m M p PT m | g PP m | 2 p ST m | i SP m | 2 + 1 λ 2 .
The energy causal constraint for the ST and the jammer that the consumed energy for transmitting or jamming in the second time slot cannot exceed the currently available battery capacity is given as
E NL , ST t + B ST t 1 β 1 t T m M p ST m 0 ,
E NL , J t + B J t 1 β 2 t T m M p J m 0 ,
0 B ST t B ST , max ,
0 B J t B J , max ,
where B ST , max and B J , max are the maximum battery capacity, B ST t 1 and B J t 1 stand for the residual battery capacity of the ST and the jammer at the transmission block t 1 , respectively.
In conclusion, the secrecy rate R sec [ t ] at each transmission block t is defined by
R sec [ t ] = m M R SR ( m ) [ t ] R E ( m ) [ t ] + ,
where [ · ] + max ( 0 , · ) , the achievable rate R SR ( m ) [ t ] on the ST-SR link and the wiretapping rate R E ( m ) [ t ] on the ST-Eavesdropper and Jammer-Eavesdropper links over each transmission block t and m-th subcarrier are, respectively, expressed as
R SR ( m ) [ t ] = β 1 t log 2 1 + p ST m | g SS m | 2 p PT m | i PS m | 2 + 1
and
R E ( m ) [ t ] = R m ( 1 ) [ t ] , α 1 t α 2 t R m ( 2 ) [ t ] , otherwise ,
where
R m ( 1 ) [ t ] = ( 1 α 1 t ) log 2 1 + p ST m | w SE m | 2 p J m | j JE m | 2 + 1
and
R m ( 2 ) [ t ] = ( α 2 t α 1 t ) log 2 1 + p ST m | w SE m | 2 + ( 1 α 2 t ) log 2 1 + p ST m | w SE m | 2 p J m | j JE m | 2 + 1

5.2. Problem Formulation

In general, there exists a tradeoff between EH and WIT, e.g., for the jammer, a longer EH duration can harvest more energy signals to increase the jamming power for fighting against illegal eavesdropping; but, for the ST, it can reduce the EH duration to acquire more available WIT duration for delivering confidential information at next transmission block. Our goal is to seek an optimal joint EH time coefficient and transmission power allocation strategy over each transmission block for maximizing the long-term achievable secrecy rate while maintaining other constraint requirements. The comprehensive problem can be formulated as
max α 1 t , α 2 t , p ST m , p J m E t = 1 R sec [ t ] s . t . ( 1 ) , ( 2 ) , ( 3 ) , ( 4 ) , ( 9 ) , ( 10 ) , ( 11 ) , ( 12 ) , ( 13 ) , ( 14 ) , ( 15 ) , ( 16 ) .
It is observed that the proposed problem is non-convex as the objective function is non-concave, and, thus, an effective solution is needed.

6. Deep Reinforcement Learning for Joint Time and Power Allocation

6.1. DRL Framework

To solve the proposed problem, this paper proposes a multi-agent DRL framework for it, as shown in Figure 3. The DRL framework is composed of an environment and multiple agents. Each sub-channel from the ST-SR and the Jammer-Eavesdropper links is regarded as an agent which aims to explore optimal transmission power allocation strategy. Without loss of generality, a time allocation network as a “time” agent is established to obtain optimal EH time coefficients α 1 t , α 2 t , respectively. Let K 0 , 1 , 2 , , 2 M as the set of 2 M + 1 agents. The others in this framework are regarded as the environment. The agents are collectively interacting with the dynamic environment to acquire a large number of learning experiences for seeking the optimal resource allocation policy. Such an interactive process can be modeled as an MDP ( S , A , R , P , γ ) , where S is the state space, A is the action space, R is the reward function, P is a state transition probability and γ [ 0 , 1 ) is a reward discount factor. A transmission block is regarded as a time step. At each time step t, once given a current environment state S t , each agent k K observes a local observation Z t k = O ( S t , k ) , and then takes an action a t k . As a result, the agent receives a reward r t k from the environment, and the current state S t instantaneously transfers to the following state S t + 1 according to the probability P .
Figure 3. A DRL framework.

6.2. Observation Space

To activate agents to learn an effective strategy π ( A t | S t ) , the current environment state S t must reflect the environment characteristics as much as possible. The jointly instantaneous CSIs from different transmission links can be expressed as
G t = { { g SS m } m M , { i PS m } m M , { h PS m } m M , { g PP m } m M , { i SP m } m M , { h PJ m } m M , { w SE m } m M , { j JE m } m M } ,
At the transmission block t 1 , we denote
I t 1 = m M p PT m | i PS m | 2 , m M p ST m | i SP m | 2 , m M p J m | j JE m | 2 ,
SINR t 1 = SIN R PR , SIN R SR ,
and
B t 1 = B ST t 1 , B J t 1
as joint interference powers, the joint SINRs, and the joint residual battery capacity, respectively. The instantaneous CSIs are included to reflect the current channel state. The joint interference power I t 1 is related to agents’ transmission powers, which have a straight impact on the environment. For instance, the transmission power of the ST may cause strong interference to the main link, under which the SU may be inhibited from accessing the licensed spectrum to the PU and, thus, straightly leads to a temporary secrecy rate reduction. The joint SINRs at the previous transmission block represents the received QoS by the PR and the SR, and it facilitates the improvement of the power strategy of SU. As current battery capacity B ST t is related to the residual capacity B ST t 1 , the transmission power at the ST will be influenced by B ST t 1 . With further analysis of the state mechanism, the previous reward r t 1 can be acted as feedback to activate agents, and, thus, the reward r t 1 is included in the observation Z t k .
In conclusion, the observation function Z t k of each agent k at the time step t is given as
Z t k = G t , I t 1 , SINR t 1 , B t 1 , r t 1 .

6.3. Action Space

We denote a t k as the action of the agent k at the time step t. The joint action of the agents is formulated as
A t = a t 0 , a t 1 , , a t M , , a t 2 M .
The action of the “time” agent is set as the EH time coefficients C t = α 1 t , α 2 t and the actions of other agents are set as their transmission powers. Therefore, the joint action of all agents is expressed as
A t = α 1 t , α 2 t , p ST 1 , , p ST M , p J 1 , , p J M .
To alleviate the effects during the process of learning, optional EH time coefficients α 1 t and α 2 t are discretized as L 1 time levels, i.e., c 1 , c 2 , , c L 1 ; optional transmission powers are discretized as L 2 power levels, i.e., p 1 , p 2 , , p L 2 .

6.4. Reward Design

We convert some constraint requirements in the proposed problem into a part of the reward. A reward consists of an instantaneous secrecy rate R sec [ t ] at current transmission block t, the joint SINRs at previous transmission block t 1 , and the battery capacity B ST t , B J t . In conclusion, the reward for each agent k is expressed as
r t = η 1 R sec [ t ] + η 2 R SINR [ t ] + η 3 R Bac [ t ]
where
R SINR [ t ] = m M p ST m | g SS m | 2 p PT m | i PS m | 2 + 1 λ 1 + m M p PT m | g PP m | 2 p ST m | i SP m | 2 + 1 λ 2 ,
R Bac [ t ] = B ST t 1 0.1 B ST , max + B J t 1 0.1 B J , max ,
η 1 + η 2 + η 3 = 1 , 0 η 1 , η 2 , η 3 1 ,
and 0.1 in (28b) is a critical threshold of the battery capacity.
In this reward r t , the first entry R sec [ t ] is a performance-oriented part, which directs an agent’s learning direction. The second entry R SINR [ t ] is related to the QoS of the SR and the PR. There is a balance between an instantaneous secrecy rate and available battery capacity; therefore, the third entry R Bac [ t ] is necessary to be added into r t . Considering different impacts on system performance, each entry is endowed with a positive weight that ranges from zero to one.
At each time step t, the long-term expected return R t for an agent k is defined as
R t = E l = 0 w l r t + 1 ,
where w [ 0 , 1 ] is a discount factor. In DRL, the main goal is to maximize the return R t by seeking an optimal strategy π .

6.5. LSTM-D3QN Algorithm

In our proposed system, the dynamic characteristics are primarily presented in time-varying channels, and the received random noise by the receivers. To overcome this issue, we resort to a combination of a classical DRL algorithm and the LSTM network, namely LSTM-D3QN, which is used to capture the temporal variation regularity of our proposed system. The LSTM-D3QN network architecture is presented in Figure 4. Each time step of the proposed algorithm is divided into two phases, i.e., the training phase and the implementation phase.
Figure 4. LSTM-D3QN network architecture.
(1)
Implementation phase
At the beginning of each episode, the environment state is randomly initialized. At each time step t, each agent k takes an action
a t k = random ( A ) 0 p < ϵ argmax a t k A ( Q ) ϵ p 1
based on the ϵ - g r e e d y policy, i.e., the optimal action a t k is selected from the action space A with a probability 1 ϵ according to the maximal estimated action-value function Q ( Z t k , a t k ) while a random action is derived with a probability ϵ . The collected transition ( Z t k , a t k , Z t + 1 k , r t ) by the agent k is stored into the prioritized experience replay buffer D k when the environment has evolved from the current state S t to the next state S t + 1 .
(2)
Training Phase
Each agent k is a DRL algorithm model and, thus, has an LSTM-D3QN network architecture with a decision Q-network Q and a target Q-network Q ^ , which are initiated by parameters θ k and θ ^ k , respectively. The action-value function of the decision Q-network is expressed as
Q ( S t , a t k ; θ 1 , θ 2 ) = V ( S t ; θ 1 ) + A ( S t , a t k ; θ 2 ) 1 | A | a * | A | A ( S t , a * ; θ 2 ) ,
where V is a state-value function with a parameter θ 1 and A is an advantage function with a parameter θ 2 . The relationship between the value of taking an action a in current state s and the value of taking an action a in the next state s is described by the Bellman expectation equation
Q ( s , a ) = E π [ r t + 1 + γ Q S t + 1 = s , A t + 1 = a S t = s , A t = a ] .
The structure of the target Q-network Q ^ is the same as that of the decision Q-network Q. During the prioritized experience replay, the agent samples a mini-batch of transitions { ( s , a , s , r ) } i = 1 K from the replay buffer D k for updating the decision Q-network. The prioritized experience replay mechanism can accelerate the learning process by endowing each transition with different priorities. We define the TD-error for the replay buffer as
e = r t + γ Q ^ ( S , a ; θ ^ k ) Q ( S t , a t k ; θ k ) ,
where S , a , S t , a t k D k . It will be more easily selected from D k as a training sample if a transition ( S t , a t k , S t + 1 , r t ) has a bigger absolute value | e | . The loss function of the decision Q-network Q is defined as a sum-squared error, that is
L ( θ k ) = ( S t , a t k ) D k y Q ( S t , a t k ; θ k ) 2 ,
where
y = r t + γ Q ^ ( S t + 1 , a ; θ ^ k ) ,
a = argmax a A Q ( S t + 1 , a ; θ k ) .
We apply the Adam optimizer with a learning rate δ to minimize the loss L ( θ k ) for updating the decision Q-network. For the target Q-network, the parameter θ ^ k will be renewed once every N Q ^ time steps per episode by assigning current parameter θ k to θ ^ k . The specific training procedure is described in the Algorithm 1.
Algorithm 1 LSTM-D3QN with prioritized experience replay
1:
Start environment simulator and generate the network topology
2:
Initialize the channel gain of each link
3:
Initialize neural networks parameters randomly
4:
Initialize the capacity for each replay buffer D k
5:
for each episode e { 0 , 1 , 2 , , E max 1 }  do
6:
     Update the locations of all nodes and the channel gains
7:
     Select randomly a joint EH time coefficient C t
8:
     Initialize randomly transmission powers
9:
         for each step t { 0 , 1 , 2 , , L max 1 }  do
10:
          for each agent k do
11:
             Observe an observation Z t k = O ( S t , k ) from
             the current environment
12:
             Choose an action a t k according to the ϵ - g r e e d y
             policy
13:
         end for
14:
         Update the channel gains
15:
         for each agent k do
16:
             Observe the next observation Z t + 1 k and receive
             the reward r t and then store the
             transition ( Z t k , a t k , Z t + 1 k , r t ) into D k
17:
         end for
18:
      end for
19:
      for each agent k do
20:
         Sample a mini-batch of transitions from the replay
         buffer D k and then update the decision Q-network
21:
         Renewing the parameter θ ^ k of target Q-network
21:
         every N Q ^ time steps by assigning the current
21:
         parameter θ k to θ ^ k
22:
      end for
23:
end for

6.6. Computational Complexity Analysis

The computational complexity of our proposed algorithm is mainly determined by the multiplications times in terms of the networks Q and Q ^ [29]. We calculate the computational complexity of the implementation and training phases at each time step, respectively.
In the implementation phase of each time step, for an input s of the environment state, the network Q calculates out its corresponding output. On the basis of the connection and calculation theory about deep neural networks, the computational complexity of this process from input to output can be calculated as O ( Ω ) , where Ω f 1 ( I 1 + W 1 W 2 ) + f 2 W 2 + i = 2 3 f i f i + 1 + f 5 ( f 2 + f 6 ) + O 1 ( f 6 + 1 ) , f l is the number of neurons in the l-th ( l = 1 , 2 , , 6 ) full connected layer, W 1 is the number of LSTM units, W 2 is the number of neurons in an LSTM unit, I 1 is the dimension of the input environment state, and O 1 is the number of neurons of the output layer which is equal to the size of the action space.
In the training phase of each time step, a minibatch of transition tuples { ( s , a , s , r ) } i = 1 K are sampled to update the network Q. Each episode contains L time steps, and the number of the agents is 2 M + 1 . As the target network Q ^ is updated every N Q ^ time steps, the computational complexity for 2 M + 1 agents in networks Q and Q ^ during the training process of each episode is O ( ( 1 + 1 / N Q ^ ) ( 2 M + 1 ) K L Ω ) . During the prioritized experience replay, each state transition tuple ( s , a , s , r ) stored in buffer D k are sorted by the priority, and the corresponding computational complexity is O ( k = 1 2 M + 1 | D k | log 2 ( | D k | ) ) , where | D k | represents the capacity of buffer D k . The total computational complexity for the whole training is calculated as O ( ( 1 + 1 / N Q ^ ) ( 2 M + 1 ) K L Ω + K L k = 1 2 M + 1 | D k | log 2 ( | D k | ) ) .

6.7. Convergence Analysis

The convergence of the Double Q-learning algorithm is the prerequisite for the convergence of the proposed algorithm. The Double Q-learning algorithm includes two functions: Q A and Q B .
Theorem 1.
If Double Q-learning based DRL algorithm meets these following conditions: (C1): a large number of state transition tuples { ( s , a , r , s ) } i = 1 K can be acquired by a proper learning policy; (C2): the reward discount factor γ [ 0 , 1 ] ; (C3): the learning rate δ t satisfies
0 δ t 1 , t = 1 δ t = , t = 1 ( δ t ) 2 < ;
(C4): the reward function r t in the EH-CRN which is defined by the Equation (30) is bounded; (C5): the Q values Q A and Q B are stored in a lookup table, and (C6): both Q A and Q B are updated infinitely many times. Then, Q values Q A and Q B will converge to the optimal value function Q * , i.e., Q A , Q B Q * .
Proof. 
In (C1), the ϵ - g r e e d y policy in the Equation (32) can be used as the proper learning policy to collect a large number of state transitions. In (C2), γ is easily found by taking a value between 0 and 1. In (C3), the learning rate can be set as δ t = 1 t + 1 , and then the formula
0 1 t + 1 1 , t = 1 δ t = t = 1 1 t + 1 = , t = 1 ( 1 t + 1 ) 2 < t = 1 1 t ( t + 1 ) = t = 1 1 t 1 t + 1 < 1 <
holds. In (C4), the reward function r t in the Equation (30) includes three parts: (1) R sec [ t ] ; (2) R SINR [ t ] ; (3) R Bac [ t ] . For the finite transmission powers p ST m and p PT m , the finite channel coefficients, the finite battery capacity, and the constant reward weights η 1 , η 2 , η 3 , all parts R sec [ t ] , R SINR [ t ] and R Bac [ t ] are also finite, and, thus, the reward function r t are bounded. In (C5), we can create a Q table in the same way as the Q-learning algorithm and then store the Q values Q A and Q B in it. In (C6), Q A and Q B can be updated infinitely by (36) as long as time steps are long enough, i.e., t . Consequently, Q A and Q B can converge to the optimal Q values function Q * . The proof is completed. □

7. Simulation Results

In this part, we conduct some numerical experiments to verify the proposed DRL-based joint EH time and power allocation scheme for our proposed system. Main simulation parameters are listed in Table 1.
Table 1. Main Simulation Parameters.
In the simulation setup, the number of multiple sub-carriers is started by N = 32 . We define the EH circuit parameters a l = 150.0 , b l = 0.014 , A l = 1.5 W [17]. As shown in Figure 4, the decision Q-network consists of six fully connected layers, an LSTM layer, an output layer, and a Softmax layer. In the LSTM layer, there are five LSTM units, and each unit consists of 100 neurons. The number of neurons in the fully connected layers are 64 , 64 , 128 , 128 , 128 , and 128, respectively. Rectified linear units (ReLUs), which are defined as f ( x ) = max ( 0 , x ) , are employed as the activation function of these six fully connected layers. The output layer generates a vector containing the Q-values corresponding to all actions, and then the Softmax layer normalizes the Q-values to zero and one. In the reward, we set the positive weights η 1 = 0.6 , η 2 = 0.2 , η 3 = 0.2 , respectively. At the start of each episode, all nodes are distributed randomly at the square area with a 300 m length of a side.
For verifying the performance of our proposed method, our proposed method is compared with the following benchmark schemes for resource allocation.
(1) JOEHTS-TP-QL (Joint Optimization of EH Time-Slot and Transmission Power Based on Q-Learning) in [20]: This method is based on a traditional reinforcement learning algorithm. To apply it to solve our problem, the state space is required to be discretized. It aims to maximize the achievable secrecy rate by optimizing transmission power.
(2) MADDPG Based Resource Allocation in [21]: It aims to maximize achievable secrecy rate by jointly optimizing EH time slot and transmission power.
(3) C-DQN scheme in [22]: This method is a combination of a curiosity-driven mechanism and Deep Q-network (DQN), and the agent is reinforced by an extrinsic reward supplied by the environment and an intrinsic reward.
Figure 5 shows the changes in secrecy rate at each episode during the training phase. In this figure, the secrecy rate under the proposed method represents a growing trend despite of continuous fluctuations during the earlier 1000 episodes and later reaches a convergence at a steady rate, which presents the effectiveness in improving secrecy performance. After 1400 training episodes, the performance gaps between our proposed method and other schemes become distinctive. The MADDPG method shows a wide range of fluctuation and requires more training episodes to converge while our proposed method converges quickly and steadily, and this demonstrates the effectiveness in improving the secrecy rate and overcoming the influence caused by the instability. Compared with the previous schemes, the C-DQN scheme can steadily converge to a lower secrecy rate due to the nonuse of the EH technique. The secrecy rate under the JOEHTS-TP-QL scheme slowly increases and surpasses the C-DQN scheme after 1280 training episodes.
Figure 5. Secrecy rate versus the number of episodes.
The proposed method solves the basic instability and overestimation problems of using function approximation in reinforcement learning: prioritized experience replay and target network by using deep learning (DL) and reinforcement learning (RL). Prioritized experience replay enables RL agents to sample and train offline from previously observed data. This not only greatly reduces the amount of interaction required by the environment, but also can sample a batch of experiences to reduce the differences in learning and updating. In addition, by uniformly sampling from a large memory, the time dependence that may adversely affect the RL algorithm for RL is broken, thereby improving throughput. Both JOEHTS-TP-QL and C-DQN schemes are not equipped with the above advantages. MADDPG has the following problem: each critic network needs to observe the state and action of all agents, and it is not particularly practical for a large number of uncertain agent scenarios, and when the number of agents is too large, the state space is too large. Based on this, they fall behind the proposed method.
Figure 6 and Figure 7 show the secrecy rates with respect to increasing transmission powers of ST and jammer under different schemes, respectively. In Figure 6, the transmission power of the jammer is fixed as 2.0 W. The secrecy rate under our proposed method increases as the maximum transmission power of ST becomes greater, and the maximum secrecy rate is obtained when it is between 1.5 W and 2.5 W. Increasing the maximum transmission power contributes to a greater secrecy rate in some ways, but a greater transmission power is likely to cause strong interferences to the PU and, thus, jeopardize the spectrum access of the SU for gaining further performance enhancement. The C-DQN and the JOEHTS-TP-QL schemes have similarly low secrecy rates. In Figure 7, the maximum transmission power of ST is fixed as 2.0 W. It can be observed that when the maximum transmission power of the jammer is kept below a certain power level, the jamming signal of the jammer has little impact on the eavesdropper. This indicates that the jammer must harvest enough energy to increase the transmission power such that the secrecy performance can be improved further.
Figure 6. Secrecy rate versus maximum transmission power of ST.
Figure 7. Secrecy rate versus maximum transmission power of jammer.
We study the effect of different discount rates on the cumulative reward per episode. In Figure 8, the more the discount rate deviates from 1.0, the more dramatically the trajectory of the reward fluctuates. When γ is set as 1.0, the reward converges in the fastest and most stable way. Theoretically, the agent is likely to focus on short-term returns when the discount rate γ is lower than 1.0. During the training process, the secrecy rate as a main part of the reward dominates the learning direction of the agent, hence maximizing the cumulative rewards encourages more instantaneous secrecy information to be delivered. However, agents cannot transmit secrecy data for long periods of time due to the limited energy. Therefore, γ = 1.0 acts as an optimal balance factor between the EH and WIT phases.
Figure 8. Reward under different discount rates.
Figure 9 and Figure 10 show the secrecy rates under all the algorithms versus the maximum battery capacity of jammer and ST, respectively. With the increase in the maximum battery capacity, the secrecy rates under different schemes also increase. Of all the algorithms, the proposed algorithm gains the best performance at each given value of maximum battery capacity. When the maximum battery capacity is beyond 60 mJ , all algorithms start to converge to different performance levels; this is mainly because the jammer and ST are influenced by the limited RF energy that can be harvested.
Figure 9. Secrecy rate versus maximum battery capacity B J , max of jammer.
Figure 10. Secrecy rate versus maximum battery capacity B ST , max of ST.
Figure 11 shows the secrecy rate versus the number of subcarriers m. The proposed method is capable of greatly enhancing the secrecy rate and outperforms other schemes with the highest secrecy rate. The MADDPG and the C-DQN schemes have performance gaps of approximately 28.8% and 80.5% with our proposed method, respectively, when the number of subcarriers is 128. The secrecy rate under the JOEHTS-TP-QL scheme degrades dramatically when the number of subcarriers is beyond 96. It is observed that the secrecy rates under all schemes start to deteriorate when the number of subcarriers is beyond 128. It is mainly because increasing the number of subcarriers will increase the size of the action space, and it is harder to find the optimal strategy, which brings a decline in performance.
Figure 11. Secrecy rate versus the number of subcarriers m.
With the knowledge of the environment regularity, agents can intelligently adjust their decision strategies so that a target state can be obtained from any initial state in a few numbers of transition steps. Here, a target state is defined as a given state where all constraint requirements in the proposed problem are satisfied. Similar to [27], we use the indicator “average number of transition steps”, which is defined as the average number of continuous transition steps agents take from an initial state to a target state, to measure the robustness performance for our proposed algorithm.
Figure 12 shows the average number of transition steps in each test. Multi-agents are tested in five hundred time steps at the end of each training episode. Our proposed scheme only requires the smallest number of transition steps to achieve the target state, while the other schemes need to take more transition steps. Moreover, when it is the 25th test, the proposed method fundamentally converges while other schemes at least need to take 65 tests and even more, and, thus, this makes the convergence speed of the proposed method 160% faster than the benchmark algorithm.
Figure 12. Average number of transition steps in each test.
This corroborates the robustness and rapidity of the proposed method.

8. Conclusions

In this paper, we have developed a multi-agent DRL framework for the proposed EH-based CRN with a wireless-powered cooperative jammer and propose the corresponding resource allocation problem. The D3QN algorithm is combined with an LSTM network to improve the system’s secrecy performance. The proposed method is divided into training and implementation phases. The numerical results demonstrate that the proposed method can increase the long-term achievable secrecy rate by 30.1% and convergence speed by 160% with the minimum average number of transition steps overheads, compared with the benchmark algorithms. In the future study, we will further explore the secrecy of energy-efficient resource allocation for our proposed network.

Author Contributions

Conceptualization, J.W. and W.J.; methodology, R.L. and H.Q.; software, H.Q., Z.J. and Z.L.; validation, J.W., Z.J. and Z.L.; formal analysis, R.L., W.J., Z.J. and Z.L.; investigation, R.L., H.Q., Z.J. and Z.L.; resources, R.L., J.W. and H.Q.; data curation, R.L., J.W. and H.Q.; writing—original draft preparation, H.Q.; writing—review and editing, H.Q., W.J. and J.W.; visualization, H.Q., Z.J. and Z.L.; supervision, R.L. and W.J.; project administration, R.L., H.Q. and W.J.; funding acquisition, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Natural Science Foundation of China under Grants No. 61871133 and in part by the Industry-Academia Collaboration Program of Fujian Universities under Grants No. 2020H6006.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Ma, Y.; Lv, T.; Liu, H.; Li, T.; Zeng, J.; Pan, G. Secrecy Outage Analysis of CR-SWIPT Networks With Artificial Noise and Spatially Random Secondary Terminals. IEEE Trans. Cogn. Commun. Netw. 2022, 8, 931–945. [Google Scholar] [CrossRef]
  2. Luo, L.; Li, Q.; Cheng, J. Performance Analysis of Overlay Cognitive NOMA Systems With Imperfect Successive Interference Cancellation. IEEE Trans. Commun. 2020, 68, 4709–4722. [Google Scholar] [CrossRef]
  3. Wang, J.; Ge, Y. A Radio Frequency Energy Harvesting-Based Multihop Clustering Routing Protocol for Cognitive Radio Sensor Networks. IEEE Sens. J. 2022, 22, 7142–71562. [Google Scholar] [CrossRef]
  4. Hu, H.; Da, X.; Ni, L.; Huang, Y.; Zhang, H. Green Energy Powered Cognitive Sensor Network With Cooperative Sensing. IEEE Access. 2019, 7, 17354–17364. [Google Scholar] [CrossRef]
  5. Thanh, P.D.; Hoan, T.N.K.; Koo, I. Joint Resource Allocation and Transmission Mode Selection Using a POMDP-Based Hybrid Half-Duplex/Full-Duplex Scheme for Secrecy Rate Maximization in Multi-Channel Cognitive Radio Networks. IEEE Sens. J. 2020, 20, 3930–3945. [Google Scholar] [CrossRef]
  6. Wu, X.; Ma, J.; Xing, Z.; Gu, C.; Xue, X.; Zeng, X. Secure and Energy Efficient Transmission for IRS-Assisted Cognitive Radio Networks. IEEE Trans. Cogn. Commun. Netw. 2022, 8, 170–185. [Google Scholar] [CrossRef]
  7. Zhang, G.; Xu, J.; Wu, Q.; Cui, M.; Li, X.; Lin, F. Wireless Powered Cooperative Jamming for Secure OFDM System. IEEE Trans. Veh. Technol. 2018, 67, 1331–1346. [Google Scholar] [CrossRef]
  8. Gu, X.; Zhang, G.; Wang, M.; Duan, W.; Wen, M.; Ho, P.-H. UAV-Aided Energy-Efficient Edge Computing Networks: Security Offloading Optimization. IEEE Internet Things J. 2022, 9, 4245–4258. [Google Scholar] [CrossRef]
  9. Xu, H.; Sun, L.; Ren, P.; Du, Q.; Wang, Y. Cooperative Privacy Preserving Scheme for Downlink Transmission in Multiuser Relay Networks. IEEE Trans. Inf. Forensic Secur. 2017, 12, 825–839. [Google Scholar] [CrossRef]
  10. Tashman, D.H.; Hamouda, W.; Moualeu, J.M. On Securing Cognitive Radio Networks-Enabled SWIPT Over Cascaded κ-μ Fading Channels with Multiple Eavesdroppers. IEEE Veh. Technol. Mag. 2022, 71, 478–488. [Google Scholar] [CrossRef]
  11. Abedi, M.R.; Mokari, N.; Saeedi, H.; Yanikomeroglu, H. Robust Resource Allocation to Enhance Physical Layer Security in Systems with Full-Duplex Receivers: Active Adversary. IEEE Trans. Wirel. Commun. 2017, 16, 885–899. [Google Scholar] [CrossRef]
  12. Xu, D.; Zhu, H. Secure Transmission for SWIPT IoT Systems with Full-Duplex IoT Devices. IEEE Internet Things J. 2019, 6, 10915–10933. [Google Scholar] [CrossRef]
  13. Yan, P.; Zou, Y.; Ding, X.; Zhu, J. Energy Aware Relay Selection Improves Security-Reliability Tradeoff in Energy Harvesting Cooperative Cognitive Radio Systems. IEEE Trans. Veh. Technol. 2020, 69, 5115–5128. [Google Scholar] [CrossRef]
  14. Li, M.; Yin, H.; Huang, Y.; Wang, Y.; Yu, R. Physical Layer Security in Overlay Cognitive Radio Networks with Energy Harvesting. IEEE Trans. Veh. Technol. 2018, 67, 11274–11279. [Google Scholar] [CrossRef]
  15. Ding, X.; Zou, Y.; Zhang, G.; Chen, X.; Wang, X.; Hanzo, L. The Security–Reliability Tradeoff of Multiuser Scheduling-Aided Energy Harvesting Cognitive Radio Networks. IEEE Trans. Commun. 2019, 67, 3890–3904. [Google Scholar] [CrossRef]
  16. Xiao, H.; Jiang, H.; Deng, L.-P.; Luo, Y.; Zhang, Q.-Y. Outage Energy Efficiency Maximization for UAV-Assisted Energy Harvesting Cognitive Radio Networks. IEEE Sens. J. 2022, 22, 7094–7105. [Google Scholar] [CrossRef]
  17. Wang, Y.; Wang, Y.; Zhou, F.; Wu, Y.; Zhou, H. Resource Allocation in Wireless Powered Cognitive Radio Networks Based on a Practical Non-Linear Energy Harvesting Model. IEEE Access 2017, 5, 17618–17626. [Google Scholar] [CrossRef]
  18. Zhou, F.; Chu, Z.; Sun, H.; Hu, R.Q.; Hanzo, L. Artificial Noise Aided Secure Cognitive Beamforming for Cooperative MISO-NOMA Using SWIPT. IEEE J. Sel. Areas Commun. 2018, 36, 918–931. [Google Scholar] [CrossRef]
  19. Hua, Y.; Li, R.; Zhao, Z.; Chen, X.; Zhang, H. GAN Powered Deep Distributional Reinforcement Learning for Resource Management in Network Slicing. IEEE J. Sel. Areas Commun. 2020, 38, 334–349. [Google Scholar] [CrossRef]
  20. Alnagar, S.I.; Salhab, A.M.; Zummo, S.A. Q-Learning-Based Power Allocation for Secure Wireless Communication in UAV-Aided Relay Network. IEEE Access 2021, 9, 33169–33180. [Google Scholar] [CrossRef]
  21. Zhang, Y.; Mou, Z.; Gao, F.; Jiang, J.; Ding, R.; Han, Z. UAV Enabled Secure Communications by Multi-Agent Deep Reinforcement Learning. IEEE Trans. Veh. Technol. 2020, 69, 11599–11611. [Google Scholar] [CrossRef]
  22. Fu, F.; Jiao, Q.; Yu, F.R.; Zhang, Z.; Du, J. Securing UAV-to-Vehicle Communications: A Curiosity-Driven Deep Q-learning Network (C-DQN) Approach. In Proceedings of the 2021 IEEE International Conference on Communications Workshops (ICC Workshops), Montreal, QC, Canada, 14–23 June 2021; pp. 1–6. [Google Scholar]
  23. Mamaghani, M.T.; Hong, Y. Intelligent Trajectory Design for Secure Full- Duplex MIMO-UAV Relaying Against Active Eavesdroppers: A Model-Free Reinforcement Learning Approach. IEEE Access 2020, 9, 4447–4465. [Google Scholar] [CrossRef]
  24. Karachontzitis, S.; Timotheou, S.; Krikidis, I.; Berberidis, K. Security Aware Max–Min Resource Allocation in Multiuser OFDMA Downlink. IEEE Trans. Inf. Forensic Secur. 2015, 10, 529–542. [Google Scholar] [CrossRef]
  25. Nguyen, P.X.; Nguyen, V.-D.; Nguyen, H.V.; Shin, O.-S. UAV-Assisted Secure Communications in Terrestrial Cognitive Radio Networks: Joint Power Control and 3D Trajectory Optimization. IEEE Trans. Veh. Technol. 2021, 70, 3298–3313. [Google Scholar] [CrossRef]
  26. Bouabdellah, M.; Bouanani, F.E. A PHY Layer Security of a Jamming-Based Underlay Cognitive Satellite-Terrestrial Network. IEEE Trans. Cogn. Commun. Netw. 2021, 7, 1266–12791. [Google Scholar] [CrossRef]
  27. Li, X.; Fang, J.; Cheng, W.; Duan, H.; Chen, Z.; Li, H. Intelligent Power Control for Spectrum Sharing in Cognitive Radios: A Deep Reinforcement Learning Approach. IEEE Access 2018, 6, 25463–25473. [Google Scholar] [CrossRef]
  28. Boshkovska, E.; Ng, D.W.K.; Dai, L.; Schober, R. Power Efficient and Secure WPCNs With Hardware Impairments and Non-Linear EH Circuit. IEEE Trans. Commun. 2018, 66, 2642–2657. [Google Scholar] [CrossRef]
  29. Shi, Z.; Xie, X.; Lu, H.; Yang, H.; Kadoch, M.; Cheriet, M. Deep-Reinforcement-Learning-Based Spectrum Resource Management for Industrial Internet of Things. IEEE Internet Things J. 2021, 8, 3476–3489. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.