Anti-Jamming Resource-Allocation Method in the EH-CIoT Network through LWDDPG Algorithm

In the Energy-Harvesting (EH) Cognitive Internet of Things (EH-CIoT) network, due to the broadcast nature of wireless communication, the EH-CIoT network is susceptible to jamming attacks, which leads to a serious decrease in throughput. Therefore, this paper investigates an anti-jamming resource-allocation method, aiming to maximize the Long-Term Throughput (LTT) of the EH-CIoT network. Specifically, the resource-allocation problem is modeled as a Markov Decision Process (MDP) without prior knowledge. On this basis, this paper carefully designs a two-dimensional reward function that includes throughput and energy rewards. On the one hand, the Agent Base Station (ABS) intuitively evaluates the effectiveness of its actions through throughput rewards to maximize the LTT. On the other hand, considering the EH characteristics and battery capacity limitations, this paper proposes energy rewards to guide the ABS to reasonably allocate channels for Secondary Users (SUs) with insufficient power to harvest more energy for transmission, which can indirectly improve the LTT. In the case where the activity states of Primary Users (PUs), channel information and the jamming strategies of the jammer are not available in advance, this paper proposes a Linearly Weighted Deep Deterministic Policy Gradient (LWDDPG) algorithm to maximize the LTT. The LWDDPG is extended from DDPG to adapt to the design of the two-dimensional reward function, which enables the ABS to reasonably allocate transmission channels, continuous power and work modes to the SUs, and to let the SUs not only transmit on unjammed channels, but also harvest more RF energy to supplement the battery power. Finally, the simulation results demonstrate the validity and superiority of the proposed method compared with traditional methods under multiple jamming attacks.


Introduction
With the development of information technology and machine-to-machine communication, more and more physical devices are connecting to the Internet.Thus, the Internet of Things (IoT) came into being.The IoT plays an important role in fields such as transportation, healthcare, industrial automation and disaster response [1].However, the huge number of the IoT devices generate large amounts of exchanging data, which increases the demand for spectrum resources.
Due to the scarcity of spectrum resources, improving spectrum utilization is an important issue [2].Cognitive Radio (CR) is an effective tool to alleviate this problem.CR allows the Secondary Users (SUs) to opportunistically access the spectrum band for data transmission by sensing the activities of the Primary Users (PUs) [3].In the CR system, the SUs have two access modes: interweave and underlay.In interweave mode, the SUs can only access the spectrum bands that are not occupied by the PUs.In underlay mode, the SUs can access the spectrum bands occupied by the PUs, but they must limit the transmission with the environment without prior knowledge to learn strategies.In [20], the authors proposed a novel channel-management method based on Q-learning to defend against jamming attacks.In [21], the authors proposed a joint power and channel-management method based on event-driven Q-learning to adaptively reduce jamming and increase throughput.In wireless communication networks, the problems of resource allocation and jamming mitigation will produce a large state and action space, and the traditional Q-learning algorithms are ineffective and hard to converge.In [22], the authors proposed a fast Deep Q-Learning (DQN)-based anti-jamming method, which effectively improved the transmission signal-to-interference-plus-noise ratio.In [23], the authors applied the double DQN algorithm to the multi-user model, and they proposed a frequency hopping method to defend against jamming attacks.In [24], to solve the problem of continuous power control in continuous state space and action space, the authors proposed an Actor-Critic DQN method to allocate transmission power more accurately.In [25], the authors proposed a multi-agent DRL cooperative resource-allocation method to manage the member grouping and power for the EH-CIoT network.In [26], the authors proposed a Deep Deterministic Policy Gradient (DDPG) method to improve the Long-Term Throughput (LTT) of the energy-constrained CR-NOMA network by optimizing the time-sharing coefficient and the SUs' transmission power.In [27], the authors studied an optimal transmission algorithm in the EH-CIoT network; they used a DDPG algorithm to handle dynamic uplink access and continuous power control, which could effectively improve the LTT of the EH-CIoT network.In [28], the authors tackled the tradeoff between the active probability and spectrum access opportunity; they derived the optimal final decision threshold that maximizes the expected achievable throughput of the EH cognitive radio network.In [29], the authors exploited the optimal time allocation between PUs and SUs, and balanced the tradeoff between energy harvesting and packet transmission to obtain the maximum total achievable throughput.In [30], the authors proposed the Adjusted-Deep Deterministic Policy Gradient (A-DDPG) and combination of A-DDPG and convex optimization method to effectively improve the long-term throughput of the ambient backscatter communications and radio frequencypowered cognitive radio network by jointly controlling the time scheduling and energy management of SUs.
With the upgrading of jammers, wireless communication devices are prone to several types of intelligent jamming attacks.Therefore, some studies focus on defending against intelligent jamming attacks.In [31], the authors proposed a hierarchical reinforcement learning-based hybrid hidden strategy to defend against intelligent reactive jamming attacks.In [32], the authors considered the intelligence full-duplex jammers, which could maximize the utility of eavesdropping and jamming by optimizing the jamming power.The authors proposed a Bayesian-Stackelberg Game method to defend the intelligence fullduplex jammers and effectively improve the utility of legitimate communication devices.In [33], the author proposed a two-layer reception strategy to defend the jamming attacks caused by attackers using reconfigurable intelligent surface on multiple users.

Contribution and Organization
The EH-CIoT network is widely studied because it can effectively alleviate the problems of spectrum scarcity and energy constraint.In addition, the broadcast nature of the EH-CIoT network makes it vulnerable to jamming attacks.Therefore, studying the antijamming method of the EH-CIoT network is an important and practical problem.The main works of this paper are summarized in the following: This paper carefully designs the reward function.Specifically, this paper proposes to take the throughput and RF energy harvested by the SUs as the two-dimensional rewards, which can enable the ABS to evaluate its actions more effectively and guide the ABS to make strategies that are beneficial for maximizing the LTT of the EH-CIoT network.
The remainder of this paper is organized as follows.First, Sections 4 and 5 propose the EH-CIoT model and the jamming models, respectively.Second, an MDP-based optimization problem is formulated in Section 6.Third, Section 7 proposes the LWDDPG RL-based algorithm for an interweave EH-CIoT under jamming attacks and gives the detailed steps of the algorithm.Finally, Section 8 analyzes the simulation results and Section 9 presents the conclusion.

System Model
In specific application scenarios, the system model shown in Figure 1 can be a wireless sensor network used for environmental monitoring.The nodes with energy-harvesting capability monitor real-time environmental data and transmit them to base stations.rewards, which can enable the ABS evaluates its actions more effectively and guide 147 the ABS to make strategies that are beneficial for maximizing LTT of the EH-CIoT 148 network.

149
The remainder of this paper is organized as follows.First, Section 4 and Section 5 150 propose the EH-CIoT model and the jamming models, respectively.Second, a MDP-based 151 optimization problem is formulated in Section 6.Third, Section 7 proposes the LWDDPG 152 RL-based algorithm for an interweave EH-CIoT under jamming attacks and gives the 153 detailed steps of the algorithm.Finally, Section 8 analyzes the simulation results and 154 Section 9 makes a conclusion.CIoT nodes.In this EH-CIoT network, the PUN has K licensed channels.The PUs use 165 licensed channels to communicate with the PBS.At the beginning of each timeslot, the ABS 166 performs spectrum sensing to identify the idle licensed channels.This paper assumes the 167 ABS can always get perfect spectrum sensing results [34].Due to the uncertainty of the 168 PUs activities, the number of the idle licensed channels IK(t) may vary in each timeslot t. 169 I k (t) = {1(idle), 0(busy)} represents the state of the k-th licensed channel sensed by the 170 ABS at the timeslot t, and IK(t) = ∑ K k=1 I k (t).In the EH-CIoT network, both small-block Rayleigh fading and large-scale path loss 173 fading are considered [35].And this paper considers that the channel gain changing 174 between varies timeslots.Therefore, the channel gain g xy (t) at the t-th timeslot is described 175 as: This paper considers the interweave mode of the CR [27].Figure 1 shows the multiuser EH-CIoT network under jamming attacks.The EH-CIoT network is composed of a Primary Users Network (PUN) which has M PUs and a Primary Base Station (PBS), one ABS and N SUs, and a Jamming Attack Node (JAN).Due to the fact that the SUs in the EH-CIoT network are IoT devices and have EH function, this paper refers to the SUs as the EH-CIoT nodes.In this EH-CIoT network, the PUN has K licensed channels.The PUs use licensed channels to communicate with the PBS.At the beginning of each timeslot, the ABS performs spectrum sensing to identify the idle licensed channels.This paper assumes the ABS can always obtain perfect spectrum-sensing results [34].Due to the uncertainty of the PU activities, the number of the idle licensed channels IK(t) may vary in each timeslot t.I k (t) = {1(idle), 0(busy)} represents the state of the k-th licensed channel sensed by the ABS at the timeslot t, and IK(t) = ∑ K k=1 I k (t).

Channel Gain
In the EH-CIoT network, both small-block Rayleigh fading and large-scale path loss fading are considered [35].And this paper considers the channel gain to be changing between varies timeslots.Therefore, the channel gain g xy (t) at the t-th timeslot is described as:

Two Mode Selections for EH-CIoT Nodes
In the EH-CIoT network, each EH-CIoT node has the same configuration: a single antenna, a rechargeable battery, a transmitter, a receiver, and an energy-acquisition device.The EH-CIoT nodes can only perform RF energy harvesting or transmission in each timeslot.The energy harvested in the current timeslot will not be used immediately but stored in the rechargeable battery.Due to the limitation of spectrum resources, one EH-CIoT node can only access one idle licensed channel.Due to the scarcity of spectrum resources and the fact that massive IoT devices access wireless networks [29], like [27], this paper considers that all the EH-CIoT nodes always have the data that need to be transmitted.
Assume that all the EH-CIoT nodes are managed by the ABS.The EH-CIoT node sends its battery level state set B(t) (see Section 4.3 for details) to the ABS through a dedicated control channel at the beginning of each timeslot t.As the core of the EH-CIoT network, the ABS determines the work mode (harvesting mode or transmission mode), channel access and transmission power of all EH-CIoT nodes in the current timeslot t according to IK(t), the channel gain information set G(t), the received battery level state set B(t) and the previous slot received result set Y(t − 1) (see Section 6 for details).Then, the resource-allocation decisions are broadcast to all EH-CIoT nodes.Let P C i (t) represent the transmission power of the i-th EH-CIoT node (i = 1, 2, . . ., N) in t-th timeslot, and P C max denote the maximum transmission power of the EH-CIoT nodes.Therefore, in the t-th timeslot, the continuous power-allocation set of all EH-CIoT nodes is expressed as P(t) = P C 1 (t), P C 2 (t), . . ., P C N (t) .If P C i (t) > 0, the i-th EH-CIoT node works in transmission mode with power P C i (t).If P C i (t) = 0, the i-th EH-CIoT node works in harvesting mode.The work mode M C i (t) of the i-th EH-CIoT node in the t-th timeslot can be described as: This paper lets M(t) = M C 1 (t), M C 2 (t), . . ., M C N (t) as the work mode set of all EH-CIoT nodes in the t-th timeslot.The number of EH-CIoT nodes that select the transmission mode is IC(t), and IC(t) ≤ IK(t).The timeslot structure is shown in Figure 2, where T is the length of a timeslot, and τ is the length of the information-exchange phase in a timeslot.

Two Mode Selections for EH-CIoT Nodes
In the EH-CIoT network, each EH-CIoT node has the same configuration: a single antenna, a rechargeable battery, a transmitter, a receiver and an energy acquisition device.The EH-CIoT nodes can only perform RF energy harvesting or transmission in each timeslot.The energy harvested in the current timeslot will not be used immediately but stored in the rechargeable battery.Due to the limitation of spectrum resources, one EH-CIoT node can only access one idle licensed channel.Due to the scarcity of spectrum resources and the fact that massive IoT devices access wireless networks [29], like [27], this paper considers that all the EH-CIoT nodes always have the data that needs to be transmitted.
Assume that all the EH-CIoT nodes are managed by the ABS.The EH-CIoT node sends its battery level state set B(t) (see 4.3 for details) to the ABS through a dedicated control channel at the beginning of each timeslot t.As the core of the EH-CIoT network, the ABS determines the work mode (harvesting mode or transmission mode), channel access and transmission power of all EH-CIoT nodes in the current timeslot t according to IK(t), the channel gain information set G(t), the received battery level state set B(t) and the previous slot received result set Y(t − 1) (see section 6 for details).Then, the resource allocation decisions are broadcast to all EH-CIoT nodes.Let P C i (t) represents the transmission power of the i-th EH-CIoT node (i = 1, 2, . . ., N) in t-th timeslot, and P C max denotes the maximum transmission power of the EH-CIoT nodes.Therefore, in the t-th timeslot, the continuous power allocation set of all EH-CIoT nodes is expressed as P(t) = P C 1 (t), P C 2 (t), . . ., P C N (t) .If P C i (t) > 0, the i-th EH-CIoT node works in transmission mode with power P C i (t).If P C i (t) = 0 , the i-th EH-CIoT node works in harvesting mode.The work mode M C i (t) of the i-th EH-CIoT node in the t-th timeslot can be described as: This paper lets as the work mode set of all EH-CIoT nodes in the t-th timeslot.The number of EH-CIoT nodes that select the transmission mode is IC(t), and IC(t) ≤ IK(t).The timeslot structure is shown in Fig. 2, where T is the length of a timeslot, and τ is the length of the information exchange phase in a timeslot.
In the EH-CIoT network, it is assumed that the energy consumed in the information exchange phase is fixed and represented by e f .If the remaining energy of the EH-CIoT node is insufficient to complete the information exchange, it will not be able to send its In the EH-CIoT network, it is assumed that the energy consumed in the informationexchange phase is fixed and represented by e f .If the remaining energy of the EH-CIoT node is insufficient to complete the information exchange, it will not be able to send its battery state.When the ABS does not receive the battery state information of the EH-CIoT node in the information-exchange phase, it will consider the EH-CIoT node in a low-power state and control it to enter harvesting mode.

Energy Harvesting and Renewal
In the EH-CIoT network, the PBS, the JAN, and the ABS are powered by the grid.The EH-CIoT nodes are powered by rechargeable batteries, and they use EH technology to charge the batteries.
(1) Energy Harvesting.This paper considers that the EH-CIoT nodes can harvest energy from three kinds of transmission signals: the PUs, the JAN, and other EH-CIoT nodes.Since the RF energy harvester typically operates over a range of frequencies [37], this paper considers that the EH-CIoT nodes can only harvest energy from one channel.The transmission power of the PU in the k-th licensed channel is The jamming power of the JAN in the k-th licensed channel is P J k (t).This paper uses the linear energy-harvesting model.Therefore, the RF energy harvested by i-th EH-CIoT nodes in the t-th timeslot can be expressed as: where η is the energy-conversion rate.P(t) represents the transmission power, and . Correspondingly, g(t) represents channel gain, and g(t) ∈ g Pi (t), g Ji (t), g si (t) .Due to the different transmission power and channel gain in different channels, the EH-CIoT nodes obtain different energy through EH on different channels.The amount of energy harvested by the EH-CIoT nodes depends on the channels and work modes allocated to them by the ABS.The harvested energy set of all EH-CIoT nodes in the t-th timeslot is expressed as . It is worth noting that even if multiple EH-CIoT nodes harvest energy on the same channel, the energy harvested by the EH-CIoT nodes will not be discounted.
(2) Battery Update.In the EH-CIoT network, the maximum battery capacity of the EH-CIoT nodes is B max .The rechargeable batteries are assumed to be ideal so that there is no energy loss during energy storage or recovery.The battery power cannot exceed the maximum capacity.The battery state set of all EH-CIoT nodes in the t-th timeslot is expressed as . The evolution of the battery state of the i-th EH-CIoT node from the timeslot t to the timeslot t + 1 can be expressed as: where represents whether the i-th EH-CIoT node has enough energy to report the B C i (t) to the ABS in the information-exchange phase.

Jamming Attack Models
In the EH-CIoT network, the JAN aims to reduce the throughput of the EH-CIoT nodes.This paper considers three types of jamming attacks, random jamming attacks, scanning jamming attacks, and intelligent reactive-scanning jamming attacks.In addition, this paper describes the strategies of the JAN as G P J (t), P J (t) , where the P J (t) is the jamming probability and P J (t) is the jamming power.Then, this paper describes P J (t) and P J (t) under different jamming attacks.This paper gives them appropriate subscripts under different jamming attacks.
(1) Random Jamming Attack.The JAN jams the k-th channel with probability P J k (t) in the t-th timeslot, and the jamming power is P J k (t).
(2) Scanning Jamming Attack.The JAN jams K N channels with probability P J K N (t) in the t-th timeslot (K N ≤ K), and the jamming power is P J K N (t)/K N , where P J K N (t) is the jamming power when the JAN jams one channel.In the next timeslot, the JAN chooses another K N channels to jam without repetition, i.e., the JAN needs K/K N timeslots to traverse K channels.This paper defines that the JAN finishes jamming with all K channels without repetition as a scanning period.Therefore, each scanning period contains K/K N timeslots.
(3) Reactive-Scanning Jamming Attack.Unlike scanning jamming attack, after the JAN allocates its jamming power P J K N (t)/K N to K N channels, it senses the ACK/NACK message of the jammed channels (the ACK/NACK is a feedback message that the receiver acknowledges channel access results to the transmitter.When the receiver receives an ACK message, it represents successful transmission; otherwise, the transmission fails [38]).If the ACK is sensed, it means that although the JAN and the EH-CIoT node are accessed to the same channel, the jamming power is insufficient to prevent the EH-CIoT node's data transmission.At this time, the JAN will change the jamming strategy from scanning attack to centralized attack, and transmit all the jamming power to the channel where ACK is located, i.e., P J K N (t)/K N → P J K N (t).Until the NACK is sensed, it means that the JAN successfully prevents the EH-CIoT node's data transmission.At this point, the JAN will leave the channel and begin the next timeslot jamming attack.If the current scanning period ends, the JAN will start a new scanning period.Figure 3 shows the flowchart of reactive-scanning jamming attack.2) Scanning Jamming Attack.The JAN jams K N channels with probability P J K N (t) 260 in t-th timeslot (K N ≤ K), and the jamming power is P J K N (t)/K N .Where the P J K N (t) is the 261 jamming power when the JAN jams one channel.In the next timeslot, the JAN choses 262 another K N channels to jam without repetition, i.e., the JAN needs K/K N timeslots to 263 traverse K channels.This paper defines that the JAN finishes jamming with all K channels 264 without repetition as a scanning period.Therefore, each scanning period contains K/K N 265 timeslots.

266
3) Reactive-Scanning Jamming Attack.Unlike scanning jamming attack, after the 267 JAN allocates its jamming power P J K N (t)/K N to K N channels, it senses the ACK/NACK 1 268 message of the jammed channels.If the ACK is sensed, it means that although the JAN 269 and the EH-CIoT node are accessed to the same channel, the jamming power is insufficient 270 to prevent the EH-CIoT node's data transmission.At this time, the JAN will change the 271 jamming strategy from scanning attack to centralized attack, and transmit all the jamming 272 power to the channel where ACK is located, i.e.P J K N (t)/K N → P J K N (t).Until the NACK is 273 sensed, it means that the JAN successfully prevents the EH-CIoT node's data transmission.274 At this point, the JAN will leave the channel and begin the next timeslot jamming attack.If 275 the current scanning period ends, the JAN will start a new scanning period.Fig. 3 shows 276 the flowchart of reactive-scanning jamming attack.In each timeslot, the maximum jamming power of the JAN is P J max .There is also a con-278 straint on the time-averaged jamming power P J avg = P J K N (t)/K N , where P J avg < P J max .The 279 JAN can select the jamming power from a set of power levels P J (t) = P J0 , P J1 , . . ., P J M .280 Therefore, in the EH-CIoT network, the Signal to Interference plus Noise Ratio (SINR) of the 281 i-th EH-CIoT node received by the ABS in a jamming attack environment can be expressed 282 as: 1 The ACK/NACK is a feedback message that the receiver acknowledges channel access result to the transmitter.When the receiver receives an ACK message, it represents successful transmission, otherwise the transmission fails [38].In each timeslot, the maximum jamming power of the JAN is There is also a constraint on the time-averaged jamming power P J avg = P J K N (t)/K N , where P J avg < P J max .The JAN can select the jamming power from a set of power levels P J (t) = P J0 , P J1 , . . ., P J M .Therefore, in the EH-CIoT network, the Signal to Interference plus Noise Ratio (SINR) of the i-th EH-CIoT node received by the ABS in a jamming attack environment can be expressed as: where f k i (t) = f k J (t) represents the channel accessed by i-th EH-CIoT node which is the same as the channel attacked by the JAN, and n ∼ N 0, ω 2 represents the additive Gaus-sian white noise.In (6), when the work mode of the EH-CIoT node is in harvesting mode, no transmission power is generated; the SINR is 0. When the work mode of the EH-CIoT node is transmission mode and it accesses the same channel as the JAN, the transmission of the EH-CIoT node will be severely jammed or even interrupted.Therefore, the ABS needs to reasonably allocate transmission channels, continuous power and work modes to the EH-CIoT nodes.

Problem Formulation
In the EH-CIoT network under jamming attacks, the ABS learns the attack strategy of the JAN through the signal reception status of the previous timeslot, predicts the channel that will be attacked, and arranges the EH-CIoT nodes that have low battery power to harvest energy, other EH-CIoT nodes transmit on the unjammed channels.The signal reception result of the previous timeslot in the information-exchange phase can be expressed as: and That is, ACK: the ABS successfully receives the EH-CIoT node transmission data; NACK: the ABS fails to receive the EH-CIoT node transmission data or the EH-CIoT node is performing energy harvesting.
Considering the long-term system performance of the EH-CIoT network, the goal of this paper is to maximize the LTT of the EH-CIoT network under jamming attacks.The instant throughput of the EH-CIoT network in a timeslot can be calculated by Shannon's capacity formula [39,40]: where r A t represents the instant throughput of the EH-CIoT network at the t-th timeslot and W represents the spectrum bandwidth.
Since the EH-CIoT network is energy constrained, it is not appropriate to simply maximize the instant throughput of the current timeslot, and future timeslots should also be considered.Therefore, the LTT of the EH-CIoT network at the t-th timeslot is calculated by the following formula: where 0 < γ < 1 represents the future discount rate.The problem of maximizing the LTT of the EH-CIoT network can be formulated: max SINR i (t) ≥ SINR threshold ( 15) where E[•] represents the expected value.The constraint ( 12) and ( 13) ensure that the state of the channel k accessed by the i-th EH-CIoT node and the JAN is a binary value.Constraint (14) ensures that the energy of the EH-CIoT node working in transmission mode does not exceed the available remaining energy.Constraint (15) ensures that the SINR is not less than the threshold.Constraint (16) ensures that the number of EH-CIoT nodes that select the transmission mode IC(t) does not exceed the number of idle channels.
In the EH-CIoT network, if the ABS can obtain the transition probability of the active state of the PUN, the attack probability, and the jamming power of the JAN in advance, Equation ( 11) can be solved using the offline method.However, it is unrealistic for the ABS to obtain such complete information.The Deep Reinforcement Learning (DRL) can solve the indeterminate polynomial resource allocation of the ABS [41].

DRL-Based Transmission-Optimization Algorithm
In this section, this paper analyzes the state parameters of the EH-CIoT network and builds the RL framework, and then briefly introduces the necessary principles of the RL.Finally, the LWDDPG resource-allocation algorithm for an interweave EH-CIoT under jamming attacks is proposed.

Framework of RL-Based EH-CIoT Network
The ABS is responsible for allocating transmission channels, continuous power, and work modes to the EH-CIoT nodes to maximize the LTT of the EH-CIoT network.The aim is to enable the ABS to effectively learn these strategies and make optimal decisions without prior knowledge.This paper constructs an environment model that maps the system model to the MDP's interactive environment [42].The MDP consists of a quintuple, namely MDP = (S, A, P sa , R, γ), where S is the state space, A is the action space, P sa is the state transition probability.Considering the proposed EH-CIoT dynamic environment model without prior knowledge, the P sa is unknown.R is the reward function, and γ is a discount factor that exponentially discounts the value of future rewards in (10).The agents use a discount factor to adjust the value placed on future rewards.The setting of state space, action space, and reward in the model is explained as follows.
Agent: The ABS in the EH-CIoT network.It interacts with the EH-CIoT environment under jamming attacks without prior knowledge to discover which actions yield the greatest rewards in certain situations.Compared with self-management of EH-CIoT nodes, setting the ABS as the agent and managing all EH-CIoT nodes is conducive to making global optimal decisions.
State space S: The ABS needs to collect real-time information of all EH-CIoT nodes at the beginning of each timeslot, perform spectrum sensing, obtain channel gain, and count information reception.Then the state space at the t-th timeslot is defined as: where I(t) = {I 1 (t), I 2 (t), . . ., I K (t)} represents the channel state set of the PUN, G(t) represents the channel gain set, B(t) represents the battery level set of all EH-C nodes and Y(t − 1) represents the information reception set of the previous timeslot.Action space A: The action of the t-th timeslot is defined as A t = {M(t), P(t), C(t)}.Where C(t) = {C 1 (t), C 2 (t), . . . ,C N (t)} represent the transmission channel set and let C i (t) represent the channel that the ABS allocates to the i-th EH-CIoT node.Overall, the ABS allocates transmission channels, continuous power, and work modes to all EH-CIoT nodes.
Reward R: When the ABS interacts with the environment, it uses the reward to evaluate the effectiveness of its action, estimate the distribution of states, and comprehend the surroundings.Therefore, a well-thought-out reward system is essential for the ABS to learn more efficiently.After taking an action in a state, the ABS will receive a reward R(S t , A t ).Then, the system will move to the next state S t+1 .
To better guide the ABS to rationally allocate resources to the EH-CIoT nodes to maximize the LTT of the EH-CIoT network and realize anti-jamming, this paper sets the reward R as a two-dimensional vector.The reward is designed as:

R(S
where r A t denotes the instant throughput reward of the system at the t-th timeslot in (9).Whether the transmission channel is jammed, the channel gain situation, and the allocation of EH-CIoT nodes transmission power all affect the size of r A t .Therefore, the ABS can evaluate the effectiveness of its actions through r A t .In addition, the battery power of the EH-CIoT nodes is also closely related to LTT.The adequate battery power can increase the transmission throughput of the EH-CIoT nodes.As mentioned earlier, due to the differences in transmission power and channel gain on different channels, the energy harvested by the EH-CIoT nodes on different channels also varies.Thus, this paper sets r E t as the energy reward, which represents the energy harvested by the EH-CIoT nodes.The expression of the certain reward in the t-th timeslot is: Compared to r A t , r E t is small, which can effectively prevent the ABS from excessively attempting to obtain rewards through EH.Although it cannot be guaranteed that the battery power of the EH-CIoT nodes working in transmission mode is always full, the proposed method can enable the EH-CIoT nodes to harvest more energy for transmission than traditional methods during the EH phase through the design of energy reward.Through the two-dimensional reward function, the ABS can continuously interact with the environment to evaluate the effectiveness of its actions and find the optimal actions to maximize the LLT.
Unlike the scalar rewards of the traditional RL algorithm, this paper designs the rewards as a two-dimensional vector that includes both instant throughput rewards and energy rewards.To solve this two-dimensional vector, this paper uses the Linear Weighted (LW) method [43].The method allocates appropriate weight coefficients to the elements in the vector according to their importance, and the sum of their products is used as a new objective function.Formally, the method aims to maximize the following function: where w i is a non-negative weight and ∑ m i=1 w i = 1.The weights w i are often taken as a constant.The proposed reward consists of instant throughput reward and energy reward; they have different weights, so the weighted sum of reward vector elements can be expressed as r = Rω T = r A t ω a + r E t ω e , where ω = [ω a , ω e ].In our setup, the weight parameter value is in the range [0, 1].This method converts the two-dimensional reward vector to a scalar to enable the ABS to intuitively evaluate the effectiveness of its actions.
In RL, after choosing action a according to the policy π, the expected return in state s is usually represented by state-action value or Q-value: where R A (t) is the future cumulative reward (10).Since the ABS obtains the action in a given state according to the policy π, the agent's target is to constantly learn the optimal policy π * = arg max π Q π (s, a).To obtain the optimal policy π * , the agent uses the Bellman function to recursively calculate the r A t .The Bellman function separates the Q-function into the discounted future reward and the reward [44], that is Traditional RL algorithms are not suitable for the huge state space and action space.For this reason, DRL introduces Deep Neural Networks (DNNs) for better nonlinear fitting of the value function.DQN uses the function Q(s, a | θ) to represent the approximate computation of the value function, where θ is the parameter of the neural network.Although DQN can make good decisions for discrete and low-dimensional action space problems, it is not suitable for continuous action control problems.To solve this problem, the Deep Policy Gradient (DPG) algorithm is adopted and an Actor-Critic (AC) method is proposed.DQN and DPG algorithms are combined in a further proposed DDPG algorithm that is based on the AC framework.To handle the high-dimensional continuous optimization problem in the proposed system model, this paper provides a LWDDPG resource-allocation algorithm for an interweave EH-CIoT under jamming attacks.

Linearly Weighted Deep Deterministic Policy Gradient-Based Power-Allocation Algorithm
The framework of the LWDDPG algorithm is shown in Figure 4.It consists of three parts: actor policy network, critic value network, and experience buffer.The AC network includes four DNNs, that is, an Online Critic Network (OCN) with parameter θ Q , an Online Actor Network (OAN) with parameter θ µ , a Target Critic Network (TCN) with parameter θ Q ′ , and a Target Policy Network (TPN) with parameter θ µ ′ .The OAN µ(s | θ µ ) is used to build the mapping from states to actions, and the OCN Q s, a | θ Q is used to estimate the value of actions.In the initialization phase, the TAN µ ′ s, a | θ µ ′ and the TCN Q ′ s, a | θ Q ′ are created by copying the parameters of the online network.
When updating network parameters, sample mini-batches from experience buffer D with capacity C.These mini-batches train the parameters through the gradient-descent method, then update the actor network, the critic network, and their corresponding target networks in turn.Any set of tuples sampled can be denoted as (s x , a x , r x , s x+1 ). to recursively calculate the r A t .The bellman function separates the Q-function into the discounted future reward and the reward [43], that is Traditional RL algorithms are not suitable for the huge state space and action space.For this reason, DRL introduces Deep Neural Networks (DNN) for better nonlinear fitting of the value function.DQN uses the function Q(s, a | θ) to represent the approximate computation of the value function, where θ is the parameter of the neural network.Although DQN can make good decisions for discrete and low-dimensional action space problems, it is not suitable for continuous action control problems.To solve this problem, the Deep Policy Gradient (DPG) algorithm is adopted and an Actor-Critic (AC) method is proposed.DQN and DPG algorithms are combined in a further proposed DDPG algorithm that is based on the AC framework.To handle the high-dimensional continuous optimization problem in the proposed system model, this paper provides a LWDDPG resource allocation algorithm for an interweave EH-CIoT under jamming attacks.OCN is optimized by minimizing the loss between the target value and the Q function.The loss function of OCN can be formulated as the Mean Squared Error (MSE) of the difference as follows:

Linearly Weighted Deep Deterministic Policy Gradient Based Power Allocation Algorithm
The target value y x is calculated as follows: then use the gradient descent method to minimize L θ Q to update the parameters in OCN.
For OAN optimization, its loss function can be obtained by summing the Q-functions of the states.OCN is used to calculate the evaluation value of the state action pair of OAN (the cumulative expected return), that is then use the gradient ascent method to maximize L(θ µ ) and update the parameters in OAN.
For the update of the two target networks, the DDPG algorithm adopts the soft update method, which can also be called exponential moving average, that is soft update: where ξ ∈ (0, 1] represents the update rate of the target network.By including noise in the actor policy, the exploration problem of ABS learning in continuous action spaces can be solved.Specifically, at each decision step, actions are chosen from a random process with expectation µ(s t | θ µ ) and variance εσ 2 , namely A t ∼ N µ(S t | θ µ ), εσ 2 , where ε is a parameter to attenuate the randomness of actions in the training process.The randomness of actions may lead to IC(t) > IK(t), which does not satisfy constraint (16).For this instability factor, the ABS will sort the SINR of the IC(t) nodes, select the IC(t) nodes with the largest SINR to allocate transmission mode, and allocate the harvesting mode for the others.The proposed LWDDPG resource-allocation algorithm for an interweave EH-CIoT under jamming attacks is given in Algorithm 1.This paper introduces energy rewards to enable the EH-CIoT nodes to harvest more energy to increase throughput, and uses the learning ability of the LWDDPG algorithm to enable the ABS to allocate transmission channels, continuous power, and work modes more reasonably for the EH-CIoT nodes; let the EH-CIoT nodes avoid transmitting on jammed channels and achieving anti-jamming.In the resource-allocation process of Algorithm 1, when all EH-CIoT nodes in the local area have insufficient energy and cannot harvest energy from the RF signals of other EH-CIoT nodes, they can still harvest the RF energy of the PUs and the JAN.And the proposed method can avoid this extreme situation through energy rewards and reasonable allocation of transmission power.Initialize the environment.

5:
Get the initial state S 0 . 6: get scalar reward through r = r A t ω a + r E t ω e and next state S t+1 .9: Save data (S t , A t , R t , S t+1 ) to experience buffer D. Randomly sample transition data (s x , a x , r x , s x+1 ) of size N B from D. Update OCN by minimizing L θ Q in Equation (23) 13: Update OAN by maximizing L(θ µ ) in Equation (25)   14: Soft update TCN and TAN by Equation (26).end for 17: end for 18: Output: Optimal action A t of each timeslot.

Simulation Settings
Through computer simulations, this paper evaluates the performance of the proposed LWDDPG resource-allocation algorithm.This paper simulates a multi-user EH-CIoT model in a jamming attack environment.In realistic scenarios, the base station provides wireless communication services to users within its coverage area.Therefore, its service range is the coverage area centered around the base station.Network nodes are usually randomly distributed within the service range of the base station.In addition, according to the 3GPP organizational rules, under the existing 5G background, the coverage radius of macro base stations is over 200 m.Based on the above actual situations, the network size of this paper is set as follows.In the area of 1 km × 1 km, the PBS is located at [500, 500]; the ABS is located at [250,250]; the PUs are distributed within a radius of 500 m with the PBS as the center; EH-CIoT nodes and JAN are distributed within a radius of 250 m with the ABS as the center.The users of each node obey the Poisson distribution.
For fair comparison, this paper uniformly considers the channel bandwidth adopted in [27], i.e., 1 MHz.In addition, to verify the effectiveness of the proposed method, this paper considers various types of jamming attacks in existing work [45], including random jamming, scanning jamming, and reactive-scanning jamming.Among them, random jammer randomly chooses a channel to inject jamming signals.The scanning jammer is an improvement based on a random jammer, which can simultaneously randomly jam with multiple channels; thus, it has a greater impact on throughput.Compared to the scanning jammer, the reactive-scanning jammer is more intelligent, so its harm is stronger than the scanning jammer.Considering the energy consumption of the jammer and the effectiveness of the jamming attack, the maximum jamming power of the jammer is usually slightly greater than the transmission power of the EH-CIoT nodes.Therefore, this paper sets the maximum jamming power of the jammer to 0.2W.In addition, this paper compares the proposed LWDDPG resource-allocation algorithm with the Greedy Algorithm, the ACDQN Algorithm in [24], and the DDPG Algorithm in [27].
The simulation uses the Python 3.6 programming language to develop the RL environment, and the results were obtained from the deep learning framework based on TensorFlow [46].In the simulation, all networks of the LWDDPG-based algorithm have two hidden layers with L1 = 256 and L2 = 256 neurons, respectively.To reduce the computational complexity, the activation functions of the hidden layer and the output layer of the critic network and the hidden layer of the participant network are set as the rectified linear units.To limit the range of action, the activation functions of the output layer of the actor network are set to tanh.The optimizer of the critics network and the actors network is Adam [47].The learning rates of the critic and the actor are set to 0.003.The soft update rate ξ is set to 0.005.The number of maximum episodes is set to 500, and the number of steps per episode is 10∼100.Each episode represents a complete RL process.And a step represents the action performed by the ABS in each episode.The parameters of the neural network are initialized randomly at the beginning of each experiment.At the beginning of each episode, the battery of the EH-CIoT nodes is reset to E max .The other simulation parameters are provided in Table 1.

Statistical Results and Analysis
In this subsection, this paper compares the performance of four algorithms, LWD-DPG (proposed method), DDPG (method in [27]), ACDQN (method in [24]), and Greedy algorithms under four different environments: no jamming, random jamming, scanning jamming, and reactive-scanning jamming.This paper also briefly analyzes the simulation results.
First, this paper compares the performance of different methods without jamming.As shown in Figure 5, all the RL algorithms grow with the number of episodes and eventually converge, which proves the convergence of the RL algorithms.The convergence speed of the proposed method is similar to the DDPG algorithm; however, their average throughput after convergence is about 1.62 and 1.4, respectively (for convenience of expression, this paper only uses the mantissa of the scientific notation to express the throughput value of different methods).Obviously, the proposed method is superior to the DDPG algorithm, and the proposed method improves by 15.7% compared to DDPG algorithm.This is because this paper introduces energy rewards to ensure that the ABS always allocates the EH-CIoT nodes with insufficient power to harvest energy on the channels that can obtain more energy, which ensures that the EH-CIoT nodes have sufficient power for transmission in the next timeslot.In addition, the average throughput of the ACDQN algorithm and the Greedy algorithm after convergence is about 1.1 and 0.9, respectively.The proposed method improves by 47.2% and 80% compared to the ACDQN algorithm and the Greedy algorithm.The Greedy algorithm has the worst performance; this is because the decisions of the Greedy algorithm are not based on long-term goals, so it cannot reasonably allocate limited energy to achieve the LTT.As shown in Figures 6 and 7, this paper compares the performance of different methods under random jamming and scanning jamming.Compared to without jamming attack, the convergence throughput of all algorithms decreases under random jamming and scanning jamming attack, and the impact of scanning jamming attacks on throughput is greater.This is because the scanning jamming attack jams multiple channels at once, so it can simultaneously jam the transmission of multiple EH-CIoT nodes, resulting in a greater decrease in throughput of different algorithms.In addition, it can be seen that the convergence throughput of the proposed method is still superior to other algorithms under random jamming and scanning jamming attacks.On the one hand, the proposed method allocates continuous power to the EH-CIoT nodes to achieve more accurate power allocation.And the ABS continuously interacts with the environment to learn, then comprehensively considers the channel jamming, channel gain, and battery power of the EH-CIoT nodes to allocate the EH-CIoT nodes to transmit on the most suitable channels.On the other hand, since the reward function of the proposed method takes energy rewards into account, it can enable the EH-CIoT nodes to have more sufficient power for transmission at each timeslot.Under random jamming attacks, the convergence throughput of the proposed method, the DDPG algorithm, the ACDQN algorithm, and the Greedy algorithm are about 1.59, 1.35, 1, and 0.8, respectively.The proposed method improves by 17.8%, 59%, and 98.8% compared to the DDPG algorithm, the ACDQN algorithm, and the Greedy algorithm.Under scanning jamming attacks, the convergence throughput of the proposed method, the DDPG algorithm, the ACDQN algorithm, and the Greedy algorithm are about 1.4, 1.18, 0.8, and 0.5, respectively.The proposed method improves by 18.6%, 75%, and 180% compared to the DDPG algorithm, the ACDQN algorithm and the Greedy algorithm.In Figure 8, this paper compares the performance of different methods under the reactive-scanning jamming attack.Unlike scanning jamming attacks, in reactive-scanning jamming attacks, the JAN achieves more accurate jamming by sensing ACK/NACK messages in the jammed channel.Therefore, compared to scanning jamming attacks, under reactive-scanning jamming attacks, the convergence throughput of different methods has decreased.Under reactive-scanning jamming attacks, the convergence throughput of the proposed method, the DDPG algorithm, the ACDQN algorithm, and the Greedy algorithm are about 1.38, 1.1, 0.7, and 0.4, respectively.The proposed method improves by 25.4%, 97.1%, and 245% compared to the DDPG algorithm, the ACDQN algorithm, and the Greedy algorithm.
Table 2 shows the percentage decrease in throughput of different methods under different jamming attacks compared to without jamming attacks.The smaller the values corresponding to different methods in Table 2, the better their performance in defending against different jamming attacks.Therefore, from Table 2, it can be intuitively seen that under random jamming attacks, scanning jamming attacks, and reactive-scanning jamming attacks, the throughput decrease percentages of the proposed method are 1.85%, 13.6%, and 14.8%, respectively, which perform the best among all methods.This verifies that the proposed method is more effective than traditional methods in defending against different jamming attacks.Figure 9 shows the relationship between the total throughput of the different methods and the number of timeslots under different jamming situations.As shown in Figure 9, when the battery capacity is fixed at 1 J, the total throughput of all methods gradually increases with timeslots and tends to converge.This is because at the beginning of the timeslots, the ABS knows nothing, or knows inaccurate information about the environment.As timeslots increase, the ABS continuously interacts with the environment to learn and makes great decisions.The difference is that the maximum convergence total throughput of each method is significantly different.Under without jamming attacks, random jamming attacks, scanning jamming attacks, and reactive-scanning jamming attacks, the convergence throughput of the proposed methods are about 1.6, 1.55, 1.4, and 1.3, respectively, and the proposed method is superior to other methods.This is because the proposed method can effectively harvest energy and enable the EH-CIoT nodes to fully utilize limited battery capacity for transmission.Figure 10 shows the relationship between the average throughput of different methods and maximum battery capacity under the different jamming attacks.In Figure 10, the simulation results show that the average throughput of all methods increases with the increase in the maximum battery capacity B max .This is because the larger the battery capacity, the more energy the EH-CIoT nodes can use to transmit, and the EH-CIoT nodes can transmit at a higher power in each time slot.Under reactive-scanning jamming attacks, when the maximum battery capacity is 2 J, the convergence throughput of the proposed method, the DDPG algorithm, the ACDQN algorithm, and the Greedy algorithm are about 2.4, 2.2, 1.4, and 0.93, respectively.The proposed method improves by 9.1%, 71%, and 158% compared to the DDPG algorithm, the ACDQN algorithm, and the Greedy algorithm.When the maximum battery capacity is 1 J, the improvement rates of the proposed method compared to traditional methods are 25.4%, 97.1% and 245%, respectively.Obviously, as the maximum battery capacity increases, the advantages of the proposed method decrease.This indicates that the proposed method is more suitable for the energy-constrained EH-CIoT network.

155 4 .
System Model156In specific application scenarios, the system model shown in Fig.1can be a wireless 157 sensor network used for environmental monitoring.The nodes with energy harvesting 158 capability monitor real-time environmental data and transmit them to base stations.159Thispaper considers the interweave mode of the CR[27].Fig. 1 shows the multi-user 160 EH-CIoT network under jamming attacks.The EH-CIoT network composes of a Primary 161 Users Network (PUN) which has M PUs and a Primary Base Station (PBS), one ABS and 162 N SUs, and a Jamming Attack Node (JAN).Due to the fact that the SUs in the EH-CIoT 163 network are IoT devices and have EH function, this paper refers to the SUs as the EH-164

st 6 ,
2024 submitted to Journal Not Specified 7 of 20 277

Figure 3 .
Figure 3.The flowchart of reactive-scanning jamming attack

st 6 ,
2024 submitted to Journal Not Specified 11 of 20

Figure 4 .
Figure 4. Framework of LWDDPG algorithm The framework of LWDDPG algorithm is shown in Fig. 4. It consists of three parts: actor policy network, critic value network and experience buffer.The AC network includes four DNNs, that is, an Online Critic Network (OCN) with parameter θ Q , an Online Actor Network (OAN) with parameter θ µ , a Target Critic Network (TCN) with parameter θ Q ′ and a Target Policy Network (TPN) with parameter θ µ ′ .The OAN µ(s | θ µ ) is used to build

10 :
if D is full, do 11:
[36] represents the distance between x and y, x ∈ {[1, 2, ..., N], P, J} denotes the x-th transmitter (with the number of EH-CIoT nodes N, PBS and JAN), y ∈ {[1, 2, ..., N], b} denotes the y-th receiver (with the number of EH-CIoT nodes N and ABS), α represents the path loss index and d 0 represents the reference distance (d xy > d 0 )[36].Therefore, g ib (i = 1, 2, . . ., N) represents the channel gain from the i-th EH-CIoT node to the ABS; g Pi represents the channel gain from the PBS to the i-th EH-CIoT node; g si (s = 1, 2, . . ., N) represents the channel gain between s-th and i-th EH-CIoT nodes, and s ̸ = i; g Ji represents the channel gain from JAN to the i-th EH-CIoT node and G(t) = g ib (t), g Pi (t), g si (t), g Ji (t) is the set of the channel gains.
Ji represents the channel gain from JAN to the i-th EH-CIoT node and G(t) = g ib (t), g Pi (t), g si (t), g Ji (t) is the set of the channel gains.

Table 2 .
Percentage decrease in throughput.