Outage Probability Minimization for Energy Harvesting Cognitive Radio Sensor Networks

The incorporation of cognitive radio (CR) capability in wireless sensor networks yields a promising network paradigm known as CR sensor networks (CRSNs), which is able to provide spectrum efficient data communication. However, due to the high energy consumption results from spectrum sensing, as well as subsequent data transmission, the energy supply for the conventional sensor nodes powered by batteries is regarded as a severe bottleneck for sustainable operation. The energy harvesting technique, which gathers energy from the ambient environment, is regarded as a promising solution to perpetually power-up energy-limited devices with a continual source of energy. Therefore, applying the energy harvesting (EH) technique in CRSNs is able to facilitate the self-sustainability of the energy-limited sensors. The primary concern of this study is to design sensing-transmission policies to minimize the long-term outage probability of EH-powered CR sensor nodes. We formulate this problem as an infinite-horizon discounted Markov decision process and propose an ϵ-optimal sensing-transmission (ST) policy through using the value iteration algorithm. ϵ is the error bound between the ST policy and the optimal policy, which can be pre-defined according to the actual need. Moreover, for a special case that the signal-to-noise (SNR) power ratio is sufficiently high, we present an efficient transmission (ET) policy and prove that the ET policy achieves the same performance with the ST policy. Finally, extensive simulations are conducted to evaluate the performance of the proposed policies and the impaction of various network parameters.


Introduction
During the last decade, bandwidth demand for the limited spectrum has been greatly increasing due to the explosive growth of wireless services. The current static frequency allocation schemes, with a severe underutilization of the licensed spectrum over vast temporal and geographic expanses [1], cannot support numerous emerging wireless services. This motivates the concept of cognitive radio (CR) [2][3][4], which has been envisioned as an intelligent and promising approach to alleviate the problem of spectrum utilization inefficiency. In CR networks (CRNs), unlicensed secondary users (SUs) opportunistically access the spectrum dedicated to some licensed primary users (PUs) without interfering with the PU operation [5]. Through enabling the CR users to dynamically access the available bands in the licensed spectrum, spectrum efficiency can be improved significantly.
The wireless sensor network (WSN), which is capable of performing event monitoring and data gathering, has been applied to various fields, including environment monitoring, military surveillance, smart homes and other industrial applications [6,7]. Currently, most WSNs work in the license-free band and are expected to suffer from heavy interference caused by other applications sharing the same spectrum. It is therefore imperative to employ CR in WSNs to exploit the dynamic spectrum access techniques, hence giving birth to the CR sensor networks (CRSNs) [8,9]. In CRSNs, in order 1. Considering the status of primary channels, the diversity of channel conditions, the energy replenishment process, as well as the imperfection of spectrum sensing, we investigate the joint optimization of channel sensing and adaptive transmit power allocation to minimize the SU's long-term outage probability. The above design problem is formulated as a discounted Markov decision process (MDP). 2. We theoretically prove the existence of an optimal stationary deterministic policy and obtain the -optimal sensing-transmission (ST) policy, which specifies the allocation of sensing time and transmission power through using the value iteration in the MDP. Moreover, an interesting structural property regarding the optimal transmission policy is obtained. It is proven that the optimal long-term outage probability is non-increasing with the amount of the available energy in the battery. 3. For a special case where the signal-to-noise (SNR) power ratio is sufficiently high, we propose an efficient transmission (ET) policy with reduced computational complexity. It is theoretically proven that the efficient transmission policy achieves the same performance as the proposed sensing-transmission policy when the SNR is sufficiently high, which has also been validated through computer simulations. 4. We provide extensive simulation results to compare the performance of the sensing-transmission policy and the efficient transmission policy with that of a benchmark policy. It is shown that the proposed sensing-transmission policy achieves significant gains with respect to the benchmark policy, and both the sensing-transmission and the efficient transmission policies converge to the same value in high SNR regions. In addition, the impacts of various system parameters on the performance of proposed policies are also investigated.
The rest of the paper is organized as follows. The related work is reviewed in Section 2. The network model and the related assumptions are presented in Section 3. We formulate the outage probability minimization problem as an MDP in Section 4. The proposed policies and the related theorems are illustrated in Section 5. The performance and characteristics of the proposed policies are evaluated through numerical results in Section 6. Finally, we conclude this paper in Section 7.

Related Work
In the literature, the topic of energy harvesting and cognitive radio receive increasing attention. Three groups of existing works are most related. First, the CR technique has received significant attention during the past few years [18][19][20][21][22]. In [18], the authors focus on designing a database access strategy that allows the SUs to jointly consider the requirements of the existing rules, as well as the maximization of the expected communication opportunities through on-demand database access. The optimal strategy introduced in [18], which is computationally unfeasible with the brute-force approach, can be solved by the efficient algorithm proposed in [19]. In [19], by proving that the optimal strategy has a threshold structure, an efficient algorithm is introduced by exploiting the threshold property. In [20], the authors investigate the achievable throughput of an unlicensed sensor network operating over the TV white space spectrum. The achievable throughput is analytically derived as a function of the channel ordering. Additionally, the closed-form expression of the maximum expected throughput is illustrated. The work in [21] studies the problem of coexistence interference among multiple secondary networks without the secondary cooperation. Under a reasonable assumption, a computationally-efficient algorithm for finding the optimal strategy is presented. The work in [22] develops robust power control strategies for cognitive radios in the case of sensing delay and model parameter uncertainty. A robust power control framework that optimizes the worst-case system performance is proposed. All of the problems considered in the above works are formulated as Markov decision process problems, and the technical contributions are very important and valuable. However, due to the unique features of the EH CRSNs, such as the dynamic energy replenishment process, which stipulates a new design constraint on energy usage in the time axis, there is a need to revisit resource allocation policies so that the energy expenditure can efficiently adapt to the dynamics of energy arrivals.
Second, the energy harvesting technique has been widely studied in wireless communication systems [23][24][25][26][27][28][29][30]. The works in [23][24][25][26] consider the point-to-point wireless communications. In [23], through optimizing the time sequence of transmit powers, the authors focus on maximizing the throughput by a deadline and minimizing the transmission completion time. For the offline policy, a directional water-filling algorithm is introduced to find the optimal power allocation. For the online policy, dynamic programming is applied to solve the optimal power allocation. In [24], the authors consider the problem of energy allocation over a finite horizon to maximize the throughput. A water-filling energy allocation where the water-level follows a staircase function is introduced. The work in [25] studies the problem of energy allocation for sensing and transmission to maximize the throughput in an energy harvesting wireless sensor network. The problem studied in [25] considers the finite horizon case, which is extended in [26] to an infinite-horizon case. In [26], the authors study the energy allocation for sensing and transmission for an energy harvesting sensor node. An optimal energy allocation algorithm and an optimal transmission energy allocation algorithm are introduced. The works in [27,28] consider the problem of hybrid energy supply. In [27], the authors investigate the minimization of the power consumption stemming from the constant energy source for transmitting a given number of data packets. In [28], for a hybrid energy supply system employing a save-then-transmit protocol, the authors explore the transmission scheduling problem. In [29], the authors study the transmission power allocation strategy to achieve the energy-efficient transmission. The harvest-use technique is adopted, which means that the harvested energy cannot be stored and must be used immediately. In [30], for a solar-powered wireless sensor network, the authors present an optimal transmission policy based on a data-driven approach. However, due to the distinctive operation of cognitive radio, such as spectrum sensing, spectrum management, etc., directly applying the strategies mentioned above to EH CRSNs can be ineffective or inefficient.
Third, much recent research has been tightly focused on CR systems powered by energy harvesting. The work in [11] focuses on an energy harvesting cognitive radio network with the save-then-transmit protocol; the authors mainly investigate the joint optimization of saving factor, sensing duration, sensing threshold and fusion rules to maximize the achievable throughput. In [31], for a single-user multi-channel setting, jointly considering probabilistic arrival energy, channel conditions and the probability of PU's occupation, the authors propose a channel selection criterion. In [32], jointly considering the battery replenish process and the secondary belief regarding the primary activities, the authors introduce an energy allocation for sensing and transmission to maximize the long-term throughput. Different from [32], a suboptimal energy allocation algorithm that allocates energy in an online approach is introduced in [33]. In [34], in order to maximize the throughput, the authors derive an optimal sensing strategy through optimizing the access probabilities of idle channels and busy channels. In [35], a joint design of the spectrum sensing and detection threshold to maximize the long-term throughput is studied. Furthermore, the upper bound of the achievable throughput is derived as a function of the energy arrival rate, the statistical behavior of the primary network traffic and the detection threshold in [36]. In [10], the authors propose a spectrum and energy-efficient heterogeneous cognitive radio sensor network (HCRSNs), where EH-enabled spectrum sensors cooperatively detect the status of the licensed channels, while the data sensors transmit data to the sink. Compared with these works, the salient feature of this paper is that, according to the current knowledge of the battery state, channel fading, as well as the arrival energy based on EH, we jointly optimize the action of channel sensing and transmission power allocation for an energy harvesting cognitive sensor node, to minimize the long-term outage probability.

Primary Network Model
We consider a primary network where a primary user (PU) owns the usage right of a channel with bandwidth B. The PU is assumed to employ synchronous slotted communication with a time slot duration T. The primary traffic is modeled as a two-state time-homogeneous random process, in which the channel randomly switches its state between idle and occupied, as assumed in [34]. Let θ t represent the status of the channel in time slot t, with θ t = 0 or 1 indicating that the channel is occupied or idle, respectively. The probability that the channel is occupied by the PU is denoted as p o Pr(θ t = 0). Correspondingly, the idle probability of the channel is defined as p i Pr(θ t = 1) = 1 − p o . It is assumed that p o and p i are available for the secondary users based on the long-term spectrum measurements [35].

Energy Model and Opportunistic Spectrum Access
We consider a point-to-point communication link between two secondary sensor nodes, which are also referred to as secondary users (SUs). An EH-enabled SU opportunistically accesses the primary channel to convey data to its receiver. The EH SU is powered exclusively by energy harvested from the ambient environment (e.g., solar, wind, thermal, ambient radio power) and stores the energy in a rechargeable battery with finite energy storage capacity. A correlated time process following a first-order discrete-time Markovian model is adopted for modeling the energy arrivals [26,37]. According to the harvest-store-use model, the harvested energy in the current time slot can only be used in the next time slot.
Since the PU has priority in utilizing the channel, in order to opportunistically use the channel, the SU has to perform real-time monitoring of the channel to avoid collisions with the PU. Thus, for each time slot, the overall transmission process consists of two phases, namely the channel acquisition phase and the transmission phase. The allocation of time durations for the two phases is illustrated in Figure 1, where the channel acquisition phase and the transmission phase consume α t and 1 − α t fractions of one time slot, respectively, and α t is referred to as spectrum sensing overhead for time slot t, which can be altered to optimize the performance of the system. For the channel acquisition phase, the SU senses the status of the spectrum with α t T time through the energy detection technique [38]. As the complexity is roughly linear in sensing duration, we can assume that the energy consumption e s for sensing is proportional to α t T with a constant sensing power p s [32], namely:  If the channel is sensed to be idle, the SU starts data transmission using energy stemming from the battery. Let P t be the transmit power of SU, then the energy consumption for the transmission phase can be expressed as: If the channel is sensed to be occupied, the SU stays in the idle state with a constant idle power p c , which is considerably less than the transmit power [39]; therefore, the energy consumption for the data transmission phase is:

Spectrum Sensing and Transmission Data Rate
During the channel acquisition phase mentioned above, the SU acquires the status of the channel by performing a binary hypothesis test to determine between idle H 0 (i.e., θ t = 1) and occupied H 1 (i.e., θ t = 0). Due to the existence of sensing errors, the reliability of spectrum sensing is evaluated by two indicators, namely the false alarm probability P f and the detection probability P d , which are defined as follows: whereθ t is the binary decision on the primary channel, withθ t = 0 or 1 representing that the primary channel is determined to be occupied or idle, respectively. Considering ensuring sufficient protection to the PU, the SU should satisfy a target detection probabilityP d on the primary channel. Regarding the complex-valued primary signal and circularly symmetric complex Gaussian (CSCG) noise case, the probability of a false alarm is given by [40]: where β is the received signal-to-noise ratio (SNR) of the primary signal at SU and f s is the sampling For time slot t, after acquiring the status of the channel, the SU performs channel estimation to obtain the channel condition. Specifically, the SU will send pilot signals to the receiver and acquires the channel power gain, denoted as γ t , through an error-free and dedicated feedback channel [31]. Since the above channel estimation takes a very short time and limited power, we assume the time and energy consumed in the channel estimation are negligible compared to the sensing time, and hence, we ignore it in our analysis (For example, if PUs are TV bands where each channel occupies 6 MHz in the case of the IEEE802.22 wireless regional area network (WRAN), the typical sensing time is about a few milliseconds, which will result in thousands of samples [40]. However, for channel estimation, a few pilot symbols would be enough. For example, in IEEE802.11a, only four pilot symbols are used for channel estimation [41]), similar to [42,43]. Then, the transmission data rate of the SU is: where N 0 is the destination noise power. The coefficient 1 − α t is due to the fact that only a 1 − α t fraction of a time slot is utilized for the SU's data transmission phase. If, on the other hand, the channel is sensed to be occupied, the sensor node abstains from transmission and stays idle for the rest of the time slot; thus, the transmission data rate r(α t , P t , γ t ) is zero. The overall objective of this paper is to design optimal policies by jointly considering the sensing overhead and the transmit power allocation, to minimize the long-term outage probability of the EH-enabled cognitive sensor nodes. In the following section, we will exhibit the procedure of formulating the problem of outage probability minimization as an Markov decision process in detail.

Problem Formulation
In this section, we formulate the problem of long-term outage probability minimization as an MDP. The MDP model is mainly composed of decision epochs, states, actions, state transition probabilities and rewards. The decision epoch is time slot t ∈ T = {0, 1, 2, · · · }. The state of the system is denoted as s = (b, g, h), where b indicates the battery energy state, g indicates the channel state and h indicates the state of arrival energy based on EH. We assume that b, g and h take discrete values from discrete finite set B = {0, 1, 2, · · · , N B − 1}, G = {0, 1, 2, · · · N G − 1} and H = {0, 1, 2, · · · N H − 1}, respectively. Thus, the state space can be expressed as S = B × G × H, where × denotes the Cartesian product. We assume the battery is quantized in units of e u , which can be referred to as one unit of energy quantum . Additionally, we denote the battery energy State 0 corresponds to the energy B 0 p c T e u e u , which is the energy consumption when the SU stays in the idle state within the entire time slot, and for battery state b ∈ B\{0}, the total energy in the battery is B 0 + be u . As for the arrival energy, if the arrival energy state is h ∈ H, then the actually arrival energy is Q h e u , where Q h ∈ N. It should be noted that as the channel state and arrival energy state can only be acquired casually, at the beginning of time slot t, the SU only attains the exact channel state and the arrival energy state of the previous time slot. Therefore, the system state for time slot t can be represented as s t = (b t , g t−1 , h t−1 ), where b t ∈ B is the energy state for the current time slot, whereas g t−1 ∈ G and h t−1 ∈ H are the states of channel and arrival energy of the previous time slot. The evolvement of the arrival energy h t is assumed to be a first-order discrete-time Markovian model introduced in Section 3.2; hence, in the following, we will first introduce the update process of the battery energy state b t along with the SU's channel capacity. Then, the evolvement of the channel state g t is presented.
First, as to the battery energy state update process, a combination of sensing overhead α t and transmit power P t leads to one of the following four possible consequences: 1. Idle detection with probability p i (1 − P f (α t )): the primary channel is idle while the sensing result is correct. Then, channel capacity: is gained, and the battery energy state updates as: 2. False alarm with probability p i P f (α t ): the primary channel is idle while the sensing result is wrong. The SU abstains from the transmission, and the channel capacity R is zero. The battery energy state is: 3. Occupied detection with probability p oPd : the primary channel is occupied while the sensing result is correct. SU abstains from the transmission, and channel capacity R is zero; the battery energy state is the same as (10). 4. Misdetection with probability p o (1 −P d ): the primary channel is occupied while the sensing result is wrong. Channel capacity R is zero due to the collision with PU and the battery energy state updates the same as (9).
Second, we formulate the evolvement of channel states. The channel fading process can be modeled as a time-homogeneous finite-state Markov chain (FSMC), which has been widely used to model the block fading channel [44][45][46][47]. Specifically, the channel power is quantized using a finite number of thresholds G We consider that the wireless channel fluctuates slowly over time slots and remains constant within a time slot, as assumed in [48,49]. Hence, the channel state transition occurs only from the current state to its neighboring states at the beginning of each time slot [30]. Considering the Rayleigh fading channel, the channel state transition probability is determined by [50]: where P(g = i) is the stationary probability that the channel state is i, and P( G a ); G a is the average channel power gain. h(β) = 2πβ/G a f D exp(−β/G a ) is the level crossing rate, where f D is the maximum Doppler frequency, normalized by 1/T. The boundary transition probabilities for channel states are: According to the current system state s t = (b t , g t−1 , h t−1 ), we introduce the action set of the SU. The sensing overhead α t is quantized in units of α u = e u p s T , and the the action set of sensing overhead can be expressed as follows: where · is the floor function. b t = 0 indicates that the energy level in the battery is so low (the energy stored in the battery is B 0 ) that the available energy is merely enough to compensate the energy expenditure when the SU stays in the idle state within the entire time slot. In this case, the SU stops the sensing, as well as transmission and keeps on harvesting energy. Respecting the constraint min{ P s T e u , b t }, the first constraint indicates that the sensing duration should be less than the time slot T; the second constraint indicates that the energy consumption for sensing should be less than the available energy b t e u . When an action a α ∈ A s t α is taken, the sensing overhead is a α · α u , the sensing time is a α · α u · T = a α · e u P s and the energy consumption for sensing is e s (a α · α u ) = a α · α u · T · P s = a α e u . According to the action of sensing overhead, the action set of transmission power is quantized in units of P u = e u (T−a α α u T) , and the action set can be expressed as: For an action a p ∈ A (s t ,a α ) p , SU will consume a p e u energy for data transmission. Therefore, given a system state s t = (b t , g t−1 , h t−1 ), the action set can be represented as: We use P(s t+1 |s t , a) to denote the system state transition probability, which indicates the probability that the system will go into state s t+1 = (b t+1 = b , g t = g , h t = h ) in the case that the current system state is s t = (b t = b, g t = g, h t = h) and SU takes an action a = (a α , a p ) ∈ A s t . The state transition probability can be derived as follows: where: since b is a certain value, which is determined by b, h and action a α , a p . I x denotes the indicator function which takes the value of one if x is true, otherwise zero.
The reward function is defined as the outage probability regarding the system state s t = (b t , g t−1 , h t−1 ) and the corresponding action a = (a α , a p ), which is given by [51]: In the following section, we first mainly study the existence of the optimal transmission policy. Then, the -optimal sensing-transmission policy that specifies the actions concerning the sensing overhead and the transmit power to minimize the long-term outage probability is introduced. Last, for a special case where the signal-to-noise power ratio is sufficiently high, we introduce an efficient transmission policy, which achieves the same performance as the -optimal sensing-transmission policy.

Proposed Transmission Policies
In this section, we focus on deriving policies that specify the actions regarding the sensing overhead and transmit power, with the goal of minimizing the long-term outage probability. First, we introduce the concept of the stationary deterministic policy. Second, we prove the convergence and the existence of the stationary deterministic policy. Then, based on the Bellman equation, we propose an -optimal stationary deterministic policy named the sensing-transmission policy through the value iteration approach. Last, for the special case where the signal-to-noise (SNR) is sufficiently high, we introduce an efficient transmission policy.
Denote π(s) = {d 0 (s 0 ), d 1 (s 1 ), d 2 (s 2 ), · · · } as the decision policy that specifies the decision rules to be used at each time slot, and d t is the decision rule that prescribes a procedure for action selection in time slot t. A policy is stationary deterministic if d t is deterministic Markovian and d t = d for all t ∈ T [26]; therefore, the stationary deterministic policy can be represented as π(s) = {d(s 0 ), d(s 1 ), d(s 2 ), · · · }. For an infinite-horizon MDP, our primary focus will be on the stationary deterministic policy because the decision rules do not change over time, and they are easiest to implement and evaluate [52]. We denote the feasible set of stationary deterministic policies as Π SD . Given the initial state s 0 = (b 0 , g −1 , h −1 ) and the policy π ∈ Π SD , the expected discounted infinite-horizon reward that represents the long-term outage probability is defined to be [52]: where V π (s 0 ) is the long-term expected reward with respect to the initial state s 0 , 0 ≤ λ < 1 is the discount factor, R(s t , a) is the reward function defined by (19) and a is the action determined by the policy π. The alteration of λ brings a wide range of performance characteristics, which can be altered according to the actual needs. The objective of the SU is to find the optimal stationary deterministic policy π * that minimize the long-term expected reward defined in (20), that is: First, we prove that the long-term expected reward V π (s 0 ), where π ∈ Π SD , is finite.
Proof of Lemma 1. In order to prove that the value of |V π (s 0 )| is limited, according to [52], we only need to prove that sup a∈A s ,s∈S |R(s, a)| < ∞. As Pr(γ t < γ th |G g t ≤ γ t < G g t +1 ) ≤ 1, P(g t |g t−1 ) ≤ 1 and G is discrete and finite, we can deduce that ∑ g t ∈G P(g t |g t−1 )Pr(γ t < γ th |G g t ≤ γ t < G g t +1 ) is finite. Since p i ≤ 1, P f ≤ 1, it can be derived that |R(s, a)| is limited. Thus, we can conclude that sup a∈A s ,s∈S |R(s, a)| < ∞, and therefore, V π (s 0 ) is finite.
Lemma 1 indicates that for any initial system state, the value of V π (s 0 ) converges to a certain value. Next, we explain the existence of the optimal stationary deterministic policy π * . Theorem 1. There exists an optimal stationary deterministic policy π * to minimize the long-term expected reward displayed in Equation (20).
Proof of Theorem 1. Since the system state S = B × G × H is discrete and finite and for an arbitrary s ∈ S, the corresponding action space A s is also discrete and finite, thus there exists an optimal stationary deterministic policy [52]. Given an arbitrary system system s, the optimal long-term expected reward V π * (s) should satisfy the following Bellman optimality equation: The first term on the right-hand side of Equation (22) is the immediate reward for the current time slot, and the second term is the expected total discount future reward if SU chooses action a. The well-known value iteration approach is then applied to find the -optimal stationary deterministic policy, as shown in Algorithm 1. Obtain the -optimal transmission policy π * = {d, d, · · · } In Algorithm 1, the SU iteratively finds the optimal policy. Specifically, in Step 1, V 0 (s) is initialized to zero for all s ∈ S; the error bound is specified; and set the iteration sequence i to be zero. In Step 2, we compute the V i+1 (s) for each s ∈ S according to the knowledge of V i (s). Then, in Step 3, the SU first estimates whether If the inequality holds, which means that the value iteration algorithm has converged, then we proceed to Step 4 to obtain the decision rule and then formulate the sensing-transmission policy. Otherwise, we need to go back to Step 2 and continue to perform the iteration. According to Algorithm 1, the SU can pre-compute the policy and records it in a look-up table. Then, based on the specific system state, the SU can check the look-up table to find out the corresponding action.
As to the convergence, V i (s) computed by Step 2 converges to V π * (s) for all s ∈ S. Once the inequality condition in Step 3 is satisfied, then the obtained optimal policy ensures that V π * − V π * < , where V π * = {V π * (s), ∀s ∈ S} is the long-term expected reward achieved by the -optimal policy obtained in Step 5 of the Algorithm 1. In practice, according to the actual needs, SU can predefine the value of to control the accuracy of convergence. Choosing small enough ensures that the algorithm stops with a policy that is very close to optimal. Next, we introduce the complexity of Algorithm 1. The complexity of each iteration in the value iteration algorithm is O(N state N state N action ) [53], where N state represents the total number of states in the state space, N state indicates the total number of states that the system can possibly transmit to and N action represents the total number of actions in the action space. For our MDP problem, the total number of states in state space S is N B · N G · N H . As the battery state of the next time slot is deterministic and the channel can only transmit to the neighbor state or remains in its current state, therefore the total possible states the current system state can transmit to is 3N H . The maximum number of actions regarding the sensing overhead, as well as the transmit power is (N B + 1)N B /2. Hence, the complexity of each iteration in Algorithm 1 is O(N 3 B N 2 H N G ). Next, we study the structural property of the proposed sensing-transmission policy. Regarding the reward function, we have the following lemma: Lemma 2. Given a system s, for an arbitrary certain action of a α , the immediate reward R(s, a α , a p ) is non-increasing with a p , namely R(s, a α , a p + 1) ≤ R(s, a α , a p ), where a α ∈ A s α , a p and a p + 1 ∈ A (s,a α ) p .
Proof of Lemma 2. First, we prove that for a certain action of a α , Pr(γ t < γ th (a p )|G g ≤ γ t < G g +1 ) defined in Equation (19) is non-increasing with transmit action a p , namely Pr Otherwise, it can be derived that Pr(γ t < γ th (a p + 1)|G g ≤ γ t < G g +1 ) < Pr(γ t < γ th (a p )|G g ≤ γ t < G g +1 ). Therefore, we can conclude that Pr(γ t < γ th (a p )|G g ≤ γ t < G g +1 ) is non-increasing with transmit action a p . Next, we calculate the difference between R(s, a α , a p ) and R(s, a α , a p + 1): R(s, a α , a p ) − R(s, a α , a p + 1) since Pr(γ t < γ th (a p )|G g ≤ γ t < G g +1 ) is non-increasing with a p , we can derive that R(s, a α , a p ) − R(s, a α , a p + 1) ≥ 0, that is R(s, a α , a p + 1) ≤ R(s, a α , a p ). , and the action set for s can be expressed as A s = (a α , a p )|a α ∈ A s α , a p ∈ A (s,a α ) p .
, b}, thus A s + α ⊇ A s α ; therefore, we have: namely min Based on Lemma 3, we have following lemma: Lemma 4. For any given channel state g ∈ G and arrival energy state h ∈ H, we have that Proof of Lemma 4. We prove this lemma by the induction. When i = 1, as the initial condition V 0 (s) = 0 for all s ∈ S, thus V 1 (s) = min a∈A s R(s, a) . According to Lemma 3, we have . Assume when i = k, for any given g ∈ G, h ∈ H and ∀b ∈ B\{N B − 1}, V k (b + 1, g, h) ≤ V k (b, g, h) holds. When i = k + 1, we use s + to indicate system state (b + 1, g, h) and use s to indicate system state (b, g, h). The action sets for s + and s are A s + = (a + α , a + p )|a + α ∈ A s + α , a + p ∈ A (s + ,a + α ) p and A s = (a α , a p )|a α ∈ A s α , a p ∈ A (s,a α ) p , respectively. When a + α = a α = w, for arbitrary a + p = a p = m, we have R(b + 1, g, h, w, m) = R(b, g, h, w, m).
for any g ∈ G, h ∈ H, we have that: Theorem 3. For any given channel state g ∈ G and arrival energy state h ∈ H, the optimal long-term expected reward achieved by optimal policy π * is non-increasing in battery state b, that is V π Proof of Theorem 3. According to Theorem 2, we acquire that the -optimal policy is non-increasing in battery state b; therefore, the optimal long-term expected reward is non-increasing in b for any given g and h.
In the following, we consider a special case where the signal-to-noise ratio (SNR) is sufficiently high. When SNR is sufficiently high, namely N 0 → 0, the reward function for the system state s = (b, g, h) and the corresponding action a = (a α , a p ) are degenerated to: For the i-th iteration, denote the long-term expected reward function with respect to action a = (a α , a p ) as V (a α ,a p ) i . Then, we have the following theorem.

Theorem 4.
When the SNR is sufficiently high, for any iteration i, the expected reward with action a = (a α , 1) is no greater than the expected reward with action a = (a α , a p ), where a p ≥ 1. That is, V Proof of Theorem 4. The value difference of the two long-term expected rewards with actions a = (a α , 1) and a = (a α , a p ) can be calculated as: where Based on Theorem 4, we can deduce the following theorem: Theorem 5. When the SNR is sufficiently high, for any iteration i with a certain action of sensing overhead a α , the action set of transmit power to minimize the expected reward is A (s,a α ) Proof of Theorem 5. When b ≤ 1, the available transmit power set is {0} ∈ A (s,a α ) p new = {0}. When b ≥ 3, we have two cases: where a p ≥ 1; therefore, the transmit power set to minimize the long-term value V (s) where a p ≥ 1; therefore, the action set to minimize the long-term value Thus, we can derive that the action set to minimize the long-term reward is A (s,a α ) Based on Theorem (5), we present an efficient transmission policy with reduced computational complexity, which is suitable for the case that the SNR is sufficiently high, as shown in Algorithm 2.

Algorithm 2 Efficient transmission (ET) policy.
1: Set V 0 (s) = 0 for all s ∈ S, set i = 0, specify > 0. 2: For each s = (b, g, h) ∈ S, formulate the new action space: (s) illustrated in Theorem 4, we can ignore the actions that a p > 1 and formulate the new action space A s new with a lesser number of candidate actions, which reduces the computational complexity significantly. The total number of states in the state space is N B · N G · N H . Similar to the analysis of Algorithm 1, the total possible states the current system state can transmit to is 3N H . The maximum number of actions regarding the sensing overhead is N B , and the maximum number of actions regarding the transmit power is two. Therefore, the complexity of each iteration in Algorithm 2 is O(N 2 B N 2 H N G ).

Numerical Results and Discussion
In this section, we evaluate the performance and characteristics of the proposed policies by extensive simulations on MatlabR2012a. Unless otherwise stated, the system parameters employed in the simulation are summarized in Table 1, which draws mainly from [26,30,31,42]. The unit of the energy quantum is e u = 0.5 mJ, and N B = 20. The quantization levels of the channel power are G = {0, 0.3, 0.6, 1.0, 2.0, 3.0}. The arrival energy takes values from the finite set {0, 4e u , 6e u , 8e u } mJ per time slot, namely Q 0 = 0, Q 1 = 4, Q 2 = 6, Q 3 = 8, and evolves according to the four-state Markov chain with the state transition probability given by: A normalized SNR γ c (i.e., γ c = 1/N 0 ) is defined with respect to the transmit power of 1 mW throughout the simulation. We choose to be 10 −2 . The initial energy state is b 0 = 6; the initial channel state is g −1 = 1; and the initial arrival energy state is h −1 = 1. The total simulation duration is 500 time slots. All of the numerical results are averaged over 500 independent runs. We compare the proposed sensing-transmission (ST) and efficient transmission (ET) policies with a benchmark named shortsighted policy [32,54] in terms of the performance in Figures 3-6. The primary concern of the shortsighted policy is to minimize the immediate reward of the current time slot, without considering the impact of the current action on the future reward, i.e., λ = 0. However, the policies proposed in this paper take into account not only the current immediate reward, but also the future expected reward. Therefore, by comparing with the shortsighted policy, we can evaluate the benefit and advantage of proposed policies. Figure 3 depicts the outage probability of ST, ET and the shortsighted policies under different normalized SNRs and channel idle probabilities. First, it can be seen that the ST policy outperforms the shortsighted policy for all settings of normalized SNR. This can be explained by the fact that the ST policy considers a tradeoff between the current immediate reward and the future achievable reward; while the shortsighted policy only focuses on maximizing the current immediate reward, ignoring the impact of the current action on the future reward. It should be noted that despite the better performance of the ST policy, it is much more computationally extensive than the shortsighted policy. Second, we can see that for ST and ET policies, when γ c is sufficiently high, the curves of ST and ET policies almost overlap, and a saturation effect is observed, namely the outage probability gradually converges to the same value. This phenomenon coincides with Theorem 5, that is when γ c is sufficiently high, the transmit action set that SU needs to consider is A s,a α p new , and that ET policy is equivalent to the ST policy in high γ c regions. Third, we also observe that the saturation outage probability of the three policies in high SNR regions becomes smaller when p i gets larger. This is because larger p i indicates more probability of employing the licensed channel for data transmission, resulting in lower outage probability.          Figure 4 plots the outage probability of three policies versus the channel idle probability for different values of normalized SNR, where the performance curves plotted correspond to γ c = 0 dB and γ c = 10 dB, respectively. It can be seen that ST policy outperforms the other two policies for all settings of p i . Besides, we can observe that the outage probability of all three policies decreases with the increase of channel idle probability, which can be easily understood since a higher value of p i results in a higher possibility of successful data transmission and therefore reduces the outage probability. We can also observe that when γ c is small (γ c = 0 dB), the gap between the ST and ET policies becomes larger as p i increases, and the shortsighted policy achieves better performance than the ET policy. While when γ c is large (γ c = 10 dB), there is only a tiny difference between the ST and ET policies, and the ET policy achieves better performance than the shortsighted policy. Figure 5 illustrates the outage probability of three policies as a function of average channel gain G a for different γ c . It can be observed that the outage probability goes down with the increase of G a . This is due to the fact that as G a increases, the data transmission is more efficient when the primary channel is idle, resulting in lower outage probability. Besides, we can see that the ST policy outperforms the other two policies for all of the settings of G a . It is also shown that when γ c is small (γ c = 0 dB), the shortsighted policy outperforms the ET policy, while when γ c is large (γ c = 10 dB), the ET policy achieve better performance than shortsighted performance in the case that G a ≥ 1.2. Thus, we can conclude that in the case that the γ c is small or the channel quality is poor, the shortsighted policy outperforms the ET policy. Figure 6 plots the outage probability of three policies with different settings of battery energy state and normalized SNR. It can be seen that the outage probability with respect to ST and ET policies decreases as N B increases; while the outage probability regarding the shortsighted policy almost remains unchanged under different values of N B . This phenomenon indicates that by increasing the capacity of the battery, we can efficiently decrease the outage probability, but the performance of the shortsighted policy is almost independent of the battery capacity. Besides, we can also observe that for lower γ c , the performance of shortsighted policy outperforms the ET policy. For higher γ c , the ET policy achieves better performance than the shortsighted policy when N B ≥ 14. Figure 7 shows the outage probability of ST and ET policies as a function of γ c for different data rate threshold R th . We can see that for lower γ c , a lower data rate threshold leads to a lower outage probability, and the curves with R th = 2 outperform the curves with R th = 4 and R th = 6. However, when γ c is sufficiently high, we observe that the curves correspond to R th = 2, R th = 4 and R th = 6 all converge to the same value. This is because when γ c is sufficiently high, according to Equation (29), the reward functions have no relation to R th ; the curves with different R th achieve the same outage probability. The outage probability of ST and ET policies with different settings of battery energy state N B and idle probability p i is shown in Figure 8. It can be seen that outage probability of the ST and ET policies decreases as the battery storage capacity N B increases. This is because with a higher N B , SU can allocate the energy more efficiently: if the expected channel condition of the next time slot is good and the channel occupancy is estimated to be idle with high probability, the SU can allocate more energy for data transmission; otherwise, the SU can allocate less energy for data transmission and save more energy for future utilization.

Conclusions
In this paper, we have considered a time-slotted energy harvesting cognitive radio sensor network, where the cognitive sensor nodes solely rely on harvested energy for spectrum sensing and data transmission. Our goal is to minimize the long-term outage probability of the sensor node by adapting the sensing time and transmission power to the current sensor node's knowledge of battery energy, channel fading and harvested energy. This problem has been formulated as an infinite-horizon discounted MDP . The existence of the optimal stationary deterministic policy has been proven, and an -optimal sensing-transmission policy has been presented through using value iterations. is the error bound between the ST policy and the optimal policy, which can be pre-defined according to the actual need. Moreover, for a special case where the signal-to-noise (SNR) power ratio is sufficiently high, we have introduced an efficient optimal transmission policy with reduced computational complexity. It has been illustrated that the efficient transmission policy is equivalent to the sensing-transmission policy for high regions of SNR. Finally, we have conducted extensive simulations to verify the performance of the proposed policies, and the impacts of system parameters have also been investigated.