Optimal Update for Energy Harvesting Sensor with Reliable Backup Energy

In this paper, we consider an information update system where a wireless sensor sends timely updates to the destination over an erasure channel with the supply of harvested energy and reliable backup energy. The metric Age of Information(AoI) is adopted to measure the timeliness of the received updates at the destination. We aim to find the optimal information updating policy that minimizes the time-average weighted sum of the AoI and the reliable backup energy cost by formulating an infinite state Markov decision process(MDP). The optimal information updating policy is proved to have a threshold structure. Based on this special structure, an algorithm for efficiently computing the optimal policy is proposed. Numerical results show that the optimal updating policy proposed outperforms baseline policies.


I. INTRODUCTION
Timely information updates from wireless sensors to the destination are critical in real-time monitoring and control systems.In order to describe the timeliness of information updates, the metric Age of Information(AoI) is proposed [1].Different from general performance metrics such as delay and throughput, AoI refers to the time elapsed since the generation of the latest received information.A lower AoI usually reflects the more timely information received by the destination.Therefore, the AoI-minimal status updating policies in sensor networks have been widely studied [2]- [4].
In sensor-based information updating systems, energy is consumed in the process of sensing and transmitting updates.If the sensor's energy comes from the grid, it pays the electricity bill.If the sensor's energy comes from its own non-rechargeable battery, the price of sensing and transmitting updates is the cost of frequent battery replacement.We call these sources reliable energy since they enable sensors reliable to operate until the power grid is cut off or sensors' batteries are exhausted.There is clearly a price to be paid for using reliable energy to update.
In order to reduce the reliable energy consumption, a reasonable idea is to introduce energy harvesting technology [5].Energy harvesting can continuously replenish energy for the sensor by extracting energy from solar power, ambient RF and thermal energy.The harvested energy is stored in the sensor's rechargeable battery.Since the harvested energy is renewable, it can be used for free.Hence, reliable energy can serve as backup energy.The design of coexistence of reliable backup energy and harvested energy has been researched and promoted in academia and industry [6], [7].However, because the harvested energy arrives sporadically and irregularly, and the capacity of rechargeable batteries is limited, we still need to schedule the usage of energy properly to reduce the cost of using reliable backup energy while maintaining the timeliness of information updates(i.e. the average AoI).
Intuitively, the average AoI and the cost of using reliable energy cannot be minimized simultaneously.On the one hand, a lower average AoI means that the sensor senses and transmits updates more frequently, which will increase the consumption of reliable backup energy since the harvested energy is limited.On the other hand, to reduce the cost of reliable backup energy, the sensor will only exploit the harvested energy.Due to the uncertainty of the energy harvesting behavior, the average AoI of the system will inevitably increase.
Therefore, in this paper, we focus on achieving the best trade-off between the average AoI and the cost of reliable backup energy in a sensor-based information update system where an energy harvesting sensor with reliable backup energy sends timely updates to the destination through an erasure channel.Related work includes [8]-[13].[8]-[10] investigate AoI-minimal status updating policies for sensor networks that rely solely on harvested energy.In [11]-[13], although the sensors can use both harvested energy and reliable energy, the authors only optimize for delay or throughput and ignore the timeliness of the system.Based on our settings, we will minimize the long-term average weighted sum of the AoI and the paid reliable energy cost to find the optimal information updating policy.The structure of the optimal policy will be analyzed theoretically, and its performance will be demonstrated through simulation.

A. System Model Overview
In this paper, we consider a point-to-point information update system where a wireless sensor and a destination are connected by an erasure channel, as shown in Fig. 1.Wireless sensors can use the free harvest energy stored in the rechargeable battery and the reliable backup energy that needs to be paid to generate and send real-time environmental status information.The destination keeps track of the environment status through the received updates.We apply the metric Age of Information to measure the freshness of the status information available at the destination.
Without loss of generality, time is slotted with equal length and indexed by t ∈ N. At the beginning of each time slot, the sensor decides whether to generate and transmit an update to the destination or stay idle.The decision action at slot t, denoted by a[t], takes value from action set A = {0, 1}, where a[t] = 1 means that the sensor decides to generate and transmit an update to the destination while a[t] = 0 means the sensor is idle.The channel between the sensor and the destination is assumed to be noisy and time-invariant, and each update will be corrupted with probability p during transmission (Note p ∈ (0, 1)).The destination will feed back an instantaneous ACK to the sensor through an error-free channel when it has successfully received an update and a NACK otherwise.We assume the above processes can be completed in one time slot.

B. Age of Information
Age of Information (AoI) is defined as the elapsed time since the generation of the latest successfully received update in this paper.Let U [t] be the time slot when the most recently received update is generated before time slot t, and ∆[t] denote the AoI of destination in time slot t.Then, the AoI is given by In particular, the AoI will decrease to one if a new update is successfully received.Otherwise it will increase by one.To summarize, the evolution of AoI can be expressed as follows: A sample path of AoI is depicted in Fig. 2.

C. Description of Energy Supply
We assume that only the sensor's measurement and transmission process will consume energy, and other energy consumption is ignored.The energy unit is normalized, so the generation and transmission for each update will consume one energy unit.As previously described, the energy sources of the sensor include energy harvested from nature and reliable backup energy.The sensor can store the harvested energy in a rechargeable battery for later use.The maximum capacity of the rechargeable battery is B units (B > 1).Let b(t) be the accumulated harvested energy in time slot t.Since the energy to be harvested is relatively limited, sometimes b(t) does not reach an energy unit.So we consider using the Bernoulli process with the parameter λ to approximately capture the arrival process of harvested energy, which is also adopted in [14]- [16].That is, we have Pr {b(t) = 1} = λ and Pr {b(t) = 0} = 1 − λ in each time slot t.
For reliable backup energy, we assume that it contains much more energy units than the rechargeable battery can store, so the energy it contains is infinite.However, it needs to be used for a fee.Therefore, when the power of the rechargeable battery is not 0, the sensor will prioritize using the energy in the rechargeable battery for status update, otherwise, it will automatically switch to the reliable backup energy until the sensor has harvested energy.Defining the power of the rechargeable battery at the beginning of time slot t as the battery state q[t], then the evolution of battery state between time slot t and t + 1 can be summarized as follows: where u(•) is unit step function, which is defined as Suppose that under paid reliable energy supply, the cost of generating and transmitting an update is a non-negative value C r .Define E[t] as the paid reliable energy costs at the time slot t, then we have

D. Problem Formulation
Let Π denote the set of non-anticipated policies in which scheduling decision a[t] are made based on the action history {a as well as the system parameters(i.e.p, λ, etc.).In order to keep the information freshness at the destination, the sensor needs to send updates.However, due to the randomness of harvested energy arrivals, the battery energy may sometimes be insufficient to support updates, and the sensor has to take energy from reliable backup energy.To balance the information freshness and the paid reliable backup energy costs, we aim to find the optimal information updating policy π ∈ Π that achieves the minimum of the time-average weighted sum of the AoI and the paid reliable backup energy costs.The problem is formulated as follows: s.t.

III. OPTIMAL POLICY ANALYSIS
In this section, we aim to solve the problem ( 6) and obtain the optimal policy.It is difficult to solve the original problem directly due to the random erasures and the temporal dependency in both AoI evolution and battery state evolution.So we reformulate the original problem as a time-average cost MDP with infinite state space and analyze the structure of the optimal policy.

A. Markov Decision Process Formulation
According to the system description mentioned in the previous section, the MDP is formulated as follows: (8) In both cases, the evolution of AoI still follows equation ( 2) and the evolution of battery state follows (3).
• One-step Cost.For the current state x = (∆, q), the onestep cost C(x, a) of taking action a is expressed by After the above modeling, the original problem ( 6) is transformed into obtaining the optimal policy for the MDP to minimize the average cost in an infinite horizon: Denote Π SD as the set of stationary deterministic policies.Given observation(∆[t], q[t]) = (∆, q), the policy π ∈ Π SD selects action a[t] = π(∆, q), where π(•) : (∆, q) → {0, 1} is a deterministic function from state space S to action space A. According to [17] , there exists a stationary deterministic policy to minimize the above unconstrained MDP with infinite countable state and action space under certain verifiable conditions.In the next section, the structural properties of the optimal policy are investigated.

B. Structure Analysis of Optimal Policy
According to [18], there exits a value function V (x) which satisfies the following Bellman equation for the infinite horizon average cost MDP: where λ is the average cost by following the optimal policy.Denote Q(x, a) as the state-action value function which means the value of taking action a in state x.We have: So the optimal policy π ⋆ ∈ Π SD in state x can be expressed as follows: Next, we first prove the monotonicity of the value function on different dimensions, which is summarized in the following lemma.
Proof: See Appendix VI-A in Supplementary Material [19].
Based on Lemma 1, we then establish the incremental property of the value function, which is shown in the following lemma.
With Lemma 1 and Lemma 2, we directly provide our main result in the following Theorem.
Theorem 1. Assuming that the channel erasure probability p is fixed.For given battery state q, there exists a threshold ∆ q , such that when ∆ < ∆ q , the optimal action π ⋆ (∆, q) = 0, i.e., the sensor keeps idle; when ∆ ≥ ∆ q , the optimal action π ⋆ (∆, q) = 1, i.e., the sensor chooses to generate and transmit a new update.
Proof: The optimal policy is of a threshold structure if Q(x, a) has a sub-modular structure, that is, (18) We will divide the whole proof into the following three cases: Case 1.When q = 0, for any ∆ ∈ Z + we have: Therefore, we have where the last inequality (a) is due to the monotonicity property revealed by ( 14) in Lemma 1. Case 2. When q ∈ {1, ..., B − 1},for any ∆ ∈ Z + we have: Algorithm 1 Modified Value Iteration Algorithm Input: Iteration number K and iteration threshold ǫ.

Output:
Optimal policy π ⋆ (x) for all state x.1: Initialization: V 0 (x) = 0. 2: for episodes k = 0, 1, 2, ..., K do 3: for state x ∈ S do 4: for action a ∈ A do 5: end for 7: Case 3. When q = B,for any ∆ ∈ Z + we have: where the last inequality (a) is also due to the incremental property revealed by (17) in Lemma 2. Therefore, we have completed the whole proof.Theorem 1 reveals the threshold structure of the optimal policy: if the optimal action in a certain state is to generate and transmit an update, then in the state with the same battery state and larger AoI, the optimal action must be the same.
Based on this unique threshold structure, we propose a modified value iteration algorithm to solve the optimal policy, as shown in Algorithm 1. Specifically, We first iterate the Bellman equation ( 11) to obtain the value function.Then based on the threshold structure, the optimal policy can be obtained without calculating the equation (13) in each state, which reduces the computational complexity.

IV. NUMERICAL RESULT
In this section, we first show the threshold structure of optimal policy by the simulation results.Then we compare the performance of the optimal policy with the zero-wait policy, the periodic policy, the randomized policy, the energy first policy under different system parameters such as weighting factor ω, energy harvesting probability λ and erasure probability p.Note that the zero-wait policy means the sensor generates and transmits an update in every time slot [2], while the periodic policy means the sensor periodically generates and sends updates to the destination.The randomized policy refers to that the sensor chooses to send an update or stay idle in each time slot with the same probability.The energy first policy means that the sensor only uses the harvested energy, that is, as long as the battery state is not 0, it will choose to sense and send updates, otherwise it will remain idle.Obviously, the energy first policy will not incur the cost of reliable energy.In our simulation, we assume that the cost of reliable energy C r for one update equals to 2 and the maximum battery capacity B equals to 20.
Fig. 3 shows the optimal policy under different system parameters.All the subfigures in Fig. 3 exhibits the threshold structure described in Theorem 1.Note that the weighting factor ω is set to be 10 , which is neither too small nor too large.Intuitively, when ω is too small, the optimal action for every state should be 0, and when ω is too large, the optimal action for every state should be 1.Fig. 3 shows that when the AoI is small, even if the battery state is not 0, the optimal action in the corresponding state is to keep idle.When the AoI is large or the battery state is large, the optimal action is to measure and send updates.Fig. 4 shows the time average cost under different policies, i.e., the zero-wait policy, the periodic policy, the randomized policy, the energy first policy and the proposed optimal policy.Here we set the period of the periodic policy to 5 and 10 for comparison without loss of generality.It can be found that under different weighting factor ω, the optimal policy proposed in this paper can obtain the minimum long-term average cost compared with the other policies, which indicates the best trade-off between the average AoI and the cost of reliable energy.When ω tends to 0, the zero-wait policy tends to be optimal.Since there is no need to consider the update cost brought by paid reliable backup energy, the optimal policy should maximize the utilization of the updating opportunities.Fig. 5 reveals the impact of energy harvesting probabilities λ.The channel erasure probability p is set to be 0.2 and weighting factor ω is 10.It also can be found that the proposed optimal update policy outperforms all other policies under different energy harvesting probabilities.The interesting point is that when the probability of energy harvesting tends to 1, i.e., energy arrives in each time slot, the performance of the zero-wait policy and the energy first policy is equal to the optimal policy, while there is still a performance gap between the optimal policy and the other two polices.This is intuitive because when the free harvested energy is sufficient, the optimal policy must be to generate and transmit updates in every time slot.However, the periodic policy and the randomized policy still keep idle in many time slots, which will lead to a higher average AoI and thus increase the average cost.In Fig. 6, we compare the above five policies under different erasure probabilities p.The simulation settings are the energy harvesting probability λ = 0.5 and weighting factor ω = 10.It can be found that when erasure probability increases from 0 to 0.9, the proposed optimal update policy always performs better than the other baseline policies.Note when p = 1, all the updates are erased by the noisy channel.So it is meaningless to discuss this case.

V. CONCLUSION
In this paper, we have studied the optimal updating policy for an information update system where a wireless sensor sends updates over an erasure channel using both harvested energy and reliable backup energy.Theoretical analysis indicates the threshold structure of the optimal policy and simulation results verify its performance.Q k (x, a), we need to bring out Q k (x, a) first.The state-action value function Q k (x, a) at state x = (∆, q) is as follows: (29) Due to the complexity of the transition probability situation and one-step cost function, we will discuss the following three cases: Case 1. q 0, In this case, to transition (7) and ( 8), have the state-value function Q (∆, q, 0) and Q k (∆, q, 1) as follows: Due to that V k (∆, q) is assumed to be non-decreasing with to ∆ for any fixed q, it is obviously that both Q k (∆, q, 0) and Q k (∆, q, 1) are non-decreasing with respect to ∆. Therefore,for any ∆ 1 ≤ ∆ we have: As a result, with the induction we prove that V (∆, q) nondecreasing function for any k with respect to ∆ and = 0.By taking the limits on both side of (27) we prove that (14) holds in the case q = 0.
Case 2. 0 < q < B, In this case, according to transition probability (7) and (8), we have the state-value function Q k (∆, q, 0) and Q k (∆, q, 1) as follows: and, Due to V k (∆, q) is assumed to be non-decreasing function with respect to ∆ for any fixed q, it is obviously that both Q k (∆, q, 0) and Q k (∆, q, 1) are non-decreasing with respect to ∆. Therefore,for any ∆ 1 ≤ ∆ 2 we have: As a result, with the induction we prove that V k (∆, q) is nondecreasing function for any k with respect to ∆ and any q ∈ {1, ..., B − 1}.By taking the limits on both side of (27) we prove that (14) holds in the case 0 < q < B. Case 3. q = B, In this case, according to transition probability ( 7) and (8), we have the state-value function Q k (∆, q, 0) and Q k (∆, q, 1) as follows: Due to V k (∆, q) is assumed to be non-decreasing function with respect to ∆ for any fixed q, it is that both Q k (∆, q, 0) and Q k (∆, q, 1) are non-decreasing with respect to ∆. Therefore,for any ∆ 1 ≤ ∆ 2 we have: As a result, with the induction we prove that V k (∆, q) is nondecreasing function for any k with respect to ∆ and q = B.By taking the limits on both side of (27) we prove that (14) holds in the case q = B.
To sum up, (14) holds and we complete the proof of the first part in Lemma1.
According to the exact same mathematical induction, we can also verify that the formula (15) holds.Due to limited space, the specific certification steps are omitted here.Thus we complete the proof of Lemma 1.
Second, we will tackle formula (17).The following proof needs to apply VIA and mathematical induction.For the convenience of explanation, an equivalent transformation is made to formula (17) as follows: V (∆ + 1, q + 1) + pV (∆, q) ≥ V (∆, q + 1) + pV (∆ + 1, q), (41) for state x, we have So every value function in (41) has two possible values.In order to prove formula (41), theoretically we need to discuss 2 4 = 16 cases, which is obviously a bit too cumbersome.Here we use a little trick, that is, as long as we prove that for the 2 2 = 4 possible combinations on the left side of the inequality sign, there exists a combination on the right side of the inequality sign to make "≥" hold, then we can prove formula (41).Next, we make a mapping, using four numbers to sequentially represent the action taken by the minimum stateaction value function in formula (41), that is, "1010" represents the following: Q(∆ + 1, q + 1, 1) + pQ(∆, q, 0) ≥ Q(∆, q + 1, 1) + pQ(∆ + 1, q, 0), So according to the previous trick, we only need to verify "0000", "1010", "0101", "1111" to prove formula (41).Due to limited space, we only show the verification process of "1010" in the following proof.The other three cases can also be proved by the same steps.